7 tips on how to fix a large amount of data in Hadoop

I have been working on an international big data project. The cluster is huge, with TBs of memory and hundreds of CPUs – another great experience in the real world of big data and parallel computing for our team. In this short article, I’d like to share a story about some lessons learned with a happy ending (and data fixes 🙂 ).

About two years ago, we found a strange bug in our cluster. Wherever there was a % sign in the input data, it was duplicated and changed to %%. This may not seem critical, but if you use pseudonymization to fulfil all security and data privacy requirements, it is. Our former colleagues had attempted to fix it using purpose-built utility based on JAVA and MapReduce. It worked, but it was extremely slow around 200GB per day. Definitely not enough to make it feasible in real life. Then the issue was closed with the statement that someone would fix it, and as there was no pressure, nobody did.

After approximately one year, we had a data issue, and I realized that it is still the same issue. I had expected that someone would have tried the Hive function regexp_replace as an etalon, but nobody did. Testing showed that it was a surprisingly fast solution at around 25 TB per day.

Initially, it seemed that we needed to fix approximately 400 TB. We planned a way to solve the issue with minimum user impact, and 400 TB would mean 16 days of work which was a no go. After a discussion with some business users, the plan was changed to fix only what was really necessary, so around 18TB.

Tip #1: Ask business users for their exact requirements. You can save yourself a lot of work

Our latest performance test said that the real speed was around 14TB per day, depending on the file type and compression. The goal was to avoid or limit downtime as much as possible. So, we decided to compute the fixed data next to the original data and then prepare everything for the day it would be exchanged. The main fix ran smoothly and fixed around 17 TB per day. That is the full size of many DWH systems, but this is a big data system.

On the day of our “big data exchange”, the estimated time to complete was approximately 3 hours, so we announced that it would take 8 hours, just to be sure that we would have enough time to deal with potential issues along the way. But, as usual, we missed some risks.

At 6 am, the transfer to our system stopped. We had to wait until all the data arrived. Then it was necessary to finish all the transformations and start our fix. After a while, we realized that something was wrong. Some transformations had failed because the disk was full.
It was fixed, yet some transformations were not running—and hadn’t been for a longer time. This was something we could manage in our plan.

Tip #2: The day before the operation, check to see if all processes are running smoothly

It was fixed, and then we executed the fix. But it was slow because each table took 3-4 minutes. With around 150 tables to fix, it was too much. We missed that the overhead for small jobs was so huge that we had to run them in parallel.

Tip #3: Think about the possibility of parallelization and the differences between small and large jobs

Fortunately, the process went faster and faster because 80% of the tables failed. WHY?!
Because of missing Avro schemas, our _fixed were removed during the last deployment!

Tip #4: Do test after each deployment

Then the central ticketing system was shut down for maintenance. We managed the whole operation via Webex, so it did not affect us very much, but:

Tip #5: Check that there will be no tool outages

We created AVRO schemas, split the list into six buckets, and in an hour, we were done.
But… one table failed. After a while, we found out that it was renamed by the business users, but nobody informed us. It was changed in the last deployment.

Tip #6 (similar to #4): Be aware of all deployments just before and during the operation

Done! All the checksums were ok, so we were done by 17:30. That is half an hour before the planned deadline 🙂 Remember that we planned 8 hours time slot instead of 3?

Summary: Everything was planned and checked many times and all the people responsible were involved. As usual, we came across some difficulties, but we dealt with them all. It is important to be prepared because you never know what can happen. And, of course, the last tip:

Tip #7: Try to solve data-related issues as soon as possible. It will be more expensive and time-consuming to fix them later

Share topic
Topic filter
Big Data