Small Files Problem in Big Data Systems

Written by Imran Abdul Rauf

Technical Content Writer

Small files often make one of the most common problems for big data systems affecting productivity and consuming significant resources. And if not handled, small files considerably slow down big data systems and leaves your team with stale analytics.

The Hadoop distributed file system (HDFS) is inefficient in storing files, leading to equally inefficient NameNode memory utilization and RPC calls, low-performing application layer, and block scanning throughput degradation.

If you’re working as a big data administrator in any modern data lake, you will face the challenges posed by small files sooner or later. This blog will discuss the concept of slow files, their issues, business impact, and how to identify the problems associated with small files and counter them.

Small files and their effect on the business

Small files and their poor management impact the enterprise and big data teams in the following ways.

  • Slowing the processing speed: Small files tend to slow down jobs run from Spark, MapReduce, and Hive. For example, programming models like MapReduce map-tasks processes catering one block at a time.Files consume one map task each, but if there are an additional number of small files, each map task will produce a significantly lower input. In short, the more the number of small files, the more the tasks accumulated.
  • Slowing down reading: Small files call for multiple attempts to retrieve data, which is inefficient in reading data through each small file.
  • Wasted storage: Files 5 KB, or even 1 KB each, are created with thousands in numbers while running jobs simultaneously, which adds quickly to the workload. In addition, complexity arises when there is no transparency on their precise location.
  • Stale data: Slow reading and processing speed and wasted storage results in the production of stale data, which hinders the entire reporting and analytics process. Hence, no real value in hand. If teams cannot run jobs faster and responses are slow, the inefficiency reflects in slow decision-making and lack of valuable data for use. Businesses can’t trust such data with the value it wasn’t expected to bring for the enterprise’s IT objectives and big data teams.
  • Lack of scalability: Operational costs grow with your business growth in a proportionate manner. But if your business grows 8 to 10 folds, the jump in operational expenses won’t be linear, further affecting the cost to scale. Unfortunately, small files are a problem not entirely avoidable too. Still, following your department's best big data management practices will give you comprehensive control over them. For example, any production department focuses on keeping its operations up and running while resources are leveraged when problems arise.

The small file problem

Consider the example of the Hadoop infrastructure distributed file system. HDFS is designed to handle large data sets, and the data included is distributed over various machines to help with parallel processing. As the metadata and data are kept in separate components, every created file occupies a minimum amount of memory unit regardless of the size.

Small files are typically read as less than 1 HDFS block or 128 MB. However, even with less than 1 KB in size, Files put a massive load on the NameNode and take a metadata storage space equivalent to 128 MB. Practically, smaller file sizes also indicate smaller clusters as there are definite limits on the number of files a single NameNode can manage.

How to identify and eliminate small files?

There is no straightforward way to determine the number of files present on HDFS, size of each file, and how and where the users are creating these files.

Small files use case

Consider the scenario from a system’s perspective that possesses multiple clusters and regions stuffed with petabytes of data consuming a large storage volume and slowing down performance. Besides the wholly wasted storage, running jobs like Hive and MapReduce will also slow down. With the changed requirements and the data lake concept users in place, they want to cater to the problem so they can serve a mobile app from their data lake platform, i.e., sub-second response times.

The time boundaries are dropping significantly for which users want to utilize this system. Hence, your optimal block size may take 64 MB of space, depending on the use case. A critical scenario would be dealing with standard file sizes of 1 KB, files usually associated with IoT data or sensor data.

Jobs where the infrastructure registers a file every 200 milliseconds and creates a new file every 60 seconds. Still, these files are small and won’t cross 10 KB. We can conclude through the above use case that although small files can’t be entirely eliminated, you can manage them and lower their occurrence probability than the bigger files. Small file management is a continuous process and requires a maintenance cycle.

What tools manage small files?

The best tool should be software agnostic which provides a cross-sectional view of the business's big data infrastructure. The issue enterprises and big data teams face with preferred means are three-fold.

  • Single-threaded APM tools: The old school APM tools are single-threaded and work on web apps, but none of them offers the deep file system analytics feature.
  • Engine-specific optimization tools: Spark, Hive, and other query engines have particular monitoring and optimization tools and features. The usability emphasizes solely on their respective platform, and optimization activities happen only when data ingestion occurs via their tool.
  • In-house scripted tool and manual intervention: Unfortunately, the process of analyzing small files and their incurred problem in the company’s big data system is still largely manual. It mainly relies on support personnel to go through separate folders, and as companies grow exponentially and their systems get complex in a proportionate manner, the current monitoring tools aren’t up to the task and deliverables.

The small file solution

A business can acquire multiple single-threaded application management softwares, which still might not be capable of acknowledging the problems associated with scaling big data systems. The best solution would be software agnostic tools like Spark, Hadoop, Hive, Presto, etc., which provide a multi-dimensional, cross-sectional view in real-time alongside the following traits.

  • An online, real-time view of the small files present in your data lake.
  • An actionable plan providing a bird’s eye view of your business’s entire data ecosystem to help teams identify the origination of these issues.
  • Ability to troubleshoot by understanding where the problem started from after it has been identified.
  • Ability to take action and validation for the root cause of the problem.

Related content: Data Lake vs. Data Warehouse—Which One to Choose?

Takeaways

Yes, small files can disrupt big data systems big time, but you can always consider the above actions to manage the problem, including timely identification, compaction, compression, and deletion. Royal Cyber is a data analytics consultancy provider helping business make the best use of their data and make better, informed decisions.

Leave a Reply