Written by Imran Abdul Rauf
Technical Content WriterSmall files often make one of the most common problems for big data systems affecting productivity and consuming significant resources. And if not handled, small files considerably slow down big data systems and leaves your team with stale analytics.
The Hadoop distributed file system (HDFS) is inefficient in storing files, leading to equally inefficient NameNode memory utilization and RPC calls, low-performing application layer, and block scanning throughput degradation.
If you’re working as a big data administrator in any modern data lake, you will face the challenges posed by small files sooner or later. This blog will discuss the concept of slow files, their issues, business impact, and how to identify the problems associated with small files and counter them.
Small files and their poor management impact the enterprise and big data teams in the following ways.
Consider the example of the Hadoop infrastructure distributed file system. HDFS is designed to handle large data sets, and the data included is distributed over various machines to help with parallel processing. As the metadata and data are kept in separate components, every created file occupies a minimum amount of memory unit regardless of the size.
Small files are typically read as less than 1 HDFS block or 128 MB. However, even with less than 1 KB in size, Files put a massive load on the NameNode and take a metadata storage space equivalent to 128 MB. Practically, smaller file sizes also indicate smaller clusters as there are definite limits on the number of files a single NameNode can manage.
There is no straightforward way to determine the number of files present on HDFS, size of each file, and how and where the users are creating these files.
Consider the scenario from a system’s perspective that possesses multiple clusters and regions stuffed with petabytes of data consuming a large storage volume and slowing down performance. Besides the wholly wasted storage, running jobs like Hive and MapReduce will also slow down. With the changed requirements and the data lake concept users in place, they want to cater to the problem so they can serve a mobile app from their data lake platform, i.e., sub-second response times.
The time boundaries are dropping significantly for which users want to utilize this system. Hence, your optimal block size may take 64 MB of space, depending on the use case. A critical scenario would be dealing with standard file sizes of 1 KB, files usually associated with IoT data or sensor data.
Jobs where the infrastructure registers a file every 200 milliseconds and creates a new file every 60 seconds. Still, these files are small and won’t cross 10 KB. We can conclude through the above use case that although small files can’t be entirely eliminated, you can manage them and lower their occurrence probability than the bigger files. Small file management is a continuous process and requires a maintenance cycle.
The best tool should be software agnostic which provides a cross-sectional view of the business's big data infrastructure. The issue enterprises and big data teams face with preferred means are three-fold.
A business can acquire multiple single-threaded application management softwares, which still might not be capable of acknowledging the problems associated with scaling big data systems. The best solution would be software agnostic tools like Spark, Hadoop, Hive, Presto, etc., which provide a multi-dimensional, cross-sectional view in real-time alongside the following traits.
Related content: Data Lake vs. Data Warehouse—Which One to Choose?
Yes, small files can disrupt big data systems big time, but you can always consider the above actions to manage the problem, including timely identification, compaction, compression, and deletion. Royal Cyber is a data analytics consultancy provider helping business make the best use of their data and make better, informed decisions.