P-Codec: Parallel Compressed File Decompression Algorithm for Hadoop

Idris Hanafi; Amal Abdel-Raouf

doi:10.24297/ijct.v15i8.1500

Authors

Idris Hanafi Southern Connecticut State University
Amal Abdel-Raouf Computer Science DepartmenSouthern Connecticut State University, USA and the Electronic Rt, esearch Institute (ERI), Egypt

DOI:

https://doi.org/10.24297/ijct.v15i8.1500

Keywords:

Hadoop, MapReduce, HDFS, Compression, Parallelism,

Abstract

The increasing amount and size of data being handled by data analytic applications running on Hadoop has created a need for faster data processing. One of the effective methods for handling big data sizes is compression. Data compression not only makes network I/O processing faster, but also provides better utilization of resources. However, this approach defeats one of Hadoopâ€™s main purposes, which is the parallelism of map and reduce tasks. The number of map tasks created is determined by the size of the file, so by compressing a large file, the number of mappers is reduced which in turn decreases parallelism. Consequently, standard Hadoop takes longer times to process. In this paper, we propose the design and implementation of a Parallel Compressed File Decompressor (P-Codec) that improves the performance of Hadoop when processing compressed data. P-Codec includes two modules; the first module decompresses data upon retrieval by a data node during the phase of uploading the data to the Hadoop Distributed File System (HDFS). This process reduces the runtime of a job by removing the burden of decompression during the MapReduce phase. The second P-Codec module is a decompressed map task divider that increases parallelism by dynamically changing the map task split sizes based on the size of the final decompressed block. Our experimental results using five different MapReduce benchmarks show an average improvement of approximately 80% compared to standard Hadoop.

Downloads

Author Biographies

Idris Hanafi, Southern Connecticut State University

a graduate researcher in the field of Big Data Systems at Southern Connecticut State University (SCSU). He is also a first year Master Student at SCSU studying Computer Science.

Â