Performance comparison between Hadoop and Spark frameworks using HiBench benchmarks

Abstract : Big Data has become one of the major areas of research for cloud service providers due to a large amount of data produced every day and the inefficiency of traditional algorithms and technologies to handle these large amounts of data. Big Data with its characteristics such as volume, variety, and veracity (3V) requires efficient technologies to process in real time. To solve this problem and to process and analyze this vast amount of data, there are many powerful tools like Hadoop and Spark, which are mainly used in the context of Big Data. They work following the principles of parallel computing. The challenge is to specify which Big Data's tool is better depending on the processing context. In this paper, we present and discuss a performance comparison between two popular Big Data frameworks deployed on virtual machines. Hadoop MapReduce and Apache Spark are used to efficiently process a vast amount of data in parallel and distributed mode on large clusters, and both of them suit for Big Data processing. We also present the execution results of Apache Hadoop in Amazon EC2, a major cloud computing environment. To compare the performance of these two frameworks, we use HiBench benchmark suite, which is an experimental approach for measuring the effectiveness of any computer system. The comparison is made based on three criteria: execution time, throughput, and speedup. We test Wordcount workload with different data sizes for more accurate results. Our experimental results show that the performance of these frameworks varies significantly based on the use case implementation. Furthermore, from our results we draw the conclusion that Spark is more efficient than Hadoop to deal with a large amount of data in major cases. However, Spark requires higher memory allocation, since it loads the data to be processed into memory and keeps them in caches for a while, just like standard databases. So the choice depends on performance level and memory constraints.
Type de document :
Article dans une revue
Liste complète des métadonnées

https://hal-mines-paristech.archives-ouvertes.fr/hal-01678981
Contributeur : Claire Medrala <>
Soumis le : mardi 9 janvier 2018 - 15:41:55
Dernière modification le : lundi 12 novembre 2018 - 10:54:43

Identifiants

Citation

Yassir Samadi, Mostapha Zbakh, Claude Tadonki. Performance comparison between Hadoop and Spark frameworks using HiBench benchmarks. Concurrency and Computation: Practice and Experience, Wiley, 2017, ⟨10.1002/cpe.4367⟩. ⟨hal-01678981⟩

Partager

Métriques

Consultations de la notice

222