Performance comparison between Hadoop and Spark frameworks using HiBench benchmarks

Yassir Samadi; Mostapha Zbakh; Claude Tadonki

doi:10.1002/cpe.4367

Article Dans Une Revue Concurrency and Computation: Practice and Experience Année : 2017

Performance comparison between Hadoop and Spark frameworks using HiBench benchmarks

(1, 2) , (1) , (2, 3)

1
2
3

Yassir Samadi

Fonction : Auteur

Ecole Nationale Supérieure d'Informatique et d'Analyses des Systèmes

Centre de Recherche en Informatique

Mostapha Zbakh

Fonction : Auteur

Ecole Nationale Supérieure d'Informatique et d'Analyses des Systèmes

Claude Tadonki

Fonction : Auteur
PersonId : 4301
IdHAL : claude-tadonki
ORCID : 0000-0003-1194-6400
IdRef : 157021890

Centre de Recherche en Informatique

Mines Paris - PSL (École nationale supérieure des mines de Paris)

Résumé

Big Data has become one of the major areas of research for cloud service providers due to a large amount of data produced every day and the inefficiency of traditional algorithms and technologies to handle these large amounts of data. Big Data with its characteristics such as volume, variety, and veracity (3V) requires efficient technologies to process in real time. To solve this problem and to process and analyze this vast amount of data, there are many powerful tools like Hadoop and Spark, which are mainly used in the context of Big Data. They work following the principles of parallel computing. The challenge is to specify which Big Data's tool is better depending on the processing context. In this paper, we present and discuss a performance comparison between two popular Big Data frameworks deployed on virtual machines. Hadoop MapReduce and Apache Spark are used to efficiently process a vast amount of data in parallel and distributed mode on large clusters, and both of them suit for Big Data processing. We also present the execution results of Apache Hadoop in Amazon EC2, a major cloud computing environment. To compare the performance of these two frameworks, we use HiBench benchmark suite, which is an experimental approach for measuring the effectiveness of any computer system. The comparison is made based on three criteria: execution time, throughput, and speedup. We test Wordcount workload with different data sizes for more accurate results. Our experimental results show that the performance of these frameworks varies significantly based on the use case implementation. Furthermore, from our results we draw the conclusion that Spark is more efficient than Hadoop to deal with a large amount of data in major cases. However, Spark requires higher memory allocation, since it loads the data to be processed into memory and keeps them in caches for a while, just like standard databases. So the choice depends on performance level and memory constraints.

Mots clés

Amazon EC2 Big Data cloud computing Hadoop HiBench parallel and distributed processing Spark

Domaines

Informatique [cs] Calcul parallèle, distribué et partagé [cs.DC]

Claire Medrala : Connectez-vous pour contacter le contributeur

https://minesparis-psl.hal.science/hal-01678981

Soumis le : mardi 9 janvier 2018-15:41:55

Dernière modification le : vendredi 19 avril 2024-16:18:57

Dates et versions

hal-01678981 , version 1 (09-01-2018)

Identifiants

HAL Id : hal-01678981 , version 1
DOI : 10.1002/cpe.4367

Citer

Yassir Samadi, Mostapha Zbakh, Claude Tadonki. Performance comparison between Hadoop and Spark frameworks using HiBench benchmarks. Concurrency and Computation: Practice and Experience, 2017, Spécial Issue paper, ⟨10.1002/cpe.4367⟩. ⟨hal-01678981⟩

Exporter

BibTeX XML-TEI Dublin Core DC Terms EndNote DataCite

Collections

INSTITUT-TELECOM ENSMP ENSMP_CRI PARISTECH PSL ENSMP_DR

237 Consultations

0 Téléchargements

Performance comparison between Hadoop and Spark frameworks using HiBench benchmarks

Résumé

Mots clés

Domaines

Dates et versions

Identifiants

Citer

Exporter

Collections

Altmetric

Partager