Whether Apache Spark is faster than Hadoop?

Welcome to Apache Spark series. In this post, you going to know whether Apache Spark is faster than Hadoop.

Let us understand the basics first.

Hadoop contains two components. Hadoop filesystem and MapReduce.

MapReduce is a framework to process the data.We can write batch jobs using Hadoop MapReduce.

Apache spark is a in-memory data processing engine.We can write both batch as well as real time processing jobs.

Now coming to the question. Whether Apache Spark is faster than Hadoop?

The answer is Yes! Apache Spark is faster than Hadoop MapReduce.

We all want our big chunk of data to be processed very fast. To visualize the data, we need a framework or mechanism to process the data very quickly. We have a Hadoop MapReduce framework which processes the data in parallel and provides us the result. But is it really fast? The answer is no. The reason is, MapReduce makes multiple reads and writes to store the intermediate results of mapper and reducer. Thus, it makes it slow to process the data. The solution is Apache Spark. Yes!!! Apache Spark is 100X lightning faster than Hadoop MapReduce.This is one of the key features of Apache Spark.

But How? It processes the data in-memory and only minimal I/O operations are performed when data spills overs. Intermediate data is stored in memory and data is spawned across the clusters to process them parallelly.

What if the size of data is more than the size of RAM(in-memory).In this case, spark writes the spillover data to disk. So only limited read and write is performed. Apache Spark is capable of providing real-time analytics at 100X speed than Hadoop MapReduce.

