Apache Spark tutorial – a complete guide

 857 total views,  1 views today

Apache Spark tutorial

In this tutorial, we will learn Apache Spark. After reading this entire tutorial I am sure you will gain in-depth knowledge on Apache Spark.


  1. Introduction to Apache Spark
  2. Guide to Install Apache spark in Ubuntu 16.04 18.04
  3. Apache Spark features explained in detail
  4. Apache Spark architecture – Master and worker
  5. Apache Spark architecture – Driver and Executor
  6. SparkContext
  7. SparkSession
  8. RDD (Resilient distributed dataset)
  9. RDD Features


What is Apache Spark?

Apache Spark is an open-source, high-speed, near real-time, fault-tolerant,in-memory cluster computing framework.

Let us break each feature further,

  • Open source – It is completely free to use
  • High-speed – 100X faster than Hadoop
  • Near real-time processing – process data in real-time
  • Fault-tolerant – handles failure efficiently
  • In-memory cluster computing framework – Reduced read-write I/O operations and data is copied to RAM and processed

Apache Spark History

Apache Spark was initially a research project at the UC Berkeley AMPLab in 2009 and later it was moved to open source in 2010. And in 2013, it was moved to Apache Software Foundation. It has now become a successful project in Apache and is being adopted extensively in top organizations around the globe. Needless to say, the developer community has grown wider.

Why Apache spark was designed?

Hadoop MapReduce was the main reason for designing and developing the Apache spark framework.

As per official documentation,

Our goal was to design a programming model that supports a much wider class of applications than MapReduce, while maintaining its automatic fault tolerance In particular, MapReduce is inefficient for multi-pass applications that require low-latency data sharing across multiple parallel operations

So Apache was designed to overcome the limitations of traditional MapReduce.

What are the important limitations of MapReduce?

  1. MapReduce is a batch processing framework for large datasets. It runs on top of data stored for quite a while. This is not for real-time analytics. The end-user has to wait for the map-reduce program to complete and he/she has to visualize the data.No Real-time processing only Batch processing.
  2. MapReduce runs in two phases. One is Mapper and the other is reducer. The output of each task is stored in the file system and read by the following tasks. Thus, MapReduce performs multiple reads and write operations. This makes the MapReduce program run slower thereby increasing the latency.
  3. It is not suitable for iterative processing that is chaining/sequencing of one map-reduce program to another.

What does Apache spark achieve when compared to MapReduce?

1. 100X processing speed.

How? It processes the data in-memory and only minimal I/O operations are performed when data spills over. Intermediate data is stored in memory and data is spawned across the clusters to process them parallelly. This is one of the key features of Apache spark.

What if the size of data is more than size of RAM(in-memory).In this case, spark writes the spillover data to disk. So only limited read and write are performed. Apache spark is capable of providing real-time analytics at 100X speed than Hadoop MapReduce.

2. Mutli language supports

Apache spark applications can be written in Java, Scala, Python and R.So developers can choose their preferred language to develop spark apps.

One thing to note is, Apache spark is developed in Scala and runs in Java virtual machine.

3. Shell supports

We can execute our programs in spark interactive shell. But it is limited only for Scala and python.

4. Easy to use

Apache spark provides developer-friendly APIs to develop spark applications.MapReduce involves developing Driver, Map and reduce class. But in spark, the APIs are so simple that the lines of code are very minimal.

5. Lazy Evaluation

Unlike MapReduce, the spark doesn’t load data in-memory once the application is executed. It only loads the data when it is needed. This means, we can perform any number of transformations to our input data and it will only load the data when we make an action on them. This evaluation is maintained in DAG.

Will Apache spark replace Hadoop?

The answer is – it depends. If you have a Hadoop cluster running, you can run Apache spark applications on top of that.YARN can be used as a cluster manager. Of course,map-reduce programs can be replaced with Spark programs but it is not necessary to remove the complete Hadoop cluster. Apache spark can work well with Hadoop, YARN, and HDFS.