Spark RDD – Resilient distributed dataset – Internals

 131 total views,  1 views today

Hey Buddy! Welcome to the Apache Spark tutorial series. This tutorial is about to Understand Spark RDD – Resilient distributed dataset – Internals.


But Wait!! Before proceeding with this tutorial/Blog I suggest you read my previous blogs on Apache Spark where I have explained Apache Spark features and its architecture in detail. Please visit this link for more information.

If you have basic knowledge of Apache Spark architecture then you are good to go!!. I will try to provide as much information as possible on the Resilient distributed dataset and I believe after reading this tutorial you will have an in-depth understanding of Spark RDD and its internals.

So Let us start!!!


RDD (Resilient distributed dataset)

RDD (Resilient distributed dataset) is the core of Apache Spark and it is the fundamental abstraction provided by Apache Spark.

In any application or program, we will be working on data of a specific format. It can be an object, JSON , list , map or any collection. But in Apache Spark,  we will be working on the RDDs.Your input will be converted to RDD internally by Apache Spark.

When I say converted, I mean, your data will be the same format i.e., list, map or JSON but spark provides abstraction on top of the data.

RDD (Resilient distributed dataset) is the data partitioned across the nodes of your cluster. 

(Just a rough example!!)  Say you have a cluster of 3 nodes and you have an input file that has 100 lines of data. Now when you read this data in your spark application, input file/data would be split into 3 nodes.

i,e 33 lines in the first node, 33 lines in second node and 34 lines in the third node. And this collectively forms an RDD. So an RDD is your collection of data partitioned across your nodes of the cluster.

Data/DataSet will be logically partitioned and operated parallelly.

Why data has to be partitioned?

The answer is, you can operate parallelly. Any operation on RDD will execute parallelly on each partition. The time required to process the data will be considerably low as we are operating parallelly via RDD.

According to Apache Spark Official documentation,

The main abstraction Spark provides is a resilient distributed dataset (RDD), which is a collection of elements partitioned across the nodes of the cluster that can be operated on in parallel

RDD (Resilient distributed dataset) is an Immutable dataset which means once we have created an RDD from the source data, it is impossible to change the RDD. Any operation on RDD will compute a new RDD.So the source RDD remains unchanged. This is basically a functional programming concept. You apply some functions(such as Map, filter or distinct) on the RDDs to create a new RDD. A good reference to this concept is Scala’s immutable collections.RDD immutability and Scala immutability are of the same concepts.

RDD (Resilient distributed dataset) is fault-tolerant. Any spark application/program contains a set of transformations( such as Map, filter, FlatMap, distinct, etc) and each transformation results in a new RDD. As we mentioned earlier, RDD is immutable so applying each transformation results in a new RDD. Spark internally creates a Lineage Graph by recording each transformation level appropriately like below, This is a conceptual example if Lineage Graph generated by Spark.

RDD Lineage graph

So if there is any failure in forming the RDD or any RDD is lost or your transformation fails, it can recompute the RDD based on the Lineage graph. Notice that, RDD is not stored anywhere to recover from error it is recomputed again. It is one of the benefits and most important features of RDD in Apache spark that helps fewer data storage and faster computations. RDD is fault-tolerant so we do not need to worry about the loss of RDD.

Spark is capable to recompute the RDD on demand.

So putting all these important pieces together, we have a Resilient distributed dataset.

  • Resilient – Fault tolerant to avoid error/loss by recomputing based on the Lineage graph
  • Distributed – Data is partitioned across the nodes of the cluster. It is Logical partitions to operate parallelly for faster computations.
  • Dataset – Underlying data which can be a text file, JSON, List or any collections.

 

I hope you are able to understand the RDD(Resilient distributed dataset) concept in detail. We have more tutorials to come to learn the in-depth features of Apache Spark. Please follow Apache Spark’s complete guide!.

In the upcoming posts, I will provide a detail explanation of Accumulators and broadcast variables.

Stay tuned!

References: RDD documentation

Tags :

About the Author

Rajasekar

Hey There, My name is Rajasekar and I am the author of this site. I hope you are liking my tutorials and references. Programming and learning new technologies are my passion. The ultimate idea of this site is to share my knowledge(I am still a learner :)) and help you out!. Please spread your words about us (staticreference.com) and give a thumbs up :) Feel free to contact me for any queries!.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.