42 total views, 1 views today
Hi! Welcome to the Apache Spark tutorial series. This tutorial is about Spark RDD features explained in detail.
Before proceeding with this tutorial/Blog I suggest you read my previous blogs on Apache Spark where I have explained Apache Spark features and its architecture in detail. Please visit this link for more information.
In the previous post, I have discussed RDDs in detail. Please check out that!.
RDD is a spark representation of a data set that is partitioned across the cluster. Spark has provided API to act on the RDDs.
The Official definition is,
RDDs are fault-tolerant, parallel data structures that let users explicitly persist intermediate results in memory, control their partitioning to optimize data placement, and manipulate them using a rich set of operators.
So now let us see what are the features does that Spark RDD offer.
There are several features of RDD and below is the list of them.
RDD (Resilient distributed dataset) is an Immutable dataset which means once we have created an RDD from the source data, it is impossible to change the RDD. When we apply any operation such as map(), filter() or flatMap() on RDD, it will produce a new RDD that means the source RDD will be left untouched. This is basically a functional programming concept. You apply some functions(such as Map, filter or distinct) on the RDDs to create a new RDD. A good reference to this concept is Scala’s immutable collections.RDD immutability and Scala immutability are of the same concepts.
Spark RDD is nothing but your dataset that is partitioned across multiple machines which is immutable. When we say partition, it is a logical partition of data sitting in separate machines. RDD is an abstraction provided by the spark to access/manipulate the dataset that is partitioned.
What is the use of partitioning? The major benefit is, we can run a parallel computation on each logically partitioned data. This will improve performance drastically. Any operation on RDD will execute parallelly on each partition. The time required to process the data will be considerably low as we are operating parallelly via RDD.
Spark RDD partitioning is the fundamental unit of Apache Spark.
Any spark application/program contains a set of transformations( such as Map, filter, FlatMap, distinct, etc) and each transformation results in a new RDD. As we mentioned earlier, RDD is immutable so applying each transformation results in a new RDD. Spark internally creates a Lineage Graph by recording each transformation level appropriately like below, This is a conceptual example if Lineage Graph generated by Spark.
So if there is any failure in forming the RDD or any RDD is lost or your transformation fails, it can recompute the RDD based on the Lineage graph. Notice that, RDD is not stored anywhere to recover from error it is recomputed again based on the transformation. It is one of the benefits and most important features of RDD in Apache spark that helps fewer data storage and faster computations. RDD is fault-tolerant so we do not need to worry about the loss of RDD. Spark is capable to recompute the RDD on demand.
RDDs are lazily evaluated. This means when any transformation is applied to the RDD such as map(), filter() or flatMap(), it actually does nothing. Actual transformation/computation happens only if we invoke certain actions on the result of RDD.
action operations – collect,take, foreach
transformation operations – map,filter,flatMap
The concept is similar to Lazy keyword in Scala. Check out my post on the Scala Lazy keyword.
Spark RDDs are stored in-memory instead of the storage disk(If your data fits in the memory else spills the data to disk).Any computation is on the in-memory which helps us to improve the performance of the spark application/program. This is was the major advantage over MapReduce because MapReduce uses a lot of I/O operations to store the intermediate results in the disk whereas spark stores them in RAM to avoid I/O thus increasing the performance.
We can persist the RDD either in RAM or Disk to reuse the RDD in our transformation pipelines.
“One of the most important capabilities in Spark is persisting (or caching) a dataset in memory across operations.This allows future actions to be much faster (often by more than 10x). Caching is a key tool for iterative algorithms and fast interactive use.” – Spark documentation.
Below are the Storage level supported by the spark to store/cache RDD.
Spark allows only Coarse-grained operations which means operations will be applied to the entire dataset.
map(), filter() or flatMap() are a few examples of Coarse-grained operations that will be applied to the entire dataset partitioned across multiple machines. We cannot apply an operation on a particular partition of data in spark. Any operation will be acted upon the entire data set. If there is any failure/loss in partition, it can be recomputed based on the Lineage graph.
The inverse of coarse-grained is fine-grained. E.g., database where we can update only certain rows or columns.
I hope you are able to understand the RDD(Resilient distributed dataset) features in detail. We have more tutorials to come to learn the in-depth features of Apache Spark. Please follow Apache Spark’s complete guide!.
In the upcoming posts, I will provide a detail explanation of Accumulators and broadcast variables.
References: RDD documentation