201 total views, 1 views today
Hey There! Welcome to the Apache Spark tutorial series. This tutorial is about Understand SparkSession in detail.
But Wait!! Before proceeding with this tutorial/Blog I suggest you read my previous blogs on Apache Spark where I have explained Apache Spark features and its architecture in detail. Please visit this link for more information.
If you have basic knowledge of Apache Spark architecture then you are good to go!!. I will try to provide as much information as possible on SparkContext and I believe after reading this tutorial you will have an in-depth understanding of SparkContext.
So Let us start!!!
In the previous post, you would have learned SparkContext in detail. If you haven’t please read this link to understand the in-depth details of SparkContext.
This post is all about SparkSession and let us get started!.
I know you might be having this question for sure.
We already have SparkContext.So Why we need SparkSession here?
Well, that is a good question I would say!. So here is my answer.
SparkContext is the main step and entry point of your spark application. SparkContext is hosted by the driver and it is the heart of your spark application.SparkContext main responsibility is to act as a gateway between your driver and the cluster. It negotiates with your cluster managers(YARN, Apache Mesos) and allocates the required resources to run the spark task.SparkContext is also responsible to create RDDs, maintain accumulators and broadcast variables.
But do you know prior to Spark 2.0 version, spark provided various context services to handle different functionality?
So you need to use each of these context objects separately according to your needs.
Let us now understand SparkSession in detail.
From Spark 2.x versions, SparkSession will be the new entry point for your spark application. To create any spark application, the SparkSession object has to be created with the required configuration(master details, AppName, and config).
The answer is a big NO. SparkContext and other context objects are not removed instead it is now part of SparkSession.All Context objects will be initialized by default via SparkSession.
SparkSession will be the unified context API
Prior to Spark 2.x versions., i.e spark 1.x, we have to rely on, work and manipulate RDDs. After the release of Spark 2.0, we now have a data frame and dataset where we can do manipulations and run queries.SparkSession was introduced to cater to the needs of the data frame and datasets.
Don’t get confused about RDDs, data frame, and datasets. We will discuss them in individual blogs in a detailed manner. As of now think them as a collection of your data source objects.
SparkSession – The entry point to programming Spark with the Dataset and DataFrame API (Spark 2.x).
SparkContext – The entry point to programming Spark with the RDD API (Spark 1.x).
SQLContext – The entry point for working with structured data (rows and columns) in Spark 1.x.
import org.apache.spark.sql.SparkSession val spark = SparkSession.builder() .appName("StructuredStreamingTest") .master("local[*]") .getOrCreate()
import org.apache.spark.sql.SparkSession val spark = SparkSession.builder() .appName("StructuredStreamingTest") .master("local[*]") .getOrCreate() var sc = spark.sparkContext; var sqlContext = spark.sqlContext;
import org.apache.spark.sql.SparkSession val spark = SparkSession.builder() .appName("StructuredStreamingTest") .master("local[*]") .enableHiveSupport() //HiveContext .getOrCreate()
After Spark 2.x, We have to use SparkSession to create our spark applications.
Below are the important operations in SparkSession.
I hope you are able to understand the SparkSession in detail. We have more tutorials to come to learn the in-depth features of Apache Spark. Please follow Apache Spark’s complete guide!.
In the upcoming posts, I will provide a detail explanation of Accumulators and broadcast variables.
References: SparkSession documentation