Understand SparkSession in detail

 201 total views,  1 views today

Hey There! Welcome to the Apache Spark tutorial series. This tutorial is about Understand SparkSession in detail.


But Wait!! Before proceeding with this tutorial/Blog I suggest you read my previous blogs on Apache Spark where I have explained Apache Spark features and its architecture in detail. Please visit this link for more information.

If you have basic knowledge of Apache Spark architecture then you are good to go!!. I will try to provide as much information as possible on SparkContext and I believe after reading this tutorial you will have an in-depth understanding of SparkContext.

So Let us start!!!


In the previous post, you would have learned SparkContext in detail. If you haven’t please read this link to understand the in-depth details of SparkContext.

This post is all about SparkSession and let us get started!.

I know you might be having this question for sure.

We already have SparkContext.So Why we need SparkSession here?

Well, that is a good question I would say!. So here is my answer.

SparkContext is the main step and entry point of your spark application. SparkContext is hosted by the driver and it is the heart of your spark application.SparkContext main responsibility is to act as a gateway between your driver and the cluster. It negotiates with your cluster managers(YARN, Apache Mesos) and allocates the required resources to run the spark task.SparkContext is also responsible to create RDDs, maintain accumulators and broadcast variables.

But do you know prior to Spark 2.0 version, spark provided various context services to handle different functionality?

  • SparkContext – To handle your application and play with RDDs
  • SqlContext – To handle and execute your Spark SQL queries
  • HiveContext – To handle hive related functionality
  • StreamingContext – To Handle streaming operations.

So you need to use each of these context objects separately according to your needs.

Let us now understand SparkSession in detail.

SparkSession

From Spark 2.x versions, SparkSession will be the new entry point for your spark application. To create any spark application, the SparkSession object has to be created with the required configuration(master details, AppName, and config).

SparkContext,SqlConext,HiveContext and StreamingContext has been removed?

The answer is a big NO. SparkContext and other context objects are not removed instead it is now part of SparkSession.All Context objects will be initialized by default via SparkSession.

SparkSession will be the unified context API

Prior to Spark 2.x versions., i.e spark 1.x, we have to rely on, work and manipulate RDDs. After the release of Spark 2.0, we now have a data frame and dataset where we can do manipulations and run queries.SparkSession was introduced to cater to the needs of the data frame and datasets.

Don’t get confused about RDDs, data frame, and datasets. We will discuss them in individual blogs in a detailed manner. As of now think them as a collection of your data source objects.

SparkSession – The entry point to programming Spark with the Dataset and DataFrame API (Spark 2.x).

SparkContext – The entry point to programming Spark with the RDD API (Spark 1.x).

SQLContext – The entry point for working with structured data (rows and columns) in Spark 1.x.

Create SparkSession

import org.apache.spark.sql.SparkSession

val spark = SparkSession.builder()
              .appName("StructuredStreamingTest")
              .master("local[*]")
              .getOrCreate()

Access SparkContext and SQLContext from SparkSession

import org.apache.spark.sql.SparkSession

val spark = SparkSession.builder() 
                        .appName("StructuredStreamingTest") 
                        .master("local[*]") 
                        .getOrCreate()

var sc = spark.sparkContext;

var sqlContext = spark.sqlContext;

Access HiveContext from SparkSession

import org.apache.spark.sql.SparkSession

val spark = SparkSession.builder()
                        .appName("StructuredStreamingTest")
                        .master("local[*]")
                        .enableHiveSupport() //HiveContext
                        .getOrCreate()

After Spark 2.x, We have to use SparkSession to create our spark applications.

Operations supported by SparkSession

Below are the important operations in SparkSession.

  1. Create a dataframe(rows and column – untyped data)
  2. Create a dataset(typed)
  3. read non-streaming data via read() operation
  4. read streaming data via readStream() operation
  5. Access SparkContext
  6. Stop SparkContext
  7. Access SqlContext
  8. Execute SQL queries

I hope you are able to understand the SparkSession in detail. We have more tutorials to come to learn the in-depth features of Apache Spark. Please follow Apache Spark’s complete guide!.

In the upcoming posts, I will provide a detail explanation of Accumulators and broadcast variables.

Stay tuned!

References: SparkSession documentation

 

Tags :

About the Author

Rajasekar

Hey There, My name is Rajasekar and I am the author of this site. I hope you are liking my tutorials and references. Programming and learning new technologies are my passion. The ultimate idea of this site is to share my knowledge(I am still a learner :)) and help you out!. Please spread your words about us (staticreference.com) and give a thumbs up :) Feel free to contact me for any queries!.