Apache Spark features explained in detail

 260 total views,  1 views today

Apache Spark features

In this tutorial, you are going to learn about Apache Spark features in details.

If you want to learn more about Apache spark.Please visit here!

Lightning fast

We all want our big chunk of data to be processed very fast. To visualize the data, we need a framework or mechanism to process the data very quickly. We have Hadoop MapReduce framework which process the data in parallel and provides us the result. But is it really fast? The answer is no. The reason is, MapReduce makes multiple reads and writes to store the intermediate results of mapper and reducer. Thus, it makes it slow to process the data. The solution is Apache Spark. Yes!!! Apache Spark is 100X lightning faster than Hadoop MapReduce.This is one of the key features of Apache Spark.

But How? It processes the data in-memory and only minimal I/O operations are performed when data spills overs. Intermediate data is stored in memory and data is spawned across the clusters to process them parallelly

Real time processing

Spark streaming is an important component of Apache Spark which will let us process data in real time. Hadoop MapReduce framework can run only on data in rest i.e., data that is already stored in our filesystem. It cannot process real time data. Now Apache spark can process the data in motion i.e real-time data processing.

Fault Tolerant

Apache Spark can handle failures intelligently and there will be no data loss. The Apache spark execution is completely Fault tolerant and it is achieved by Resilient Distributed Dataset i.e RDD. We will discuss more on RDDs in upcoming posts. Stay tuned!

Deployments

Apache Spark can be deployed in our Hadoop Cluster. YARN can be used as a cluster manager to schedule and monitor the jobs. Apache Spark can use existing Hadoop data sources such as HDFS and HBase. THIS IS MAJOR ADVANTAGE! So if you have a Hadoop cluster up and running you can easily install Apache spark and integrate it swiftly. Apache spark also has its own cluster manager. We can use Apache Mesos for deployment.

Lazy Loading

Unlike MapReduce, the spark doesn’t load data in-memory once the application is executed. It only loads the data when it is needed. This means, we can perform any number of transformations to our input data and it will only load the data when we make an action on them. This evaluation is maintained in DAG.

Mutli language supports

Apache spark applications can be written in Java, Scala, Python and R.So developers can choose their preferred language to develop spark apps.

Advance analytics

Not only Map Reduce or Batch processing , With apache spark we can also write Spark SQL queries on top of data. We can write machine algorithms given by Spark ML library and graph based algorithms using Spark Graphx processing.

These are the key features of Apache spark explained in detail.

Tags :

About the Author

Rajasekar

Hey There, My name is Rajasekar and I am the author of this site. I hope you are liking my tutorials and references. Programming and learning new technologies are my passion. The ultimate idea of this site is to share my knowledge(I am still a learner :)) and help you out!. Please spread your words about us (staticreference.com) and give a thumbs up :) Feel free to contact me for any queries!.