260 total views, 1 views today
In this tutorial, you are going to learn about Apache Spark features in details.
If you want to learn more about Apache spark.Please visit here!
We all want our big chunk of data to be processed very fast. To visualize the data, we need a framework or mechanism to process the data very quickly. We have Hadoop MapReduce framework which process the data in parallel and provides us the result. But is it really fast? The answer is no. The reason is, MapReduce makes multiple reads and writes to store the intermediate results of mapper and reducer. Thus, it makes it slow to process the data. The solution is Apache Spark. Yes!!! Apache Spark is 100X lightning faster than Hadoop MapReduce.This is one of the key features of Apache Spark.
But How? It processes the data in-memory and only minimal I/O operations are performed when data spills overs. Intermediate data is stored in memory and data is spawned across the clusters to process them parallelly
Spark streaming is an important component of Apache Spark which will let us process data in real time. Hadoop MapReduce framework can run only on data in rest i.e., data that is already stored in our filesystem. It cannot process real time data. Now Apache spark can process the data in motion i.e real-time data processing.
Apache Spark can handle failures intelligently and there will be no data loss. The Apache spark execution is completely Fault tolerant and it is achieved by Resilient Distributed Dataset i.e RDD. We will discuss more on RDDs in upcoming posts. Stay tuned!
Apache Spark can be deployed in our Hadoop Cluster. YARN can be used as a cluster manager to schedule and monitor the jobs. Apache Spark can use existing Hadoop data sources such as HDFS and HBase. THIS IS MAJOR ADVANTAGE! So if you have a Hadoop cluster up and running you can easily install Apache spark and integrate it swiftly. Apache spark also has its own cluster manager. We can use Apache Mesos for deployment.
Unlike MapReduce, the spark doesn’t load data in-memory once the application is executed. It only loads the data when it is needed. This means, we can perform any number of transformations to our input data and it will only load the data when we make an action on them. This evaluation is maintained in DAG.
Apache spark applications can be written in Java, Scala, Python and R.So developers can choose their preferred language to develop spark apps.
Not only Map Reduce or Batch processing , With apache spark we can also write Spark SQL queries on top of data. We can write machine algorithms given by Spark ML library and graph based algorithms using Spark Graphx processing.
These are the key features of Apache spark explained in detail.