Apache Spark is a cluster computing framework designed for fast Hadoop computation.
Apache Spark is an open source framework for running large-scale data analytics applications across clustered computers. It can handle both batch and real-time analytics and data processing workloads.
Spark consists of in-memory cluster computing to increase the processing speed on an application. Spark is based on Hadoop MapReduce, and it extends the MapReduce model to perform multiple computations. It also includes interactive querying. Spark’s goal was to generalize MapReduce to support new apps within same engine.
Spark provides multiple advantages. It allows running an application on Hadoop cluster much faster than running in memory and on disk. It also reduces the number of read and write operations to disk. It supports various programming languages. It has built-in APIs in Java, Python, Scala so the programmer can write the application in different languages. Furthermore, it provides support for streaming data, graph and machine learning algorithms to perform advanced data analytics.
a. Swift Processing
Apache Spark offers high data processing speed. That is about 100x faster in memory and 10x faster on the disk. However, it is only possible by reducing the number of read-write to disk.
b. Dynamic in Nature
Basically, it is possible to develop a parallel application in Spark. Since there are 80 high-level operators available in Apache Spark.
c. In-Memory Computation in Spark
The increase in processing speed is possible due to in-memory processing. It enhances the processing speed.
We can easily reuse spark code for batch-processing or join stream against historical data. Also to run ad-hoc queries on stream state.
e. Spark Fault Tolerance
Spark offers fault tolerance. It is possible through Spark’s core abstraction-RDD. Basically, to handle the failure of any worker node in the cluster, Spark RDD are designed. Therefore, the loss of data is reduced to zero.
f. Real-Time Stream Processing
We can do real-time stream processing in Spark. Basically, Hadoop does not support real-time processing. It can only process data which is already present. Hence with Spark Streaming, we can solve this problem.
g. Lazy Evaluation in Spark
All the transformations we make in spark RDD are lazy in nature, that is it does not give the result right away rather a new RDD is formed from the existing one. Thus, this increases the efficiency of the system.
Why do we need spark
Spark uses Micro-batching for real-time streaming. Apache Spark is open source, general-purpose distributed computing engine used for processing and analyzing a large amount of data. Just like Hadoop MapReduce, it also works with the system to distribute data across the cluster and process the data in parallel.