Before
diving into Apache Spark, we should know about the shortcoming of Hadoop. We
will take one example and will understand how Hadoop actually work and we will
see the different phases of execution in Hadoop than we will see the same
example in Apache Spark and how it will improve the execution performance.
Scenario :
There is one text file having some line of statements, we need to process
this text file and need to count of each word coming in that file.
Shortcoming in Hadoops MapReduce architecture
1.
Slow
Processing Speed
The data goes through following phases
Input Splits:
Input to a MapReduce job is divided into fixed-size pieces
called input splits Input split is a chunk of the input that
is consumed by a single map.
Mapping
This is very first phase in the execution of map-reduce program.
In this phase data in each split is passed to a mapping function to produce
output values. In our example, job of mapping phase is to count number of occurrences
of each word from input splits (more details about input-split is given above)
and prepare a list in the form of <word, frequency>
Shuffling
This phase consumes output of Mapping phase. Its task is to
consolidate the relevant records from Mapping phase output. In our example,
same words are clubbed together along with their respective frequency.
Reducing
In this phase, output values from Shuffling phase are
aggregated. This phase combines values from Shuffling phase and returns a
single output value. In short, this phase summarizes the complete dataset.
Hadoop divides the job into tasks.
There are two types of tasks:
1.
Map
tasks (Splits &
Mapping)
2.
Reduce
tasks (Shuffling,
Reducing)
Execution of map tasks results into writing
output to a local disk on the respective node and not to HDFS. Reading and
writing to local disk is very time consuming.
Processing
Data Using MapReduce
2. Support
for Batch Processing only
Hadoop supports
batch processing only, it does not process streamed data, and hence overall
performance is slower. MapReduce framework of Hadoop does not leverage the
memory of the Hadoop cluster to
the maximum. There are of course Big Data frameworks which
are made to handle real-time data sources, a few of them are Apache Storm, Apache Flink, Apache Samza and Apache Spark Streaming.
Apache
Hadoop is designed for batch processing, that means it take a huge amount of
data in input, process it and produce the result. Although batch processing is
very efficient for processing a high volume of data, but depending on the size
of the data being processed and computational power of the system, an output
can be delayed significantly. Hadoop is not suitable for Real-time data
processing.
finally, there are certain cases where map reduce is not a suitable choice :
- Real-time processing.
- It's not always very easy to implement each and everything as a MR program.
- When your intermediate processes need to talk to each other(jobs run in isolation).
- When your processing requires lot of data to be shuffled over the network.
- When you need to handle streaming data. MR is best suited to batch process huge amounts of data which you already have with you.
- When you can get the desired result with a standalone system. It's obviously less painful to configure and manage a standalone system as compared to a distributed system.
- When you have OLTP(Online transaction processing) needs. MR is not suitable for a large number of short on-line transactions. Hadoop is purely an OLAP (online analytical processing).
In next blog i will show how Apache Spark solve these issues. We will see internal architecture and each component of SPARK in detail.




No comments:
Post a Comment