Friday, 1 March 2019


Before diving into Apache Spark, we should know about the shortcoming of Hadoop. We will take one example and will understand how Hadoop actually work and we will see the different phases of execution in Hadoop than we will see the same example in Apache Spark and how it will improve the execution performance.



Scenario :
There is one text file having some line of statements, we need to process this text file and need to count of each word coming in that file.




Shortcoming in Hadoops MapReduce architecture





1.   Slow Processing Speed


The data goes through following phases

  Input Splits:

Input to a MapReduce job is divided into fixed-size pieces called input splits Input split is a chunk of the input that is consumed by a single map.





         
         Mapping

This is very first phase in the execution of map-reduce program. In this phase data in each split is passed to a mapping function to produce output values. In our example, job of mapping phase is to count number of occurrences of each word from input splits (more details about input-split is given above) and prepare a list in the form of <word, frequency>

Shuffling

This phase consumes output of Mapping phase. Its task is to consolidate the relevant records from Mapping phase output. In our example, same words are clubbed together along with their respective frequency.

         Reducing

In this phase, output values from Shuffling phase are aggregated. This phase combines values from Shuffling phase and returns a single output value. In short, this phase summarizes the complete dataset.
Hadoop divides the job into tasks. 

There are two types of tasks:

1.   Map tasks (Splits & Mapping)
2.   Reduce tasks (Shuffling, Reducing)

Execution of map tasks results into writing output to a local disk on the respective node and not to HDFS. Reading and writing to local disk is very time consuming.







Processing Data Using MapReduce




2. Support for Batch Processing only

      Hadoop supports batch processing only, it does not process streamed data, and hence overall performance is slower. MapReduce framework of Hadoop does not leverage the memory of the Hadoop cluster to the maximum. There are of course Big Data frameworks which are made to handle real-time data sources, a few of them are Apache StormApache FlinkApache Samza and Apache Spark Streaming.

    3. No Real-time Data Processing

         Apache Hadoop is designed for batch processing, that means it take a huge amount of data in input, process it and produce the result. Although batch processing is very efficient for processing a high volume of data, but depending on the size of the data being processed and computational power of the system, an output can be delayed significantly. Hadoop is not suitable for Real-time data processing.
      
     finally, there are certain cases where map reduce is not a suitable choice :


  • Real-time processing.
  • It's not always very easy to implement each and everything as a MR program.
  • When your intermediate processes need to talk to each other(jobs run in isolation).
  • When your processing requires lot of data to be shuffled over the network.
  • When you need to handle streaming data. MR is best suited to batch process huge amounts of data which you already have with you.
  • When you can get the desired result with a standalone system. It's obviously less painful to configure and manage a standalone system as compared to a distributed system.
  • When you have OLTP(Online transaction processing) needs. MR is not suitable for a large number of short on-line transactions.  Hadoop is purely an OLAP (online analytical processing).


In next blog i will show how Apache Spark solve these issues. We will see internal architecture and each component of SPARK in detail.