Map Reduce programming, I am sure you heard about it when you started to learn Big Data. So, today I am going to cover Map Reduce. Basically, Map Reduce is a java based framework, which is used for analysis of data. That's why we have to know about it. It is developed by Google. OK, so this was the basic overview of Map Reduce but what actually Map Reduce is? Let me explain.
What is Map-Reduce:
In the previous article, I told you about Big Data. So, what we were doing is that we divided large data into a small amount of blocks. Then we assigned those blocks to our devices which perform some specific tasks and return output to the sender.
Map Reduce divide a task into small parts and assigns them to many devices then results are collected at one place and integrated to the result data set.
As you can see, We distributed large data into blocks and we assigned to our local computers to perform an operation on it. After the operation, it will send output to a Centralized System and we can access that data from Centralized System.
Here we can say Centralized System is Name Node also and our local computers are Data Nodes. Because the centralized system keeps information about what task you gave to which device and local computers take data and then perform actions so they can be called as Data Nodes. If you don't know what is Data Node and Name Node then please check out our previous article what is Big Data.
How Map Reduce Works
In Map Reduce there is two main phase: 1) Map and 2) Reduce
1. Map:- Map takes data and converts it into another set of data. In which individual elements are broken down into tuples.
2. Reduce:- Map takes data and reduce takes the output from the map as an input and combine that all data into a small set of tuples.
Here is the example of Map Reduce working. What is going on here, we have data in words and we have to count repeated words. We have three words "Call", "Of", "Techies" suppose it is a Big data. Now let me tell you how it works:
1. The first step is to divide Huge data, as you can see here we have some couple of words. So, in the first step, we split the data into small blocks.
2. After splitting the data, blocks go to their parallel devices. Example:- We have 100 MB of data. So, it will split into 4 blocks which will be 25-25 MB. The first machine takes 25 MB data, another one is taking 25 Mb data, the Third one is taking 25 MB data and fourth also 25 MB data. In the above image, after data splitting map reduces takes input and convert into key pairs.
3. In the shuffle it collects all data from the map and winds up all the similar things then it sends to Reduce.
4. Reduce sums that data like call(1,1,1) and gives us the result as (Call,3) this is how we can get a result.
What we need for Map Reduce Programming
What is Map-Reduce:
In the previous article, I told you about Big Data. So, what we were doing is that we divided large data into a small amount of blocks. Then we assigned those blocks to our devices which perform some specific tasks and return output to the sender.
Map Reduce divide a task into small parts and assigns them to many devices then results are collected at one place and integrated to the result data set.
As you can see, We distributed large data into blocks and we assigned to our local computers to perform an operation on it. After the operation, it will send output to a Centralized System and we can access that data from Centralized System.
Here we can say Centralized System is Name Node also and our local computers are Data Nodes. Because the centralized system keeps information about what task you gave to which device and local computers take data and then perform actions so they can be called as Data Nodes. If you don't know what is Data Node and Name Node then please check out our previous article what is Big Data.
How Map Reduce Works
In Map Reduce there is two main phase: 1) Map and 2) Reduce
1. Map:- Map takes data and converts it into another set of data. In which individual elements are broken down into tuples.
2. Reduce:- Map takes data and reduce takes the output from the map as an input and combine that all data into a small set of tuples.
Here is the example of Map Reduce working. What is going on here, we have data in words and we have to count repeated words. We have three words "Call", "Of", "Techies" suppose it is a Big data. Now let me tell you how it works:
1. The first step is to divide Huge data, as you can see here we have some couple of words. So, in the first step, we split the data into small blocks.
2. After splitting the data, blocks go to their parallel devices. Example:- We have 100 MB of data. So, it will split into 4 blocks which will be 25-25 MB. The first machine takes 25 MB data, another one is taking 25 Mb data, the Third one is taking 25 MB data and fourth also 25 MB data. In the above image, after data splitting map reduces takes input and convert into key pairs.
3. In the shuffle it collects all data from the map and winds up all the similar things then it sends to Reduce.
4. Reduce sums that data like call(1,1,1) and gives us the result as (Call,3) this is how we can get a result.
What we need for Map Reduce Programming
- We need two custom classes. First is Mapper and the second one is Reducer.
- We also need two functions which we have to override, Map and Reduce.
This is how Map Reduce works, still, you can't get it? Don't worry, let's take an example-
In this example, we are going to calculate words from the text file. Suppose we have a text file like this:
I am Nothing
I am Bored
I am Smart
So, here is our text file, in which some sentences are written. Now, what happens?
1. Recorder Reader will split your file into key pairs like,
{1, I am Nothing}
{2, I am Bored}
{3, I am Smart}
And for this example, the data type is used {LongWritable, Text}
2. After splitting data will go into mapper, so mapper divide that data individually like:
(I,1)(am,1)(Nothing,1)
(I,1)(am,1)(Bored,1)
(I,1)(am,1)(Smart,1)
The Datatype is (Text, InWritable)
3. After mapping finished there is two more processes have to do which are shuffle and sort. They will combine all the data like:
(I,1,1,1) ---------------------------> (Text, Iterable(IntWritable)) // Key Pairs
(am,1,1,1)
(Nothing,1)
(Bored,1)
(Smart,1)
4. Now finally it will go with reducer and then reducer will sum up all the things like:
(I,3)
(am,3)
(Nothing,1)
(Bored,1)
(Smart,1)
This is how we can count word from big files. I will not include the original program for this because it will get more complex to you. So, I gave you the basic idea about how map reduces work. Hope you understand all the thing behind the Map Reduce. I will see you guys in the next article, Goodbye.
0 Comments