What is Bigdata, Hadoop and HDFS?

Artificial intelligence and Bitcoin are trending buzzwords today, but you know about Bigdata? It is also important as they, why?. Don't worry you will have a good idea after reading this article. Data is everywhere, If you are uploading pictures, songs, movies and your all activities in mobile and computer generate lots of data daily. When we calculate this on a large amount of audience then the combination of these small-small data becomes the Bigdata. Now think, how complicated is to manage that huge amount of data? let's take an example just to give you an idea. Every day there are lots of creators are uploading their videos on YouTube. If you calculate the sum of that data then it will blow your mind. How YouTube manages that amount of huge data? how they optimize that data? Then the Bigdata concept is applied to it. I am not going to bore you with the history of Bigdata, So let's move directly on to the definition of Bigdata.

Bigdata:- Bigdata is basically a Huge volume of Data which can not be stored or processed by a traditional approach in a given time period.



You didn't get the definition? don't worry, let me explain. the definition is saying that huge volume of data, so let's consider we have 1000 TB of data. All right so we all know that we can not store that much amount of data on our PC/laptop. Generally, we use 1 Tb or 2 TB hard disk on our personal computer. Now, what if we have to store 1000 TB data? this is the first question. Now let's move to another part is processing. The data which we can not even store properly so how can our single processor can handle that amount of data? OK, now the third point is the time limit. The time period is the most important factor to apply Bigdata concept because it's a valuable thing for everyone, you have to provide faster and best service, If you can not then you will fail. So we have to think about these all three factors and provide the best service, so no doubt here we have to use Bigdata concept.

How huge this data needs to be for getting elected to count as Bigdata?
You already got some idea about Bigdata. Now the question will stick in your head that you talked about Bigdata that is OK but how much should it for consideration as Bigdata? Many of you have a misconception about it that if data is about to GB or TB or PB that is called Bigdata but that's not true.

Being called Bigdata is that not necessary that you have data in TB or PB. Even a small amount of data also can be called as Bigdata. You didn't get this, let me explain.

You all are using Gmail, you noticed that the attachment file size limit is 25 MB. you can not send a larger file than 25 MB. So your data is Bigdata for email. 

One more, we travel in trains, Before 1 day we book train tickets and check availability and many more. Before going to the station we check train status from our mobile that how far it is from our station? How it took to read station? Train continuously sending data to management. Now think how many trains are running on the tracks at the same time and all are generating data one by one. so how to manage that data at the same time? Now, what if we use the old process for this? can we get perfect results at the perfect time? that's why Bigdata concept is here.  

Big giant companies have lots of amount of data but still, they are giving us perfect services, just because of Bigdata. Google used its own system that is Google file system (GFS) and MapReduce program. Then later on Hadoop comes with open source, so now mostly Hadoop is used in Bigdata. You are quite curious about Hadoop right? OK, let's see what is Hadoop.

What is Hadoop?
Hadoop is a programming framework. It is Java-based and Open source. It supports processing a large amount of data into a distributed computing environment.

Hadoop is actually used for 2 purposes:

  • Storing
  • Processing
Storing:- Storing a large amount of data is the Big challenge to us, We can not store that large amount of data into one particular device. So, we have to distribute data into small parts and then we can store it. Hadoop provides a data distribution for storing into more than one devices.

Process:- Process is actually a very complicated task to perform on a large amount of data. So, we are separating large data into small parts of data and then we perform processing on it. This is the basic concept of Hadoop, in my word divide and rule(on data lol).

Hadoop can work with so many interconnected machines and handle lots of TB data. It provides rapid data transfer rate. It was inspired by Google's MapReduce Programming framework.

Hadoop components
  • HDFS (Hadoop distributed file system)
  • Name Node
  • Data Node
  • YARN (Yet another resource Negotiator)
DataNode:- Data you want to analyze that all kind of data is stored at DataNode. DataNodes stores your actual data in HDFS. DataNode is also known as the SLAVE. NameNode and DataNode work in constant communication, for example, one DataNode fails then NameNode gives that task to another.

NameNode:- NameNode is known as the MASTER. NameNode stores the metadata of the HDFS. One more thing is that NameNode does not store actual data, It keeps only metadata or you can say just information. NameNode can store a list of blocks and its location information also and with that information it can construct the file from the block. I called NameNode as MASTER. Why? Because if we don't have the address of our data then how can we process on to it? We need some little bit more RAM to use NameNode.

YARN:- YARN is basically used as a resource manager. In some cases data is overloaded in one system, that time YARN distributes it to another system.

Working of HDFS



As you can see in the picture there is one NameNode, DataNode, and Blocks. Large data sets get divided into small blocks. These blocks are stored into DataNode as you can see block1, block2, block3 with respect to DataNode1, DataNode2, DataNode3. DataNode stores that data and there is HDFS which perform a process on blocks and you can get output from there. Above the NameNode, one process is used which is called Job Tracker. Job Tracker is a Java program. Job Tracker perform a task on all DataNodes and stores results into HDFS. By default, each block size is limited to 128 MB But if you want to change then you can.

Any doubt?? If yes then ask in comments and if the article was helpful then you can share with your friends.

Post a comment

4 Comments

  1. Thanks for the article, I just wanted to know the job tracker you are talking about is yarn or zookeeper or something else.

    ReplyDelete
    Replies
    1. You are most welcome. Stay tuned for more interesting stuff

      Delete
  2. Really nice and interesting post. I was looking for this kind of information and enjoyed reading this one. Keep posting. Thanks for sharing.
    ethical hacking course training in guduvanchery

    ReplyDelete