Setup and Configuration of Hadoop Cluster

What we will do?

Setup a Hadoop cluster with one NameNode(Master) and 4 DataNodes(Slave) and one Client Node.

✴️ Upload a file through Client to the NameNode.

✴️ Check which DataNode the Master chooses to store the file.

✴️ Once uploaded try to read the file through Client using the cat command and while Master is trying to access the file from that DataNode where it stored the file delete that DataNode or crash it and see with help of the replicated storage how master retrieves the file and present it to client.

Hadoop is an open-source software framework for storing data and running applications on clusters of commodity hardware. It provides massive storage for any kind of data, enormous processing power and the ability to handle virtually limitless concurrent tasks or jobs.

It is part of the Apache project sponsored by the Apache Software Foundation.

What is a Hadoop Cluster?

A Hadoop cluster is a collection of computers, known as nodes, that are networked together to perform these kinds of parallel computations on big data sets. Unlike other computer clusters, Hadoop clusters are designed specifically to store and analyze mass amounts of structured and unstructured data in a distributed computing environment. It uses HDFS (Hadoop Distributed File System) Protocol to communicate among each other.

Name Node:- Name node is a node which regulates file access to the clients. It maintains and manages the slave nodes and assign tasks to them. It maintains an Index Table that has all the metadata about the files stored in the Cluster.

Data Node:- Nodes that share their storage and compute unit with the master node are known as data nodes or slave nodes. The data node stores the data uploaded by the client. And the client directly reads the data from the slave node.

Client Node:- Nodes that are responsible for loading the data and fetching the results are known as the client node.

Practical:

We will run each node (Master Node, Data Nodes and Client all on AWS Cloud). We are running RedHat 8 AWS EC2 Instance for each node.

So, first of all, we need Jdk package as well as hadoop package on each instance first. jdk is needed since Hadoop Framework is written in Java.

Run the following:

sudo su — root (to log in as root account)

rpm -ivh jdk-8u171-linux-x64.rpm (to install java)

rpm -ivh hadoop-1.2.1–1.x86_64.rpm — force (to install Hadoop)

echo 3 > /proc/sys/vm/drop_caches (to clear the caches)

Confirm whether they are installed by:

java -version

hadoop version

Name Node Setup and Configuration

First, make a directory

mkdir /nn

cd /etc/hadoop/

ls

Open hdfs-site.xml inside the /etc/hadoop/ folder:

<configuration>

<property>

<name>dfs.name.dir</name>

<value>/nn</value>

</property>

</configuration>

Open core-site.xml

<configuration>

<property>

<name>fs.default.name</name>

<value>hdfs://0.0.0.0:9001</value>

</property>

</configuration>

Now, files are correct. Now, we have to format the namenode.

But why we need formatting in Name Node?

Because on formatting, Name Node will create a index table or INode Table which will contain all the metadata about the Data Nodes available and also about all the files stored in the cluster.

hadoop namenode -format (to format the name node)

Now, start the namenode using:

hadoop-daemon.sh start namenode (to start the name node)

Confirm if it’s started or not by typing:

jps (If namenode will be shown, then it is properly started)

To check how many datanodes are available and their details:

hadoop dfsadmin -report (to check the report of hadoop cluster)

SETUP OF DATA NODE

mkdir /dn

cd /etc/hadoop/

ls

In hdfs-site.xml

<configuration>

<property>

<name>dfs.data.dir</name>

<value>/dn</value>

</property>

</configuration>

In core-site.xml

We have to write the following code in this file and in this file we write master’s IP because we have to share our storage with master node:-

<configuration>

<property>

<name>fs.default.name</name>

<value>hdfs://52.66.204.158:9001</value>

</property>

</configuration>

Start the datanode

hadoop-daemon.sh start datanode (to start the name node)

jps (To check if datanode is started or not)

hadoop dfsadmin -report (to check the report of hadoop cluster)

AWS EC2 Dashboard
core-site.xml file
hdfs-site.xml file
Data Node started
hadoop dfsadmin -report

SETUP OF CLIENT NODE

cd /etc/hadoop/

In core-site.xml

We have to write the following code in this file and in this file we write master’s IP because we have to use the storage of master node:-

<configuration>

<property>

<name>fs.default.name</name>

<value>hdfs://52.66.204.158:9001</value>

</property>

</configuration>

Some commands to perform uploading file operation, reading, deleting

hadoop fs -ls / (to see the list of files uploaded by client)

hadoop fs -put file.txt / (to upload the file)

hadoop fs -cat /file.txt (to read the file)

hadoop fs -rm /file.txt (to remove the file)

Finally, all the configurations and setup is completed.

Now, we are doing some tasks to better understand the concept of the Hadoop cluster.

Uploading a file through Client

Before uploading the file, we run the “tcpdump” command in all data nodes to see the packets received by the slave node. So the command is as follows:-

tcpdump -i eth0 tcp port 50010 -n (port 50010 is used for transferring files across hadoop cluster)

But, what happens in data nodes when we upload the file 🤔

Are you curious to know?🤔

So let’s find out what happens to the data node when we upload the file.🧐

When we upload files from the client node, all 3 data nodes start receiving packets because by default replication factor (no. of replica ) is equals to 3.

Although, we can change the replication factor.

Packets received through tcpdump

So hello.txt file uploaded in all the 3 data nodes. Now we read the file from the client node by using the “-cat” command.

But before reading the file, we give alias in the master node and in all the data nodes by which it is easy to stop the nodes faster.

Now when we start reading the file from the client node one of the name node start receiving packets with great speed. And we immediately stop that data node from which the client is reading the data. And we also stop the master node to check from which node the client directly reads the data.

⚫ Stopping the data node

Hence, after stopping the name node we get to know that client directly reads the data from the data node and the client node not facing any issue while reading the data. On the other hand, after stopping this data node, immediately 2nd data node starts receiving packets. And again we stop that 2nd data node.

After stopping both the nodes the last node starts receiving packets. And we stop that data node too.

After stopping all the data nodes, the client is unable to read the file.

By this, we get to know that we need at least one data node to read the file in which the replica of that file should be present.

We can also see the information on the Hadoop cluster from the WebUI. By visiting:- master’s_IP:50070/dfshealth.jsp

And we can also see the information about our files.

Finally, we all completed all the tasks successfully

By the above tasks, we get to know the following points:-

1. We set up a Hadoop cluster.

2. If we have more than 3 data nodes and if we upload a file from the client node it stores in 3 data nodes by default.

3. Client node directly reads data from the data node. But the client goes to the datanode only once to get the IP addresses of the Data Nodes so that it can communicate with them directly through HDFS Protocol.

4. If one data node stops from which client node reads data. Then the client starts reading data from another data node which store that file.

Hope you found this useful!!

Thank you!!

Data Science, Big Data, Cloud Computing