Wednesday, November 13, 2013

Setup Multi Node Hadoop 2.0 cluster configuration

Installing Hadoop 2.x.x – multi-node cluster configuration
OS : Debin / BOSS / Ubuntu
Hadoop : Hadoop 2.2.0

find here Hadoop 1.x.x comman mistake while installation

find here Hadoop-2.2.0 Single-node cluster setup. Multi-node hadoop cluster setup can be done be either by 1) single-node cluster setup by all the machine and changes the Hadoop configuration files (or) 2) just follow the below step (expect 5.b, 5.c, which is only for master) for both Master and all the slave node.

  1. Prerequisites: ( for both Master and all the slave)
    1. Java 6 or above need to be installed
      Ensure that JDK had been already installed in your machine. Otherwise install JDK.
      Download and extract the jdk1.* and extartct the same.
      root@solaiv[~]#vi /etc/profile
      Add : JAVA_HOME= /usr/local/jdk1.6.0_18
      Append : PATH = “...:$JAVA_HOME/bin”
      ADD : export JAVA_HOME
      Run /etc/profile for reflecting the changes and check the Java version
      root@solaiv[~]#. /etc/profile (or) source /etc/profile
      root@solaiv[~]# java --version

    1. Create dedicated user/group for hadoop. (optional)
      Create user, create group and add the user to the group.
      root@solaiv[~]#createuser hduser
      root@solaiv[~]#addgroup hadoop
      root@solaiv[~]#adduser --ingroup hadoop hduser
      root@solaiv[~]#su hduser

    1. Password less SSH configuration for localhost, later will do for salve (optional, if we didn't do this then have to provide password for each process to start by ./start-*.sh)
      generate an SSH key for the hduser user. Then Enable password less SSH access to your local machine with this newly created key.
      hduser@solaiv[~]#ssh-keygen -t rsa -P ""
      hduser@solaiv[~]#cat /home/hduser/.ssh/ >> /home/hduser/.ssh/authorized_keys
      hduser@solaiv[~]#ssh localhost

  1. Steps to install Hadoop 2.x.x ( for both Master and all the slave)
    1. Download Hadoop 2.x.x
    2. Extract the hadoop-2.2.0 move to /opt/hadoop-2.2.0
    3. Add the follwing lines into .bashrc file
      hduser@solaiv[~]#cd ~
      hduser@solaiv[~]#vi .bashrc

copy and paste following line at end of the file
      #copy start here
      export HADOOP_HOME=/opt/hadoop-2.2.0
      export YARN_HOME=$HADOOP_HOME 
      export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop
      export YARN_CONF_DIR=$HADOOP_HOME/etc/hadoop 
      #copy end here

  1. Modify hadoop environment file ( for both Master and all the slave)
    1. Add JAVA_HOME to libexec/ at beginning of the file
      hduser@solaiv[~]#vi /opt/hadoop-2.2.0/libexec/
      export JAVA_HOME=/usr/local/jdk1.6.0_18
    2. Add JAVA_HOME to hadoop/ at beginning of the file
      hduser@solaiv[~]#vi /opt/hadoop-2.2.0/etc/hadoop/
      export JAVA_HOME=/usr/local/jdk1.6.0_18
    3. Check Hadoop installation
      hduser@solaiv[~]#cd /opt/hadoop-2.2.0/bin
      hduser@solaiv[bin]#./hadoop version
      Hadoop 2.2.0
      At this point Hadoop installed in your node.

  1. Create folder for tmp ( for both Master and all the slave)
      hduser@solaiv[~]#mkdir -p $HADOOP_HOME/tmp

  1. Configuration : Multi-node setup
    1. Add IP address of Master and all Slaves to /etc/hosts ( for both Master and all the slave node)
      Add the association between the hostnames and the IP address for the master and the slaves on all the nodes in the /etc/hosts. Make sure that the all the nodes in the cluster are able to ping to each other.
      hduser@boss:/opt/hadoop-2.2.0/bin#vi /etc/hosts master slave
      in my case only one slave, if u have more no.of slave node, name it like slave1, slave2 etc..
    2. Password less ssh from master to slave (Optional, only at Master node)
      hduser@boss:[~]#ssh-keygen -t rsa -P ""
      hduser@boss:[~]#ssh-copy-id -i /home/hduser/.ssh/ hduser@slave
      root@boss[bin]#ssh slave
    [Note : If you skip this step, then have to provide password for all slave when Master start the process by ./start-*.sh. If you have configured more no.of slave as mentioned in /etc/hosts, repeet the 2nd line of above to all the slaves by hduser@slave1, hduser@slave2 etc.. ]
    1. Add the Slave entries in $HADOOP_CONF_DIR/slaves ( only at Master node )
      Add all the slave entries in slaves file in Master node. This intimating Hadoop that these nodes for running DataNode and NodeManager. If you dont want master to act as DataNode just omit.
      hduser@boss:[~]#vi /opt/hadoop-2.2.0/etc/hadoop/slaves
      Note : in my case only one slave, if u have more no.of slave node, add all the slave hostname one in line as mentioned in /etc/hosts
  2. Hadoop Configuration ( for both Master and all the slave)
    Add the properties in following hadoop configuration file which is availabile under $HADOOP_CONF_DIR
    1. core-site.xml
      hduser@solaiv[~]#cd /opt/hadoop-2.2.0/etc/hadoop
      hduser@solaiv[hadoop]#vi core-site.xml
    #Paste following between <configuration> tag

    1. hdfs-site.xml
      hduser@solaiv[hadoop]#vi hdfs-site.xml
    #Paste following between <configuration> tag
Note : Here I've only one slave and master so I put replication values as 2, If you have more slave put replication value based on that.
    1. mapred-site.xml
      hduser@solaiv[hadoop]#vi mapred-site.xml
    #Paste following between <configuration> tag

    1. yarn-site.xml
      hduser@solaiv[hadoop]#vi yarn-site.xml
      #Paste following between <configuration> tag
          <name>yarn.nodemanager.aux- services.mapreduce.shuffle.class</name>
          <name>yarn.resourcemanager.resource- tracker.address</name>

  1. Format the namenode ( only at Master node )
      hduser@boss:/opt/hadoop-2.2.0/bin#cd /opt/hadoop-2.2.0/bin
      hduser@boss:/opt/hadoop-2.2.0/bin# ./hadoop namenode -format

  1. Admintaring Hadoop - Start & Stop (Only at Master node)
    just start the process at Master slave node automatically startup.
    1. : to start namenode and datanode
      hduser@boss:[~]# cd /opt/hadoop-2.2.0/sbin
      hduser@boss:[sbin]# ./
    check at Master
      17675 Jps
      17578 SecondaryNameNode
      17409 NameNode
    check at Salve
      9317 Jps
      9250 DataNode

    1. : to start resourcemanager and nodemanager
      hduser@boss:[sbin]# ./
    check at Master
      17578 SecondaryNameNode
      17917 ResourceManager
      17409 NameNode
      18153 Jps
    check at Salve
      9317 Jps
      9250 DataNode
      9357 NodeManager

  1. Working on Hadoop multi-node environment
    1. excute this command at master
      hduser@boss:/opt/hadoop-2.2.0/bin# ./hdfs dfs -mkdir -p /user/hadoop2
      hduser@boss:/opt/hadoop-2.2.0/bin# ./hdfs dfs -put /root/Desktop/test.html /user/hadoop2
      hduser@boss:/opt/hadoop-2.2.0/bin# ./hdfs dfs -ls
      Found 1 items
      -rw-r--r-- 2 root supergroup 225 2013-11-11 20:19 /user/hadoop2/test.html
    2. check at slave node

      hduser@boss:/opt/hadoop-2.2.0/bin# ./hdfs dfs -ls user/hadoop2/
      Found 1 items
      -rw-r--r-- 2 root supergroup 225 2013-11-11 20:19 /user/hadoop2/test.html
      hduser@boss:/opt/hadoop-2.2.0/bin# /opt/hadoop-2.2.0/bin# ./hdfs dfs -cat /user/hadoop2/test.html
      test file. Welcome to Hadoop2.2.0 Installation. !!!!!!!!!!!


Karthic said...

Thanks Solai, for the step by step instructions.

Anonymous said...

Awesome post solai.. Please post all the administration activities and how to put large files in hadoop and how it is getting balanced.

Anonymous said...

awesome Karthic ! You helped me a lot and it worked for hadoop 2.3 too.

There were some minor issues i googled and could fi x it. Iam posting it here for the benefit of others.

#1 - It is related to yarn-site.xml

Here is my new yarn-site.xml:






#3 - Add this to and

export HADOOP_OPTS="$HADOOP_OPTS -Djava.library.path=$HADOOP_HOME/lib"

dataanalytics said...

thanks to all for your comments

Unknown said...

Thanks Solai. I ment awesome solai ! in my previous comment.

Btw, for the password less SSH setup here are the directory permissions:

chmod 750 ~
chmod 700 ~/.ssh
chmod 600 ~/.ssh/*

Where ~ is your hadoop user (hduser) home.

We had a permission to home as 777 and spent hours debugging that issue.

Unknown said...

Instructions to start the mapreduce history server

vishal said...

I have setup a cluster, but my datanodes process are not listed under jps command. But the hdfs dfsadmin -report show me all my datanodes.

Also i have tried deleting the datanode directory and reformatting hdfs, doesnt work.

Do you have any solution to this

dataanalytics said...

Hi Vishal,
In your case, DataNode does not started gracefully. kindly check your DataNode logs.

vishal said...

hey thanks for reply
This is my datanode log: Problem binding to [] Address already in\
use; For more details see:
org.apache.hadoop.hdfs.server.datanode.DataNode: SHUTDOWN_MSG:

Looks like i am not able to bind the port. I looked up the link, it says some one else might be listening to the port. However no other process is using and when i, the java process starts using the ports and when i do, the java process are still listening to the port

vishal said...

Also i tried changidfs.datanode.address

t0 50011,50012,50013 respectively . the new java process bind to new ports but cant see the datanode process in the jps command

dataanalytics said...

sry late,

try this, hdfs://

in core-site.xml

vishal said...

hey thank you for reply,
Well i tried changing core-site.xml as you said, no success. i can also run mapreduce but cant see datanode and nodemanager process.

vishal said...

Hey looks like the and are useless here. i wrote my own script for doing it using Works like a charm!!!

Thnaks a lot for help

J Shantz said...

Thanks for the instructions. I had one issue with the setup running on AWS. After originally getting it running, I was getting a "Connection Refused" error between the slave and master and the slave was not doing anything. I ultimately determined the cause of this to be in my /etc/hosts file. On the master, make sure the "master" entry in the hosts file is the local IP address and not Otherwise the yarn service is only available locally and the slave cannot connect to it.

dataanalytics said...

Hi Shantz..,

*) Issue "ping" salve command from master. Hope it works well then check hostname of the slave m/c by issuing command "hostname" .

*) now start only nodemanager from datanode/slave

*) still getting error, share your nodemanager log from slave m/c

Unknown said...

Hi , I have followed the tutorial and completed setup of cluster exactly same as said in the tutorial . Now i am trying to run pi example but it is struck at the point as shown below.Can you guide me what am i missing here?

hadoop jar ./share/hadoop/mapreduce/hadoop-mapreduce-examples-2.2.0.jar pi 2 5
Number of Maps = 2
Samples per Map = 5
14/08/03 06:35:30 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Wrote input for Map #0
Wrote input for Map #1
Starting Job
14/08/03 06:35:31 INFO client.RMProxy: Connecting to ResourceManager at VM-002/
14/08/03 06:35:32 INFO input.FileInputFormat: Total input paths to process : 2
14/08/03 06:35:32 INFO mapreduce.JobSubmitter: number of splits:2
14/08/03 06:35:32 INFO Configuration.deprecation: is deprecated. Instead, use
14/08/03 06:35:32 INFO Configuration.deprecation: mapred.jar is deprecated. Instead, use mapreduce.job.jar
14/08/03 06:35:32 INFO Configuration.deprecation: is deprecated. Instead, use
14/08/03 06:35:32 INFO Configuration.deprecation: mapred.reduce.tasks is deprecated. Instead, use mapreduce.job.reduces
14/08/03 06:35:32 INFO Configuration.deprecation: mapred.output.value.class is deprecated. Instead, use mapreduce.job.output.value.class
14/08/03 06:35:32 INFO Configuration.deprecation: mapred.reduce.tasks.speculative.execution is deprecated. Instead, use mapreduce.reduce.speculative
14/08/03 06:35:32 INFO Configuration.deprecation: is deprecated. Instead, use
14/08/03 06:35:32 INFO Configuration.deprecation: is deprecated. Instead, use
14/08/03 06:35:32 INFO Configuration.deprecation: mapreduce.reduce.class is deprecated. Instead, use mapreduce.job.reduce.class
14/08/03 06:35:32 INFO Configuration.deprecation: mapreduce.inputformat.class is deprecated. Instead, use mapreduce.job.inputformat.class
14/08/03 06:35:32 INFO Configuration.deprecation: mapred.input.dir is deprecated. Instead, use mapreduce.input.fileinputformat.inputdir
14/08/03 06:35:32 INFO Configuration.deprecation: mapred.output.dir is deprecated. Instead, use mapreduce.output.fileoutputformat.outputdir
14/08/03 06:35:32 INFO Configuration.deprecation: mapreduce.outputformat.class is deprecated. Instead, use mapreduce.job.outputformat.class
14/08/03 06:35:32 INFO Configuration.deprecation: is deprecated. Instead, use mapreduce.job.maps
14/08/03 06:35:32 INFO Configuration.deprecation: mapred.output.key.class is deprecated. Instead, use mapreduce.job.output.key.class
14/08/03 06:35:32 INFO Configuration.deprecation: mapred.working.dir is deprecated. Instead, use mapreduce.job.working.dir
14/08/03 06:35:33 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1407043822091_0001
14/08/03 06:35:33 INFO impl.YarnClientImpl: Submitted application application_1407043822091_0001 to ResourceManager at VM-002/
14/08/03 06:35:33 INFO mapreduce.Job: The url to track the job: http://VM-002:8088/proxy/application_1407043822091_0001/
14/08/03 06:35:33 INFO mapreduce.Job: Running job: job_1407043822091_0001

Prologic Corporation said...

This is a good article & good site.Thank you for sharing this article. It is help us following categorize:
healthcare, e commerce, programming, it consulting, retail, manufacturing, CRM, digital supply chain management, Delivering high-quality service for your business applications,
Solutions for all Industries,
Getting your applications talking is the key to better business processes,
Rapid web services solutions for real business problems,
Web-based Corporate Document Management System,
Outsourcing Solution,
Financial and Operations Business Intelligence Solution,

Our address:
2002 Timberloch Place, Suite 200
The Woodlands, TX 77380


Unknown said...

I want how to run wordcount program

dataanalytics said...

try this

$HADOOP_PREFIX/bin/hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-2.5.1.jar wordcount /INPUT_PATH_LOCAL /OUTOUT_DIR_HDFS

note : hadoop-mapreduce-examples-2.5.1.jar, may vary based on the version you have installed.

Unknown said...

I'm using Hadoop2.2.0.
I just followed the steps you published

dataanalytics said...

this will work.. run the below command on terminal...

$HADOOP_HOME/bin/hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-2.2.0.jar wordcount /INPUT_PATH_LOCAL /OUTOUT_DIR_HDFS

Unknown said...

hduser@ubuntu:/opt/hadoop-2.2.0$ bin/hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-2.2.0.jar wordcount /user/hadoop2/input /user/hadoop2/output

15/03/06 21:30:54 INFO mapreduce.Job: Job job_1425635780130_0009 running in uber mode : false
15/03/06 21:30:54 INFO mapreduce.Job: map 0% reduce 0%
15/03/06 21:30:54 INFO mapreduce.Job: Job job_1425635780130_0009 failed with state FAILED due to: Application application_1425635780130_0009 failed 2 times due to Error launching appattempt_1425635780130_0009_000002. Got exception: Call From ubuntu/ to ubuntu:50507 failed on connection exception: Connection refused; For more details see:
at sun.reflect.GeneratedConstructorAccessor47.newInstance(Unknown Source)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(
at java.lang.reflect.Constructor.newInstance(
at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(
at com.sun.proxy.$Proxy22.startContainers(Unknown Source)
at org.apache.hadoop.yarn.api.impl.pb.client.ContainerManagementProtocolPBClientImpl.startContainers(
at org.apache.hadoop.yarn.server.resourcemanager.amlauncher.AMLauncher.launch(
at java.util.concurrent.ThreadPoolExecutor.runWorker(
at java.util.concurrent.ThreadPoolExecutor$
Caused by: Connection refused
at Method)
at org.apache.hadoop.ipc.Client$Connection.setupConnection(
at org.apache.hadoop.ipc.Client$Connection.setupIOstreams(
at org.apache.hadoop.ipc.Client$Connection.access$2600(
at org.apache.hadoop.ipc.Client.getConnection(
... 9 more
. Failing the application.
15/03/06 21:30:54 INFO mapreduce.Job: Counters: 0

dataanalytics said...

1) make sure all the daemons are up and running

2) kindly post your /etc/hosts files on both NN and DN

Unknown said...

All started on master and slaves

Unknown said...
This comment has been removed by the author.
Unknown said... localhost ubuntu master slave1 slave2 slave3 slave4

# The following lines are desirable for IPv6 capable hosts
::1 ip6-localhost ip6-loopback
fe00::0 ip6-localnet
ff00::0 ip6-mcastprefix
ff02::1 ip6-allnodes
ff02::2 ip6-allrouters

solai said...

Remove/comment first two lines from /etc/hosts. localhost ubuntu
Once done, restart namenode daemon by executing the below command. (To know more refer..

sbin/ start namenode
sbin/ stop namenode

As well as resources manager. By

sbin/ stop resourcemanager

sbun/ start resourcemanager

Then execute the MR wordcount.

Unknown said...
This comment has been removed by the author.
Unknown said...

sbin/ stop resourcemanager

sbun/ start resourcemanager
these are not working
And localhost ubuntu

After removing these on master and slaves resourcemanager, nodemanager, datanodes

solai said...

Sry, typo error.. Its

sbin/ stop resourcemanager

sbin/ start resourcemanager

Unknown said...

sbin/ stop resourcemanager

sbin/ start resourcemanager

after doing this resourcemanager is not starting

solai said...

Send me the log file.

Unknown said...

It is not accepting my log file

solai said...

Boss copy and past last 50 lines..or where u find error from bottom of file

Haddad Riadh said...

Hello ,Thanks for this post ,please haw i can access to hadoop multi node from remote ,how i can configure http access in nodes and master?

solai said...

If you setup multi node, then u can access NANE NODE by "masterIP:50070" from any node even other system which doesn't have hadoop distro. Like wise you cam acceas RESOURCE MANAGER by "masterIP:8088".