Pages

Sunday, 25 August 2013

Hadoop Installation on ubuntu 12.04 (Single Node cluster)

The Apache Hadoop is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage. Rather than rely on hardware to deliver high-availability, the library itself is designed to detect and handle failures at the application layer, so delivering a highly-available service on top of a cluster of computers, each of which may be prone to failures.



In this tutorial we are going to learn how to install Hadoop single node cluster on Ubuntu 12.04.

Before getting started with hadoop installation we need to make sure the java is installed on our system. In this tutorial i am going to install java7 on my machine you can go with java6 also.

Install oracle java 7 via PPA repository. Use the following commands:

$sudo add-apt-repository ppa:webupd8team/java
$sudo apt-get update
$sudo apt-get install oracle-java7-installer
$sudo update-java-alternatives -s java-7-oracle

To check if the java is correctly installed or not and what is the version installed type in the folloeing command:
$java -version

To install you can either create a new hadoop user or you can use your current itself. I am going with the second approach. So in this article i will be using user "shakeel" which is my default and only user in ubuntu.

Install SSH Server if not already present. This is needed as hadoop does an ssh into localhost for execution.
$sudo apt-get install openssh-server
$ssh-keygen -t rsa -P ""
$cat $HOME/.ssh/id_rsa.pub >> $HOME/.ssh/authorized_keys

The final step is to test the SSH setup by connecting to your local machine with the shakeel user. The step is also needed to save your local machine’s host key fingerprint to the shakeel user’s known_hosts file.
$ssh localhost

Disable IPV6

$sudo gedit /etc/sysctl.conf
Paste the below lines at the end of the file :
#disable ipv6
net.ipv6.conf.all.disable_ipv6 = 1
net.ipv6.conf.default.disable_ipv6 = 1
net.ipv6.conf.lo.disable_ipv6 = 1

For these configurations to take effect normally you need to reboot the system. But you can aloo re-initialize the configurations without rebooting the system by executing below command:
$sudo sysctl -p

To make sure that IPV6 is disabled, you can run the following command:
$cat /proc/sys/net/ipv6/conf/all/disable_ipv6
The printed value should be 1, which means that is disabled.

HADOOP Installation

As all the basic settings for hadoop installations are done now we need to proceed with hadoop installation. You can download hadoop package from the Apache downloads http://www.apache.org/dyn/closer.cgi/hadoop/core.
I downloaded hadoop-1.0.4.tar.gz.

Copy the tar file to your user directory. 
$cd /home/shakeel

Untar all the contents of the tar file.
$sudo tar xzf hadoop-1.0.4.tar.gz

To keep the things simple we are renaming hadoop-1.0.4 to hadoop
$sudo mv hadoop-1.0.4 hadoop

Open .bashrc file
$sudo gedit /home/shakeel/.bashrc

Now add the HADOOP_HOME environment variable to your .bashrc which corresponds to the dirctory where you have extracted hadoop-1.0.4.tar.gz contents i.e. hadoop.
export HADOOP_HOME=/home/shakeel/hadoop

Add JAVA_HOME environment variable also at the end of .bachrc file
export JAVA_HOME=/usr/lib/jvm/java-7-oracle

Add the $HADOOP_HOME/bin to $PATH. By doing this you can start and stop hadoop cluster (run start-all.sh or stop-all.sh) from any of the directory without actually navigating to bin directory of hadoop and executing it.
export PATH=$PATH:$HADOOP_HOME/bin

Open a new terminal window and check if the hadoop home, java home and path is set properly and contains the changes that you have made to them
$echo $HADOOP_HOME
$echo $JAVA_HOME
$echo $PATH

Update JAVA_HOME in hadoop-env.sh
$sudo gedit /home/shakeel/hadoop/conf/hadoop-env.sh
replace # export JAVA_HOME=/usr/lib/j2sdk1.5-sun with
export JAVA_HOME=/usr/lib/jvm/java-7-oracle
Make sure to remove "#" which is placed at the begining of the command.

Create Temprory directory for hadoop
$sudo mkdir /home/shakeel/tmp

Open the core-site.xml and add the following between <configuration> .. </configuration> tags.
$sudo gedit /home/shakeel/hadoop/conf/core-site.xml
<property>
<name>hadoop.tmp.dir</name>
<value>/home/shakeel/tmp</value>
<description>A base for other temporary directories.</description>
</property>
<property>
<name>fs.default.name</name>
<value>hdfs://localhost:54310</value>
<description>The name of the default file system. A URI whose
scheme and authority determine the FileSystem implementation. The
uri’s scheme determines the config property (fs.SCHEME.impl) naming
the FileSystem implementation class. The uri’s authority is used to
determine the host, port, etc. for a filesystem.</description>
</property>

Open the mapred-ste.xml and add the following between <configuration> .. </configuration> tags.
$sudo gedit /home/shakeel/hadoop/conf/mapred-site.xml
<property>
  <name>mapred.job.tracker</name>
  <value>localhost:54311</value>
  <description>The host and port that the MapReduce job tracker runs
  at.  If "local", then jobs are run in-process as a single map
  and reduce task.
  </description>
</property>

Open the hdfs-site.xml and add the following between <configuration> .. </configuration> tags.
$sudo gedit /home/shakeel/hadoop/conf/hdfs-site.xml
<property>
  <name>dfs.replication</name>
  <value>1</value>
  <description>Default block replication.
  The actual number of replications can be specified when the file is created.
  The default is used if replication is not specified in create time.
  </description>
</property>

Next step is formatting the HDFS filesystem via NameNode (which simply initializes the directory specified by the dfs.name.dir variable that corresponds to ${hadoop.tmp.dir}/dfs/name on the local filesystem )
Don't run this command when the system is running and its done only at the first time during installation.
$/home/shakeel/hadoop/bin/hadoop namenode -format

To start the hadoop server navigate to bin directory of hadoop
$cd /home/shakeel/hadoop/bin/

Type in the command
$./start-all.sh
or if you have PATH variable appended with the HADOOP_HOME/bin you can directly use below command from anywhere:
$start-all.sh
Once all the process are started go to logs and check if the logs doesn't have any exceptions in it.

To check what all processes are running you can type in:
$jps

Output should be something like:
3435 NameNode
5645 DataNode
6766 SecondaryNameNode
6788 JobTracker
6567 TaskTracker
3445 jps
If you find any of the process missing from the above mentioned processes than it means there was some error with the starting of hadoop cluster. Go and check all the logs for verifying the cause for it.

Hadoop comes with several web interfaces which are by default (see conf/hadoop-default.xml) available at these locations:
http://localhost:50070/  --> web UI of the NameNode daemon
http://localhost:50030/  --> web UI of the JobTracker daemon
http://localhost:50060/  -->  web UI of the TaskTracker daemon
Make sure all these links are working fine which means your single-node hadoop cluster was installed successfully on your machine.

To stop the hadoop server use the below command :
$./stop-all.sh
or
$stop-all.sh

I hope this would have helped you in installing the single-node hadoop cluster on Ubuntu.
My next post would be on installing HBase over HDFS.

References:
http://www.michael-noll.com/tutorials/running-hadoop-on-ubuntu-linux-single-node-cluster/
http://mysolvedproblem.blogspot.com/2012/05/installing-hadoop-on-ubuntu-linux-on.html

No comments:

Post a Comment