If you are a just starting out as a Software Engineer getting started with Cloud Computing platform, you will want to have a local instance to learn and experiment without having to go down the route of virtualization.
Apache Hadoop is a software framework that supports data-intensive distributed applications under a free license. It enables applications to work with thousands of nodes and petabytes of data. Hadoop was inspired by Google’s MapReduce and Google File System (GFS) papers.
Hadoop is a top-level Apache project being built and used by a global community of contributors, using the Java programming language. Yahoo! has been the largest contributor to the project, and uses Hadoop extensively across its businesses.
In this post, I will describe how to set up a single node Apache Hadoop cluster in Mac OS (10.6.8).
- Ensure Java is installed. For me it was already pre-installed.
To check if it’s installed, open the terminal and type java -version
java version "1.6.0_33" Java(TM) SE Runtime Environment (build 1.6.0_33-b03-424-10M3720) Java HotSpot(TM) 64-Bit Server VM (build 20.8-b03-424, mixed mode)
If you don’t have it, you can get it directly from Apple site here
- Download Hadoop tar file from here. Unzip it wherever you want. Preferably place it in non-root folder. In that way permission issues can be avoided. My directory was /Users/bharath/Documents/Hadoop/hadoop
- Now cd into conf directory in hadoop folder. Modify hadoop-env.sh like this
# The java implementation to use. Required. export JAVA_HOME=/System/Library/Java/JavaVirtualMachines/1.6.0.jdk/Contents/Home # The maximum amount of heap to use, in MB. Default is 1000. export HADOOP_HEAPSIZE=2000
- Modify the hdfs-site.xml, core-site.xml, mapred-site.xml under conf
<configuration> <property> <name>dfs.replication</name> <value>1</value> </property> <property> <name>dfs.name.dir</name> <value>/Users/bharath/Documents/Hadoop/hadoop/dfs/name</value> </property> </configuration>
<configuration> <property> <name>fs.default.name</name> <value>hdfs://localhost:9000</value> </property> <property> <name>hadoop.tmp.dir</name> <value>/Users/bharath/Documents/Hadoop/hadoop/tmp</value> </property> </configuration>
<configuration> <property> <name>mapred.job.tracker</name> <value>localhost:9001</value> </property> </configuration>
- Next setup ssh on your Mac
Make sure Remote Login is turned on. To check this, open System Preferences. Under Internet & Wireless, open Sharing. Make
sure Remote Login is checked.
We need to prepare a password-less login into localhost.
First type ssh localhost in terminal. If it asks for a password, follow the below steps. Else you are good to go.
In the terminal, type
ssh-keygen -t rsa -P ""
This will generate a pass key. In Mac OS, this key is stored in /var/root/.ssh in under Home directory
Login as root (type sudo su in terminal)
Then type, cd /var/root/.ssh
Next type ls
The key generated will be id_rsa.pub. We need to copy this into known_hosts
To copy the key file, use the command:
cat $HOME/var/root/.ssh/id_rsa.pub >> $HOME/.ssh/known_hosts
- Setting up HDFS for the first time
cd into HADOOP_HOME
bin/hadoop namenode -format
The output should be something like below. You should see a statement “…. successfully formatted”
2/09/19 15:44:53 INFO namenode.NameNode: STARTUP_MSG: /************************************************************ STARTUP_MSG: Starting NameNode STARTUP_MSG: host = Bharath-Kumar-Reddys-MacBook-Air.local/172.20.10.2 STARTUP_MSG: args = [-format] STARTUP_MSG: version = 0.20.2+737 STARTUP_MSG: build = git://ubuntu64-build01.sf.cloudera.com/ on branch -r 98c55c28258aa6f42250569bd7fa431ac657b dbd; compiled by 'root' on Tue Dec 14 11:50:19 PST 2010 ************************************************************/ 12/09/19 15:44:54 INFO namenode.FSNamesystem: fsOwner=bharath 12/09/19 15:44:54 INFO namenode.FSNamesystem: supergroup=supergroup 12/09/19 15:44:54 INFO namenode.FSNamesystem: isPermissionEnabled=true 12/09/19 15:44:54 INFO namenode.FSNamesystem: isAccessTokenEnabled=false accessKeyUpdateInterval=0 min(s), accessTokenLifetime=0 min(s) 12/09/19 15:44:54 INFO common.Storage: Image file of size 113 saved in 0 seconds. 12/09/19 15:44:54 INFO common.Storage: Storage directory /Users/bharath/Documents/Hadoop/hadoop/dfs/name has been successfully formatted. 12/09/19 15:44:54 INFO namenode.NameNode: SHUTDOWN_MSG: /************************************************************ SHUTDOWN_MSG: Shutting down NameNode at Bharath-Kumar-Reddys-MacBook-Air.local/172.20.10.2
- Do ssh into localhost
- Start the hadoop daemons
The output should be like this:
starting namenode, logging to /Users/bharath/Documents/Hadoop/hadoop/bin/../logs/hadoop-bharath-namenode-Bharath-Kumar-Reddys-MacBook-Air.local.out localhost: starting datanode, logging to /Users/bharath/Documents/Hadoop/hadoop/bin/../logs/hadoop-bharath-datanode-Bharath-Kumar-Reddys-MacBook-Air.local.out localhost: starting secondarynamenode, logging to /Users/bharath/Documents/Hadoop/hadoop/bin/../logs/hadoop-bharath secondarynamenode-Bharath-Kumar-Reddys-MacBook-Air.local.out starting jobtracker, logging to /Users/bharath/Documents/Hadoop/hadoop/bin/../logs/hadoop-bharath-jobtracker-Bharath-Kumar-Reddys-MacBook-Air.local.out localhost: starting tasktracker, logging to /Users/bharath/Documents/Hadoop/hadoop/bin/../logs/hadoop-bharath-tasktracker Bharath-Kumar-Reddys-MacBook-Air.local.out
- Test to see if all the nodes are running
The output of the above command:
2490 Jps 2206 TaskTracker 2071 SecondaryNameNode 1919 NameNode 2130 JobTracker 1995 DataNode
So, all the nodes are up and running. 🙂
To see a list of ports opened, use the command
lsof -i | grep LISTEN
- Test the examples
$HADOOP_HOME/bin/hadoop jar $HADOOP_HOME/hadoop-examples-*.jar
The output :
An example program must be given as the first argument. Valid program names are: aggregatewordcount: An Aggregate based map/reduce program that counts the words in the input files. aggregatewordhist: An Aggregate based map/reduce program that computes the histogram of the words in the input files. dbcount: An example job that count the pageview counts from a database. grep: A map/reduce program that counts the matches of a regex in the input. join: A job that effects a join over sorted, equally partitioned datasets multifilewc: A job that counts words from several files. pentomino: A map/reduce tile laying program to find solutions to pentomino problems. pi: A map/reduce program that estimates Pi using monte-carlo method. randomtextwriter: A map/reduce program that writes 10GB of random textual data per node. randomwriter: A map/reduce program that writes 10GB of random data per node. secondarysort: An example defining a secondary sort to the reduce. sleep: A job that sleeps at each map and reduce task. sort: A map/reduce program that sorts the data written by the random writer. sudoku: A sudoku solver. teragen: Generate data for the terasort terasort: Run the terasort teravalidate: Checking results of terasort wordcount: A map/reduce program that counts the words in the input files.
- Run pi example
$HADOOP_HOME/bin/hadoop jar $HADOOP_HOME/hadoop-examples-*.jar pi 10 100
Last few lines of output:
..................................................... 12/09/19 16:14:20 INFO mapred.JobClient: Reduce output records=0 12/09/19 16:14:20 INFO mapred.JobClient: Spilled Records=40 12/09/19 16:14:20 INFO mapred.JobClient: Map output bytes=180 12/09/19 16:14:20 INFO mapred.JobClient: Map input bytes=240 12/09/19 16:14:20 INFO mapred.JobClient: Combine input records=0 12/09/19 16:14:20 INFO mapred.JobClient: Map output records=20 12/09/19 16:14:20 INFO mapred.JobClient: SPLIT_RAW_BYTES=1240 12/09/19 16:14:20 INFO mapred.JobClient: Reduce input records=20 Job Finished in 76.438 seconds Estimated value of Pi is 3.14800000000000000000
- To stop the nodes, type
Done and done!
If you get an error like : java.io.IOException: Tmp directory hdfs://localhost:9000/user/bharath/PiEstimator_TMP_3_141592654 already exists. Please remove it first.
Then in the terminal type,
$HADOOP_HOME/bin/hadoop fs -rmr hdfs://localhost:9000/user/bharath/PiEstimator_TMP_3_141592654
- http://www.thegeekstuff.com/2012/02/hadoop-pseudo-distributed-installation/ (Good reference for troubleshooting)