Setting Up Single-Node Hadoop Cluster on a Mac

If you are a just starting out as a Software Engineer getting started with Cloud Computing platform, you will want to have a local instance to learn and experiment without having to go down the route of virtualization.

Apache Hadoop is a software framework that supports data-intensive distributed applications under a free license. It enables applications to work with thousands of nodes and petabytes of data. Hadoop was inspired by Google’s MapReduce and Google File System (GFS) papers.

Hadoop is a top-level Apache project being built and used by a global community of contributors, using the Java programming language. Yahoo! has been the largest contributor to the project, and uses Hadoop extensively across its businesses.

In this post, I will describe how to set up a single node Apache Hadoop cluster in Mac OS (10.6.8).

  1. Ensure Java is installed. For me it was already pre-installed.
    To check if it’s installed, open the terminal and type java -version
    Terminal output:

    java version "1.6.0_33"
    Java(TM) SE Runtime Environment (build 1.6.0_33-b03-424-10M3720)
    Java HotSpot(TM) 64-Bit Server VM (build 20.8-b03-424, mixed mode)

    If you don’t have it, you can get it directly from Apple site here

  2. Download Hadoop tar file from here. Unzip it wherever you want. Preferably place it in non-root folder. In that way permission issues can be avoided. My directory was /Users/bharath/Documents/Hadoop/hadoop

    export HADOOP_HOME=/Users/bharath/Documents/Hadoop/hadoop
  3. Now cd into conf directory in hadoop folder. Modify hadoop-env.sh like this
    # The java implementation to use. Required.
    export JAVA_HOME=/System/Library/Java/JavaVirtualMachines/1.6.0.jdk/Contents/Home
    # The maximum amount of heap to use, in MB. Default is 1000.
    export HADOOP_HEAPSIZE=2000
  4. Modify the hdfs-site.xml, core-site.xml, mapred-site.xml under conf
    hdfs-site.xml

     <configuration>
     <property>
     <name>dfs.replication</name>
     <value>1</value>
     </property>
     <property>
     <name>dfs.name.dir</name>
     <value>/Users/bharath/Documents/Hadoop/hadoop/dfs/name</value>
     </property>
     </configuration>

    core-site.xml

     <configuration>
     <property>
     <name>fs.default.name</name>
     <value>hdfs://localhost:9000</value>
     </property>
     <property>
     <name>hadoop.tmp.dir</name>
     <value>/Users/bharath/Documents/Hadoop/hadoop/tmp</value>
     </property>
     </configuration>

    mapred-site.xml

     <configuration>
     <property>
     <name>mapred.job.tracker</name>
     <value>localhost:9001</value>
     </property>
     </configuration>
  5. Next setup ssh on your Mac
    Make sure Remote Login is turned on. To check this, open System Preferences. Under Internet & Wireless, open Sharing. Make
    sure Remote Login is checked.
    We need to prepare a password-less login into localhost.
    First type ssh localhost in terminal. If it asks for a password, follow the below steps. Else you are good to go.
    In the terminal, type

    ssh-keygen -t rsa -P ""

    This will generate a pass key. In Mac OS, this key is stored in /var/root/.ssh in under Home directory
    Login as root (type sudo su in terminal)
    Then type, cd /var/root/.ssh
    Next type ls
    The key generated will be id_rsa.pub. We need to copy this into known_hosts
    To copy the key file, use the command:

    cat $HOME/var/root/.ssh/id_rsa.pub >> $HOME/.ssh/known_hosts
  6. Setting up HDFS for the first time
    cd into HADOOP_HOME
    Type

    bin/hadoop namenode -format

    The output should be something like below. You should see a statement “…. successfully formatted”

    2/09/19 15:44:53 INFO namenode.NameNode: STARTUP_MSG:
      /************************************************************
      STARTUP_MSG: Starting NameNode
      STARTUP_MSG: host = Bharath-Kumar-Reddys-MacBook-Air.local/172.20.10.2
      STARTUP_MSG: args = [-format]
      STARTUP_MSG: version = 0.20.2+737
      STARTUP_MSG: build = git://ubuntu64-build01.sf.cloudera.com/ on branch -r 98c55c28258aa6f42250569bd7fa431ac657b  dbd; compiled by 'root' on Tue Dec 14 11:50:19 PST 2010
      ************************************************************/
      12/09/19 15:44:54 INFO namenode.FSNamesystem: fsOwner=bharath
      12/09/19 15:44:54 INFO namenode.FSNamesystem: supergroup=supergroup
      12/09/19 15:44:54 INFO namenode.FSNamesystem: isPermissionEnabled=true
      12/09/19 15:44:54 INFO namenode.FSNamesystem: isAccessTokenEnabled=false accessKeyUpdateInterval=0 min(s),
      accessTokenLifetime=0 min(s)
      12/09/19 15:44:54 INFO common.Storage: Image file of size 113 saved in 0 seconds.
      12/09/19 15:44:54 INFO common.Storage: Storage directory /Users/bharath/Documents/Hadoop/hadoop/dfs/name has been successfully formatted.
      12/09/19 15:44:54 INFO namenode.NameNode: SHUTDOWN_MSG:
      /************************************************************
     SHUTDOWN_MSG: Shutting down NameNode at Bharath-Kumar-Reddys-MacBook-Air.local/172.20.10.2
    
  7. Do ssh into localhost
    ssh localhost
  8. Start the hadoop daemons
    $HADOOP_HOME/bin/start-all.sh

     The output should be like this:

    starting namenode, logging to /Users/bharath/Documents/Hadoop/hadoop/bin/../logs/hadoop-bharath-namenode-Bharath-Kumar-Reddys-MacBook-Air.local.out
    localhost: starting datanode, logging to /Users/bharath/Documents/Hadoop/hadoop/bin/../logs/hadoop-bharath-datanode-Bharath-Kumar-Reddys-MacBook-Air.local.out
    localhost: starting secondarynamenode, logging to /Users/bharath/Documents/Hadoop/hadoop/bin/../logs/hadoop-bharath
    secondarynamenode-Bharath-Kumar-Reddys-MacBook-Air.local.out
    starting jobtracker, logging to /Users/bharath/Documents/Hadoop/hadoop/bin/../logs/hadoop-bharath-jobtracker-Bharath-Kumar-Reddys-MacBook-Air.local.out
    localhost: starting tasktracker, logging to /Users/bharath/Documents/Hadoop/hadoop/bin/../logs/hadoop-bharath-tasktracker
    Bharath-Kumar-Reddys-MacBook-Air.local.out
  9. Test to see if all the nodes are running
    $JAVA_HOME/bin/jps

    The output of the above command:

    2490 Jps
    2206 TaskTracker
    2071 SecondaryNameNode
    1919 NameNode
    2130 JobTracker
    1995 DataNode

    So, all the nodes are up and running. 🙂
    To see a list of ports opened, use the command

    lsof -i | grep LISTEN 
  10. Test the examples
    $HADOOP_HOME/bin/hadoop jar $HADOOP_HOME/hadoop-examples-*.jar

    The output :

    An example program must be given as the first argument.
    Valid program names are:
    aggregatewordcount: An Aggregate based map/reduce program that counts the words in the input files.
    aggregatewordhist: An Aggregate based map/reduce program that computes the histogram of the words in the input files.
    dbcount: An example job that count the pageview counts from a database.
    grep: A map/reduce program that counts the matches of a regex in the input.
    join: A job that effects a join over sorted, equally partitioned datasets
    multifilewc: A job that counts words from several files.
    pentomino: A map/reduce tile laying program to find solutions to pentomino problems.
    pi: A map/reduce program that estimates Pi using monte-carlo method.
    randomtextwriter: A map/reduce program that writes 10GB of random textual data per node.
    randomwriter: A map/reduce program that writes 10GB of random data per node.
    secondarysort: An example defining a secondary sort to the reduce.
    sleep: A job that sleeps at each map and reduce task.
    sort: A map/reduce program that sorts the data written by the random writer.
    sudoku: A sudoku solver.
    teragen: Generate data for the terasort
    terasort: Run the terasort
    teravalidate: Checking results of terasort
    wordcount: A map/reduce program that counts the words in the input files.
  11. Run pi example
    $HADOOP_HOME/bin/hadoop jar $HADOOP_HOME/hadoop-examples-*.jar pi 10 100

    Last few lines of output:

    .....................................................
     12/09/19 16:14:20 INFO mapred.JobClient: Reduce output records=0
     12/09/19 16:14:20 INFO mapred.JobClient: Spilled Records=40
     12/09/19 16:14:20 INFO mapred.JobClient: Map output bytes=180
     12/09/19 16:14:20 INFO mapred.JobClient: Map input bytes=240
     12/09/19 16:14:20 INFO mapred.JobClient: Combine input records=0
     12/09/19 16:14:20 INFO mapred.JobClient: Map output records=20
     12/09/19 16:14:20 INFO mapred.JobClient: SPLIT_RAW_BYTES=1240
     12/09/19 16:14:20 INFO mapred.JobClient: Reduce input records=20
     Job Finished in 76.438 seconds
     Estimated value of Pi is 3.14800000000000000000
  12. To stop the nodes, type
    $HADOOP_HOME/bin/stop-all.sh

Done and done!

If you get an error like : java.io.IOException: Tmp directory hdfs://localhost:9000/user/bharath/PiEstimator_TMP_3_141592654 already exists. Please remove it first.

Then in the terminal type,

$HADOOP_HOME/bin/hadoop fs -rmr hdfs://localhost:9000/user/bharath/PiEstimator_TMP_3_141592654 

References:

  1. http://www.chaceliang.com/blog/study/03-how-to-setup-hadoop-at-ur-macbook-pro/
  2. http://wiki.apache.org/hadoop/Running_Hadoop_On_OS_X_10.5_64-bit_(Single-Node_Cluster)
  3. http://www.thegeekstuff.com/2012/02/hadoop-pseudo-distributed-installation/ (Good reference for troubleshooting)
Advertisements

2 responses to “Setting Up Single-Node Hadoop Cluster on a Mac

  1. Thanks for the tutorial. I had to change the following to get it to work:

    1) I didn’t do sudo su, instead I did the following: cat /Users/my_user/.ssh/id_rsa.pub >> /Users/my_user/.ssh/authorized_keys and then chmod -R go-rwx /User/my_user/.ssh

    2) Add the following to $HADOOP_HOME/conf/hadoop-env.sh: export HADOOP_OPTS=”-Djava.security.krb5.realm=OX.AC.UK -Djava.security.krb5 .kdc=kdc0.ox.ac.uk:kdc1.ox.ac.uk”

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s