Running Hadoop Java and C++ Word Count example on Raspberry Pi

I hear that hadoop is incredibly slow on pi from various blog posts, and yes pls lower your hope as the speed is really appalling. But it is very interesting to see how slow it can be on the pi.

This post assumes you already have hadoop installed and configured on your pi. Before we start, we need to increase swap file size if your pi is 256MB ver. otherwise your pi will run out of memory.

1. Increase the swap file size (I stole this from David’s post)

hduser@raspberrypi ~ $ pico /etc/dphys-swapfile
change the value to 500 (MB)
hduser@raspberrypi ~ $ sudo dphys-swapfile setup
hduser@raspberrypi ~ $ sudo reboot

2. Download the example file

go http://www.gutenberg.org/ebooks/20417 and download the plain text e-book. Assuming you have downloaded the file to your home directory, we then copy this file to HDFS.

hduser@raspberrypi ~ $ start-all.sh
hduser@raspberrypi ~ $ hadoop dfs -copyFromLocal pg20417.txt /user/hduser/wordcount/pg20417.txt 

You can the check the file existence similar to ls command

hduser@raspberrypi ~ $ hadoop dfs -ls /user/hduser/wordcount

3.  Run example Java wordcount example

hduser@raspberrypi ~ $ hadoop jar /usr/local/hadoop/hadoop-examples-1.1.2.jar wordcount /user/hduser/wordcount /user/hduser/wordcount-output

Now, be patient! it will take approx. 8 minutes to complete….

4. Check execution result

hduser@raspberrypi ~ $ hadoop dfs -cat /user/hduser/wordcount-output/part-r-00000

5. C++ wordcount example

Getting hadoop pipes to run on pi needs a little more effort (hacking?) as we will need to build some pi compatible libraries. Particularly we’ll want libhdfs libhadooppipes as well as libhadooputils.

Let’s get the build environment ready first.

hduser@raspberrypi ~ $ apt-get install libssl-dev

go to /usr/local/hadoop/src/c++/libhdfs/ and edit the configure file, so it will run without errors.

in configure file, find the comment out the following two lines.

as_fn_error $? "Unsupported CPU architecture \"$host_cpu\"" "$LINENO" 5;;

and

define size_t unsigned int

Those are all hackings we need to do. Next,

hduser@raspberrypi ~ $ ./configure --prefix=/usr/local/hadoop/c++/Linux-i386-32
hduser@raspberrypi ~ $ make
hduser@raspberrypi ~ $ make install

We’re almost done, just do the same for pipes and utils. Once finished, you’ll have pi compatible libraries and just build the wordcount.cpp with Makefile given below.

wordcount.cpp

#include <algorithm>
#include <limits>
#include <string>

#include  "stdint.h"  // <--- to prevent uint64_t errors! 

#include "hadoop/Pipes.hh"
#include "hadoop/TemplateFactory.hh"
#include "hadoop/StringUtils.hh"

using namespace std;

class WordCountMapper : public HadoopPipes::Mapper {
public:
  // constructor: does nothing
  WordCountMapper( HadoopPipes::TaskContext& context ) {
  }

  // map function: receives a line, outputs (word,"1")
  // to reducer.
  void map( HadoopPipes::MapContext& context ) {
    //--- get line of text ---
    string line = context.getInputValue();

    //--- split it into words ---
    vector< string > words =
      HadoopUtils::splitString( line, " " );

    //--- emit each word tuple (word, "1" ) ---
    for ( unsigned int i=0; i < words.size(); i++ ) {
      context.emit( words[i], HadoopUtils::toString( 1 ) );
    }
  }
};

class WordCountReducer : public HadoopPipes::Reducer {
public:
  // constructor: does nothing
  WordCountReducer(HadoopPipes::TaskContext& context) {
  }

  // reduce function
  void reduce( HadoopPipes::ReduceContext& context ) {
    int count = 0;

    //--- get all tuples with the same key, and count their numbers ---
    while ( context.nextValue() ) {
      count += HadoopUtils::toInt( context.getInputValue() );
    }

    //--- emit (word, count) ---
    context.emit(context.getInputKey(), HadoopUtils::toString( count ));
  }
};

int main(int argc, char *argv[]) {
  return HadoopPipes::runTask(HadoopPipes::TemplateFactory< 
			      WordCountMapper, 
                              WordCountReducer >() );
}

Makefile

CC = g++
HADOOP_INSTALL = /usr/local/hadoop
PLATFORM = Linux-i386-32
CPPFLAGS =  -I$(HADOOP_INSTALL)/c++/$(PLATFORM)/include

wordcount: wordcount.cpp
     $(CC) $(CPPFLAGS) $< -Wall -L$(HADOOP_INSTALL)/c++/$(PLATFORM)/lib -lhadooppipes \
     -lhadooputils -lpthread -lcrypto -lssl -g -O2 -o $

Remark: on my 256BM ver.B pi, C++ wordcount take about 10 minutes to finish.

References:

[1] http://cs.smith.edu/dftwiki/index.php/Hadoop_Tutorial_2.2_–_Running_C%2B%2B_Programs_on_Hadoop

[2] http://www.michael-noll.com/tutorials/running-hadoop-on-ubuntu-linux-single-node-cluster/#Copy_local_example_data_to_HDFS

Getting hadoop to run on the Raspberry Pi

Hadoop was implemented on Java, so getting it to run on the Pi is just as easy as doing so on x86 servers. First of all, we need JVM for pi. You can either get OpenJDK or Oracle’s JDK 8 for ARM Early Access. I would personally recommended JDK8 as it is **just a little slightly* faster than OpenJDK, which is easier to install.

1. Install Java

Installing OpenJDK is easy, just do and wait

pi@raspberrypi ~ $ sudo apt-get install openjdk-7-jdk
pi@raspberrypi ~ $ java -version
java version "1.7.0_07"
OpenJDK Runtime Environment (IcedTea7 2.3.2) (7u7-2.3.2a-1+rpi1)
OpenJDK Zero VM (build 22.0-b10, mixed mode)

Alternatively, you can install Oracle’s JDK 8 for ARM Early Access (some said it was optimized for Pi).
First get it from here: https://jdk8.java.net/fxarmpreview/index.html

pi@raspberrypi ~ $ sudo tar zxvf jdk-8-ea-b36e-linux-arm-hflt-*.tar.gz -C /opt
pi@raspberrypi ~ $ sudo update-alternatives --install "/usr/bin/java" 
"java" "/opt/jdk1.8.0/bin/java" 1 
pi@raspberrypi ~ $ java -version
java version "1.8.0-ea"
Java(TM) SE Runtime Environment (build 1.8.0-ea-b36e)
Java HotSpot(TM) Client VM (build 25.0-b04, mixed mode)

If you have both versions installed, you can use switch between them with

sudo update-alternatives --config java

2. Create a hadoop system user

pi@raspberrypi ~ $ sudo addgroup hadoop
pi@raspberrypi ~ $ sudo adduser --ingroup hadoop hduser
pi@raspberrypi ~ $ sudo adduser hduser sudo

3. Setup SSH

pi@raspberrypi ~ $ su - hduser
hduser@raspberrypi ~ $ ssh-keygen -t rsa -P ""

This will create an RSA key pair with an empty password. It is done so to stop Hadoop prompting for the passphrase when in talks to its nodes

hduser@raspberrypi ~ $ cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys

Now SSH access to your local machine is enabled with this newly created key

hduser@raspberrypi ~ $ ssh localhost

You should be good to login without password

4. Download (install?) Hadoop
Download hadoop from http://www.apache.org/dyn/closer.cgi/hadoop/core

hduser@raspberrypi ~ $ wget http://mirror.catn.com/pub/apache/hadoop/core/hadoop-1.1.2/hadoop-1.1.2.tar.gz
hduser@raspberrypi ~ $ sudo tar vxzf hadoop-1.1.2.tar.gz -C /usr/local
hduser@raspberrypi ~ $ cd /usr/local
hduser@raspberrypi /usr/local $ sudo mv hadoop-1.1.2 hadoop
hduser@raspberrypi /usr/local $ sudo chown -R hduser:hadoop hadoop

Now hadoop has been installed and ready to roll (not yet). Edit .bashrc under your home, and append the following lines

export JAVA_HOME=/usr/lib/jvm/java-6-openjdk-armhf
export HADOOP_INSTALL=/usr/local/hadoop
export PATH=$PATH:$HADOOP_INSTALL/bin

modify JAVA_HOME accordingly if you use oracle’s version.

Reboot Pi and verify the installation:

hduser@raspberrypi ~ $ hadoop version
Hadoop 1.1.2
Subversion https://svn.apache.org/repos/asf/hadoop/common/branches/
branch-1.1 -r 1440782
Compiled by hortonfo on Thu Jan 31 02:03:24 UTC 2013
From source with checksum c720ddcf4b926991de7467d253a79b8b

5. Configure Hadoop
NOTE: this how-to is just a minimal configuration for single-node mode hadoop

configuration files are at "/usr/local/hadoop/conf/", and will need to 
edit core-site.xml, hdfs-site.xml, mapred-site.xml

core-site.xml

<configuration>
  <property>
    <name>hadoop.tmp.dir</name>
    <value>/fs/hadoop/tmp</value>
  </property>
  <property>
    <name>fs.default.name</name>
    <value>hdfs://localhost:54310</value>
  </property>
</configuration>

mapred-site.xml

<configuration>
  <property>
    <name>mapred.job.tracker</name>
    <value>localhost:54311</value>
  </property>
</configuration>

hdfs-site.xml

<configuration>
  <property>
    <name>dfs.replication</name>
    <value>1</value>
  </property>
</configuration>

OK, we’re almost done, one last step.

hduser@raspberrypi ~ $ sudo mkdir -p /fs/hadoop/tmp
hduser@raspberrypi ~ $ sudo chown hduser:hadoop /fs/hadoop/tmp
hduser@raspberrypi ~ $ sudo chmod 750 /fs/hadoop/tmp
hduser@raspberrypi ~ $ hadoop namenode -format

ATTENTION:

If you use JDK 8 for hadoop, you need to force DataNode to run in JVM client mode as JDK 8 does not support server yet. Go to /usr/local/hadoop/bin and edit hadoop file (please create a backup first). Assuming you’re using nano, the procedure is as follows.  nano hadoop, ctrl-w to search for “-server” argument. What you need is to delete “-server” and then save & exit.

Now hadoop single-node system is ready. Below are some useful commands.

1. jps           // will report the local VM identifier
2. start-all.sh  // will start all hadoop processes
3. stop-all.sh   // will stop all hadoop processes

 

References:

[1] http://raspberrypi.stackexchange.com/questions/4683/how-to-install-java-jdk-on-raspberry-pi
[2] http://www.michael-noll.com/tutorials/running-hadoop-on-ubuntu-linux-single-node-cluster/

Pi Cloud Networking

We went public for the first time on 6th November 2012 for SICSA’s demofest held in Edinburgh University. Pi Cloud also showed his (or her?) excitement in his twitter (@glasgowpicloud):

Here I am in the #SICSA #Demofest. Come along and have a look at me at Stand 24.

While there are plenty of network topologies for data centers, Glasgow Pi Cloud only adopts most prevailing canonical tree -multirooted tree- as underpinning network architecture. Other alternatives such as fat-tree, mesh etc., will be considered, perhaps, in the near future. So networking in Pi Cloud is straightforward, isn’t it?