Running Hadoop Java and C++ Word Count example on Raspberry Pi

I hear that hadoop is incredibly slow on pi from various blog posts, and yes pls lower your hope as the speed is really appalling. But it is very interesting to see how slow it can be on the pi.

This post assumes you already have hadoop installed and configured on your pi. Before we start, we need to increase swap file size if your pi is 256MB ver. otherwise your pi will run out of memory.

1. Increase the swap file size (I stole this from David’s post)

hduser@raspberrypi ~ $ pico /etc/dphys-swapfile
change the value to 500 (MB)
hduser@raspberrypi ~ $ sudo dphys-swapfile setup
hduser@raspberrypi ~ $ sudo reboot

2. Download the example file

go http://www.gutenberg.org/ebooks/20417 and download the plain text e-book. Assuming you have downloaded the file to your home directory, we then copy this file to HDFS.

hduser@raspberrypi ~ $ start-all.sh
hduser@raspberrypi ~ $ hadoop dfs -copyFromLocal pg20417.txt /user/hduser/wordcount/pg20417.txt 

You can the check the file existence similar to ls command

hduser@raspberrypi ~ $ hadoop dfs -ls /user/hduser/wordcount

3.  Run example Java wordcount example

hduser@raspberrypi ~ $ hadoop jar /usr/local/hadoop/hadoop-examples-1.1.2.jar wordcount /user/hduser/wordcount /user/hduser/wordcount-output

Now, be patient! it will take approx. 8 minutes to complete….

4. Check execution result

hduser@raspberrypi ~ $ hadoop dfs -cat /user/hduser/wordcount-output/part-r-00000

5. C++ wordcount example

Getting hadoop pipes to run on pi needs a little more effort (hacking?) as we will need to build some pi compatible libraries. Particularly we’ll want libhdfs libhadooppipes as well as libhadooputils.

Let’s get the build environment ready first.

hduser@raspberrypi ~ $ apt-get install libssl-dev

go to /usr/local/hadoop/src/c++/libhdfs/ and edit the configure file, so it will run without errors.

in configure file, find the comment out the following two lines.

as_fn_error $? "Unsupported CPU architecture \"$host_cpu\"" "$LINENO" 5;;

and

define size_t unsigned int

Those are all hackings we need to do. Next,

hduser@raspberrypi ~ $ ./configure --prefix=/usr/local/hadoop/c++/Linux-i386-32
hduser@raspberrypi ~ $ make
hduser@raspberrypi ~ $ make install

We’re almost done, just do the same for pipes and utils. Once finished, you’ll have pi compatible libraries and just build the wordcount.cpp with Makefile given below.

wordcount.cpp

#include <algorithm>
#include <limits>
#include <string>

#include  "stdint.h"  // <--- to prevent uint64_t errors! 

#include "hadoop/Pipes.hh"
#include "hadoop/TemplateFactory.hh"
#include "hadoop/StringUtils.hh"

using namespace std;

class WordCountMapper : public HadoopPipes::Mapper {
public:
  // constructor: does nothing
  WordCountMapper( HadoopPipes::TaskContext& context ) {
  }

  // map function: receives a line, outputs (word,"1")
  // to reducer.
  void map( HadoopPipes::MapContext& context ) {
    //--- get line of text ---
    string line = context.getInputValue();

    //--- split it into words ---
    vector< string > words =
      HadoopUtils::splitString( line, " " );

    //--- emit each word tuple (word, "1" ) ---
    for ( unsigned int i=0; i < words.size(); i++ ) {
      context.emit( words[i], HadoopUtils::toString( 1 ) );
    }
  }
};

class WordCountReducer : public HadoopPipes::Reducer {
public:
  // constructor: does nothing
  WordCountReducer(HadoopPipes::TaskContext& context) {
  }

  // reduce function
  void reduce( HadoopPipes::ReduceContext& context ) {
    int count = 0;

    //--- get all tuples with the same key, and count their numbers ---
    while ( context.nextValue() ) {
      count += HadoopUtils::toInt( context.getInputValue() );
    }

    //--- emit (word, count) ---
    context.emit(context.getInputKey(), HadoopUtils::toString( count ));
  }
};

int main(int argc, char *argv[]) {
  return HadoopPipes::runTask(HadoopPipes::TemplateFactory< 
			      WordCountMapper, 
                              WordCountReducer >() );
}

Makefile

CC = g++
HADOOP_INSTALL = /usr/local/hadoop
PLATFORM = Linux-i386-32
CPPFLAGS =  -I$(HADOOP_INSTALL)/c++/$(PLATFORM)/include

wordcount: wordcount.cpp
     $(CC) $(CPPFLAGS) $< -Wall -L$(HADOOP_INSTALL)/c++/$(PLATFORM)/lib -lhadooppipes \
     -lhadooputils -lpthread -lcrypto -lssl -g -O2 -o $

Remark: on my 256BM ver.B pi, C++ wordcount take about 10 minutes to finish.

References:

[1] http://cs.smith.edu/dftwiki/index.php/Hadoop_Tutorial_2.2_–_Running_C%2B%2B_Programs_on_Hadoop

[2] http://www.michael-noll.com/tutorials/running-hadoop-on-ubuntu-linux-single-node-cluster/#Copy_local_example_data_to_HDFS

3 thoughts on “Running Hadoop Java and C++ Word Count example on Raspberry Pi

  1. I remember talking to Eben at MakerFaire 2012 and he made a comment about the makeup of the RPI processor.

    “95% of that chip is effectively GPU”

    Implying that the PI had very little CPU power relative to its GPU power. I’m there are several OS tweaks in Raspbian to leverage the GPU for floating point

    And frankly, that is probably the major bottleneck here. Unless the Java framework has very specific hooks for accessing the “right part” of the chip, I can only imagine that perf is going to suck.

    I think the key challenge might be unlocking the performance for exactly these types of workloads.

  2. “Before we start, we need to increase swap file size if your pi is 256MB ver. otherwise your pi will run out of memory.” – I would guess this is the bottleneck. If your machine runs out of memory and uses swap you cannot expect any performance at all

Leave a comment