Saturday, August 12, 2017

Java Parser Benchmark


I ran into slow parsing issues when loading this Hackerrank problem input. When parsing 1M integers, you can no longer rely on high level library such as java.util.Scanner as they are not optimized to be fast. You can easily replace this handy yet slow utility by a lower level function that wraps simple String.split / Integer.parseInt operations.

We also compare loading file from disk using JVM Heap buffer vs native, off-Heap buffer. We also look at the encoding overhead to convert binary data to UTF-8 characters required before manipulating strings. This will show the benefits of just converting bytes to int data to achieve the best performance.

Maven Project

The code is bundled in this SameOccurrence maven project. The input is a 16 MB plain text file containing more than 1M integers to scan. This was the input of Test Case #18 in the Hackerrank challenge.


Antlr parser is the slowest, loading 1M integers in 3 seconds. This is due to extra overhead to generate the parse tree. Parser generators are very helpful to validate input syntax using a grammar, but are not tailored for performance.

On the other end the raw byte parser finishes in 10 ms as it does not do extra conversions / validation. It assumes the input is valid and simply converts sequence of bytes separated by whitespaces into base 10 integers.

Here we compare the time to load a file into a heap buffer, a byte array allocated within the JVM and the time to load the file using mmap system call, implemented via a native function that allocates the buffer off-heap leveraging Virtual Memory. We reduce our 20 ms loading time down to 4 ms.

The 3rd bar shows the UTF-8 conversion overhead you will need to include to start working with Strings. On top of loading raw data in 4 ms, you need to spend an additional 80 ms to just get UTF-8 data ...


ANTLR is a parser generator written in Java that converts a grammar into a parser. It provides a maven plugin that wraps the Tool utility to parse the grammar and generate associated parser.
You implement a parse listener to load the data as rules get executed.

public void antlr() {
    CharStream charStream = CharStreams.fromString(input);

    Lexer lexer = new SameOccurrenceLexer(charStream);
    CommonTokenStream stream = new CommonTokenStream(lexer);
    SameOccurrenceParser parser = new SameOccurrenceParser(stream);
    parser.addParseListener(new SameOccurrenceParseListener());


With the JDK Scanner it is very easy to load raw data types from an input stream into memory.

public void scan() {
    Scanner scanner = new Scanner(input);
    int n = scanner.nextInt();
    int q = scanner.nextInt();
    int[] a = new int[n];
    for (int i = 0; i < n; i++) {
        a[i] = scanner.nextInt();
    for (int t = 0; t < q; t++) {
        int x = scanner.nextInt();
        int y = scanner.nextInt();


With the JDK String split method, you can break down your String input into tokens using a regex separator. Then convert string to int using Integer.parseInt. Returning raw data type is faster than Integer object via Integer.valueOf.

int tokenOffset = 0;
public void split() {
    String[] tokens = input.split("\\s+");
    int n = parseToken(tokens);
    int q = parseToken(tokens);
    int[] a = new int[n];
    for (int i = 0; i < n; i++) {
        a[i] = parseToken(tokens);
    for (int t = 0; t < q; t++) {
        int x = parseToken(tokens);
        int y = parseToken(tokens);
    assert tokenOffset == tokens.length;

private int parseToken(String[] tokens) {
    return Integer.parseInt(tokens[tokenOffset++]);


You just scan for the next whitespace delimiter, then converts the byte sequence to an integer. This is the same logics as Integer.parseInt implementation.

int byteOffset = 0;
public void raw() {
    int n = parseInt();
    int q = parseInt();
    int[] a = new int[n];
    for (int i = 0; i < n; i++) {
        a[i] = parseInt();
    for (int t = 0; t < q; t++) {
        int x = parseInt();
        int y = parseInt();

private static final int BASE = 10;
private static final char ZERO = '0';
private int parseInt() {
    int start = byteOffset;
    int end = getTokenEnd();

    int i = 0;
    int pow = 1;
    for (int pos = end; pos >= start; pos--) {
        int digit = byteBuffer.get(pos) - ZERO;
        i += pow * digit;
        pow *= BASE;
    return i;

private int getTokenEnd() {
    while (isWhitespace(byteBuffer.get(byteOffset))) {
    int pos = byteOffset;
    while(! isWhitespace(byteBuffer.get(pos))) {
    byteOffset = pos;
    return pos-1;

private static final char LF = '\n';
private static final char CR = '\r';
private static final char SPACE = ' ';
private boolean isWhitespace(byte b) {
    return b == LF || b == CR || b == SPACE;

Sunday, May 4, 2014

DNG 1.4 Parser

This tutorial describes how to build the Adobe DNG SDK on Linux.
It generates the dng_validate C++ program that can parse any DNG images, a bit like a "Hello world" for DNG image processing.

  1. Adobe DNG SDK 1.4

Adobe DNG SDK 1.4

The goal of this blog entry is to execute the dng_validate command built from the source.
We show how to build the XMP libraries then suggest a Makefile configuration to build a DNG SDK sample app.
All the patches were generated by the diff command.


In the current directory, download the XMP SDK:
mv XMP-Toolkit-SDK-CC201306 xmp_sdk

Make sure you have at least the following packages installed:

sudo apt-get install cmake libjpeg8-dev uuid-dev

The 2 next snippets download zlib and expat sources. Instructions are explained in the ReadMe.txt files in the third-party directories.


cd xmp_sdk/third-party/zlib
tar xzf zlib-1.2.8.tar.gz
cp zlib-1.2.8/*.h zlib-1.2.8/*.c .


cd xmp_sdk/third-party/expat
Download here the source at
tar xzf expat-2.1.0.tar.gz
cp -R expat-2.1.0/lib .

For cross-platform compatibility, to be able to support both Mac & Windows, cmake is the tool selected by the SDK authors to build the source. We go through all the build errors.

cd xmp_sdk/build

Build error 1

CMake Error at shared/SharedConfig_Common.cmake:38 (if):
  if given arguments:

    "LESS" "413"

  Unknown arguments specified

I use gcc 4.8.2 for the build. There seems to be a XMP_VERSIONING_GCC_VERSION variable not properly set. We just delete the version check by removing line 37 till 42 in xmp_sdk/build/shared/SharedConfig_Common.cmake:

>               # workaround for visibility problem and gcc 4.1.x
>               if(${${COMPONENT}_VERSIONING_GCC_VERSION} LESS 413)
>                       # only remove inline hidden...
>                       string(REGEX REPLACE "-fvisibility-inlines-hidden" "" CMAKE_CXX_FLAGS ${CMAKE_CXX_FLAGS})
>               endif()

Build error 2

Linking CXX shared library /home/alexis/linux/dng2/xmp_sdk/public/libraries/i80386linux_x64/release/
g++: error: /usr/lib64/gcc/x86_64-redhat-linux/4.4.4//libssp.a: No such file or directory


... He may also want to change the parameter XMP_ENABLE_SECURE_SETTINGS as per the configured gcc.
 a) If the gcc is configured with --enable-libssp (can be checked by executing gcc -v), he has to set the variable XMP_GCC_LIBPATH inside of file /build/XMP_Linux.cmake to the path containing the static lib( libssp.a).In this case he can set the variable the XMP_ENABLE_SECURE_SETTINGS on.
b) If the gcc is configured with --disable-libssp, he has to set the variable XMP_ENABLE_SECURE_SETTINGS off.

I don't have any --enable-libssp in my gcc version:

$ gcc -v
Using built-in specs.
Target: x86_64-linux-gnu
Configured with: ../src/configure -v --with-pkgversion='Debian 4.8.2-21' --with-bugurl=file:///usr/share/doc/gcc-4.8/README.Bugs --enable-languages=c,c++,java,go,d,fortran,objc,obj-c++ --prefix=/usr --program-suffix=-4.8 --enable-shared --enable-linker-build-id --libexecdir=/usr/lib --without-included-gettext --enable-threads=posix --with-gxx-include-dir=/usr/include/c++/4.8 --libdir=/usr/lib --enable-nls --with-sysroot=/ --enable-clocale=gnu --enable-libstdcxx-debug --enable-libstdcxx-time=yes --enable-gnu-unique-object --disable-libmudflap --enable-plugin --with-system-zlib --disable-browser-plugin --enable-java-awt=gtk --enable-gtk-cairo --with-java-home=/usr/lib/jvm/java-1.5.0-gcj-4.8-amd64/jre --enable-java-home --with-jvm-root-dir=/usr/lib/jvm/java-1.5.0-gcj-4.8-amd64 --with-jvm-jar-dir=/usr/lib/jvm-exports/java-1.5.0-gcj-4.8-amd64 --with-arch-directory=amd64 --with-ecj-jar=/usr/share/java/eclipse-ecj.jar --enable-objc-gc --enable-multiarch --with-arch-32=i586 --with-abi=m64 --with-multilib-list=m32,m64,mx32 --with-tune=generic --enable-checking=release --build=x86_64-linux-gnu --host=x86_64-linux-gnu --target=x86_64-linux-gnu
Thread model: posix
gcc version 4.8.2 (Debian 4.8.2-21)

so I disable it in xmp_sdk/build/shared/ToolchainGCC.cmake line 37:


Build error 3

/home/alexis/linux/dng2/xmp_sdk/XMPFiles/build/../../XMPFiles/source/NativeMetadataSupport/ValueObject.h:111:60: error: 'memcmp' was not declared in this scope, and no declarations were found by argument-dependent lookup at the point of instantiation [-fpermissive]
    doSet = ( memcmp( mArray, buffer, numElements*sizeof(T) ) != 0 );
In file included from /home/alexis/linux/dng2/xmp_sdk/XMPFiles/build/../../XMPFiles/source/FormatSupport/TIFF_Support.hpp:17:0,
                 from /home/alexis/linux/dng2/xmp_sdk/XMPFiles/build/../../XMPFiles/source/FormatSupport/ReconcileLegacy.hpp:15,
                 from /home/alexis/linux/dng2/xmp_sdk/XMPFiles/build/../../XMPFiles/source/FormatSupport/Reconcile_Impl.hpp:15,
                 from /home/alexis/linux/dng2/xmp_sdk/XMPFiles/source/FormatSupport/WAVE/WAVEReconcile.cpp:24:
/usr/include/string.h:65:12: note: 'int memcmp(const void*, const void*, size_t)' declared here, later in the translation unit
 extern int memcmp (const void *__s1, const void *__s2, size_t __n)

This is an error relative to my new version of gcc, described in the Name lookup changes section.
We simply add the #include "string.h" in xmp_sdk/XMPFiles/source/NativeMetadataSupport/ValueObject.h, line 16:

< #include "string.h"

Validate the build

[100%] Built target XMPFilesStatic

Check that the shared and static libraries got properly generated:

$ ls xmp_sdk/public/libraries/i80386linux_x64/release**


In the current directory, where the XMP SDK got downloaded, download the DNG SDK:


There are multiple ways to come up with the binary. In this case,
  • We seperate the compilation and linking steps in 2 separate targets.
  • We link with the shared libraries of the XMP Toolkit.
  • We did not install system wide in /usr/local/lib and /usr/local/include the required XMP files that were generated during the XMP build.

cd dng_sdk/source


Create the "Makefile" build configuration file:

# Binary name

# A DNG image

# The XMP SDK build directory if we don't want to install it system-wide.

INCL=-I $(XMP_PUB_DIR)/include
LIB=-ljpeg -lz -lpthread -ldl -L $(XMP_RELEASE) -lXMPCore -lXMPFiles
SOURCES:=$(shell ls *.cpp)

# Execute the binary

# Linking
        g++ $^ $(LIB) -o $@

# Compilation
        g++ -c -Wall -g $(INCL) $^

        rm $(EXECUTABLE) *.o

Build error 1

To avoid errors like

dng_flags.h:36:28: fatal error: RawEnvironment.h: No such file or directory
 #include "RawEnvironment.h"

Create the RawEnvironment.h file containing build settings:

#define qLinux 1
#define qDNGThreadSafe 1
#define UNIX_ENV 1

Build error 2

To avoid errors like

dng_flags.h:40:2: error: #error Unable to figure out platform

A qLinux setting got created in RawEnvironment.h. The "!defined(qMacOS) || !defined(qWinOS)" kind of test line 39 in dng_flags.h does not seem effective on Linux platform even when qMacOS or qWinOS are defined. We replace it by something more standard, for example "#ifndef qLinux":

39 #ifndef qLinux
40 #error Unable to figure out platform
41 #endif

chmod +w dng_flags.h
< #ifndef qLinux
> #if !defined(qMacOS) || !defined(qWinOS)

Build error 3

dng_string.cpp:1792:24: error: 'isdigit' was not declared in this scope
    if (isdigit ((int) c) || c == '.' || c == '-' || c == '+' || c == 'e' || c == 'E')

We simply need to include "ctype.h" in dng_string.cpp for linux platform at line 33:

32 #if qiPhone || qAndroid || qLinux
33 #include <ctype.h> // for isdigit
34 #endif

At line 32 we add qLinux is the list of platforms on when to include ctype.h.

chmod +x dng_string.cpp
< #if qiPhone || qAndroid || qLinux
> #if qiPhone || qAndroid

Build error 4

/usr/lib/gcc/x86_64-linux-gnu/4.8/../../../x86_64-linux-gnu/crt1.o: In function `_start':
(.text+0x20): undefined reference to `main'

The main function declared in dng_validate.cpp was not found. Enable the command build setting in line 200 in dng_flags.h:

< #define qDNGValidateTarget 1
> #define qDNGValidateTarget 0

Build error 5

make clean

We now have a linking error

dng_xmp_sdk.cpp:(.text._ZN9TXMPFilesISsE8OpenFileEP6XMP_IOjj[_ZN9TXMPFilesISsE8OpenFileEP6XMP_IOjj]+0x40): undefined reference to `WXMPFiles_OpenFile_2'
collect2: error: ld returned 1 exit status

We want to link with the shared libraries. Delete the static build setting in dng_xmp_sdk.cpp line 48:

chmod +w dng_xmp_sdk.cpp
> #define XMP_StaticBuild 1

Validate the build

The execution of the command looks like:

LD_LIBRARY_PATH=/home/alexis/linux/dng/xmp_sdk/public/libraries/i80386linux_x64/release ./dng_validate ~/job/image_samples/9436b5a2336f0a575a5b0ef3adf0b25171125081.dng
Validating "/home/alexis/job/image_samples/9436b5a2336f0a575a5b0ef3adf0b25171125081.dng"...
Raw image read time: 0.074 sec
Linearization time: 0.051 sec
Interpolate time: 0.000 sec
Validation complete

Sunday, February 20, 2011

Increase your Swap partition

Build error with Mahout

Interestingly enough, with as low as 1GB for the RAM and Swap sizes, you should run into the exact same issue during the build of Mahout project from source than the one described in this previous post, HBase memory issue.

$ svn co mahout
$ cd mahout
$ mvn

From core/target/surefire-reports/, I had the same error:

Test set:
Tests run: 21, Failures: 0, Errors: 2, Skipped: 0, Time elapsed: 33.882 sec <<< FAILURE!
testCompleteJobBoolean(  Time elapsed: 15.717 sec  <<< ERROR! Cannot run program "chmod": error=12, Cannot allocate memory
        at java.lang.ProcessBuilder.start(
        at org.apache.hadoop.util.Shell.runCommand(
        at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(
        at org.apache.hadoop.util.Shell.execCommand(
        at org.apache.hadoop.util.Shell.execCommand(
        at org.apache.hadoop.fs.RawLocalFileSystem.execCommand(
        at org.apache.hadoop.fs.RawLocalFileSystem.setPermission(
        at org.apache.hadoop.fs.FilterFileSystem.setPermission(
        at org.apache.hadoop.fs.FileSystem.mkdirs(
        at org.apache.hadoop.mapred.JobClient.configureCommandLineOptions(
        at org.apache.hadoop.mapred.JobClient.submitJobInternal(
        at org.apache.hadoop.mapreduce.Job.submit(
        at org.apache.hadoop.mapreduce.Job.waitForCompletion(

As mentioned here, you should increase the size of your swap partition. The swap partition was not big enough for the system to fork the java process while simultaneously keeping the memory pages required by the other running applications.

Increase your Swap partition

To get more swap, you can create a swap file. A cleaner way is to change your partition table, shrink one partition to get some space and then increase the swap one. You want to know what you are doing here. Read carefully the Partition guide and back-up your personal files before changing anything. In my case, I got lucky as I was able to shrink an unimportant primary partition (/dev/sda2) to give additional space to the swap (the logical partition /dev/sda8) located in a different primary partition (/dev/sda1). Following is the former partition table that had the same amount of swap than RAM, 1GB. The newer now has twice as much as RAM, 2GB. This was the old partition table:

# fdisk /dev/sda

WARNING: DOS-compatible mode is deprecated. It's strongly recommended to
         switch off the mode (command 'c') and change display units to
         sectors (command 'u').

Command (m for help): p

Disk /dev/sda: 58.5 GB, 58506416640 bytes
255 heads, 63 sectors/track, 7113 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disk identifier: 0xcfce28df

   Device Boot      Start         End      Blocks   Id  System
/dev/sda1               1        3649    29310561    5  Extended
/dev/sda2   *        3650        7112    27816547+   7  HPFS/NTFS
/dev/sda5               1         973     7815559+  83  Linux
/dev/sda6             974        1095      979933+  83  Linux
/dev/sda7            1096        3527    19535008+  83  Linux
/dev/sda8            3528        3649      979933+  82  Linux swap / Solaris

/dev/sda8 got assigned the cylinders from 3528 to 3649. Its size is 121 * 8225280 / 1024 = 971 932 kB ( ~ 1 GB)

To perform the modification, I followed these steps:

  1. Count the number of additionnal cylinders required for the swap partition (121)
  2. Run fdisk /dev/sda
  3. Shrink primary partition /dev/sda2 : delete it (d), add it again (n) with less cylinders (starting cylinder is now 3771 instead of 3650) and write the partition table to disk (w).
  4. Run kpartx /dev/sda
  5. Expand extended partition /dev/sda1 : (d) then (n), with more cylinders (ending cylinder is 3770 instead of 3649). Recreate logical partitions /dev/sda5, /dev/sda6, /dev/sda7 ((n) 3 times with the same corresponding start/end cylinders than before) and finally /dev/sda8 with more cylinders (ending cylinder is now 3770 instead of 3649). Assign 82 as system id to /dev/sda8 to tag it as "Linux swap / Solaris" (t). Then (w).
  6. Run kpartx /dev/sda again
  7. Reformat the modified partitions:

Format /dev/sda2 partition with ext3 filesytem. You will loose everything.

# mke2fs -j /dev/sda2

Format /dev/sda8 as swap:

# swapoff /dev/sda8
# mkswap -f /dev/sda8
# swapon /dev/sda8

This is the new partition table:

   Device Boot      Start         End      Blocks   Id  System
/dev/sda1               1        3770    30282493+   5  Extended
/dev/sda2            3771        7113    26852647+  83  Linux
/dev/sda5               1         973     7815559+  83  Linux
/dev/sda6             974        1095      979933+  83  Linux
/dev/sda7            1096        3527    19535008+  83  Linux
/dev/sda8            3528        3770     1951866   82  Linux swap / Solaris

With the top command, you should see the new total amount of swap available.

top - 09:45:16 up  1:48,  6 users,  load average: 0.01, 0.04, 0.01
Tasks: 139 total,   1 running, 138 sleeping,   0 stopped,   0 zombie
Cpu(s):  1.0%us,  1.0%sy,  0.0%ni, 96.8%id,  0.0%wa,  0.7%hi,  0.5%si,  0.0%st
Mem:   1026468k total,   668960k used,   357508k free,    22036k buffers
Swap:  1951856k total,   168876k used,  1782980k free,   438632k cached


Run the build in Mahout directory with mvn command. It should now be successful.

Sunday, February 6, 2011

Gora, an ORM framework for Hadoop jobs

Table of Contents
Gora development in Eclipse
I/O Frequency
Cassandra in Gora
  gora-cassandra module
  Avro schema


Lately I have been focusing on rewriting the whole Cassandra stack in Gora, for GORA-22. It was hinted that it needed to be revamped due to some concurrency issues? I tried to port the old backend by updating the API calls to make it compatible with Cassandra 0.8. I could not make it work, so I just rewrote the module from scratch. Instead of rewriting the entire Thrift layer, we now delegate Cassandra Read/Write operations to Hector, the first Cassandra client listed on the official wiki.


Here is some Gora background, from what I understand.

Nutch performs a web crawl by running iterations via generate/fetch/parse/updatedb steps implemented through Hadoop jobs. To access the data, it relies on Gora, which is actually an ORM framework (Object-Relationnal Mapping, a bit like activerecord in Rails), instead of previously manipulating segments.

The gora-mapreduce module intends to abstract away the data access within Map/Reduce. It replaces the data storage that is usually done through HDFS files (hadoop-hdfs). Instead, you're given the ability to query your data from a database, the row-oriented (RDBMS such as MySQL, HSQL) or column-oriented (no-SQL dbs such as HBase or Cassandra) style.

This of course has impacts on performance by adding network overhead when the mappers need to connect to a centralized remote server instead of reading distributed files from the cluster. It kills as well a few intrinsic HDFS features, such as data recovery through replication (connection failures) or speedup through network topology analysis. Here I'm not quite aware of the implications of using Gora so please don't hesitate to share your own impressions.

Gora development in Eclipse

Setup a Gora project the same way that Nutch.
To resolve dependencies, we can use the maven m2eclipse or the ivyde plugin.

If you want to use IvyDE, you should replace every occurence of ${project.dir} by ${basedir}/trunk (assuming trunk is the local directory containing Gora checkout) as well as commenting the gora-core dependency in gora-cassandra/ivy/ivy.xml, to make it work in Eclipse. I have no idea how to load such project.dir property in IvyDE Eclipse plugin.

Index: gora-core/ivy/ivy.xml
--- gora-core/ivy/ivy.xml       (revision 1149427)
+++ gora-core/ivy/ivy.xml       (working copy)
@@ -23,7 +23,7 @@
-    <include file="${project.dir}/ivy/ivy-configurations.xml"/>
+    <include file="${basedir}/trunk/ivy/ivy-configurations.xml"/>
   <publications defaultconf="compile">
@@ -44,10 +44,11 @@
       <exclude org="org.eclipse.jdt" name="core"/>
       <exclude org="org.mortbay.jetty" name="jsp-*"/>
     <dependency org="org.apache.hadoop" name="avro" rev="1.3.2" conf="*->default">
       <exclude org="ant" name="ant"/>
     <!-- test dependencies -->
     <dependency org="org.apache.hadoop" name="hadoop-test" rev="0.20.2" conf="test->master"/>
     <dependency org="org.slf4j" name="slf4j-simple" rev="1.5.8" conf="test -> *,!sources,!javadoc"/>
Index: gora-sql/ivy/ivy.xml
--- gora-sql/ivy/ivy.xml        (revision 1149427)
+++ gora-sql/ivy/ivy.xml        (working copy)
@@ -23,7 +23,7 @@
-    <include file="${project.dir}/ivy/ivy-configurations.xml"/>
+    <include file="${basedir}/trunk/ivy/ivy-configurations.xml"/>
@@ -33,12 +33,13 @@
     <!-- conf="*->@" means every conf is mapped to the conf of the same name of the artifact-->
-    <dependency org="org.apache.gora" name="gora-core" rev="latest.integration" changing="true" conf="*->@"/> 
+    <dependency org="org.apache.gora" name="gora-core" rev="latest.integration" changing="true" conf="*->@"/>
     <dependency org="org.jdom" name="jdom" rev="1.1" conf="*->master"/>
     <dependency org="com.healthmarketscience.sqlbuilder" name="sqlbuilder" rev="2.0.6" conf="*->default"/>
     <!-- test dependencies -->
     <dependency org="org.hsqldb" name="hsqldb" rev="2.0.0" conf="test->default"/>
+    <dependency org="mysql" name="mysql-connector-java" rev="5.1.13" conf="*->default"/>
Index: gora-cassandra/ivy/ivy.xml
--- gora-cassandra/ivy/ivy.xml  (revision 1149427)
+++ gora-cassandra/ivy/ivy.xml  (working copy)
@@ -24,7 +24,7 @@
-    <include file="${project.dir}/ivy/ivy-configurations.xml"/>
+    <include file="${basedir}/trunk/ivy/ivy-configurations.xml"/>
@@ -35,9 +35,9 @@
     <!-- conf="*->@" means every conf is mapped to the conf of the same name of the artifact-->
+    <!--
     <dependency org="org.apache.gora" name="gora-core" rev="latest.integration" changing="true" conf="*->@"/>
+    -->
     <dependency org="org.jdom" name="jdom" rev="1.1">
        <exclude org="xerces" name="xercesImpl"/>

This is how the Gora project looks like

In the Nutch project, we want to comment gora-core, gora-hbase, gora-cassandra or gora-mysql dependencies in Ivy since they are already loaded with the Gora project being included in the Nutch Java Build Path. We add Gora as a project dependency in the Java Build Path

That's how we can update Gora code and test it on the fly by running Nutch classes.

I/O Frequency

By default the records shuffled in the Hadoop job are buffered to memory. At some point, you want to flush the buffer and write the records in the actual database. Similarly, you can not load the entire content of the database then start the mappers, so you need to limit the number of records you read at a time.

People might be familiar with mapred-site.xml configuration file when they write a Hadoop job and do not necesarilly use Gora together with Nutch. You can overwrite the default number of rows fetched per select query for read operations and the default number of rows buffered into memory before it gets flushed for write operations. Default is 10000.

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

Cassandra in Gora

A Gora Hadoop job just fetches or emits key/value pairs to process the data. The task performed by gora-cassandra module is to the write the values, coming as specific instances of Avro records then store them to Cassandra. So this subproject is just a thin layer between Gora Hadoop job code and the Cassandra client, which itself is interacting with the Cassandra server. That's it.

gora-cassandra module

Here are the main changes/improvements:

  • Compatibility with Cassandra 0.8
  • Use Hector as Cassandra client
  • Concurrency now relies on Hector.
  • Not all the features have not yet been implemented: delete query...

Avro schema

The object serialization is dictated by an Avro schema. An example can be found in $NUTCH_HOME/src/gora/webpage.avsc:

{"name": "WebPage",
 "type": "record",
 "namespace": "",
 "fields": [
        {"name": "baseUrl", "type": "string"}, 
        {"name": "status", "type": "int"},
        {"name": "fetchTime", "type": "long"},
        {"name": "prevFetchTime", "type": "long"},
        {"name": "fetchInterval", "type": "int"},
        {"name": "retriesSinceFetch", "type": "int"},
        {"name": "modifiedTime", "type": "long"},
        {"name": "protocolStatus", "type": {
            "name": "ProtocolStatus",
            "type": "record",
            "namespace": "",
            "fields": [
                {"name": "code", "type": "int"},
                {"name": "args", "type": {"type": "array", "items": "string"}},
                {"name": "lastModified", "type": "long"}
        {"name": "content", "type": "bytes"},
        {"name": "contentType", "type": "string"},
        {"name": "prevSignature", "type": "bytes"},
        {"name": "signature", "type": "bytes"},
        {"name": "title", "type": "string"},
        {"name": "text", "type": "string"},
        {"name": "parseStatus", "type": {
            "name": "ParseStatus",
            "type": "record",
            "namespace": "",
            "fields": [
                {"name": "majorCode", "type": "int"},
                {"name": "minorCode", "type": "int"},
                {"name": "args", "type": {"type": "array", "items": "string"}}
        {"name": "score", "type": "float"},
        {"name": "reprUrl", "type": "string"},
        {"name": "headers", "type": {"type": "map", "values": "string"}},
        {"name": "outlinks", "type": {"type": "map", "values": "string"}},
        {"name": "inlinks", "type": {"type": "map", "values": "string"}},
        {"name": "markers", "type": {"type": "map", "values": "string"}},
        {"name": "metadata", "type": {"type": "map", "values": "bytes"}}

The schema is hardcoded in the Nutch class, which is a POJO (Plain Old Java Object?) that contains the logics used for crawling a web page. It extends org.apache.gora.persistency.impl.PersistentBase. gora-cassandra considers the PersistentBase class as a base class for the RECORD fields. It considers org.apache.gora.persistency.StatefulHashMap as a base class for the MAP fields.

The 2 complex types, MAP and RECORD, are represented in Cassandra by "super columns", which are maps, ie sets of key/value pairs. The ARRAY type is represented by a simple column, via a coma separated list bounded by square brackets, like "[one, two, three]". Not the best.


Tuesday, January 18, 2011

Trying Nutch 2.0 HBase storage


It is likely one can run into issues using HBase as datastore for Nutch, especially with a commodity hardware that has very limited memory. This article follows up two posts:

which explain how to setup Nutch 2.0 with HBase.

Memory Issue

In my case, HBase is running in an environment that consists of a laptop that only has 1 GB of memory. This is too little. When Java forks the process, it duplicates the parent process' pages in order to load memory for the child one. Hence requiring twice as much it was using before the fork.

You can see that the JVM doubles the Java Heap Space size at around 5 PM, which crashes HBase. Keep reading to see what are the errors in the corresponding logs .

HBase error

First let's take a look at the data Nutch created.

$ bin/hbase shell
HBase Shell; enter 'help' for list of supported commands.
Version: 0.20.6, r965666, Mon Jul 19 16:54:48 PDT 2010
hbase(main):001:0> list        
1 row(s) in 0.1480 seconds
hbase(main):002:0> describe "webpage"
 NAME => 'webpage',
  {NAME => 'f', COMPRESSION => 'NONE', VERSIONS => '3', TTL => '2147483647', BLOCKSIZE => '65536', IN_MEMORY => 'false', BLOCKCACHE => 'true'},
  {NAME => 'h', COMPRESSION => 'NONE', VERSIONS => '3', TTL => '2147483647', BLOCKSIZE => '65536', IN_MEMORY => 'false', BLOCKCACHE => 'true'},
  {NAME => 'il', COMPRESSION => 'NONE', VERSIONS => '3', TTL => '2147483647', BLOCKSIZE => '65536', IN_MEMORY => 'false', BLOCKCACHE => 'true'},
  {NAME => 'mk', COMPRESSION => 'NONE', VERSIONS => '3', TTL => '2147483647', BLOCKSIZE => '65536', IN_MEMORY => 'false', BLOCKCACHE => 'true'},
  {NAME => 'mtdt', COMPRESSION => 'NONE', VERSIONS => '3', TTL => '2147483647', BLOCKSIZE => '65536', IN_MEMORY => 'false', BLOCKCACHE => 'true'},
  {NAME => 'ol', COMPRESSION => 'NONE', VERSIONS => '3', TTL => '2147483647', BLOCKSIZE => '65536', IN_MEMORY => 'false', BLOCKCACHE => 'true'},
  {NAME => 'p', COMPRESSION => 'NONE', VERSIONS => '3', TTL => '2147483647', BLOCKSIZE => '65536', IN_MEMORY => 'false', BLOCKCACHE => 'true'},
  {NAME => 's', COMPRESSION => 'NONE', VERSIONS => '3', TTL => '2147483647', BLOCKSIZE => '65536', IN_MEMORY => 'false', BLOCKCACHE => 'true'}
hbase(main):003:0> scan "webpage", { LIMIT => 1 }
ROW                          COLUMN+CELL                                                                      
 com.richkidzradio:http/     column=f:bas, timestamp=1295012635817, value=

I had issues with the updatedb command on 200k rows after parsing around 20k rows. When the fork happened, the logs in $HBASE_HOME/logs/hbase-alex-master-maison.log show:

2011-01-18 16:59:16,685 DEBUG org.apache.hadoop.hbase.regionserver.HRegion: Flush requested on webpage,com.richkidzradio:http/,1295020425887
2011-01-18 16:59:16,685 DEBUG org.apache.hadoop.hbase.regionserver.HRegion: Started memstore flush for region webpage,com.richkidzradio:http/,1295020425887. Current region memstore size 64.0m
2011-01-18 16:59:16,686 DEBUG org.apache.hadoop.hbase.regionserver.HRegion: Finished snapshotting, commencing flushing stores
2011-01-18 16:59:16,728 FATAL org.apache.hadoop.hbase.regionserver.MemStoreFlusher: Replay of hlog required. Forcing server shutdown
org.apache.hadoop.hbase.DroppedSnapshotException: region: webpage,com.richkidzradio:http/,1295020425887
        at org.apache.hadoop.hbase.regionserver.HRegion.internalFlushcache(
        at org.apache.hadoop.hbase.regionserver.HRegion.flushcache(
        at org.apache.hadoop.hbase.regionserver.MemStoreFlusher.flushRegion(
        at org.apache.hadoop.hbase.regionserver.MemStoreFlusher.flushRegion(
Caused by: Cannot run program "chmod": error=12, Cannot allocate memory
        at java.lang.ProcessBuilder.start(
        at org.apache.hadoop.util.Shell.runCommand(
        at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(
        at org.apache.hadoop.util.Shell.execCommand(
        at org.apache.hadoop.util.Shell.execCommand(
        at org.apache.hadoop.fs.RawLocalFileSystem.execCommand(
        at org.apache.hadoop.fs.RawLocalFileSystem.setPermission(
        at org.apache.hadoop.fs.FilterFileSystem.setPermission(
        at org.apache.hadoop.fs.ChecksumFileSystem.create(
        at org.apache.hadoop.fs.FileSystem.create(
        at org.apache.hadoop.fs.FileSystem.create(
        at org.apache.hadoop.fs.FileSystem.create(
        at org.apache.hadoop.fs.FileSystem.create(
        at org.apache.hadoop.hbase.regionserver.StoreFile.getWriter(
        at org.apache.hadoop.hbase.regionserver.Store.getWriter(
        at org.apache.hadoop.hbase.regionserver.Store.getWriter(
        at org.apache.hadoop.hbase.regionserver.Store.internalFlushCache(
        at org.apache.hadoop.hbase.regionserver.Store.flushCache(
        at org.apache.hadoop.hbase.regionserver.Store.access$100(
        at org.apache.hadoop.hbase.regionserver.Store$StoreFlusherImpl.flushCache(
        at org.apache.hadoop.hbase.regionserver.HRegion.internalFlushcache(
        ... 4 more
Caused by: error=12, Cannot allocate memory
        at java.lang.UNIXProcess.<init>(
        at java.lang.ProcessImpl.start(
        at java.lang.ProcessBuilder.start(
        ... 26 more
2011-01-18 16:59:16,730 DEBUG org.apache.hadoop.hbase.regionserver.HRegion: Flush requested on webpage,com.richkidzradio:http/,1295020425887
2011-01-18 16:59:16,758 INFO org.apache.hadoop.ipc.HBaseServer: IPC Server handler 11 on 34511, call put([B@24297, [Lorg.apache.hadoop.hbase.client.Put;@61f9c6) from 0:0:0:0:0:0:0:1:38369: error: Server not running, aborting Server not running, aborting
        at org.apache.hadoop.hbase.regionserver.HRegionServer.checkOpen(
        at org.apache.hadoop.hbase.regionserver.HRegionServer.put(
        at sun.reflect.GeneratedMethodAccessor46.invoke(Unknown Source)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(
        at java.lang.reflect.Method.invoke(
        at org.apache.hadoop.hbase.ipc.HBaseRPC$
        at org.apache.hadoop.hbase.ipc.HBaseServer$
2011-01-18 16:59:16,759 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Dump of metrics: request=0.0, regions=7, stores=43, storefiles=42, storefileIndexSize=0, memstoreSize=129, compactionQueueSize=0, usedHeap=317, maxHeap=996, blockCacheSize=175238616, blockCacheFree=33808120, blockCacheCount=2196, blockCacheHitRatio=88, fsReadLatency=0, fsWriteLatency=0, fsSyncLatency=0
2011-01-18 16:59:16,759 INFO org.apache.hadoop.hbase.regionserver.MemStoreFlusher: RegionServer:0.cacheFlusher exiting
2011-01-18 16:59:16,944 INFO org.apache.hadoop.ipc.HBaseServer: Stopping server on 34511

The exception occurs when HBase was trying to update the first row from the batch. The class that forks the process is org.apache.hadoop.fs.RawLocalFileSystem. It's actually a hadoop related issue, reported in HADOOP-5059.
Here are the versions being used:

  • HBase 0.20.6
  • lib/hadoop-0.20.2-core.jar

Recover HBase

Running Nutch updatedb command again, you might see an error in the Nutch logs:

org.apache.hadoop.hbase.client.RetriesExhaustedException: Trying to contact region server Some server, retryOnlyOne=true, index=0, islastrow=true, tries=9, numtries=10, i=0, listsize=1, region=webpage,com.richkidzradio:http/,1295020425887 for region webpage,com.richkidzradio:http/,1295020425887, row 'hf:http/', but failed after 10 attempts.
        at org.apache.hadoop.hbase.client.HConnectionManager$TableServers$Batch.process(
        at org.apache.hadoop.hbase.client.HConnectionManager$TableServers.processBatchOfRows(
        at org.apache.hadoop.hbase.client.HTable.flushCommits(
        at org.apache.hadoop.hbase.client.HTable.put(
        at org.apache.gora.mapreduce.GoraRecordWriter.write(
        at org.apache.hadoop.mapred.ReduceTask$NewTrackingRecordWriter.write(
        at org.apache.hadoop.mapreduce.TaskInputOutputContext.write(
        at org.apache.nutch.crawl.DbUpdateReducer.reduce(
        at org.apache.nutch.crawl.DbUpdateReducer.reduce(
        at org.apache.hadoop.mapred.ReduceTask.runNewReducer(
        at org.apache.hadoop.mapred.LocalJobRunner$

and in the HBase logs

2011-01-18 15:56:11,195 DEBUG org.apache.hadoop.hbase.regionserver.HRegionServer: Batch puts interrupted at index=0 because:Requested row out of range for HRegion webpage,com.richkidzradio:http/,1295020425887, startKey='com.richkidzradio:http/', getEndKey()='', row='hf:http/'

As mentioned here, you might have "holes". You want to specify to add_table.rb script the webpage table directory to fix the problem

$ cd $HBASE_HOME/bin
$ ./hbase org.jruby.Main add_table.rb /home/alex/hbase/hbase-alex/hbase/webpage


To fix this issue, check out this new post, Increase your Swap size. With the current versions of Hadoop and HBase, very limited RAM and insufficient Swap, you will not go very far, due to silly fork operations. Maybe with the new version of Hadoop, the 0.22 yet to be released, as well as 0.90 for HBase, these issues will be fixed. I guess it's now time to give it a try to Cassandra...

Thursday, January 6, 2011

SSH Setup For HBase

SSH Setup For HBase

SSH passwordless login
 Login with password
 Login with passphrase
 Login without password
 Configure HBase
 Start HBase

HBase can be used as datastore for Nutch 2.0.

This is a tutorial to get started quickly with a standalone instance of HBase. I did not find it so straightforward on the original guide, hence this blog entry.

SSH passwordless login

We are going to setup the SSH environment for HBase. Personally I just use a standalone instance of HBase on a single machine, so the server and client are the same. I use Debian as Linux OS.

$ whoami

First, if trying to login with no instance of the SSH daemon running, I get this error:

$ ssh alex@localhost
ssh: connect to host localhost port 22: Connection refused

You want to setup an SSH server. I installed the Debian package:
# apt-get install openssh-server

This will start it automatically. In case you need that later, this is how to start it:

$ sudo /etc/init.d/ssh start
Starting OpenBSD Secure Shell server: sshd.

Login with password

Now you can login to the server by using your regular linux user password:

$ ssh alex@localhost
The authenticity of host 'localhost (' can't be established.
RSA key fingerprint is b6:96:06:e1:fb:f1:9f:23:40:32:ac:cb:ac:c9:bc:12.
Are you sure you want to continue connecting (yes/no)? yes
Warning: Permanently added 'localhost' (RSA) to the list of known hosts.
alex@localhost's password:  
Linux maison #1 SMP Tue Oct 5 08:36:28 CEST 2010 i686

The programs included with the Debian GNU/Linux system are free software;
the exact distribution terms for each program are described in the
individual files in /usr/share/doc/*/copyright.

Debian GNU/Linux comes with ABSOLUTELY NO WARRANTY, to the extent
permitted by applicable law.
You have mail.
Last login: Sun Jan  2 13:32:46 2011 from localhost
$ exit

On the client, you want to generate an SSH key:

$ ssh-keygen -t rsa
Generating public/private rsa key pair.
Enter file in which to save the key (/home/alex/.ssh/id_rsa):    
Enter passphrase (empty for no passphrase): 
Enter same passphrase again: 
Your identification has been saved in /home/alex/.ssh/id_rsa.
Your public key has been saved in /home/alex/.ssh/
The key fingerprint is:
47:79:b7:e8:e4:25:1b:8d:f6:0a:86:36:67:fa:1d:94 alex@maison
The key's randomart image is:
+--[ RSA 2048]----+
|                 |
|           .     |
|          o . .  |
|         . . * . |
|        S . E +  |
|         o * *   |
|        + = = .  |
|       . * o o   |
|        ... o    |
$ chmod 755 ~/.ssh

Copy the public key to the SSH server.

$ scp ~/.ssh/ alex@localhost:.ssh/authorized_keys
alex@localhost's password:                                                                                                                                       100%  393     0.4KB/s   00:00

Now on the server:

$ su alex
$ chmod 600 ~/.ssh/authorized_keys

Login with passphrase

From the client, login to the server with your passphrase:

$ ssh alex@localhost
Enter passphrase for key '/home/alex/.ssh/id_rsa': 
Linux maison #1 SMP Tue Oct 5 08:36:28 CEST 2010 i686

The programs included with the Debian GNU/Linux system are free software;
the exact distribution terms for each program are described in the
individual files in /usr/share/doc/*/copyright.

Debian GNU/Linux comes with ABSOLUTELY NO WARRANTY, to the extent
permitted by applicable law.
You have mail.
Last login: Wed Dec 29 15:57:24 2010 from localhost
$ exit

On the client, start an SSH agent to avoid typing your passphrase in the future:+

$ exec /usr/bin/ssh-agent $SHELL
$ ssh-add
Enter passphrase for /home/alex/.ssh/id_rsa: 
Identity added: /home/alex/.ssh/id_rsa (/home/alex/.ssh/id_rsa)
$ nohup ssh-agent -s > ~/.ssh-agent
nohup: ignoring input and redirecting stderr to stdout

Login without password

Now login to the server. It should be passwordless.

$ ssh alex@localhost
Linux maison #1 SMP Tue Oct 5 08:36:28 CEST 2010 i686

The programs included with the Debian GNU/Linux system are free software;
the exact distribution terms for each program are described in the
individual files in /usr/share/doc/*/copyright.

Debian GNU/Linux comes with ABSOLUTELY NO WARRANTY, to the extent
permitted by applicable law.
You have mail.
Last login: Sun Jan  2 13:41:59 2011 from localhost
$ exit


Now that we have setup SSH, we can configure HBase.

Configure HBase

Change HBase properties in conf/hbase-site.xml according to your needs. For example you can change the directory where the data is stored.


Start HBase

$ bin/ 
localhost: starting zookeeper, logging to /home/alex/java/ext/hbase-0.20.6/bin/../logs/hbase-alex-zookeeper-maison.out
starting master, logging to /home/alex/java/ext/hbase-0.20.6/logs/hbase-alex-master-maison.out
localhost: starting regionserver, logging to /home/alex/java/ext/hbase-0.20.6/bin/../logs/hbase-alex-regionserver-maison.out

These 2 Java processes are now running on the machine:

  • org.apache.hadoop.hbase.zookeeper.HQuorumPeer
  • org.apache.hadoop.hbase.master.HMaster


Friday, December 17, 2010

Build Nutch 2.0

Testing Nutch 2.0 under Eclipse

Table of Contents
Setup the projects in Eclipse
  Install plugins
  Check out SVN directories
  Build the projects
    JUnit Tests
    Nutch Commands
      Running Nutch classes from Eclipse
Crawl script


     This is a guide on setting up Nutch 2 in an Eclipse project. You will then be able to hack the code and run tests, especially JUnit ones, pretty easily.

If you have Nutch branch or Gora trunk already checked-out, please run
$ svn up

in order to have its most up-to-date version. Multiple fixes show up in this document through diff outputs. In order to apply the fix, place the diff content into the root directory and run the patch command

$ patch -p0 < myDiff

Setup the projects in Eclipse

     The idea is to be able to improve Nutch and Gora code comfortably, with the help of the Eclipse IDE. You want to add in the Java Build Path the source - and why not the test - directories of the modules you are interested in working on. Then manage the dependencies with Ivy or Maven plugins to resolve the external libraries. Then update the code. Optionally run a few JUnit tests. Then run the ant task that builds the project. Then submit a patch. This is the easiest and the fastest way, especially as regards productivity.

Install plugins

Install the Subclipse, IvyDE and m2e plugins if you don't have them yet.
Help > Install New Software ...

Add the following urls:

Check out SVN directories

Check-out Nutch branch and Gora trunk versions using the SVN wizard, with the following urls

File > New > Project ...

Note that you can just create a Java project and check out Nutch source with svn command, if you don't like SVN Eclipse plugin:

$ cd ~/java/workspace/NutchGora
$ svn co branch

Build the projects

Window > Show View > Ant
Drag and drop the build.xml files in the Ant Eclipse tab.
Just double click on the Gora and Nutch items in the Ant view. That will run the default task. For Gora, it will publish the modules to the Ivy local repository. For Nutch, it will build a "release" in runtime/local directory


Within the Nutch project, we want to manage the dependencies with the Ivy plugin, not the Maven one.

  • The call to "nutch.root" property set in build.xml for ant should be replaced in src/plugin/protocol-sftp/ivy.xml by the built-in "basedir" ivy property. I am not sure how to load a property in Eclipse Ivy plugin. This will break the build, so be sure to replace it back when running Ant tasks.

This is the Ivy configuration tweaks:

Index: src/plugin/protocol-sftp/ivy.xml
--- src/plugin/protocol-sftp/ivy.xml    (revision 1177967)
+++ src/plugin/protocol-sftp/ivy.xml    (working copy)
@@ -27,7 +27,7 @@
-    <include file="${nutch.root}/ivy/ivy-configurations.xml"/>
+    <include file="${basedir}/branch/ivy/ivy-configurations.xml"/>
Index: ivy/ivy.xml
--- ivy/ivy.xml (revision 1177967)
+++ ivy/ivy.xml (working copy)
@@ -21,7 +21,7 @@
-               <include file="${basedir}/ivy/ivy-configurations.xml" />
+               <include file="${basedir}/branch/ivy/ivy-configurations.xml" />
@@ -58,8 +58,9 @@
                  <dependency org="org.apache.tika" name="tika-parsers" rev="0.9" />
+               <!--
                <dependency org="org.apache.gora" name="gora-core" rev="0.1.1-incubating" conf="*->compile"/>
                <dependency org="log4j" name="log4j" rev="1.2.15" conf="*->master" />
                <dependency org="xerces" name="xercesImpl" rev="2.9.1" />

Now, right click on ivy/ivy.xml, "Add Ivy Library ...". Do the same for src/plugin/protocol-sftp/ivy.xml.

Remove the default src directory as a Source entry in the Java Build Path if it exists. Add at least the "java", "test" and "resources" source files (which are src/java, src/test and conf) so that they get included in the classpath. That will allow us to run the classes or tests later from Eclipse.

This is how the project tree looks like:


The datastore holds the information Nutch crawls from the web. You can opt for a Relational DataBase System or a column-oriented, NoSQL store. Thanks to Gora interface, you can use any backend you might be familiar with: HSQL, MySQL, HBase, Cassandra ...


This is the default Gora backend. Make sure your Ivy configuration contains the dependency:
<dependency org="org.hsqldb" name="hsqldb" rev="2.0.0" conf="*->default"/>

This is the content of conf/


Setup HSQL. I downloaded this version: HSQLDB 2.0.0. Finally starts HSQL server with the same database alias, called "nutchtest":

~/java/ext/hsqldb-2.0.0/hsqldb/data$ java -cp ../lib/hsqldb.jar org.hsqldb.server.Server --database.0 file:crawldb --dbname.0 nutchtest
[Server@12ac982]: [Thread[main,5,main]]: checkRunning(false) entered
[Server@12ac982]: [Thread[main,5,main]]: checkRunning(false) exited
[Server@12ac982]: Startup sequence initiated from main() method
[Server@12ac982]: Loaded properties from [/home/alex/java/ext/hsqldb-2.0.0/hsqldb/data/]
[Server@12ac982]: Initiating startup sequence...
[Server@12ac982]: Server socket opened successfully in 5 ms.
[Server@12ac982]: Database [index=0, id=0, db=file:crawldb, alias=nutchtest] opened sucessfully in 420 ms.
[Server@12ac982]: Startup sequence completed in 426 ms.
[Server@12ac982]: 2011-01-12 10:41:56.181 HSQLDB server 2.0.0 is online on port 9001
[Server@12ac982]: To close normally, connect and execute SHUTDOWN SQL
[Server@12ac982]: From command line, use [Ctrl]+[C] to abort abruptly


To use MySQL as datastore, add the dependency to ivy/ivy.xml:

<dependency org="mysql" name="mysql-connector-java" rev="5.1.13" conf="*->default"></dependency>

Change conf/ to setup the MySQL connection:


You will notice MySQL is a lot faster than HSQL. It was at least 12x in my case with the default setups. For example injecting 50k took 6 min with HSQL instead of 30 sec with MySQL. You could make a similar comparison in Rails with SQLite and MySQL ...


Add the gora-hbase and zookeeper dependency in ivy/ivy.xml

<dependency org="org.apache.gora" name="gora-hbase" rev="0.1" conf="*->compile">
  <exclude org="com.sun.jdmk"/>
  <exclude org="com.sun.jmx"/>
  <exclude org="javax.jms"/>
 <dependency org="org.apache.zookeeper" name="zookeeper" rev="3.3.2" conf="*->default">
  <exclude org="com.sun.jdmk"/>
  <exclude org="com.sun.jmx"/>
  <exclude org="javax.jms"/>        

After rebuilding Nutch, since a recent version (0.20.6 ?) is not available in the Maven repository, you will need to manually add hbase jar to the runtime/local/lib directory :

$ cp $HBASE_HOME/hbase-*.jar $NUTCH_HOME/runtime/local/lib

Create the $NUTCH_HOME/runtime/local/conf/gora-hbase-mapping.xml file as described in GORA_HBase wiki page. Overwrite property in runtime/local/conf/nutch-site.xml:


Finally setup and Run HBase. See this blog entry as a quick guide.


Start Cassandra:

$ bin/cassandra -f

Run the injector job with the Gora setting to Cassandra store in nutch-site.xml.

  <description>Default class for storing data</description>

Please place gora-cassandra-mapping.xml in your Nutch conf directory which is included in the classpath. This configuration defines how Avro fields are stored in Cassandra. This the content of gora-cassandra-mapping.xml file:
 <keyspace name="webpage" cluster="Test Cluster" host="localhost">
  <family name="p"/>
  <family name="f"/>
  <family name="sc" type="super"/>
 <class keyClass="java.lang.String" name="">
  <!-- fetch fields -->
  <field name="baseUrl" family="f" qualifier="bas"/>
  <field name="status" family="f" qualifier="st"/>
  <field name="prevFetchTime" family="f" qualifier="pts"/>
  <field name="fetchTime" family="f" qualifier="ts"/>
  <field name="fetchInterval" family="f" qualifier="fi"/>
  <field name="retriesSinceFetch" family="f" qualifier="rsf"/>
  <field name="reprUrl" family="f" qualifier="rpr"/>
  <field name="content" family="f" qualifier="cnt"/>
  <field name="contentType" family="f" qualifier="typ"/>
  <field name="modifiedTime" family="f" qualifier="mod"/>
  <!-- parse fields -->
  <field name="title" family="p" qualifier="t"/>
  <field name="text" family="p" qualifier="c"/>
  <field name="signature" family="p" qualifier="sig"/>
  <field name="prevSignature" family="p" qualifier="psig"/>
  <!-- score fields -->
  <field name="score" family="f" qualifier="s"/>
  <!-- super columns -->
  <field name="markers" family="sc" qualifier="mk"/>
  <field name="inlinks" family="sc" qualifier="il"/>
  <field name="outlinks" family="sc" qualifier="ol"/>
  <field name="metadata" family="sc" qualifier="mtdt"/>
  <field name="headers" family="sc" qualifier="h"/>
  <field name="parseStatus" family="sc" qualifier="pas"/>
  <field name="protocolStatus" family="sc" qualifier="prs"/>

Check the data that has been initialized and gets populated.

$ bin/cassandra-cli --host localhost
[default@unknown] use webpage;
Authenticated to keyspace: webpage
[default@webpage] update column family p with key_validation_class=UTF8Type;                                                                                                       
Waiting for schema agreement...
... schemas agree across the cluster
[default@webpage] update column family f with key_validation_class=UTF8Type;
Waiting for schema agreement...
... schemas agree across the cluster
[default@webpage] update column family sc with key_validation_class=UTF8Type;
Waiting for schema agreement...
... schemas agree across the cluster
[default@webpage] list f;
Using default limit of 100
RowKey: com.truveo.www:http/
=> (column=fi, value=2592000, timestamp=1311532210076000)
=> (column=s, value=1.0, timestamp=1311532210080000)
=> (column=ts, value=1311532203790, timestamp=1311532209796000)
RowKey: com.blogspot.techvineyard:http/
=> (column=fi, value=2592000, timestamp=1311532210134000)
=> (column=s, value=1.0, timestamp=1311532210137000)
=> (column=ts, value=1311532203790, timestamp=1311532210131000)
=> (column=fi, value=2592000, timestamp=1311532210146000)
=> (column=s, value=1.0, timestamp=1311532210149000)
=> (column=ts, value=1311532203790, timestamp=1311532210144000)

3 Rows Returned.

JUnit Tests

Let's run a few unit tests to verify the setup.


We want to make the GoraStorage test pass. First apply this patch to test your datastore setting and avoid crashing your old laptop because it has limited capacity.

Index: src/test/org/apache/nutch/storage/
--- src/test/org/apache/nutch/storage/      (revision 1053817)
+++ src/test/org/apache/nutch/storage/      (working copy)
@@ -1,22 +1,17 @@
-import java.util.ArrayList;
 import java.util.BitSet;
-import java.util.Iterator;
-import java.util.List;
 import java.util.Random;
-import java.util.Vector;
 import java.util.concurrent.atomic.AtomicInteger;
+import junit.framework.TestCase;
 import org.apache.avro.util.Utf8;
-import org.apache.hadoop.conf.Configuration;
-import org.apache.nutch.util.NutchConfiguration;
 import org.apache.gora.query.Result;
+import org.apache.hadoop.conf.Configuration;
+import org.apache.nutch.util.NutchConfiguration;
-import junit.framework.TestCase;
 public class TestGoraStorage extends TestCase {
   Configuration conf;
@@ -80,8 +75,8 @@
   private AtomicInteger threadCount = new AtomicInteger(0);
   public void testMultithread() throws Exception {
-    int COUNT = 1000;
-    int NUM = 100;
+    int COUNT = 50;
+    int NUM = 5;
     DataStore<String,WebPage> store;
     for (int i = 0; i < NUM; i++) {
@@ -113,115 +108,4 @@
     assertEquals(size, keys.cardinality());
-  public void testMultiProcess() throws Exception {
-    int COUNT = 1000;
-    int NUM = 100;
-    DataStore<String,WebPage> store;
-    List<Process> procs = new ArrayList<Process>();
-    for (int i = 0; i < NUM; i++) {
-      Process p = launch(i, i * COUNT, COUNT);
-      procs.add(p);
-    }
-    while (procs.size() > 0) {
-      try {
-        Thread.sleep(5000);
-      } catch (Exception e) {};
-      Iterator<Process> it = procs.iterator();
-      while (it.hasNext()) {
-        Process p =;
-        int code = 1;
-        try {
-          code = p.exitValue();
-          assertEquals(0, code);
-          it.remove();
-          p.destroy();
-        } catch (IllegalThreadStateException e) {
-          // not ready yet
-        }
-      }
-      System.out.println("* running " + procs.size() + "/" + NUM);
-    }
-    System.out.println("Verifying...");
-    store = StorageUtils.createDataStore(conf, String.class, WebPage.class);
-    Result<String,WebPage> res = store.execute(store.newQuery());
-    int size = COUNT * NUM;
-    BitSet keys = new BitSet(size);
-    while ( {
-      String key = res.getKey();
-      WebPage p = res.get();
-      assertEquals(key, p.getTitle().toString());
-      int pos = Integer.parseInt(key);
-      assertTrue(pos < size && pos >= 0);
-      if (keys.get(pos)) {
-        fail("key " + key + " already set!");
-      }
-      keys.set(pos);
-    }
-    if (size != keys.cardinality()) {
-      System.out.println("ERROR Missing keys:");
-      for (int i = 0; i < size; i++) {
-        if (keys.get(i)) continue;
-        System.out.println(" " + i);
-      }
-      fail("key count should be " + size + " but is " + keys.cardinality());
-    }
-  }
-  private Process launch(int id, int start, int count) throws Exception {
-    //  Build exec child jmv args.
-    Vector<String> vargs = new Vector<String>(8);
-    File jvm =                                  // use same jvm as parent
-      new File(new File(System.getProperty("java.home"), "bin"), "java");
-    vargs.add(jvm.toString());
-    // Add child (task) java-vm options.
-    // tmp dir
-    String prop = System.getProperty("");
-    vargs.add("" + prop);
-    // library path
-    prop = System.getProperty("java.library.path");
-    if (prop != null) {
-      vargs.add("-Djava.library.path=" + prop);      
-    }
-    // working dir
-    prop = System.getProperty("user.dir");
-    vargs.add("-Duser.dir=" + prop);    
-    // combat the stupid Xerces issue
-    vargs.add("");
-    // prepare classpath
-    String sep = System.getProperty("path.separator");
-    StringBuffer classPath = new StringBuffer();
-    // start with same classpath as parent process
-    classPath.append(System.getProperty("java.class.path"));
-    //classPath.append(sep);
-    // Add classpath.
-    vargs.add("-classpath");
-    vargs.add(classPath.toString());
-    // append class name and args
-    vargs.add(TestGoraStorage.class.getName());
-    vargs.add(String.valueOf(id));
-    vargs.add(String.valueOf(start));
-    vargs.add(String.valueOf(count));
-    ProcessBuilder builder = new ProcessBuilder(vargs);
-    return builder.start();
-  }
-  public static void main(String[] args) throws Exception {
-    if (args.length < 3) {
-      System.err.println("Usage: TestGoraStore <id> <startKey> <numRecords>");
-      System.exit(-1);
-    }
-    TestGoraStorage test = new TestGoraStorage();
-    test.init();
-    int id = Integer.parseInt(args[0]);
-    int start = Integer.parseInt(args[1]);
-    int count = Integer.parseInt(args[2]);
-    Worker w = Worker(id, start, count, true);
-    System.exit(0);
-  }
Index: src/test/org/apache/nutch/util/
--- src/test/org/apache/nutch/util/       (revision 1053817)
+++ src/test/org/apache/nutch/util/       (working copy)
@@ -16,28 +16,14 @@
 package org.apache.nutch.util;
-import java.nio.ByteBuffer;
-import java.util.ArrayList;
-import java.util.List;
 import junit.framework.TestCase;
-import org.apache.avro.util.Utf8;
 import org.apache.hadoop.conf.Configuration;
 import org.apache.hadoop.fs.FileSystem;
 import org.apache.hadoop.fs.Path;
-import org.apache.nutch.crawl.URLWebPage;
-import org.apache.nutch.util.TableUtil;
-import org.apache.gora.query.Query;
-import org.apache.gora.query.Result;
-import org.apache.gora.util.ByteUtils;
  * This class provides common routines for setup/teardown of an in-memory data
@@ -55,16 +41,12 @@
   public void setUp() throws Exception {
     conf = CrawlTestUtil.createConfiguration();
-    conf.set("", "");
     fs = FileSystem.get(conf);
-    // using hsqldb in memory
-    // use separate in-memory db-s for tests
-"gora.sqlstore.jdbc.url","jdbc:hsqldb:mem:" + getClass().getName());
     webPageStore = StorageUtils.createWebStore(conf, String.class,
+    // empty the datastore
+    webPageStore.deleteByQuery(webPageStore.newQuery());


We can try to run the Fetcher test as well.

  • Change the location of the static files that will be returned to the Nutch crawler by the Jetty server, from "build/test/data/fetch-test-site" to "src/testresources/fetch-test-site"
  • Overwrite as well for testing purpose the plugin directory setting.
  • Set and http.robots.agents properties.
  • Limit the content length to the maximum for a blob column type. This is only required for MySQL.

Index: src/test/nutch-site.xml               
--- src/test/nutch-site.xml     (revision 1053817)
+++ src/test/nutch-site.xml     (working copy)
@@ -22,4 +22,20 @@
<description>Default class for storing data</description>
+       <property>
+         <name>plugin.folders</name>
+         <value>build/plugins</value>
+       </property>
+       <property>
+         <name></name>
+         <value>NutchRobot</value>
+       </property>
+       <property>
+         <name>http.robots.agents</name>
+         <value>NutchRobot,*</value>
+       </property>
+       <property>
+         <name>http.content.limit</name>
+         <value>65535</value>
+       </property>
Index: src/test/org/apache/nutch/fetcher/
--- src/test/org/apache/nutch/fetcher/  (revision 1050697)
+++ src/test/org/apache/nutch/fetcher/  (working copy)
@@ -50,7 +50,7 @@
   public void setUp() throws Exception{
     urlPath = new Path(testdir, "urls");
-    server = CrawlTestUtil.getServer(conf.getInt("content.server.port",50000), "build/test/data/fetch-test-site");
+    server = CrawlTestUtil.getServer(conf.getInt("content.server.port",50000), "src/testresources/fetch-test-site");

Now right click on the org.apache.nutch.fetcher.TestFetcher class located in the src/test source directory, then "Run As" > "JUnit Test".

Nutch Commands

Several commands are available to maintain and index your crawl. Here are the possible options from the Bash script:

~/java/workspace/Nutch2.0/runtime/local$ bin/nutch
Usage: nutch [-core] COMMAND
where COMMAND is one of:
 inject inject new urls into the database
 generate generate new segments to fetch from crawl db
 fetch fetch URLs marked during generate
 parse parse URLs marked during fetch
 updatedb update web table after parsing
 readdb read/dump records from page database
 solrindex run the solr indexer on parsed segments and linkdb
 solrdedup remove duplicates from solr
 plugin load a plugin and run one of its classes main()
 CLASSNAME run the class named CLASSNAME
Most commands print help when invoked w/o parameters.

Expert: -core option is for developers only. It avoids building the job jar, 
 instead it simply includes classes compiled with ant compile-core. 
 NOTE: this works only for jobs executed in 'local' mod

Running Nutch classes from Eclipse

You can either run a command with the Bash script or execute a Nutch class directly from Eclipse. The latter is easier for development since you do not need to build the whole project each time you change something. When a Nutch class is executed, it first loads the configuration by looking in the classpath for a nutch-site.xml file that overwrites nutch-default.xml. Depending on the order of the "src/test" and "conf" source directories in your Eclipse build path, only one nutch-site.xml file will be loaded to the classpath. In my case, it was the one that is located in "src/test". If I edit the one in "conf", I see the warning

The resource is a duplicate of src/test/nutch-default.xml and was not copied to the output folder.

which indicates it will be ignored. So you want to edit the one that is activated.

  • Apply the modifications to src/test/nutch-site.xml (or conf/nutch-site.xml, depending on your classpath order setting) that are given in the Fetch Test section from above.

crawl / org.apache.nutch.crawl.Crawler

~/java/workspace/Nutch2.0/runtime/local$ bin/nutch crawl
Usage: Crawl (<seedDir> | -continue) [-solr <solrURL>] [-threads n] [-depth i] [-topN N]

Right click on org.apache.nutch.crawl.Crawler in src/java source directory. Then "Run As" > "Java Application"

  1. The first argument called "seedDir" is the path to a directory containing lists of seed urls. They will be injected to the database. They define a forest of pages that will be visited by the crawler during the first iteration of the graph exploration. Then the crawler will expand the graph by adding neighbours to these pages when extracting new urls out of the page content. These new pages should then be visited in the second iteration.
  2. The -continue parameter instead resumes the crawl without injecting any seeds.
  3. -solr defines the solr server used to index the documents
  4. -threads defines the number of threads spawned to fetch several pages simultaneously.
  5. -depth defines the number of iterations in the graph exploration, before the traversal gets pruned.
  6. -topN limits the number of urls that get downloaded in one iteration.

Let's create some input to the crawl command. This is the content of a seeds/urls file that we can use for the demo:

I used MySQL as a datastore. Let's clear it if the webpage table exists before running the crawl command.

$ mysql -hlocalhost -ualex -psome_pass nutch
mysql> delete from webpage;

From the Eclipse menu:

Run > Run Configurations ...

Click Run. You can compare your output with my logs here. Then check the content of the MySQL table:

mysql> describe webpage;
| Field             | Type           | Null | Key | Default | Extra |
| id                | varchar(512)   | NO   | PRI | NULL    |       |
| headers           | blob           | YES  |     | NULL    |       |
| text              | varchar(32000) | YES  |     | NULL    |       |
| status            | int(11)        | YES  |     | NULL    |       |
| markers           | blob           | YES  |     | NULL    |       |
| parseStatus       | blob           | YES  |     | NULL    |       |
| modifiedTime      | bigint(20)     | YES  |     | NULL    |       |
| score             | float          | YES  |     | NULL    |       |
| typ               | varchar(32)    | YES  |     | NULL    |       |
| baseUrl           | varchar(512)   | YES  |     | NULL    |       |
| content           | blob           | YES  |     | NULL    |       |
| title             | varchar(512)   | YES  |     | NULL    |       |
| reprUrl           | varchar(512)   | YES  |     | NULL    |       |
| fetchInterval     | int(11)        | YES  |     | NULL    |       |
| prevFetchTime     | bigint(20)     | YES  |     | NULL    |       |
| inlinks           | blob           | YES  |     | NULL    |       |
| prevSignature     | blob           | YES  |     | NULL    |       |
| outlinks          | blob           | YES  |     | NULL    |       |
| fetchTime         | bigint(20)     | YES  |     | NULL    |       |
| retriesSinceFetch | int(11)        | YES  |     | NULL    |       |
| protocolStatus    | blob           | YES  |     | NULL    |       |
| signature         | blob           | YES  |     | NULL    |       |
| metadata          | blob           | YES  |     | NULL    |       |
23 rows in set (0.14 sec)

mysql> select count(*) from webpage;
| count(*) |
|      151 |
1 row in set (0.00 sec)

mysql> select id, markers from webpage where content is not null;
| id                              | markers                                  |
|     | _injmrk_y_updmrk_*1294943864-1806760603  |
| com.blogspot.techvineyard:http/ | _injmrk_y_updmrk_*1294943864-1806760603  |
| com.truveo.www:http/            | _injmrk_y_updmrk_*1294943864-1806760603  |
3 rows in set (0.00 sec)

readdb / org.apache.nutch.crawl.WebTableReader

~/java/workspace/Nutch2.0/runtime/local$ bin/nutch readdb
Usage: WebTableReader (-stats | -url [url] | -dump <out_dir> [-regex regex]) [-crawlId <id>] [-content] [-headers] [-links] [-text]
        -crawlId <id>    the id to prefix the schemas to operate on, (default:
        -stats [-sort]  print overall statistics to System.out
                [-sort] list status sorted by host
        -url <url>      print information on <url> to System.out
        -dump <out_dir> [-regex regex]  dump the webtable to a text file in <out_dir>
                -content        dump also raw content
                -headers        dump protocol headers
                -links  dump links
                -text   dump extracted text
                [-regex]        filter on the URL of the webtable entry

WebTableReader class scans the entire database via a Hadoop job that outputs all the fields.

inject / org.apache.nutch.crawl.InjectorJob

~/java/workspace/Nutch2.0/runtime/local$ bin/nutch inject
Usage: InjectorJob <url_dir> [-crawlId <id>]

First, we need to initialize the crawl db. The "url_dir" argument to the inject command is a directory containing flat files of lists of urls, used as "seeds".

generate / org.apache.nutch.crawl.GeneratorJob

~/java/workspace/Nutch2.0/runtime/local$ bin/nutch generate
GeneratorJob: Selecting best-scoring urls due for fetch.
GeneratorJob: starting
GeneratorJob: filtering: true
GeneratorJob: done
GeneratorJob: generated batch id: 1294943864-1806760603

This steps generates a batch-id containing selected urls to be fetched.

fetch / org.apache.nutch.fetcher.FetcherJob

~/java/workspace/Nutch2.0/runtime/local$ bin/nutch fetch
Usage: FetcherJob (<batchId> | -all) [-crawlId <id>] [-threads N] [-parse] [-resume] [-numTasks N]
        batchId crawl identifier returned by Generator, or -all for all generated batchId-s
        -crawlId <id>    the id to prefix the schemas to operate on, (default:
        -threads N      number of fetching threads per task
        -parse  if specified then fetcher will immediately parse fetched content
        -resume resume interrupted job
        -numTasks N     if N > 0 then use this many reduce tasks for fetching (default:

parse / org.apache.nutch.parse.ParserJob

~/java/workspace/Nutch2.0/runtime/local$ bin/nutch parse
Usage: ParserJob (<batchId> | -all) [-crawlId <id>] [-resume] [-force]
        batchId symbolic batch ID created by Generator
        -crawlId <id>    the id to prefix the schemas to operate on, (default:
        -all    consider pages from all crawl jobs
-resume resume a previous incomplete job
-force  force re-parsing even if a page is already parsed

Once we have a local copy the web pages, we need to parse them to extract keywords and links the web page points to. This parsing task is delegated to Tika.

updatedb / org.apache.nutch.crawl.DbUpdaterJob

~/java/workspace/Nutch2.0/runtime/local$ bin/nutch updatedb

solrindex / org.apache.nutch.indexer.solr.SolrIndexerJob

The indexing task is now delegated to Solr, which is a server using Lucene indexes that will make the crawled documents searchable by indexing the data posted via HTTP. I ran into a few caveats before making it work. This is the suggested patch.

  • Avoid multiple values for id field.
  • Allow multiple values for tag field. Add tld (Top Level Domain) field.
  • Get the content-type from WebPage object's member. Otherwise, you will see NullPointerExceptions.
  • Compare strings with equalsTo. That's pretty random, but it avoids having some suprises.

Index: conf/solrindex-mapping.xml
--- conf/solrindex-mapping.xml  (revision 1053817)
+++ conf/solrindex-mapping.xml  (working copy)
@@ -39,8 +39,7 @@
                <field dest="boost" source="boost"/>
                <field dest="digest" source="digest"/>
                <field dest="tstamp" source="tstamp"/>
-               <field dest="id" source="url"/>
-               <copyField source="url" dest="url"/>
+               <field dest="url" source="url"/>
Index: conf/schema.xml
--- conf/schema.xml     (revision 1053817)
+++ conf/schema.xml     (working copy)
@@ -95,12 +95,15 @@
         <!-- fields for feed plugin -->
         <field name="author" type="string" stored="true" indexed="true"/>
-        <field name="tag" type="string" stored="true" indexed="true"/>
+        <field name="tag" type="string" stored="true" indexed="true" multiValued="true"/>
         <field name="feed" type="string" stored="true" indexed="true"/>
         <field name="publishedDate" type="string" stored="true"
         <field name="updatedDate" type="string" stored="true"
+        <field name="tld" type="string" stored="false" indexed="false"/>
Index: src/plugin/index-more/src/java/org/apache/nutch/indexer/more/
--- src/plugin/index-more/src/java/org/apache/nutch/indexer/more/        (revision 1053817)
+++ src/plugin/index-more/src/java/org/apache/nutch/indexer/more/        (working copy)
@@ -172,7 +172,7 @@
   private NutchDocument addType(NutchDocument doc, WebPage page, String url) {
     MimeType mimeType = null;
-    Utf8 contentType = page.getFromHeaders(new Utf8(HttpHeaders.CONTENT_TYPE));
+    Utf8 contentType = page.getContentType();
     if (contentType == null) {
       // Note by Jerome Charron on 20050415:
       // Content Type not solved by a previous plugin
Index: src/java/org/apache/nutch/indexer/solr/
--- src/java/org/apache/nutch/indexer/solr/      (revision 1053817)
+++ src/java/org/apache/nutch/indexer/solr/      (working copy)
@@ -56,7 +56,7 @@
       for (final String val : e.getValue()) {
         inputDoc.addField(solrMapping.mapKey(e.getKey()), val);
         String sCopy = solrMapping.mapCopyKey(e.getKey());
-        if (sCopy != e.getKey()) {
+        if (! sCopy.equals(e.getKey())) {
                inputDoc.addField(sCopy, val);

Download Solr. To setup the Solr server, copy the example directory from the Solr distribution and the patched schema.xml configuration file to solr/conf of the Solr app.

cp -r $SOLR_HOME/example solrapp
 cp $NUTCH_HOME/conf/schema.xml solrapp/solr/conf/            
 cd solrapp
 java -jar start.jar

This starts the Solr server. Now let's index a few documents, by adding as parameter to SolrIndexerJob class the batch id showing up in the markers column.

Here are some excerpts of the logs from the Jetty server to make sure the documents were properly sent:

13-Jan-2011 19:50:47 org.apache.solr.update.processor.LogUpdateProcessor finish
INFO: {add=[com.truveo.www:http/,, com.blogspot.techvineyard:http/]} 0 206
13-Jan-2011 19:50:47 org.apache.solr.core.SolrCore execute
INFO: [] webapp=/solr path=/update params={wt=javabin&version=1} status=0 QTime=206 
13-Jan-2011 19:50:47 org.apache.solr.update.DirectUpdateHandler2 commit
INFO: start commit(optimize=false,waitFlush=true,waitSearcher=true,expungeDeletes=false)
13-Jan-2011 19:50:47 org.apache.solr.core.SolrDeletionPolicy onCommit
INFO: SolrDeletionPolicy.onCommit: commits:num=2
        commit{dir=/home/alex/java/perso/nutch/solrapp/solr/data/index,segFN=segments_2,version=1294944630024,generation=2,filenames=[_0.nrm, _0.tis, _0.fnm, _0.tii, _0.frq, segments_2, _0.fdx, _0.prx, _0.fdt]
13-Jan-2011 19:50:47 org.apache.solr.core.SolrDeletionPolicy updateCommits
INFO: newest commit = 1294944630024

You can now do a search via the api:

$ curl "http://localhost:8983/solr/select/?q=video&indent=on"
<?xml version="1.0" encoding="UTF-8"?>

<lst name="responseHeader">
 <int name="status">0</int>
 <int name="QTime">0</int>
 <lst name="params">
  <str name="indent">on</str>
  <str name="q">video</str>
<result name="response" numFound="2" start="0">
  <arr name="anchor"><str>Logout</str></arr>
  <float name="boost">1.03571</float>
  <str name="date">20110212</str>
  <str name="digest">5d62587216b50ed7e52987b09dcb9925</str>
  <str name="id">com.truveo.www:http/</str>
  <str name="lang">unknown</str>
  <arr name="tag"><str/></arr>
  <str name="title">Truveo Video Search</str>
  <long name="tstamp">2011-02-12T18:37:53.031Z</long>
  <arr name="type"><str>text/html</str><str>text</str><str>html</str></arr>
  <str name="url"></str>
  <arr name="anchor"><str>Comments</str></arr>
  <float name="boost">1.00971</float>
  <str name="date">20110212</str>
  <str name="digest">59edefd6f4711895c2127d45b569d8c9</str>
  <str name="id"></str>
  <str name="lang">en</str>
  <arr name="subcollection"><str>nutch</str></arr>
  <arr name="tag"><str/></arr>
  <str name="title">FrontPage - Nutch Wiki</str>
  <long name="tstamp">2011-02-12T18:37:53.863Z</long>
  <arr name="type"><str>text/html</str><str>text</str><str>html</str></arr>
  <str name="url"></str>

Crawl Script

To automate the crawl process, we might want to use a Bash script that runs the suite of Nutch commands, then add it as a cron job. Don't forget to initialize first the crawl db with the inject command. We run several iterations of the the generate/fetch/parse/update cycle with the for loop. We limit the number of urls that will get fetched in one iteration by specifying a -topN argument in the generate command.


# Nutch crawl

export NUTCH_HOME=~/java/workspace/Nutch2.0/runtime/local

# depth in the web exploration
# number of selected urls for fetching
# solr server
for (( i = 1 ; i <= $n ; i++ ))


# Generate
$NUTCH_HOME/bin/nutch generate -topN $maxUrls > $log

batchId=`sed -n 's|.*batch id: \(.*\)|\1|p' < $log`

# rename log file by appending the batch id
mv $log $log2

# Fetch
$NUTCH_HOME/bin/nutch fetch $batchId >> $log

# Parse
$NUTCH_HOME/bin/nutch parse $batchId >> $log

# Update
$NUTCH_HOME/bin/nutch updatedb >> $log

# Index
$NUTCH_HOME/bin/nutch solrindex $solrUrl $batchId >> $log



I managed to fetch in one run 50k urls with these minor changes. With the default values in conf/nutch-default.xml and MySQL as datastore, these are the logs timestamps when running the initialization and one iteration of generate/fetch/update cycle:
2010-12-13 07:19:26,089 INFO  crawl.InjectorJob - InjectorJob: starting
2010-12-13 07:20:00,077 INFO  crawl.InjectorJob - InjectorJob: finished
2010-12-13 07:20:00,715 INFO  crawl.GeneratorJob - GeneratorJob: starting
2010-12-13 07:20:34,304 INFO  crawl.GeneratorJob - GeneratorJob: done
2010-12-13 07:20:35,041 INFO  fetcher.FetcherJob - FetcherJob: starting
2010-12-13 11:04:00,933 INFO  fetcher.FetcherJob - FetcherJob: done
2010-12-15 01:38:44,262 INFO  crawl.DbUpdaterJob - DbUpdaterJob: starting
2010-12-15 02:15:15,503 INFO  crawl.DbUpdaterJob - DbUpdaterJob: done
The next step is comparing with a setup backed by a HBase datastore. I tried once but got a memory error which left my HBase server instance unresponsive. See the description of the problem.
Please don't hesitate to comment and share your own feedback, difficulties and results.