Friday, December 17, 2010

Build Nutch 2.0

Testing Nutch 2.0 under Eclipse



Table of Contents
Introduction
Setup the projects in Eclipse
  Install plugins
  Check out SVN directories
  Build the projects
Nutch
    Datastores
      HSQL
      MySQL
      HBase
      Cassandra
    JUnit Tests
      Datastore
      Fetch
    Nutch Commands
      Running Nutch classes from Eclipse
      crawl
      readdb
      inject
      generate
      fetch
      parse
      updatedb
      solrindex
Crawl script
Conclusion


Introduction


     This is a guide on setting up Nutch 2 in an Eclipse project. You will then be able to hack the code and run tests, especially JUnit ones, pretty easily.

If you have Nutch branch or Gora trunk already checked-out, please run
$ svn up

in order to have its most up-to-date version. Multiple fixes show up in this document through diff outputs. In order to apply the fix, place the diff content into the root directory and run the patch command

$ patch -p0 < myDiff


Setup the projects in Eclipse


     The idea is to be able to improve Nutch and Gora code comfortably, with the help of the Eclipse IDE. You want to add in the Java Build Path the source - and why not the test - directories of the modules you are interested in working on. Then manage the dependencies with Ivy or Maven plugins to resolve the external libraries. Then update the code. Optionally run a few JUnit tests. Then run the ant task that builds the project. Then submit a patch. This is the easiest and the fastest way, especially as regards productivity.


Install plugins

Install the Subclipse, IvyDE and m2e plugins if you don't have them yet.
Help > Install New Software ...

Add the following urls:








Check out SVN directories

Check-out Nutch branch and Gora trunk versions using the SVN wizard, with the following urls

File > New > Project ...












Note that you can just create a Java project and check out Nutch source with svn command, if you don't like SVN Eclipse plugin:

$ cd ~/java/workspace/NutchGora
$ svn co http://svn.apache.org/repos/asf/nutch/branches/nutchgora branch

Build the projects

Window > Show View > Ant
Drag and drop the build.xml files in the Ant Eclipse tab.
Just double click on the Gora and Nutch items in the Ant view. That will run the default task. For Gora, it will publish the modules to the Ivy local repository. For Nutch, it will build a "release" in runtime/local directory





Nutch


Within the Nutch project, we want to manage the dependencies with the Ivy plugin, not the Maven one.

  • The call to "nutch.root" property set in build.xml for ant should be replaced in src/plugin/protocol-sftp/ivy.xml by the built-in "basedir" ivy property. I am not sure how to load a property in Eclipse Ivy plugin. This will break the build, so be sure to replace it back when running Ant tasks.

This is the Ivy configuration tweaks:

Index: src/plugin/protocol-sftp/ivy.xml
===================================================================
--- src/plugin/protocol-sftp/ivy.xml    (revision 1177967)
+++ src/plugin/protocol-sftp/ivy.xml    (working copy)
@@ -27,7 +27,7 @@
   </info>
 
   <configurations>
-    <include file="${nutch.root}/ivy/ivy-configurations.xml"/>
+    <include file="${basedir}/branch/ivy/ivy-configurations.xml"/>
   </configurations>
 
   <publications>
Index: ivy/ivy.xml
===================================================================
--- ivy/ivy.xml (revision 1177967)
+++ ivy/ivy.xml (working copy)
@@ -21,7 +21,7 @@
        </info>
 
        <configurations>
-               <include file="${basedir}/ivy/ivy-configurations.xml" />
+               <include file="${basedir}/branch/ivy/ivy-configurations.xml" />
        </configurations>
 
        <publications>
@@ -58,8 +58,9 @@
                  <dependency org="org.apache.tika" name="tika-parsers" rev="0.9" />
                -->
 
+               <!--
                <dependency org="org.apache.gora" name="gora-core" rev="0.1.1-incubating" conf="*->compile"/>
-
+-->
                <dependency org="log4j" name="log4j" rev="1.2.15" conf="*->master" />
 
                <dependency org="xerces" name="xercesImpl" rev="2.9.1" />

Now, right click on ivy/ivy.xml, "Add Ivy Library ...". Do the same for src/plugin/protocol-sftp/ivy.xml.














Remove the default src directory as a Source entry in the Java Build Path if it exists. Add at least the "java", "test" and "resources" source files (which are src/java, src/test and conf) so that they get included in the classpath. That will allow us to run the classes or tests later from Eclipse.



This is how the project tree looks like:





Datastores

The datastore holds the information Nutch crawls from the web. You can opt for a Relational DataBase System or a column-oriented, NoSQL store. Thanks to Gora interface, you can use any backend you might be familiar with: HSQL, MySQL, HBase, Cassandra ...


HSQL


This is the default Gora backend. Make sure your Ivy configuration contains the dependency:
<dependency org="org.hsqldb" name="hsqldb" rev="2.0.0" conf="*->default"/>

This is the content of conf/gora.properties:

gora.sqlstore.jdbc.driver=org.hsqldb.jdbcDriver
gora.sqlstore.jdbc.url=jdbc:hsqldb:hsql://localhost/nutchtest
gora.sqlstore.jdbc.user=sa
gora.sqlstore.jdbc.password=


Setup HSQL. I downloaded this version: HSQLDB 2.0.0. Finally starts HSQL server with the same database alias, called "nutchtest":

~/java/ext/hsqldb-2.0.0/hsqldb/data$ java -cp ../lib/hsqldb.jar org.hsqldb.server.Server --database.0 file:crawldb --dbname.0 nutchtest
[Server@12ac982]: [Thread[main,5,main]]: checkRunning(false) entered
[Server@12ac982]: [Thread[main,5,main]]: checkRunning(false) exited
[Server@12ac982]: Startup sequence initiated from main() method
[Server@12ac982]: Loaded properties from [/home/alex/java/ext/hsqldb-2.0.0/hsqldb/data/server.properties]
[Server@12ac982]: Initiating startup sequence...
[Server@12ac982]: Server socket opened successfully in 5 ms.
[Server@12ac982]: Database [index=0, id=0, db=file:crawldb, alias=nutchtest] opened sucessfully in 420 ms.
[Server@12ac982]: Startup sequence completed in 426 ms.
[Server@12ac982]: 2011-01-12 10:41:56.181 HSQLDB server 2.0.0 is online on port 9001
[Server@12ac982]: To close normally, connect and execute SHUTDOWN SQL
[Server@12ac982]: From command line, use [Ctrl]+[C] to abort abruptly


MySQL

To use MySQL as datastore, add the dependency to ivy/ivy.xml:

<dependency org="mysql" name="mysql-connector-java" rev="5.1.13" conf="*->default"></dependency>


Change conf/gora.properties to setup the MySQL connection:

gora.sqlstore.jdbc.driver=com.mysql.jdbc.Driver
gora.sqlstore.jdbc.url=jdbc:mysql://localhost:3306/nutch?createDatabaseIfNotExist=true
gora.sqlstore.jdbc.user=alex
gora.sqlstore.jdbc.password=some_pass

You will notice MySQL is a lot faster than HSQL. It was at least 12x in my case with the default setups. For example injecting 50k took 6 min with HSQL instead of 30 sec with MySQL. You could make a similar comparison in Rails with SQLite and MySQL ...


HBase


Add the gora-hbase and zookeeper dependency in ivy/ivy.xml

<dependency org="org.apache.gora" name="gora-hbase" rev="0.1" conf="*->compile">
  <exclude org="com.sun.jdmk"/>
  <exclude org="com.sun.jmx"/>
  <exclude org="javax.jms"/>
 </dependency>
 <dependency org="org.apache.zookeeper" name="zookeeper" rev="3.3.2" conf="*->default">
  <exclude org="com.sun.jdmk"/>
  <exclude org="com.sun.jmx"/>
  <exclude org="javax.jms"/>        
 </dependency>  

After rebuilding Nutch, since a recent version (0.20.6 ?) is not available in the Maven repository, you will need to manually add hbase jar to the runtime/local/lib directory :

$ cp $HBASE_HOME/hbase-*.jar $NUTCH_HOME/runtime/local/lib

Create the $NUTCH_HOME/runtime/local/conf/gora-hbase-mapping.xml file as described in GORA_HBase wiki page. Overwrite storage.data.store.class property in runtime/local/conf/nutch-site.xml:

<property>
  <name>storage.data.store.class</name>
  <value>org.apache.gora.hbase.store.HBaseStore</value>
 </property>

Finally setup and Run HBase. See this blog entry as a quick guide.


Cassandra


Start Cassandra:

$ bin/cassandra -f


Run the injector job with the Gora setting to Cassandra store in nutch-site.xml.

<property>
  <name>storage.data.store.class</name>
  <value>org.apache.gora.cassandra.store.CassandraStore</value>
  <description>Default class for storing data</description>
</property>

Please place gora-cassandra-mapping.xml in your Nutch conf directory which is included in the classpath. This configuration defines how Avro fields are stored in Cassandra. This the content of gora-cassandra-mapping.xml file:
<gora-orm>
 <keyspace name="webpage" cluster="Test Cluster" host="localhost">
  <family name="p"/>
  <family name="f"/>
  <family name="sc" type="super"/>
 </keyspace>
 <class keyClass="java.lang.String" name="org.apache.nutch.storage.WebPage">
  <!-- fetch fields -->
  <field name="baseUrl" family="f" qualifier="bas"/>
  <field name="status" family="f" qualifier="st"/>
  <field name="prevFetchTime" family="f" qualifier="pts"/>
  <field name="fetchTime" family="f" qualifier="ts"/>
  <field name="fetchInterval" family="f" qualifier="fi"/>
  <field name="retriesSinceFetch" family="f" qualifier="rsf"/>
  <field name="reprUrl" family="f" qualifier="rpr"/>
  <field name="content" family="f" qualifier="cnt"/>
  <field name="contentType" family="f" qualifier="typ"/>
  <field name="modifiedTime" family="f" qualifier="mod"/>
  <!-- parse fields -->
  <field name="title" family="p" qualifier="t"/>
  <field name="text" family="p" qualifier="c"/>
  <field name="signature" family="p" qualifier="sig"/>
  <field name="prevSignature" family="p" qualifier="psig"/>
  <!-- score fields -->
  <field name="score" family="f" qualifier="s"/>
  <!-- super columns -->
  <field name="markers" family="sc" qualifier="mk"/>
  <field name="inlinks" family="sc" qualifier="il"/>
  <field name="outlinks" family="sc" qualifier="ol"/>
  <field name="metadata" family="sc" qualifier="mtdt"/>
  <field name="headers" family="sc" qualifier="h"/>
  <field name="parseStatus" family="sc" qualifier="pas"/>
  <field name="protocolStatus" family="sc" qualifier="prs"/>
 </class>
</gora-orm>




Check the data that has been initialized and gets populated.


$ bin/cassandra-cli --host localhost
[default@unknown] use webpage;
Authenticated to keyspace: webpage
[default@webpage] update column family p with key_validation_class=UTF8Type;                                                                                                       
138c3060-b623-11e0-0000-242d50cf1fb7
Waiting for schema agreement...
... schemas agree across the cluster
[default@webpage] update column family f with key_validation_class=UTF8Type;
139de3a0-b623-11e0-0000-242d50cf1fb7
Waiting for schema agreement...
... schemas agree across the cluster
[default@webpage] update column family sc with key_validation_class=UTF8Type;
13b2f240-b623-11e0-0000-242d50cf1fb7
Waiting for schema agreement...
... schemas agree across the cluster
[default@webpage] list f;
Using default limit of 100
-------------------
RowKey: com.truveo.www:http/
=> (column=fi, value=2592000, timestamp=1311532210076000)
=> (column=s, value=1.0, timestamp=1311532210080000)
=> (column=ts, value=1311532203790, timestamp=1311532209796000)
-------------------
RowKey: com.blogspot.techvineyard:http/
=> (column=fi, value=2592000, timestamp=1311532210134000)
=> (column=s, value=1.0, timestamp=1311532210137000)
=> (column=ts, value=1311532203790, timestamp=1311532210131000)
-------------------
RowKey: org.apache.wiki:http/nutch/
=> (column=fi, value=2592000, timestamp=1311532210146000)
=> (column=s, value=1.0, timestamp=1311532210149000)
=> (column=ts, value=1311532203790, timestamp=1311532210144000)

3 Rows Returned.
[default@webpage] 



JUnit Tests


Let's run a few unit tests to verify the setup.


Datastore


We want to make the GoraStorage test pass. First apply this patch to test your datastore setting and avoid crashing your old laptop because it has limited capacity.




Index: src/test/org/apache/nutch/storage/TestGoraStorage.java
===================================================================
--- src/test/org/apache/nutch/storage/TestGoraStorage.java      (revision 1053817)
+++ src/test/org/apache/nutch/storage/TestGoraStorage.java      (working copy)
@@ -1,22 +1,17 @@
 package org.apache.nutch.storage;
 
-import java.io.File;
-import java.util.ArrayList;
 import java.util.BitSet;
-import java.util.Iterator;
-import java.util.List;
 import java.util.Random;
-import java.util.Vector;
 import java.util.concurrent.atomic.AtomicInteger;
 
+import junit.framework.TestCase;
+
 import org.apache.avro.util.Utf8;
-import org.apache.hadoop.conf.Configuration;
-import org.apache.nutch.util.NutchConfiguration;
 import org.apache.gora.query.Result;
 import org.apache.gora.store.DataStore;
+import org.apache.hadoop.conf.Configuration;
+import org.apache.nutch.util.NutchConfiguration;
 
-import junit.framework.TestCase;
-
 public class TestGoraStorage extends TestCase {
   Configuration conf;
   
@@ -80,8 +75,8 @@
   private AtomicInteger threadCount = new AtomicInteger(0);
   
   public void testMultithread() throws Exception {
-    int COUNT = 1000;
-    int NUM = 100;
+    int COUNT = 50;
+    int NUM = 5;
     DataStore<String,WebPage> store;
     
     for (int i = 0; i < NUM; i++) {
@@ -113,115 +108,4 @@
     assertEquals(size, keys.cardinality());
   }
   
-  public void testMultiProcess() throws Exception {
-    int COUNT = 1000;
-    int NUM = 100;
-    DataStore<String,WebPage> store;
-    List<Process> procs = new ArrayList<Process>();
-    
-    for (int i = 0; i < NUM; i++) {
-      Process p = launch(i, i * COUNT, COUNT);
-      procs.add(p);
-    }
-    
-    while (procs.size() > 0) {
-      try {
-        Thread.sleep(5000);
-      } catch (Exception e) {};
-      Iterator<Process> it = procs.iterator();
-      while (it.hasNext()) {
-        Process p = it.next();
-        int code = 1;
-        try {
-          code = p.exitValue();
-          assertEquals(0, code);
-          it.remove();
-          p.destroy();
-        } catch (IllegalThreadStateException e) {
-          // not ready yet
-        }
-      }
-      System.out.println("* running " + procs.size() + "/" + NUM);
-    }
-    System.out.println("Verifying...");
-    store = StorageUtils.createDataStore(conf, String.class, WebPage.class);
-    Result<String,WebPage> res = store.execute(store.newQuery());
-    int size = COUNT * NUM;
-    BitSet keys = new BitSet(size);
-    while (res.next()) {
-      String key = res.getKey();
-      WebPage p = res.get();
-      assertEquals(key, p.getTitle().toString());
-      int pos = Integer.parseInt(key);
-      assertTrue(pos < size && pos >= 0);
-      if (keys.get(pos)) {
-        fail("key " + key + " already set!");
-      }
-      keys.set(pos);
-    }
-    if (size != keys.cardinality()) {
-      System.out.println("ERROR Missing keys:");
-      for (int i = 0; i < size; i++) {
-        if (keys.get(i)) continue;
-        System.out.println(" " + i);
-      }
-      fail("key count should be " + size + " but is " + keys.cardinality());
-    }
-  }
-  
-  private Process launch(int id, int start, int count) throws Exception {
-    //  Build exec child jmv args.
-    Vector<String> vargs = new Vector<String>(8);
-    File jvm =                                  // use same jvm as parent
-      new File(new File(System.getProperty("java.home"), "bin"), "java");
-
-    vargs.add(jvm.toString());
-
-    // Add child (task) java-vm options.
-    // tmp dir
-    String prop = System.getProperty("java.io.tmpdir");
-    vargs.add("-Djava.io.tmpdir=" + prop);
-    // library path
-    prop = System.getProperty("java.library.path");
-    if (prop != null) {
-      vargs.add("-Djava.library.path=" + prop);      
-    }
-    // working dir
-    prop = System.getProperty("user.dir");
-    vargs.add("-Duser.dir=" + prop);    
-    // combat the stupid Xerces issue
-    vargs.add("-Djavax.xml.parsers.DocumentBuilderFactory=com.sun.org.apache.xerces.internal.jaxp.DocumentBuilderFactoryImpl");
-    // prepare classpath
-    String sep = System.getProperty("path.separator");
-    StringBuffer classPath = new StringBuffer();
-    // start with same classpath as parent process
-    classPath.append(System.getProperty("java.class.path"));
-    //classPath.append(sep);
-    // Add classpath.
-    vargs.add("-classpath");
-    vargs.add(classPath.toString());
-    
-    // append class name and args
-    vargs.add(TestGoraStorage.class.getName());
-    vargs.add(String.valueOf(id));
-    vargs.add(String.valueOf(start));
-    vargs.add(String.valueOf(count));
-    ProcessBuilder builder = new ProcessBuilder(vargs);
-    return builder.start();
-  }
-  
-  public static void main(String[] args) throws Exception {
-    if (args.length < 3) {
-      System.err.println("Usage: TestGoraStore <id> <startKey> <numRecords>");
-      System.exit(-1);
-    }
-    TestGoraStorage test = new TestGoraStorage();
-    test.init();
-    int id = Integer.parseInt(args[0]);
-    int start = Integer.parseInt(args[1]);
-    int count = Integer.parseInt(args[2]);
-    Worker w = test.new Worker(id, start, count, true);
-    w.run();
-    System.exit(0);
-  }
 }
Index: src/test/org/apache/nutch/util/AbstractNutchTest.java
===================================================================
--- src/test/org/apache/nutch/util/AbstractNutchTest.java       (revision 1053817)
+++ src/test/org/apache/nutch/util/AbstractNutchTest.java       (working copy)
@@ -16,28 +16,14 @@
  */
 package org.apache.nutch.util;
 
-import java.io.IOException;
-import java.nio.ByteBuffer;
-import java.util.ArrayList;
-import java.util.List;
-
 import junit.framework.TestCase;
 
-import org.apache.avro.util.Utf8;
+import org.apache.gora.store.DataStore;
 import org.apache.hadoop.conf.Configuration;
 import org.apache.hadoop.fs.FileSystem;
 import org.apache.hadoop.fs.Path;
-import org.apache.nutch.crawl.URLWebPage;
-import org.apache.nutch.storage.Mark;
 import org.apache.nutch.storage.StorageUtils;
 import org.apache.nutch.storage.WebPage;
-import org.apache.nutch.util.TableUtil;
-import org.apache.gora.query.Query;
-import org.apache.gora.query.Result;
-import org.apache.gora.sql.store.SqlStore;
-import org.apache.gora.store.DataStore;
-import org.apache.gora.store.DataStoreFactory;
-import org.apache.gora.util.ByteUtils;
 
 /**
  * This class provides common routines for setup/teardown of an in-memory data
@@ -55,16 +41,12 @@
   public void setUp() throws Exception {
     super.setUp();
     conf = CrawlTestUtil.createConfiguration();
-    conf.set("storage.data.store.class", "org.gora.sql.store.SqlStore");
     fs = FileSystem.get(conf);
-    // using hsqldb in memory
-    DataStoreFactory.properties.setProperty("gora.sqlstore.jdbc.driver","org.hsqldb.jdbcDriver");
-    // use separate in-memory db-s for tests
-    DataStoreFactory.properties.setProperty("gora.sqlstore.jdbc.url","jdbc:hsqldb:mem:" + getClass().getName());
-    DataStoreFactory.properties.setProperty("gora.sqlstore.jdbc.user","sa");
-    DataStoreFactory.properties.setProperty("gora.sqlstore.jdbc.password","");
     webPageStore = StorageUtils.createWebStore(conf, String.class,
         WebPage.class);
+    
+    // empty the datastore
+    webPageStore.deleteByQuery(webPageStore.newQuery());
   }
 
   @Override












Fetch



We can try to run the Fetcher test as well.

  • Change the location of the static files that will be returned to the Nutch crawler by the Jetty server, from "build/test/data/fetch-test-site" to "src/testresources/fetch-test-site"
  • Overwrite as well for testing purpose the plugin directory setting.
  • Set http.agent.name and http.robots.agents properties.
  • Limit the content length to the maximum for a blob column type. This is only required for MySQL.


Index: src/test/nutch-site.xml               
===================================================================
--- src/test/nutch-site.xml     (revision 1053817)
+++ src/test/nutch-site.xml     (working copy)
@@ -22,4 +22,20 @@
<description>Default class for storing data</description>
 </property>
 
+       <property>
+         <name>plugin.folders</name>
+         <value>build/plugins</value>
+       </property>
+       <property>
+         <name>http.agent.name</name>
+         <value>NutchRobot</value>
+       </property>
+       <property>
+         <name>http.robots.agents</name>
+         <value>NutchRobot,*</value>
+       </property>
+       <property>
+         <name>http.content.limit</name>
+         <value>65535</value>
+       </property>
 </configuration>
Index: src/test/org/apache/nutch/fetcher/TestFetcher.java
===================================================================
--- src/test/org/apache/nutch/fetcher/TestFetcher.java  (revision 1050697)
+++ src/test/org/apache/nutch/fetcher/TestFetcher.java  (working copy)
@@ -50,7 +50,7 @@
   public void setUp() throws Exception{
     super.setUp();
     urlPath = new Path(testdir, "urls");
-    server = CrawlTestUtil.getServer(conf.getInt("content.server.port",50000), "build/test/data/fetch-test-site");
+    server = CrawlTestUtil.getServer(conf.getInt("content.server.port",50000), "src/testresources/fetch-test-site");
     server.start();
   }

Now right click on the org.apache.nutch.fetcher.TestFetcher class located in the src/test source directory, then "Run As" > "JUnit Test".





Nutch Commands


Several commands are available to maintain and index your crawl. Here are the possible options from the Bash script:

~/java/workspace/Nutch2.0/runtime/local$ bin/nutch
Usage: nutch [-core] COMMAND
where COMMAND is one of:
 inject inject new urls into the database
 generate generate new segments to fetch from crawl db
 fetch fetch URLs marked during generate
 parse parse URLs marked during fetch
 updatedb update web table after parsing
 readdb read/dump records from page database
 solrindex run the solr indexer on parsed segments and linkdb
 solrdedup remove duplicates from solr
 plugin load a plugin and run one of its classes main()
 or
 CLASSNAME run the class named CLASSNAME
Most commands print help when invoked w/o parameters.

Expert: -core option is for developers only. It avoids building the job jar, 
 instead it simply includes classes compiled with ant compile-core. 
 NOTE: this works only for jobs executed in 'local' mod

Running Nutch classes from Eclipse


You can either run a command with the Bash script or execute a Nutch class directly from Eclipse. The latter is easier for development since you do not need to build the whole project each time you change something. When a Nutch class is executed, it first loads the configuration by looking in the classpath for a nutch-site.xml file that overwrites nutch-default.xml. Depending on the order of the "src/test" and "conf" source directories in your Eclipse build path, only one nutch-site.xml file will be loaded to the classpath. In my case, it was the one that is located in "src/test". If I edit the one in "conf", I see the warning

The resource is a duplicate of src/test/nutch-default.xml and was not copied to the output folder.

which indicates it will be ignored. So you want to edit the one that is activated.

  • Apply the modifications to src/test/nutch-site.xml (or conf/nutch-site.xml, depending on your classpath order setting) that are given in the Fetch Test section from above.




crawl / org.apache.nutch.crawl.Crawler


~/java/workspace/Nutch2.0/runtime/local$ bin/nutch crawl
Usage: Crawl (<seedDir> | -continue) [-solr <solrURL>] [-threads n] [-depth i] [-topN N]


Right click on org.apache.nutch.crawl.Crawler in src/java source directory. Then "Run As" > "Java Application"

  1. The first argument called "seedDir" is the path to a directory containing lists of seed urls. They will be injected to the database. They define a forest of pages that will be visited by the crawler during the first iteration of the graph exploration. Then the crawler will expand the graph by adding neighbours to these pages when extracting new urls out of the page content. These new pages should then be visited in the second iteration.
  2. The -continue parameter instead resumes the crawl without injecting any seeds.
  3. -solr defines the solr server used to index the documents
  4. -threads defines the number of threads spawned to fetch several pages simultaneously.
  5. -depth defines the number of iterations in the graph exploration, before the traversal gets pruned.
  6. -topN limits the number of urls that get downloaded in one iteration.


Let's create some input to the crawl command. This is the content of a seeds/urls file that we can use for the demo:

http://techvineyard.blogspot.com/
http://www.truveo.com/
http://wiki.apache.org/nutch/

I used MySQL as a datastore. Let's clear it if the webpage table exists before running the crawl command.

$ mysql -hlocalhost -ualex -psome_pass nutch
mysql> delete from webpage;

From the Eclipse menu:

Run > Run Configurations ...





Click Run. You can compare your output with my logs here. Then check the content of the MySQL table:

mysql> describe webpage;
+-------------------+----------------+------+-----+---------+-------+
| Field             | Type           | Null | Key | Default | Extra |
+-------------------+----------------+------+-----+---------+-------+
| id                | varchar(512)   | NO   | PRI | NULL    |       |
| headers           | blob           | YES  |     | NULL    |       |
| text              | varchar(32000) | YES  |     | NULL    |       |
| status            | int(11)        | YES  |     | NULL    |       |
| markers           | blob           | YES  |     | NULL    |       |
| parseStatus       | blob           | YES  |     | NULL    |       |
| modifiedTime      | bigint(20)     | YES  |     | NULL    |       |
| score             | float          | YES  |     | NULL    |       |
| typ               | varchar(32)    | YES  |     | NULL    |       |
| baseUrl           | varchar(512)   | YES  |     | NULL    |       |
| content           | blob           | YES  |     | NULL    |       |
| title             | varchar(512)   | YES  |     | NULL    |       |
| reprUrl           | varchar(512)   | YES  |     | NULL    |       |
| fetchInterval     | int(11)        | YES  |     | NULL    |       |
| prevFetchTime     | bigint(20)     | YES  |     | NULL    |       |
| inlinks           | blob           | YES  |     | NULL    |       |
| prevSignature     | blob           | YES  |     | NULL    |       |
| outlinks          | blob           | YES  |     | NULL    |       |
| fetchTime         | bigint(20)     | YES  |     | NULL    |       |
| retriesSinceFetch | int(11)        | YES  |     | NULL    |       |
| protocolStatus    | blob           | YES  |     | NULL    |       |
| signature         | blob           | YES  |     | NULL    |       |
| metadata          | blob           | YES  |     | NULL    |       |
+-------------------+----------------+------+-----+---------+-------+
23 rows in set (0.14 sec)

mysql> select count(*) from webpage;
+----------+
| count(*) |
+----------+
|      151 |
+----------+
1 row in set (0.00 sec)

mysql> select id, markers from webpage where content is not null;
+---------------------------------+------------------------------------------+
| id                              | markers                                  |
+---------------------------------+------------------------------------------+
| org.apache.wiki:http/nutch/     | _injmrk_y_updmrk_*1294943864-1806760603  |
| com.blogspot.techvineyard:http/ | _injmrk_y_updmrk_*1294943864-1806760603  |
| com.truveo.www:http/            | _injmrk_y_updmrk_*1294943864-1806760603  |
+---------------------------------+------------------------------------------+
3 rows in set (0.00 sec)


readdb / org.apache.nutch.crawl.WebTableReader


~/java/workspace/Nutch2.0/runtime/local$ bin/nutch readdb
Usage: WebTableReader (-stats | -url [url] | -dump <out_dir> [-regex regex]) [-crawlId <id>] [-content] [-headers] [-links] [-text]
        -crawlId <id>    the id to prefix the schemas to operate on, (default: storage.crawl.id)
        -stats [-sort]  print overall statistics to System.out
                [-sort] list status sorted by host
        -url <url>      print information on <url> to System.out
        -dump <out_dir> [-regex regex]  dump the webtable to a text file in <out_dir>
                -content        dump also raw content
                -headers        dump protocol headers
                -links  dump links
                -text   dump extracted text
                [-regex]        filter on the URL of the webtable entry

WebTableReader class scans the entire database via a Hadoop job that outputs all the fields.





inject / org.apache.nutch.crawl.InjectorJob


~/java/workspace/Nutch2.0/runtime/local$ bin/nutch inject
Usage: InjectorJob <url_dir> [-crawlId <id>]

First, we need to initialize the crawl db. The "url_dir" argument to the inject command is a directory containing flat files of lists of urls, used as "seeds".

generate / org.apache.nutch.crawl.GeneratorJob


~/java/workspace/Nutch2.0/runtime/local$ bin/nutch generate
GeneratorJob: Selecting best-scoring urls due for fetch.
GeneratorJob: starting
GeneratorJob: filtering: true
GeneratorJob: done
GeneratorJob: generated batch id: 1294943864-1806760603

This steps generates a batch-id containing selected urls to be fetched.

fetch / org.apache.nutch.fetcher.FetcherJob


~/java/workspace/Nutch2.0/runtime/local$ bin/nutch fetch
Usage: FetcherJob (<batchId> | -all) [-crawlId <id>] [-threads N] [-parse] [-resume] [-numTasks N]
        batchId crawl identifier returned by Generator, or -all for all generated batchId-s
        -crawlId <id>    the id to prefix the schemas to operate on, (default: storage.crawl.id)
        -threads N      number of fetching threads per task
        -parse  if specified then fetcher will immediately parse fetched content
        -resume resume interrupted job
        -numTasks N     if N > 0 then use this many reduce tasks for fetching (default: mapred.map.tasks)


parse / org.apache.nutch.parse.ParserJob


~/java/workspace/Nutch2.0/runtime/local$ bin/nutch parse
Usage: ParserJob (<batchId> | -all) [-crawlId <id>] [-resume] [-force]
        batchId symbolic batch ID created by Generator
        -crawlId <id>    the id to prefix the schemas to operate on, (default: storage.crawl.id)
        -all    consider pages from all crawl jobs
-resume resume a previous incomplete job
-force  force re-parsing even if a page is already parsed

Once we have a local copy the web pages, we need to parse them to extract keywords and links the web page points to. This parsing task is delegated to Tika.

updatedb / org.apache.nutch.crawl.DbUpdaterJob


~/java/workspace/Nutch2.0/runtime/local$ bin/nutch updatedb


solrindex / org.apache.nutch.indexer.solr.SolrIndexerJob


The indexing task is now delegated to Solr, which is a server using Lucene indexes that will make the crawled documents searchable by indexing the data posted via HTTP. I ran into a few caveats before making it work. This is the suggested patch.

  • Avoid multiple values for id field.
  • Allow multiple values for tag field. Add tld (Top Level Domain) field.
  • Get the content-type from WebPage object's member. Otherwise, you will see NullPointerExceptions.
  • Compare strings with equalsTo. That's pretty random, but it avoids having some suprises.

Index: conf/solrindex-mapping.xml
===================================================================
--- conf/solrindex-mapping.xml  (revision 1053817)
+++ conf/solrindex-mapping.xml  (working copy)
@@ -39,8 +39,7 @@
                <field dest="boost" source="boost"/>
                <field dest="digest" source="digest"/>
                <field dest="tstamp" source="tstamp"/>
-               <field dest="id" source="url"/>
-               <copyField source="url" dest="url"/>
+               <field dest="url" source="url"/>
        </fields>
        <uniqueKey>id</uniqueKey>
 </mapping>
Index: conf/schema.xml
===================================================================
--- conf/schema.xml     (revision 1053817)
+++ conf/schema.xml     (working copy)
@@ -95,12 +95,15 @@
 
         <!-- fields for feed plugin -->
         <field name="author" type="string" stored="true" indexed="true"/>
-        <field name="tag" type="string" stored="true" indexed="true"/>
+        <field name="tag" type="string" stored="true" indexed="true" multiValued="true"/>
         <field name="feed" type="string" stored="true" indexed="true"/>
         <field name="publishedDate" type="string" stored="true"
             indexed="true"/>
         <field name="updatedDate" type="string" stored="true"
             indexed="true"/>
+            
+        <field name="tld" type="string" stored="false" indexed="false"/>
+            
     </fields>
     <uniqueKey>id</uniqueKey>
     <defaultSearchField>content</defaultSearchField>
Index: src/plugin/index-more/src/java/org/apache/nutch/indexer/more/MoreIndexingFilter.java
===================================================================
--- src/plugin/index-more/src/java/org/apache/nutch/indexer/more/MoreIndexingFilter.java        (revision 1053817)
+++ src/plugin/index-more/src/java/org/apache/nutch/indexer/more/MoreIndexingFilter.java        (working copy)
@@ -172,7 +172,7 @@
    */
   private NutchDocument addType(NutchDocument doc, WebPage page, String url) {
     MimeType mimeType = null;
-    Utf8 contentType = page.getFromHeaders(new Utf8(HttpHeaders.CONTENT_TYPE));
+    Utf8 contentType = page.getContentType();
     if (contentType == null) {
       // Note by Jerome Charron on 20050415:
       // Content Type not solved by a previous plugin
Index: src/java/org/apache/nutch/indexer/solr/SolrWriter.java
===================================================================
--- src/java/org/apache/nutch/indexer/solr/SolrWriter.java      (revision 1053817)
+++ src/java/org/apache/nutch/indexer/solr/SolrWriter.java      (working copy)
@@ -56,7 +56,7 @@
       for (final String val : e.getValue()) {
         inputDoc.addField(solrMapping.mapKey(e.getKey()), val);
         String sCopy = solrMapping.mapCopyKey(e.getKey());
-        if (sCopy != e.getKey()) {
+        if (! sCopy.equals(e.getKey())) {
                inputDoc.addField(sCopy, val);
         }
       }

Download Solr. To setup the Solr server, copy the example directory from the Solr distribution and the patched schema.xml configuration file to solr/conf of the Solr app.

cp -r $SOLR_HOME/example solrapp
 cp $NUTCH_HOME/conf/schema.xml solrapp/solr/conf/            
 cd solrapp
 java -jar start.jar

This starts the Solr server. Now let's index a few documents, by adding as parameter to SolrIndexerJob class the batch id showing up in the markers column.








Here are some excerpts of the logs from the Jetty server to make sure the documents were properly sent:

13-Jan-2011 19:50:47 org.apache.solr.update.processor.LogUpdateProcessor finish
INFO: {add=[com.truveo.www:http/, org.apache.wiki:http/nutch/, com.blogspot.techvineyard:http/]} 0 206
13-Jan-2011 19:50:47 org.apache.solr.core.SolrCore execute
INFO: [] webapp=/solr path=/update params={wt=javabin&version=1} status=0 QTime=206 
13-Jan-2011 19:50:47 org.apache.solr.update.DirectUpdateHandler2 commit
INFO: start commit(optimize=false,waitFlush=true,waitSearcher=true,expungeDeletes=false)
13-Jan-2011 19:50:47 org.apache.solr.core.SolrDeletionPolicy onCommit
INFO: SolrDeletionPolicy.onCommit: commits:num=2
        commit{dir=/home/alex/java/perso/nutch/solrapp/solr/data/index,segFN=segments_1,version=1294944630023,generation=1,filenames=[segments_1]
        commit{dir=/home/alex/java/perso/nutch/solrapp/solr/data/index,segFN=segments_2,version=1294944630024,generation=2,filenames=[_0.nrm, _0.tis, _0.fnm, _0.tii, _0.frq, segments_2, _0.fdx, _0.prx, _0.fdt]
13-Jan-2011 19:50:47 org.apache.solr.core.SolrDeletionPolicy updateCommits
INFO: newest commit = 1294944630024

You can now do a search via the api:

$ curl "http://localhost:8983/solr/select/?q=video&indent=on"
<?xml version="1.0" encoding="UTF-8"?>
<response>

<lst name="responseHeader">
 <int name="status">0</int>
 <int name="QTime">0</int>
 <lst name="params">
  <str name="indent">on</str>
  <str name="q">video</str>
 </lst>
</lst>
<result name="response" numFound="2" start="0">
 <doc>
  <arr name="anchor"><str>Logout</str></arr>
  <float name="boost">1.03571</float>
  <str name="date">20110212</str>
  <str name="digest">5d62587216b50ed7e52987b09dcb9925</str>
  <str name="id">com.truveo.www:http/</str>
  <str name="lang">unknown</str>
  <arr name="tag"><str/></arr>
  <str name="title">Truveo Video Search</str>
  <long name="tstamp">2011-02-12T18:37:53.031Z</long>
  <arr name="type"><str>text/html</str><str>text</str><str>html</str></arr>
  <str name="url">http://www.truveo.com/</str>
 </doc>
 <doc>
  <arr name="anchor"><str>Comments</str></arr>
  <float name="boost">1.00971</float>
  <str name="date">20110212</str>
  <str name="digest">59edefd6f4711895c2127d45b569d8c9</str>
  <str name="id">org.apache.wiki:http/nutch/</str>
  <str name="lang">en</str>
  <arr name="subcollection"><str>nutch</str></arr>
  <arr name="tag"><str/></arr>
  <str name="title">FrontPage - Nutch Wiki</str>
  <long name="tstamp">2011-02-12T18:37:53.863Z</long>
  <arr name="type"><str>text/html</str><str>text</str><str>html</str></arr>
  <str name="url">http://wiki.apache.org/nutch/</str>
 </doc>
</result>
</response>




Crawl Script


To automate the crawl process, we might want to use a Bash script that runs the suite of Nutch commands, then add it as a cron job. Don't forget to initialize first the crawl db with the inject command. We run several iterations of the the generate/fetch/parse/update cycle with the for loop. We limit the number of urls that will get fetched in one iteration by specifying a -topN argument in the generate command.

#!/bin/bash

# Nutch crawl

export NUTCH_HOME=~/java/workspace/Nutch2.0/runtime/local

# depth in the web exploration
n=1
# number of selected urls for fetching
maxUrls=50000
# solr server
solrUrl=http://localhost:8983
                                                                                                                                                                                                                                                                                                                                                                      
for (( i = 1 ; i <= $n ; i++ ))
do

log=$NUTCH_HOME/logs/log                                                                                                                                                           

# Generate
$NUTCH_HOME/bin/nutch generate -topN $maxUrls > $log

batchId=`sed -n 's|.*batch id: \(.*\)|\1|p' < $log`

# rename log file by appending the batch id
log2=$log$batchId
mv $log $log2
log=$log2

# Fetch
$NUTCH_HOME/bin/nutch fetch $batchId >> $log

# Parse
$NUTCH_HOME/bin/nutch parse $batchId >> $log

# Update
$NUTCH_HOME/bin/nutch updatedb >> $log

# Index
$NUTCH_HOME/bin/nutch solrindex $solrUrl $batchId >> $log

done




Conclusion


I managed to fetch in one run 50k urls with these minor changes. With the default values in conf/nutch-default.xml and MySQL as datastore, these are the logs timestamps when running the initialization and one iteration of generate/fetch/update cycle:
2010-12-13 07:19:26,089 INFO  crawl.InjectorJob - InjectorJob: starting
2010-12-13 07:20:00,077 INFO  crawl.InjectorJob - InjectorJob: finished
2010-12-13 07:20:00,715 INFO  crawl.GeneratorJob - GeneratorJob: starting
2010-12-13 07:20:34,304 INFO  crawl.GeneratorJob - GeneratorJob: done
2010-12-13 07:20:35,041 INFO  fetcher.FetcherJob - FetcherJob: starting
2010-12-13 11:04:00,933 INFO  fetcher.FetcherJob - FetcherJob: done
2010-12-15 01:38:44,262 INFO  crawl.DbUpdaterJob - DbUpdaterJob: starting
2010-12-15 02:15:15,503 INFO  crawl.DbUpdaterJob - DbUpdaterJob: done
 
The next step is comparing with a setup backed by a HBase datastore. I tried once but got a memory error which left my HBase server instance unresponsive. See the description of the problem.
Please don't hesitate to comment and share your own feedback, difficulties and results.

37 comments:

  1. i built nutch 2.0 and using hbase as data storage. after building sucessfully. i did inject commnad as :

    bin/nutch inject urls/seeds.txt

    i have following problem.

    InjectorJob: org.apache.gora.util.GoraException: java.lang.RuntimeException: java.net.MalformedURLException
    at org.apache.gora.store.DataStoreFactory.createDataStore(DataStoreFactory.java:110)
    at org.apache.gora.store.DataStoreFactory.createDataStore(DataStoreFactory.java:93)
    at org.apache.nutch.storage.StorageUtils.createWebStore(StorageUtils.java:43)
    at org.apache.nutch.crawl.InjectorJob.run(InjectorJob.java:227)
    at org.apache.nutch.crawl.InjectorJob.inject(InjectorJob.java:252)
    at org.apache.nutch.crawl.InjectorJob.run(InjectorJob.java:266)
    at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
    at org.apache.nutch.crawl.InjectorJob.main(InjectorJob.java:276)
    Caused by: java.lang.RuntimeException: java.net.MalformedURLException
    at org.apache.gora.hbase.store.HBaseStore.initialize(HBaseStore.java:128)
    at org.apache.gora.store.DataStoreFactory.initializeDataStore(DataStoreFactory.java:81)
    at org.apache.gora.store.DataStoreFactory.createDataStore(DataStoreFactory.java:104)
    ... 7 more
    Caused by: java.net.MalformedURLException
    at java.net.URL.(URL.java:617)
    at java.net.URL.(URL.java:480)
    at java.net.URL.(URL.java:429)
    at org.apache.xerces.impl.XMLEntityManager.setupCurrentEntity(Unknown Source)
    at org.apache.xerces.impl.XMLVersionDetector.determineDocVersion(Unknown Source)
    at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source)
    at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source)
    at org.apache.xerces.parsers.XMLParser.parse(Unknown Source)
    at org.apache.xerces.parsers.AbstractSAXParser.parse(Unknown Source)
    at org.jdom.input.SAXBuilder.build(SAXBuilder.java:489)
    at org.jdom.input.SAXBuilder.build(SAXBuilder.java:807)
    at org.apache.gora.hbase.store.HBaseStore.readMapping(HBaseStore.java:528)
    at org.apache.gora.hbase.store.HBaseStore.initialize(HBaseStore.java:114)
    ... 9 more
    Caused by: java.lang.NullPointerException
    at java.net.URL.(URL.java:522)
    ... 21 more

    Pls help me.

    ReplyDelete
  2. Nutch wiki page explains how to use HBase as Gora backend: http://wiki.apache.org/nutch/GORA_HBase

    There is a rather imperfect JUnit test for storage, org.apache.nutch.storage.TestGoraStorage.

    Honestly, when i ran it as is, it made my laptop hang because it takes up too much resources.


    You might want to apply this patch before running it:

    Index: src/test/org/apache/nutch/util/AbstractNutchTest.java
    ===================================================================
    --- src/test/org/apache/nutch/util/AbstractNutchTest.java (revision 1050697)
    +++ src/test/org/apache/nutch/util/AbstractNutchTest.java (working copy)
    @@ -16,34 +16,23 @@
    */
    package org.apache.nutch.util;

    -import java.io.IOException;
    -import java.nio.ByteBuffer;
    -import java.util.ArrayList;
    -import java.util.List;
    -
    import junit.framework.TestCase;

    -import org.apache.avro.util.Utf8;
    +import org.apache.gora.store.DataStore;
    import org.apache.hadoop.conf.Configuration;
    import org.apache.hadoop.fs.FileSystem;
    import org.apache.hadoop.fs.Path;
    -import org.apache.nutch.crawl.URLWebPage;
    -import org.apache.nutch.storage.Mark;
    import org.apache.nutch.storage.StorageUtils;
    import org.apache.nutch.storage.WebPage;
    -import org.apache.nutch.util.TableUtil;
    -import org.apache.gora.query.Query;
    -import org.apache.gora.query.Result;
    -import org.apache.gora.sql.store.SqlStore;
    -import org.apache.gora.store.DataStore;
    -import org.apache.gora.store.DataStoreFactory;
    -import org.apache.gora.util.ByteUtils;
    +import org.slf4j.Logger;
    +import org.slf4j.LoggerFactory;

    /**
    * This class provides common routines for setup/teardown of an in-memory data
    * store.
    */
    public class AbstractNutchTest extends TestCase {
    + protected static final Logger LOG = LoggerFactory.getLogger(AbstractNutchTest.class);

    protected Configuration conf;
    protected FileSystem fs;
    @@ -55,16 +44,12 @@
    public void setUp() throws Exception {
    super.setUp();
    conf = CrawlTestUtil.createConfiguration();
    - conf.set("storage.data.store.class", "org.gora.sql.store.SqlStore");
    fs = FileSystem.get(conf);
    - // using hsqldb in memory
    - DataStoreFactory.properties.setProperty("gora.sqlstore.jdbc.driver","org.hsqldb.jdbcDriver");
    - // use separate in-memory db-s for tests
    - DataStoreFactory.properties.setProperty("gora.sqlstore.jdbc.url","jdbc:hsqldb:mem:" + getClass().getName());
    - DataStoreFactory.properties.setProperty("gora.sqlstore.jdbc.user","sa");
    - DataStoreFactory.properties.setProperty("gora.sqlstore.jdbc.password","");
    webPageStore = StorageUtils.createWebStore(conf, String.class,
    WebPage.class);
    +
    + // empty the datastore
    + webPageStore.deleteByQuery(webPageStore.newQuery());
    }


    Good luck.

    ReplyDelete
  3. Hi Alexis,
    How can i execute query string to search data with nutch 2.0 using hbase?. In nutch 1.0,1.1, i can show the result throught webpage ui

    ReplyDelete
  4. i have mistake in configuration so that it thrown that exception. thanks to Alexis for supporting.

    ReplyDelete
    Replies
    1. Trang, I have same exception, Can you tell me what you did to solve it? Thanks!

      Delete
  5. Dear Pham,

    You might want to use HBase shell to query the data.

    To do full text search, you need to index the crawled data, via the solrindex Nutch command, after parsing it (the crawl command automatically does it). I just added a new section on how to do the indexing part:
    http://techvineyard.blogspot.com/2010/12/build-nutch-20.html#solrindex

    It shows how to query Solr API and get XML results back.

    I have not tested yet the webapp yet, neither on Nutch nor Solr sides.

    ReplyDelete
  6. Dear Alexis,

    I had downloaded Nutch 2.0, but i found that missing some function such as Scoring, so i'm wondering when will nutch 2.0 release?

    ReplyDelete
  7. Dear Trang,

    See http://search.lucidimagination.com/search/document/a2522f02270e1125/release_planning
    that mentions a forthcoming Nutch 2.0 release.

    There is still some work to be done though.

    ReplyDelete
  8. Hi all,

    I have been trying to integrate Nutch crawler with Mysql database..
    For this I'm following this tutorial mentioned in the thread.
    http://techvineyard.blogspot.com/2010/12/build-nutch-20.html

    When I tried to do this step right click on ivy/ivy.xml, "Add Ivy Library ..." after modifying the mentioned files, I am facing problems.

    'Ivy resolve job of ivy/ivy.xml in 'nutch'' has encountered a problem.
    Impossible to reslove dependencies of org.apache.nutch#${ant.project.name};

    Impossible to resolve dependencies of org.apache.nutch#${ant.project.name};working@arjun-ninjas
    unresolved dependency: org.apache.gora#gora-core;0.1: not found
    unresolved dependency: org.apache.gora#gora-sql;0.1: not found
    unresolved dependency: org.restlet.jse#org.restlet;2.0.0: not found
    unresolved dependency: org.restlet.jse#org.restlet.ext.jackson;2.0.0: not found
    unresolved dependency: org.apache.gora#gora-core;0.1: not found
    unresolved dependency: org.apache.gora#gora-sql;0.1: not found
    unresolved dependency: org.restlet.jse#org.restlet;2.0.0: not found
    unresolved dependency: org.restlet.jse#org.restlet.ext.jackson;2.0.0: not found
    unresolved dependency: org.apache.gora#gora-core;0.1: not found
    unresolved dependency: org.apache.gora#gora-sql;0.1: not found
    unresolved dependency: org.restlet.jse#org.restlet;2.0.0: not found
    unresolved dependency: org.restlet.jse#org.restlet.ext.jackson;2.0.0: not found


    I am attaching the screen shot


    Could you please help me in this regard

    Thanks and regards,

    Ch. Arjun Kumar Reddy,
    International Institute of Information Technology – Bangalore (IIITB),

    ReplyDelete
  9. Dear Arjun Kumar,

    i think that you had missed some step in the tutorial thread, Maybe you dont add "src/plugin/protocol-sftp/ivy.xm" so that it could not download dependency.

    i had same problem to you, but after adding "src/plugin/protocol-sftp/ivy.xm" i had resolved that.

    Good luck.

    ReplyDelete
  10. Hello,

    When I try to install gora I get

    ::::::::::::::::::::::::::::::::::::::::::::::
    [ivy:resolve] :: UNRESOLVED DEPENDENCIES ::
    [ivy:resolve] ::::::::::::::::::::::::::::::::::::::::::::::
    [ivy:resolve] :: com.sun.jersey#jersey-core;1.4: not found
    [ivy:resolve] :: com.sun.jersey#jersey-json;1.4: not found
    [ivy:resolve] :: com.sun.jersey#jersey-server;1.4: not found



    Thanks.
    Alex.

    ReplyDelete
  11. Dear Alex,

    I moved the Gora description part out of this page to a new blog post. See http://techvineyard.blogspot.com/2011/02/gora-orm-framework-for-hadoop-jobs.html#Gora_development_in_Eclipse

    I hope this will help.

    ReplyDelete
  12. Alexis, Thank you very much for your article.

    Whether you have a problem with the PrimaryKey in mysql, namely the length allowed (255)?
    My baseUrl length is very long > 255 b.
    I don't know how i can change unique id=> md5(baseUrl) and thеn for new Сrawl cicle get unReversedUrl() by custom column.

    Sorry for my bad language.

    ReplyDelete
  13. I tried to complete this tutorial but failed. A more detailed describtion can be found here (http://lucene.472066.n3.nabble.com/TestFetcher-hangs-td3091057.html).

    If you have any ideas, how to solve this, I would like to hear of them.

    ReplyDelete
  14. Hi Alexis,

    Just getting back into this again...

    I found that I could not compile gora (SVN downloaded 2011-12-31) with ant (don't know ivy, tried deleting the cache, no luck). Ultimately I was able to compile it with mvn (mvn compile; mvn jar:jar). I needed to do this to put the gora-cassandra.jar into the nutch runtime/local/lib directory.

    I am also using cassandra-1.0.6. I had to download the following additional jar files:
    - apache-cassandra-thrift-1.0.1.jar and libthrift-0.6.1.jar (both from cassandra's lib).
    - hector-core-1.0-1.jar and guava-r09.jar (from hector download).
    to put into the runtime/local/lib directory.

    After that I was able to do nutch inject successfully.

    ReplyDelete
  15. Hi Alexis,
    I am trying to test Nutch 2.0 with eclipse, but it is giving compilation error,
    "The method createDataStore(Class, Class, Class, Properties, String) in the type
    DataStoreFactory is not applicable for the arguments (Class>, Class, Class, Configuration, String)"
    in org.apache.nutch.storage.StorageUtils.createWebStore(StorageUtils.java:69).

    I tried to find whether there are other versions where the signatures match, but couldn't find.
    Could please help in this regard.

    Thanks.

    ReplyDelete
  16. This is deprecated. The updated steps are here: https://wiki.apache.org/nutch/RunNutchInEclipse

    ReplyDelete

  17. the blog is very useful, interesting and informative. thank you for sharing the blog with us. keep on updating.
    Linux Training in Chennai

    ReplyDelete
  18. I have this issue with Hadoop-2.5.2 and Nutch-2.3.1 and hbase x.x.x.98-hadoop2

    InjectorJob: starting at 2016-09-16 17:21:21
    InjectorJob: Injecting urlDir: urls
    InjectorJob: Using class org.apache.gora.hbase.store.HBaseStore as the Gora storage class.
    Error: org.apache.gora.util.GoraException: java.lang.RuntimeException: java.net.MalformedURLException
    at org.apache.gora.store.DataStoreFactory.createDataStore(DataStoreFactory.java:167)
    at org.apache.gora.store.DataStoreFactory.createDataStore(DataStoreFactory.java:118)
    at org.apache.gora.mapreduce.GoraOutputFormat.getRecordWriter(GoraOutputFormat.java:88)
    at org.apache.hadoop.mapred.MapTask$NewDirectOutputCollector.(MapTask.java:624)
    at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:744)
    at org.apache.hadoop.mapred.MapTask.run(MapTask.java:340)
    at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:168)
    at java.security.AccessController.doPrivileged(AccessController.java:686)
    at javax.security.auth.Subject.doAs(Subject.java:569)
    at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1614)
    at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:163)
    Caused by: java.lang.RuntimeException: java.net.MalformedURLException
    at org.apache.gora.hbase.store.HBaseStore.initialize(HBaseStore.java:132)
    at org.apache.gora.store.DataStoreFactory.initializeDataStore(DataStoreFactory.java:102)
    at org.apache.gora.store.DataStoreFactory.createDataStore(DataStoreFactory.java:161)
    ... 10 more
    Caused by: java.net.MalformedURLException
    at java.net.URL.(URL.java:639)
    at java.net.URL.(URL.java:502)
    at java.net.URL.(URL.java:451)
    at org.apache.xerces.impl.XMLEntityManager.setupCurrentEntity(Unknown Source)
    at org.apache.xerces.impl.XMLVersionDetector.determineDocVersion(Unknown Source)
    at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source)
    at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source)
    at org.apache.xerces.parsers.XMLParser.parse(Unknown Source)
    at org.apache.xerces.parsers.AbstractSAXParser.parse(Unknown Source)
    at org.apache.xerces.jaxp.SAXParserImpl$JAXPSAXParser.parse(Unknown Source)
    at org.jdom.input.SAXBuilder.build(SAXBuilder.java:518)
    at org.jdom.input.SAXBuilder.build(SAXBuilder.java:865)
    at org.apache.gora.hbase.store.HBaseStore.readMapping(HBaseStore.java:719)
    at org.apache.gora.hbase.store.HBaseStore.initialize(HBaseStore.java:116)
    ... 12 more
    Caused by: java.lang.NullPointerException
    at java.net.URL.(URL.java:544)
    ... 25 more


    Does anyone knows what is that?

    Thx in advance

    Wilson

    ReplyDelete
  19. I cant wait to check out some of these blogs! I’ve really wanted to start learning more about cars and auto repairs lately and I think this will help a lot. I think it can save my family some money if we knew how to do some repairs at home.! Thanks again for all the options.

    bike spa services in mumbai
    house cleaning services in mumbai
    car wash services in mumbai

    ReplyDelete
  20. Replies
    1. 지휘봉을 맡긴 프로농구 창원 가 김영만 전 원주 동부 감독을 코치로 선임했다고 일 발표했다 김영만 신임 코치는 현주엽 감독의 년 선배다 김영만 코치는 마산고와 중앙대 출신으gq...치터스...ax 성 씨와 박수현 씨가 등장했다 부엌 수도꼭지에서 나오는 수돗물이 얼마나 깨끗한지 검사하기 위해서다 지은 지 년도 안 된… ...zr...사설토토먹튀42 하고 추가 조치를 검토하고 있다고 밝혔다 손금주 국민의당 선대위 수석대변인은 이날 “송민순 전 장관 관련된 부분에 있어서 여러가지를 검토하고 있다”면서도 “송 전 장관이 말한 대

      온라인카지노사이트
      와이번스배수현
      밤헌터 주소
      생중계카지노
      레이싱모델 이예빈

      Delete
  21. I do believe all of the concepts you’ve introduced in your post. They’re very convincing and will definitely work. Nonetheless, the posts are too short for novices. May you please extend them a bit from subsequent time? Thank you for the post.
    Office Interiors in Chennai

    ReplyDelete
  22. There are many interesting information included and i can easily understand all given information.I post something on my blog to post something, or wait to post something worth saying. Keep update more information....
    Pest Control in Chennai
    Security Services in Chennai

    ReplyDelete
    Replies
    1. 일 오전 국회 정론관에서 자유한국당 홍준표 후보 지지선언을 하면서 탈당을 공식화했다 이에 바른정당 탈당파 의원들에게 원색적인 비난이 쏟아지는 등 … 새기사 de...바카라노하우...wc 고도미사일방어체계 문제 시진핑 주석과의 관계 ‘하나의 중국’ 정책 등을 언급한 데 대해 중국 언론도 비상한 관심을 보였다 일 관영 환구시보는 트럼프 대통령이 이날 로이터 통…...fj...로얄잭팟32 시 정부세종청사 동 대강당에서 전직원이 참여한 가운데 세월호 사고 주기 추모 행사를 갖는다고 밝혔다 교육부는 오는 일 세월호 사고 주기를 맞아 희생자들의 명복을 빌고 미수습자 명

      온카주소
      타이거즈이지은
      33카지노 주소
      와이번스배수현
      해외사이트주소

      Delete
  23. Very nice post here and thanks for it .I always like and such a super contents of these post.Excellent and very cool idea and great content of different kinds of the valuable information's.
    Hadoop Training in Chennai

    ReplyDelete
    Replies
    1. 이에서 찬반 논란이 벌어지고 있어서다 이로 인해 정의… 월부터 일부 공공기관도 ‘금요일 오후 시’ 퇴근금요일 오후 시 퇴근으로 대표되는 ‘가족과 함께하는 날’을 일부 공공기관에서dy...바카라사이트...xm
      는 이날 오전 시부터 전주시 서신노인복지관에서 배식봉사와 함께 시설 관계자들과 간담회를… ...te...라스베가스블랙잭룰60 한 건물에서 금융상품을 제공하는 것으로 은행·증권·보험에서 각각 제… 현대카드 ‘카멜레온’ “여러 혜택을 카드 한 장에”현대카드가 ‘카멜레온’을 내놨다 여러 장의 카드를 한 장에

      바카라노하우
      카지노 복장
      개츠비카지노
      배구 라이브 스코어
      해외사이트주소

      Delete
  24. Replies
    1. 김지석 단이 흠칫 놀랐다고 한 수가 바로 백 다 김 단의 감각에는 없던 수다 그는 하변 백 ○을 먼저 돌볼 줄 알았는데 중앙에서 먼저 움직이는 것을 미처 예상하지 못했던 것이다 xp...띵동 라이브 스코어...cs 난민과 관련해 아버지와 시각차를 보였다 이방카는 일이하 현지시간 오전 독일 베를린에서 열린 ‘여성경제회의’ 회담에서 가진 뉴스와의 인터뷰에서 시리아 난민과 관련해 “세계적 인...eg...동두천조건만남72 화제의사가 산모의 배를 가르면 아기가 스스로 배 밖으로 나오도록 유도하는 ‘젠틀 제왕절개’ 수술 영상이 온라인에서 뜨거운 관심을 받고 있다 영국 데일리메일 더 선 등 외신은 일

      한국일본야구중계
      바카라게임
      롯데박기량
      온라인카지노사이트
      배구 라이브 스코어

      Delete
  25. Thank you for taking the time and sharing this information with us. It was indeed very helpful and insightful while being straight forward and to the point.
    mc donalds gutscheine | startlr | salud limpia

    ReplyDelete
  26. It is amazing and wonderful to visit your site.Thanks for sharing this information,this is useful to me...
    Android Training in Chennai
    Ios Training in Chennai

    ReplyDelete
  27. Replies
    1. 새 사회 새 정치 새 세상을 갈망하는 민심에 또다시 도전해나서고 있는 반역패당의 최악의 민… “경쟁사 폰파라치하면 만원” 이통사들의 ‘진흙탕 싸움’“경쟁사 불법행위 알려주면 만원 lo...다이노스김연정...ws 다 해당 영상은 일 오후 기준 유튜브 실시간 인기 동영상 위에 올라있다 지난 일 유튜브에 공개된 ‘정책은 언제나 목마르다’라는 제목의 영상에는 문재인 캠프 … [청계천 옆 사진관...xg...무료식보06 얼떨떨하던 표정 역시 한층 밝아졌다 내야수 오태곤의 이야기다 일 저녁 오태곤은 청천벽력 같은 소식을 들어야했다 평소처럼 경기를 마무리하던 시점에서 코칭스태프가 그… 김한길 “

      밤의전쟁주소
      개츠비카지노
      두산박소진
      인터넷카지노사이트
      사다리 필승법

      Delete
  28. 대표팀 코치를 맡았다 올해 남자 대표팀은 국제배구연맹 월드리그 국제 남자 배구대회… 스크럭스 호포… 테임즈 못지않네프로야구 는 올 시즌 메이저리그 밀워키에 입단해 맹활약을 펼치고cl...우리카지노...wo 타자 앤디 번즈는 월일 고척 넥센전 사 만루에서 대타 김대우로 교체됐다 롯데 조원우 감독은 월일 잠실 두산전에서 정훈을 번타자 루수로 선발 출장시켰다 번즈는 벤치에 앉아 있다가 대...ek...미국채팅사이트19 교장관들이 일 이탈리아 중부 도시 루카의 한 대성당을 방문해 성당 내 묘를 관람하고 있다 다음 달 말 정상회의에 앞서 만난 외교장관들은 이날 시리아와 러시아에 대한 추가 제재안을

    해외사이트주소
    중년나라 주소
    사다리 필승법
    도하사 주소
    바카라필승법

    ReplyDelete
    Replies
    1. 후보는 일 “초·중·교 역사 수업은 국정교과서로 해야 한다”며 자신이 집권시 새 국정교과서를 만들겠다고 밝혔다 그는 이날 오전 서울 서초구… 홍석천 “동성애 문제 큰 발전 월 sh...정선카지노...qv ·연합뉴스 의뢰 여론조사 ‘코리아리서치’ 과태료 만원중앙선거여론조사심의위원회여심위는 일 여론조사업체 ‘코리아리서치’에 과태료 만원을 부과하기로 했다 여심위에 따르면 코리아리서치...el...여수섹스파트너43 추모객 발길철제 울타리에 매달린 노란 리본이 바닷바람에 거대한 물결처럼 나부꼈다 하늘이 유난히 맑고 깨끗했던 일 주말을 맞아 세월호가 거치된 목포신항만 철재부두에 추모객 여명목포

      라이브카지노
      레이싱모델 순위
      모바일 카지노
      다자바
      온카주소

      Delete