Testing Nutch 2.0 under Eclipse
Table of Contents
Introduction
This is a guide on setting up Nutch 2 in an Eclipse project. You will then be able to hack the code and run tests, especially JUnit ones, pretty easily.
If you have Nutch branch or Gora trunk already checked-out, please run
$ svn up
in order to have its most up-to-date version. Multiple fixes show up in this document through diff outputs. In order to apply the fix, place the diff content into the root directory and run the patch command
$ patch -p0 < myDiff
Setup the projects in Eclipse
The idea is to be able to improve Nutch and Gora code comfortably, with the help of the Eclipse IDE. You want to add in the Java Build Path the source - and why not the test - directories of the modules you are interested in working on. Then manage the dependencies with Ivy or Maven plugins to resolve the external libraries. Then update the code. Optionally run a few JUnit tests. Then run the ant task that builds the project. Then submit a patch. This is the easiest and the fastest way, especially as regards productivity.
Install plugins
Help > Install New Software ...
Add the following urls:
Check out SVN directories
Check-out Nutch branch and Gora trunk versions using the SVN wizard, with the following urls
File > New > Project ...
Note that you can just create a Java project and check out Nutch source with svn command, if you don't like SVN Eclipse plugin:
$ cd ~/java/workspace/NutchGora $ svn co http://svn.apache.org/repos/asf/nutch/branches/nutchgora branch
Build the projects
Window > Show View > Ant
Drag and drop the build.xml files in the Ant Eclipse tab.
Just double click on the Gora and Nutch items in the Ant view. That will run the default task. For Gora, it will publish the modules to the Ivy local repository. For Nutch, it will build a "release" in runtime/local directory
Nutch
Within the Nutch project, we want to manage the dependencies with the Ivy plugin, not the Maven one.
- The call to "nutch.root" property set in build.xml for ant should be replaced in src/plugin/protocol-sftp/ivy.xml by the built-in "basedir" ivy property. I am not sure how to load a property in Eclipse Ivy plugin. This will break the build, so be sure to replace it back when running Ant tasks.
This is the Ivy configuration tweaks:
Index: src/plugin/protocol-sftp/ivy.xml =================================================================== --- src/plugin/protocol-sftp/ivy.xml (revision 1177967) +++ src/plugin/protocol-sftp/ivy.xml (working copy) @@ -27,7 +27,7 @@ </info> <configurations> - <include file="${nutch.root}/ivy/ivy-configurations.xml"/> + <include file="${basedir}/branch/ivy/ivy-configurations.xml"/> </configurations> <publications> Index: ivy/ivy.xml =================================================================== --- ivy/ivy.xml (revision 1177967) +++ ivy/ivy.xml (working copy) @@ -21,7 +21,7 @@ </info> <configurations> - <include file="${basedir}/ivy/ivy-configurations.xml" /> + <include file="${basedir}/branch/ivy/ivy-configurations.xml" /> </configurations> <publications> @@ -58,8 +58,9 @@ <dependency org="org.apache.tika" name="tika-parsers" rev="0.9" /> --> + <!-- <dependency org="org.apache.gora" name="gora-core" rev="0.1.1-incubating" conf="*->compile"/> - +--> <dependency org="log4j" name="log4j" rev="1.2.15" conf="*->master" /> <dependency org="xerces" name="xercesImpl" rev="2.9.1" />
Now, right click on ivy/ivy.xml, "Add Ivy Library ...". Do the same for src/plugin/protocol-sftp/ivy.xml.
Remove the default src directory as a Source entry in the Java Build Path if it exists. Add at least the "java", "test" and "resources" source files (which are src/java, src/test and conf) so that they get included in the classpath. That will allow us to run the classes or tests later from Eclipse.
This is how the project tree looks like:
Datastores
The datastore holds the information Nutch crawls from the web. You can opt for a Relational DataBase System or a column-oriented, NoSQL store. Thanks to Gora interface, you can use any backend you might be familiar with: HSQL, MySQL, HBase, Cassandra ...
HSQL
This is the default Gora backend. Make sure your Ivy configuration contains the dependency:
<dependency org="org.hsqldb" name="hsqldb" rev="2.0.0" conf="*->default"/>
This is the content of conf/gora.properties:
gora.sqlstore.jdbc.driver=org.hsqldb.jdbcDriver gora.sqlstore.jdbc.url=jdbc:hsqldb:hsql://localhost/nutchtest gora.sqlstore.jdbc.user=sa gora.sqlstore.jdbc.password=
Setup HSQL. I downloaded this version: HSQLDB 2.0.0. Finally starts HSQL server with the same database alias, called "nutchtest":
~/java/ext/hsqldb-2.0.0/hsqldb/data$ java -cp ../lib/hsqldb.jar org.hsqldb.server.Server --database.0 file:crawldb --dbname.0 nutchtest [Server@12ac982]: [Thread[main,5,main]]: checkRunning(false) entered [Server@12ac982]: [Thread[main,5,main]]: checkRunning(false) exited [Server@12ac982]: Startup sequence initiated from main() method [Server@12ac982]: Loaded properties from [/home/alex/java/ext/hsqldb-2.0.0/hsqldb/data/server.properties] [Server@12ac982]: Initiating startup sequence... [Server@12ac982]: Server socket opened successfully in 5 ms. [Server@12ac982]: Database [index=0, id=0, db=file:crawldb, alias=nutchtest] opened sucessfully in 420 ms. [Server@12ac982]: Startup sequence completed in 426 ms. [Server@12ac982]: 2011-01-12 10:41:56.181 HSQLDB server 2.0.0 is online on port 9001 [Server@12ac982]: To close normally, connect and execute SHUTDOWN SQL [Server@12ac982]: From command line, use [Ctrl]+[C] to abort abruptly
MySQL
To use MySQL as datastore, add the dependency to ivy/ivy.xml:
<dependency org="mysql" name="mysql-connector-java" rev="5.1.13" conf="*->default"></dependency>
Change conf/gora.properties to setup the MySQL connection:
gora.sqlstore.jdbc.driver=com.mysql.jdbc.Driver gora.sqlstore.jdbc.url=jdbc:mysql://localhost:3306/nutch?createDatabaseIfNotExist=true gora.sqlstore.jdbc.user=alex gora.sqlstore.jdbc.password=some_pass
You will notice MySQL is a lot faster than HSQL. It was at least 12x in my case with the default setups. For example injecting 50k took 6 min with HSQL instead of 30 sec with MySQL. You could make a similar comparison in Rails with SQLite and MySQL ...
HBase
Add the gora-hbase and zookeeper dependency in ivy/ivy.xml
<dependency org="org.apache.gora" name="gora-hbase" rev="0.1" conf="*->compile"> <exclude org="com.sun.jdmk"/> <exclude org="com.sun.jmx"/> <exclude org="javax.jms"/> </dependency> <dependency org="org.apache.zookeeper" name="zookeeper" rev="3.3.2" conf="*->default"> <exclude org="com.sun.jdmk"/> <exclude org="com.sun.jmx"/> <exclude org="javax.jms"/> </dependency>
After rebuilding Nutch, since a recent version (0.20.6 ?) is not available in the Maven repository, you will need to manually add hbase jar to the runtime/local/lib directory :
$ cp $HBASE_HOME/hbase-*.jar $NUTCH_HOME/runtime/local/lib
Create the $NUTCH_HOME/runtime/local/conf/gora-hbase-mapping.xml file as described in GORA_HBase wiki page. Overwrite storage.data.store.class property in runtime/local/conf/nutch-site.xml:
<property> <name>storage.data.store.class</name> <value>org.apache.gora.hbase.store.HBaseStore</value> </property>
Finally setup and Run HBase. See this blog entry as a quick guide.
Cassandra
Start Cassandra:
$ bin/cassandra -f
Run the injector job with the Gora setting to Cassandra store in nutch-site.xml.
<property> <name>storage.data.store.class</name> <value>org.apache.gora.cassandra.store.CassandraStore</value> <description>Default class for storing data</description> </property>
Please place gora-cassandra-mapping.xml in your Nutch conf directory which is included in the classpath. This configuration defines how Avro fields are stored in Cassandra. This the content of gora-cassandra-mapping.xml file:
<gora-orm> <keyspace name="webpage" cluster="Test Cluster" host="localhost"> <family name="p"/> <family name="f"/> <family name="sc" type="super"/> </keyspace> <class keyClass="java.lang.String" name="org.apache.nutch.storage.WebPage"> <!-- fetch fields --> <field name="baseUrl" family="f" qualifier="bas"/> <field name="status" family="f" qualifier="st"/> <field name="prevFetchTime" family="f" qualifier="pts"/> <field name="fetchTime" family="f" qualifier="ts"/> <field name="fetchInterval" family="f" qualifier="fi"/> <field name="retriesSinceFetch" family="f" qualifier="rsf"/> <field name="reprUrl" family="f" qualifier="rpr"/> <field name="content" family="f" qualifier="cnt"/> <field name="contentType" family="f" qualifier="typ"/> <field name="modifiedTime" family="f" qualifier="mod"/> <!-- parse fields --> <field name="title" family="p" qualifier="t"/> <field name="text" family="p" qualifier="c"/> <field name="signature" family="p" qualifier="sig"/> <field name="prevSignature" family="p" qualifier="psig"/> <!-- score fields --> <field name="score" family="f" qualifier="s"/> <!-- super columns --> <field name="markers" family="sc" qualifier="mk"/> <field name="inlinks" family="sc" qualifier="il"/> <field name="outlinks" family="sc" qualifier="ol"/> <field name="metadata" family="sc" qualifier="mtdt"/> <field name="headers" family="sc" qualifier="h"/> <field name="parseStatus" family="sc" qualifier="pas"/> <field name="protocolStatus" family="sc" qualifier="prs"/> </class> </gora-orm>
Check the data that has been initialized and gets populated.
$ bin/cassandra-cli --host localhost [default@unknown] use webpage; Authenticated to keyspace: webpage [default@webpage] update column family p with key_validation_class=UTF8Type; 138c3060-b623-11e0-0000-242d50cf1fb7 Waiting for schema agreement... ... schemas agree across the cluster [default@webpage] update column family f with key_validation_class=UTF8Type; 139de3a0-b623-11e0-0000-242d50cf1fb7 Waiting for schema agreement... ... schemas agree across the cluster [default@webpage] update column family sc with key_validation_class=UTF8Type; 13b2f240-b623-11e0-0000-242d50cf1fb7 Waiting for schema agreement... ... schemas agree across the cluster [default@webpage] list f; Using default limit of 100 ------------------- RowKey: com.truveo.www:http/ => (column=fi, value=2592000, timestamp=1311532210076000) => (column=s, value=1.0, timestamp=1311532210080000) => (column=ts, value=1311532203790, timestamp=1311532209796000) ------------------- RowKey: com.blogspot.techvineyard:http/ => (column=fi, value=2592000, timestamp=1311532210134000) => (column=s, value=1.0, timestamp=1311532210137000) => (column=ts, value=1311532203790, timestamp=1311532210131000) ------------------- RowKey: org.apache.wiki:http/nutch/ => (column=fi, value=2592000, timestamp=1311532210146000) => (column=s, value=1.0, timestamp=1311532210149000) => (column=ts, value=1311532203790, timestamp=1311532210144000) 3 Rows Returned. [default@webpage]
JUnit Tests
Let's run a few unit tests to verify the setup.
Datastore
We want to make the GoraStorage test pass. First apply this patch to test your datastore setting and avoid crashing your old laptop because it has limited capacity.
Index: src/test/org/apache/nutch/storage/TestGoraStorage.java =================================================================== --- src/test/org/apache/nutch/storage/TestGoraStorage.java (revision 1053817) +++ src/test/org/apache/nutch/storage/TestGoraStorage.java (working copy) @@ -1,22 +1,17 @@ package org.apache.nutch.storage; -import java.io.File; -import java.util.ArrayList; import java.util.BitSet; -import java.util.Iterator; -import java.util.List; import java.util.Random; -import java.util.Vector; import java.util.concurrent.atomic.AtomicInteger; +import junit.framework.TestCase; + import org.apache.avro.util.Utf8; -import org.apache.hadoop.conf.Configuration; -import org.apache.nutch.util.NutchConfiguration; import org.apache.gora.query.Result; import org.apache.gora.store.DataStore; +import org.apache.hadoop.conf.Configuration; +import org.apache.nutch.util.NutchConfiguration; -import junit.framework.TestCase; - public class TestGoraStorage extends TestCase { Configuration conf; @@ -80,8 +75,8 @@ private AtomicInteger threadCount = new AtomicInteger(0); public void testMultithread() throws Exception { - int COUNT = 1000; - int NUM = 100; + int COUNT = 50; + int NUM = 5; DataStore<String,WebPage> store; for (int i = 0; i < NUM; i++) { @@ -113,115 +108,4 @@ assertEquals(size, keys.cardinality()); } - public void testMultiProcess() throws Exception { - int COUNT = 1000; - int NUM = 100; - DataStore<String,WebPage> store; - List<Process> procs = new ArrayList<Process>(); - - for (int i = 0; i < NUM; i++) { - Process p = launch(i, i * COUNT, COUNT); - procs.add(p); - } - - while (procs.size() > 0) { - try { - Thread.sleep(5000); - } catch (Exception e) {}; - Iterator<Process> it = procs.iterator(); - while (it.hasNext()) { - Process p = it.next(); - int code = 1; - try { - code = p.exitValue(); - assertEquals(0, code); - it.remove(); - p.destroy(); - } catch (IllegalThreadStateException e) { - // not ready yet - } - } - System.out.println("* running " + procs.size() + "/" + NUM); - } - System.out.println("Verifying..."); - store = StorageUtils.createDataStore(conf, String.class, WebPage.class); - Result<String,WebPage> res = store.execute(store.newQuery()); - int size = COUNT * NUM; - BitSet keys = new BitSet(size); - while (res.next()) { - String key = res.getKey(); - WebPage p = res.get(); - assertEquals(key, p.getTitle().toString()); - int pos = Integer.parseInt(key); - assertTrue(pos < size && pos >= 0); - if (keys.get(pos)) { - fail("key " + key + " already set!"); - } - keys.set(pos); - } - if (size != keys.cardinality()) { - System.out.println("ERROR Missing keys:"); - for (int i = 0; i < size; i++) { - if (keys.get(i)) continue; - System.out.println(" " + i); - } - fail("key count should be " + size + " but is " + keys.cardinality()); - } - } - - private Process launch(int id, int start, int count) throws Exception { - // Build exec child jmv args. - Vector<String> vargs = new Vector<String>(8); - File jvm = // use same jvm as parent - new File(new File(System.getProperty("java.home"), "bin"), "java"); - - vargs.add(jvm.toString()); - - // Add child (task) java-vm options. - // tmp dir - String prop = System.getProperty("java.io.tmpdir"); - vargs.add("-Djava.io.tmpdir=" + prop); - // library path - prop = System.getProperty("java.library.path"); - if (prop != null) { - vargs.add("-Djava.library.path=" + prop); - } - // working dir - prop = System.getProperty("user.dir"); - vargs.add("-Duser.dir=" + prop); - // combat the stupid Xerces issue - vargs.add("-Djavax.xml.parsers.DocumentBuilderFactory=com.sun.org.apache.xerces.internal.jaxp.DocumentBuilderFactoryImpl"); - // prepare classpath - String sep = System.getProperty("path.separator"); - StringBuffer classPath = new StringBuffer(); - // start with same classpath as parent process - classPath.append(System.getProperty("java.class.path")); - //classPath.append(sep); - // Add classpath. - vargs.add("-classpath"); - vargs.add(classPath.toString()); - - // append class name and args - vargs.add(TestGoraStorage.class.getName()); - vargs.add(String.valueOf(id)); - vargs.add(String.valueOf(start)); - vargs.add(String.valueOf(count)); - ProcessBuilder builder = new ProcessBuilder(vargs); - return builder.start(); - } - - public static void main(String[] args) throws Exception { - if (args.length < 3) { - System.err.println("Usage: TestGoraStore <id> <startKey> <numRecords>"); - System.exit(-1); - } - TestGoraStorage test = new TestGoraStorage(); - test.init(); - int id = Integer.parseInt(args[0]); - int start = Integer.parseInt(args[1]); - int count = Integer.parseInt(args[2]); - Worker w = test.new Worker(id, start, count, true); - w.run(); - System.exit(0); - } } Index: src/test/org/apache/nutch/util/AbstractNutchTest.java =================================================================== --- src/test/org/apache/nutch/util/AbstractNutchTest.java (revision 1053817) +++ src/test/org/apache/nutch/util/AbstractNutchTest.java (working copy) @@ -16,28 +16,14 @@ */ package org.apache.nutch.util; -import java.io.IOException; -import java.nio.ByteBuffer; -import java.util.ArrayList; -import java.util.List; - import junit.framework.TestCase; -import org.apache.avro.util.Utf8; +import org.apache.gora.store.DataStore; import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.fs.FileSystem; import org.apache.hadoop.fs.Path; -import org.apache.nutch.crawl.URLWebPage; -import org.apache.nutch.storage.Mark; import org.apache.nutch.storage.StorageUtils; import org.apache.nutch.storage.WebPage; -import org.apache.nutch.util.TableUtil; -import org.apache.gora.query.Query; -import org.apache.gora.query.Result; -import org.apache.gora.sql.store.SqlStore; -import org.apache.gora.store.DataStore; -import org.apache.gora.store.DataStoreFactory; -import org.apache.gora.util.ByteUtils; /** * This class provides common routines for setup/teardown of an in-memory data @@ -55,16 +41,12 @@ public void setUp() throws Exception { super.setUp(); conf = CrawlTestUtil.createConfiguration(); - conf.set("storage.data.store.class", "org.gora.sql.store.SqlStore"); fs = FileSystem.get(conf); - // using hsqldb in memory - DataStoreFactory.properties.setProperty("gora.sqlstore.jdbc.driver","org.hsqldb.jdbcDriver"); - // use separate in-memory db-s for tests - DataStoreFactory.properties.setProperty("gora.sqlstore.jdbc.url","jdbc:hsqldb:mem:" + getClass().getName()); - DataStoreFactory.properties.setProperty("gora.sqlstore.jdbc.user","sa"); - DataStoreFactory.properties.setProperty("gora.sqlstore.jdbc.password",""); webPageStore = StorageUtils.createWebStore(conf, String.class, WebPage.class); + + // empty the datastore + webPageStore.deleteByQuery(webPageStore.newQuery()); } @Override
Fetch
We can try to run the Fetcher test as well.
- Change the location of the static files that will be returned to the Nutch crawler by the Jetty server, from "build/test/data/fetch-test-site" to "src/testresources/fetch-test-site"
- Overwrite as well for testing purpose the plugin directory setting.
- Set http.agent.name and http.robots.agents properties.
- Limit the content length to the maximum for a blob column type. This is only required for MySQL.
Index: src/test/nutch-site.xml =================================================================== --- src/test/nutch-site.xml (revision 1053817) +++ src/test/nutch-site.xml (working copy) @@ -22,4 +22,20 @@
<description>Default class for storing data</description> </property> + <property> + <name>plugin.folders</name> + <value>build/plugins</value> + </property> + <property> + <name>http.agent.name</name> + <value>NutchRobot</value> + </property> + <property> + <name>http.robots.agents</name> + <value>NutchRobot,*</value> + </property> + <property> + <name>http.content.limit</name> + <value>65535</value> + </property> </configuration> Index: src/test/org/apache/nutch/fetcher/TestFetcher.java =================================================================== --- src/test/org/apache/nutch/fetcher/TestFetcher.java (revision 1050697) +++ src/test/org/apache/nutch/fetcher/TestFetcher.java (working copy) @@ -50,7 +50,7 @@ public void setUp() throws Exception{ super.setUp(); urlPath = new Path(testdir, "urls"); - server = CrawlTestUtil.getServer(conf.getInt("content.server.port",50000), "build/test/data/fetch-test-site"); + server = CrawlTestUtil.getServer(conf.getInt("content.server.port",50000), "src/testresources/fetch-test-site"); server.start(); }
Now right click on the org.apache.nutch.fetcher.TestFetcher class located in the src/test source directory, then "Run As" > "JUnit Test".
Nutch Commands
Several commands are available to maintain and index your crawl. Here are the possible options from the Bash script:
~/java/workspace/Nutch2.0/runtime/local$ bin/nutch Usage: nutch [-core] COMMAND where COMMAND is one of: inject inject new urls into the database generate generate new segments to fetch from crawl db fetch fetch URLs marked during generate parse parse URLs marked during fetch updatedb update web table after parsing readdb read/dump records from page database solrindex run the solr indexer on parsed segments and linkdb solrdedup remove duplicates from solr plugin load a plugin and run one of its classes main() or CLASSNAME run the class named CLASSNAME Most commands print help when invoked w/o parameters. Expert: -core option is for developers only. It avoids building the job jar, instead it simply includes classes compiled with ant compile-core. NOTE: this works only for jobs executed in 'local' mod
Running Nutch classes from Eclipse
You can either run a command with the Bash script or execute a Nutch class directly from Eclipse. The latter is easier for development since you do not need to build the whole project each time you change something. When a Nutch class is executed, it first loads the configuration by looking in the classpath for a nutch-site.xml file that overwrites nutch-default.xml. Depending on the order of the "src/test" and "conf" source directories in your Eclipse build path, only one nutch-site.xml file will be loaded to the classpath. In my case, it was the one that is located in "src/test". If I edit the one in "conf", I see the warning
The resource is a duplicate of src/test/nutch-default.xml and was not copied to the output folder.
which indicates it will be ignored. So you want to edit the one that is activated.
- Apply the modifications to src/test/nutch-site.xml (or conf/nutch-site.xml, depending on your classpath order setting) that are given in the Fetch Test section from above.
crawl / org.apache.nutch.crawl.Crawler
~/java/workspace/Nutch2.0/runtime/local$ bin/nutch crawl Usage: Crawl (<seedDir> | -continue) [-solr <solrURL>] [-threads n] [-depth i] [-topN N]
Right click on org.apache.nutch.crawl.Crawler in src/java source directory. Then "Run As" > "Java Application"
- The first argument called "seedDir" is the path to a directory containing lists of seed urls. They will be injected to the database. They define a forest of pages that will be visited by the crawler during the first iteration of the graph exploration. Then the crawler will expand the graph by adding neighbours to these pages when extracting new urls out of the page content. These new pages should then be visited in the second iteration.
- The -continue parameter instead resumes the crawl without injecting any seeds.
- -solr defines the solr server used to index the documents
- -threads defines the number of threads spawned to fetch several pages simultaneously.
- -depth defines the number of iterations in the graph exploration, before the traversal gets pruned.
- -topN limits the number of urls that get downloaded in one iteration.
Let's create some input to the crawl command. This is the content of a seeds/urls file that we can use for the demo:
http://techvineyard.blogspot.com/ http://www.truveo.com/ http://wiki.apache.org/nutch/
I used MySQL as a datastore. Let's clear it if the webpage table exists before running the crawl command.
$ mysql -hlocalhost -ualex -psome_pass nutch mysql> delete from webpage;
From the Eclipse menu:
Run > Run Configurations ...
Run > Run Configurations ...
Click Run. You can compare your output with my logs here. Then check the content of the MySQL table:
mysql> describe webpage; +-------------------+----------------+------+-----+---------+-------+ | Field | Type | Null | Key | Default | Extra | +-------------------+----------------+------+-----+---------+-------+ | id | varchar(512) | NO | PRI | NULL | | | headers | blob | YES | | NULL | | | text | varchar(32000) | YES | | NULL | | | status | int(11) | YES | | NULL | | | markers | blob | YES | | NULL | | | parseStatus | blob | YES | | NULL | | | modifiedTime | bigint(20) | YES | | NULL | | | score | float | YES | | NULL | | | typ | varchar(32) | YES | | NULL | | | baseUrl | varchar(512) | YES | | NULL | | | content | blob | YES | | NULL | | | title | varchar(512) | YES | | NULL | | | reprUrl | varchar(512) | YES | | NULL | | | fetchInterval | int(11) | YES | | NULL | | | prevFetchTime | bigint(20) | YES | | NULL | | | inlinks | blob | YES | | NULL | | | prevSignature | blob | YES | | NULL | | | outlinks | blob | YES | | NULL | | | fetchTime | bigint(20) | YES | | NULL | | | retriesSinceFetch | int(11) | YES | | NULL | | | protocolStatus | blob | YES | | NULL | | | signature | blob | YES | | NULL | | | metadata | blob | YES | | NULL | | +-------------------+----------------+------+-----+---------+-------+ 23 rows in set (0.14 sec) mysql> select count(*) from webpage; +----------+ | count(*) | +----------+ | 151 | +----------+ 1 row in set (0.00 sec) mysql> select id, markers from webpage where content is not null; +---------------------------------+------------------------------------------+ | id | markers | +---------------------------------+------------------------------------------+ | org.apache.wiki:http/nutch/ | _injmrk_y_updmrk_*1294943864-1806760603 | | com.blogspot.techvineyard:http/ | _injmrk_y_updmrk_*1294943864-1806760603 | | com.truveo.www:http/ | _injmrk_y_updmrk_*1294943864-1806760603 | +---------------------------------+------------------------------------------+ 3 rows in set (0.00 sec)
readdb / org.apache.nutch.crawl.WebTableReader
~/java/workspace/Nutch2.0/runtime/local$ bin/nutch readdb Usage: WebTableReader (-stats | -url [url] | -dump <out_dir> [-regex regex]) [-crawlId <id>] [-content] [-headers] [-links] [-text] -crawlId <id> the id to prefix the schemas to operate on, (default: storage.crawl.id) -stats [-sort] print overall statistics to System.out [-sort] list status sorted by host -url <url> print information on <url> to System.out -dump <out_dir> [-regex regex] dump the webtable to a text file in <out_dir> -content dump also raw content -headers dump protocol headers -links dump links -text dump extracted text [-regex] filter on the URL of the webtable entry
WebTableReader class scans the entire database via a Hadoop job that outputs all the fields.
inject / org.apache.nutch.crawl.InjectorJob
~/java/workspace/Nutch2.0/runtime/local$ bin/nutch inject Usage: InjectorJob <url_dir> [-crawlId <id>]
First, we need to initialize the crawl db. The "url_dir" argument to the inject command is a directory containing flat files of lists of urls, used as "seeds".
generate / org.apache.nutch.crawl.GeneratorJob
~/java/workspace/Nutch2.0/runtime/local$ bin/nutch generate GeneratorJob: Selecting best-scoring urls due for fetch. GeneratorJob: starting GeneratorJob: filtering: true GeneratorJob: done GeneratorJob: generated batch id: 1294943864-1806760603
This steps generates a batch-id containing selected urls to be fetched.
fetch / org.apache.nutch.fetcher.FetcherJob
~/java/workspace/Nutch2.0/runtime/local$ bin/nutch fetch Usage: FetcherJob (<batchId> | -all) [-crawlId <id>] [-threads N] [-parse] [-resume] [-numTasks N] batchId crawl identifier returned by Generator, or -all for all generated batchId-s -crawlId <id> the id to prefix the schemas to operate on, (default: storage.crawl.id) -threads N number of fetching threads per task -parse if specified then fetcher will immediately parse fetched content -resume resume interrupted job -numTasks N if N > 0 then use this many reduce tasks for fetching (default: mapred.map.tasks)
parse / org.apache.nutch.parse.ParserJob
~/java/workspace/Nutch2.0/runtime/local$ bin/nutch parse Usage: ParserJob (<batchId> | -all) [-crawlId <id>] [-resume] [-force] batchId symbolic batch ID created by Generator -crawlId <id> the id to prefix the schemas to operate on, (default: storage.crawl.id) -all consider pages from all crawl jobs -resume resume a previous incomplete job -force force re-parsing even if a page is already parsed
Once we have a local copy the web pages, we need to parse them to extract keywords and links the web page points to. This parsing task is delegated to Tika.
updatedb / org.apache.nutch.crawl.DbUpdaterJob
~/java/workspace/Nutch2.0/runtime/local$ bin/nutch updatedb
solrindex / org.apache.nutch.indexer.solr.SolrIndexerJob
The indexing task is now delegated to Solr, which is a server using Lucene indexes that will make the crawled documents searchable by indexing the data posted via HTTP. I ran into a few caveats before making it work. This is the suggested patch.
- Avoid multiple values for id field.
- Allow multiple values for tag field. Add tld (Top Level Domain) field.
- Get the content-type from WebPage object's member. Otherwise, you will see NullPointerExceptions.
- Compare strings with equalsTo. That's pretty random, but it avoids having some suprises.
Index: conf/solrindex-mapping.xml =================================================================== --- conf/solrindex-mapping.xml (revision 1053817) +++ conf/solrindex-mapping.xml (working copy) @@ -39,8 +39,7 @@ <field dest="boost" source="boost"/> <field dest="digest" source="digest"/> <field dest="tstamp" source="tstamp"/> - <field dest="id" source="url"/> - <copyField source="url" dest="url"/> + <field dest="url" source="url"/> </fields> <uniqueKey>id</uniqueKey> </mapping> Index: conf/schema.xml =================================================================== --- conf/schema.xml (revision 1053817) +++ conf/schema.xml (working copy) @@ -95,12 +95,15 @@ <!-- fields for feed plugin --> <field name="author" type="string" stored="true" indexed="true"/> - <field name="tag" type="string" stored="true" indexed="true"/> + <field name="tag" type="string" stored="true" indexed="true" multiValued="true"/> <field name="feed" type="string" stored="true" indexed="true"/> <field name="publishedDate" type="string" stored="true" indexed="true"/> <field name="updatedDate" type="string" stored="true" indexed="true"/> + + <field name="tld" type="string" stored="false" indexed="false"/> + </fields> <uniqueKey>id</uniqueKey> <defaultSearchField>content</defaultSearchField> Index: src/plugin/index-more/src/java/org/apache/nutch/indexer/more/MoreIndexingFilter.java =================================================================== --- src/plugin/index-more/src/java/org/apache/nutch/indexer/more/MoreIndexingFilter.java (revision 1053817) +++ src/plugin/index-more/src/java/org/apache/nutch/indexer/more/MoreIndexingFilter.java (working copy) @@ -172,7 +172,7 @@ */ private NutchDocument addType(NutchDocument doc, WebPage page, String url) { MimeType mimeType = null; - Utf8 contentType = page.getFromHeaders(new Utf8(HttpHeaders.CONTENT_TYPE)); + Utf8 contentType = page.getContentType(); if (contentType == null) { // Note by Jerome Charron on 20050415: // Content Type not solved by a previous plugin Index: src/java/org/apache/nutch/indexer/solr/SolrWriter.java =================================================================== --- src/java/org/apache/nutch/indexer/solr/SolrWriter.java (revision 1053817) +++ src/java/org/apache/nutch/indexer/solr/SolrWriter.java (working copy) @@ -56,7 +56,7 @@ for (final String val : e.getValue()) { inputDoc.addField(solrMapping.mapKey(e.getKey()), val); String sCopy = solrMapping.mapCopyKey(e.getKey()); - if (sCopy != e.getKey()) { + if (! sCopy.equals(e.getKey())) { inputDoc.addField(sCopy, val); } }
Download Solr. To setup the Solr server, copy the example directory from the Solr distribution and the patched schema.xml configuration file to solr/conf of the Solr app.
cp -r $SOLR_HOME/example solrapp cp $NUTCH_HOME/conf/schema.xml solrapp/solr/conf/ cd solrapp java -jar start.jar
This starts the Solr server. Now let's index a few documents, by adding as parameter to SolrIndexerJob class the batch id showing up in the markers column.
Here are some excerpts of the logs from the Jetty server to make sure the documents were properly sent:
13-Jan-2011 19:50:47 org.apache.solr.update.processor.LogUpdateProcessor finish INFO: {add=[com.truveo.www:http/, org.apache.wiki:http/nutch/, com.blogspot.techvineyard:http/]} 0 206 13-Jan-2011 19:50:47 org.apache.solr.core.SolrCore execute INFO: [] webapp=/solr path=/update params={wt=javabin&version=1} status=0 QTime=206 13-Jan-2011 19:50:47 org.apache.solr.update.DirectUpdateHandler2 commit INFO: start commit(optimize=false,waitFlush=true,waitSearcher=true,expungeDeletes=false) 13-Jan-2011 19:50:47 org.apache.solr.core.SolrDeletionPolicy onCommit INFO: SolrDeletionPolicy.onCommit: commits:num=2 commit{dir=/home/alex/java/perso/nutch/solrapp/solr/data/index,segFN=segments_1,version=1294944630023,generation=1,filenames=[segments_1] commit{dir=/home/alex/java/perso/nutch/solrapp/solr/data/index,segFN=segments_2,version=1294944630024,generation=2,filenames=[_0.nrm, _0.tis, _0.fnm, _0.tii, _0.frq, segments_2, _0.fdx, _0.prx, _0.fdt] 13-Jan-2011 19:50:47 org.apache.solr.core.SolrDeletionPolicy updateCommits INFO: newest commit = 1294944630024
You can now do a search via the api:
$ curl "http://localhost:8983/solr/select/?q=video&indent=on" <?xml version="1.0" encoding="UTF-8"?> <response> <lst name="responseHeader"> <int name="status">0</int> <int name="QTime">0</int> <lst name="params"> <str name="indent">on</str> <str name="q">video</str> </lst> </lst> <result name="response" numFound="2" start="0"> <doc> <arr name="anchor"><str>Logout</str></arr> <float name="boost">1.03571</float> <str name="date">20110212</str> <str name="digest">5d62587216b50ed7e52987b09dcb9925</str> <str name="id">com.truveo.www:http/</str> <str name="lang">unknown</str> <arr name="tag"><str/></arr> <str name="title">Truveo Video Search</str> <long name="tstamp">2011-02-12T18:37:53.031Z</long> <arr name="type"><str>text/html</str><str>text</str><str>html</str></arr> <str name="url">http://www.truveo.com/</str> </doc> <doc> <arr name="anchor"><str>Comments</str></arr> <float name="boost">1.00971</float> <str name="date">20110212</str> <str name="digest">59edefd6f4711895c2127d45b569d8c9</str> <str name="id">org.apache.wiki:http/nutch/</str> <str name="lang">en</str> <arr name="subcollection"><str>nutch</str></arr> <arr name="tag"><str/></arr> <str name="title">FrontPage - Nutch Wiki</str> <long name="tstamp">2011-02-12T18:37:53.863Z</long> <arr name="type"><str>text/html</str><str>text</str><str>html</str></arr> <str name="url">http://wiki.apache.org/nutch/</str> </doc> </result> </response>
Crawl Script
To automate the crawl process, we might want to use a Bash script that runs the suite of Nutch commands, then add it as a cron job. Don't forget to initialize first the crawl db with the inject command. We run several iterations of the the generate/fetch/parse/update cycle with the for loop. We limit the number of urls that will get fetched in one iteration by specifying a -topN argument in the generate command.
#!/bin/bash # Nutch crawl export NUTCH_HOME=~/java/workspace/Nutch2.0/runtime/local # depth in the web exploration n=1 # number of selected urls for fetching maxUrls=50000 # solr server solrUrl=http://localhost:8983 for (( i = 1 ; i <= $n ; i++ )) do log=$NUTCH_HOME/logs/log # Generate $NUTCH_HOME/bin/nutch generate -topN $maxUrls > $log batchId=`sed -n 's|.*batch id: \(.*\)|\1|p' < $log` # rename log file by appending the batch id log2=$log$batchId mv $log $log2 log=$log2 # Fetch $NUTCH_HOME/bin/nutch fetch $batchId >> $log # Parse $NUTCH_HOME/bin/nutch parse $batchId >> $log # Update $NUTCH_HOME/bin/nutch updatedb >> $log # Index $NUTCH_HOME/bin/nutch solrindex $solrUrl $batchId >> $log done
Conclusion
I managed to fetch in one run 50k urls with these minor changes. With the default values in conf/nutch-default.xml and MySQL as datastore, these are the logs timestamps when running the initialization and one iteration of generate/fetch/update cycle:
2010-12-13 07:19:26,089 INFO crawl.InjectorJob - InjectorJob: starting 2010-12-13 07:20:00,077 INFO crawl.InjectorJob - InjectorJob: finished 2010-12-13 07:20:00,715 INFO crawl.GeneratorJob - GeneratorJob: starting 2010-12-13 07:20:34,304 INFO crawl.GeneratorJob - GeneratorJob: done 2010-12-13 07:20:35,041 INFO fetcher.FetcherJob - FetcherJob: starting 2010-12-13 11:04:00,933 INFO fetcher.FetcherJob - FetcherJob: done 2010-12-15 01:38:44,262 INFO crawl.DbUpdaterJob - DbUpdaterJob: starting 2010-12-15 02:15:15,503 INFO crawl.DbUpdaterJob - DbUpdaterJob: done
The next step is comparing with a setup backed by a HBase datastore. I tried once but got a memory error which left my HBase server instance unresponsive. See the description of the problem.
Please don't hesitate to comment and share your own feedback, difficulties and results.
i built nutch 2.0 and using hbase as data storage. after building sucessfully. i did inject commnad as :
ReplyDeletebin/nutch inject urls/seeds.txt
i have following problem.
InjectorJob: org.apache.gora.util.GoraException: java.lang.RuntimeException: java.net.MalformedURLException
at org.apache.gora.store.DataStoreFactory.createDataStore(DataStoreFactory.java:110)
at org.apache.gora.store.DataStoreFactory.createDataStore(DataStoreFactory.java:93)
at org.apache.nutch.storage.StorageUtils.createWebStore(StorageUtils.java:43)
at org.apache.nutch.crawl.InjectorJob.run(InjectorJob.java:227)
at org.apache.nutch.crawl.InjectorJob.inject(InjectorJob.java:252)
at org.apache.nutch.crawl.InjectorJob.run(InjectorJob.java:266)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
at org.apache.nutch.crawl.InjectorJob.main(InjectorJob.java:276)
Caused by: java.lang.RuntimeException: java.net.MalformedURLException
at org.apache.gora.hbase.store.HBaseStore.initialize(HBaseStore.java:128)
at org.apache.gora.store.DataStoreFactory.initializeDataStore(DataStoreFactory.java:81)
at org.apache.gora.store.DataStoreFactory.createDataStore(DataStoreFactory.java:104)
... 7 more
Caused by: java.net.MalformedURLException
at java.net.URL.(URL.java:617)
at java.net.URL.(URL.java:480)
at java.net.URL.(URL.java:429)
at org.apache.xerces.impl.XMLEntityManager.setupCurrentEntity(Unknown Source)
at org.apache.xerces.impl.XMLVersionDetector.determineDocVersion(Unknown Source)
at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source)
at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source)
at org.apache.xerces.parsers.XMLParser.parse(Unknown Source)
at org.apache.xerces.parsers.AbstractSAXParser.parse(Unknown Source)
at org.jdom.input.SAXBuilder.build(SAXBuilder.java:489)
at org.jdom.input.SAXBuilder.build(SAXBuilder.java:807)
at org.apache.gora.hbase.store.HBaseStore.readMapping(HBaseStore.java:528)
at org.apache.gora.hbase.store.HBaseStore.initialize(HBaseStore.java:114)
... 9 more
Caused by: java.lang.NullPointerException
at java.net.URL.(URL.java:522)
... 21 more
Pls help me.
Nutch wiki page explains how to use HBase as Gora backend: http://wiki.apache.org/nutch/GORA_HBase
ReplyDeleteThere is a rather imperfect JUnit test for storage, org.apache.nutch.storage.TestGoraStorage.
Honestly, when i ran it as is, it made my laptop hang because it takes up too much resources.
You might want to apply this patch before running it:
Index: src/test/org/apache/nutch/util/AbstractNutchTest.java
===================================================================
--- src/test/org/apache/nutch/util/AbstractNutchTest.java (revision 1050697)
+++ src/test/org/apache/nutch/util/AbstractNutchTest.java (working copy)
@@ -16,34 +16,23 @@
*/
package org.apache.nutch.util;
-import java.io.IOException;
-import java.nio.ByteBuffer;
-import java.util.ArrayList;
-import java.util.List;
-
import junit.framework.TestCase;
-import org.apache.avro.util.Utf8;
+import org.apache.gora.store.DataStore;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
-import org.apache.nutch.crawl.URLWebPage;
-import org.apache.nutch.storage.Mark;
import org.apache.nutch.storage.StorageUtils;
import org.apache.nutch.storage.WebPage;
-import org.apache.nutch.util.TableUtil;
-import org.apache.gora.query.Query;
-import org.apache.gora.query.Result;
-import org.apache.gora.sql.store.SqlStore;
-import org.apache.gora.store.DataStore;
-import org.apache.gora.store.DataStoreFactory;
-import org.apache.gora.util.ByteUtils;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
/**
* This class provides common routines for setup/teardown of an in-memory data
* store.
*/
public class AbstractNutchTest extends TestCase {
+ protected static final Logger LOG = LoggerFactory.getLogger(AbstractNutchTest.class);
protected Configuration conf;
protected FileSystem fs;
@@ -55,16 +44,12 @@
public void setUp() throws Exception {
super.setUp();
conf = CrawlTestUtil.createConfiguration();
- conf.set("storage.data.store.class", "org.gora.sql.store.SqlStore");
fs = FileSystem.get(conf);
- // using hsqldb in memory
- DataStoreFactory.properties.setProperty("gora.sqlstore.jdbc.driver","org.hsqldb.jdbcDriver");
- // use separate in-memory db-s for tests
- DataStoreFactory.properties.setProperty("gora.sqlstore.jdbc.url","jdbc:hsqldb:mem:" + getClass().getName());
- DataStoreFactory.properties.setProperty("gora.sqlstore.jdbc.user","sa");
- DataStoreFactory.properties.setProperty("gora.sqlstore.jdbc.password","");
webPageStore = StorageUtils.createWebStore(conf, String.class,
WebPage.class);
+
+ // empty the datastore
+ webPageStore.deleteByQuery(webPageStore.newQuery());
}
Good luck.
Hi Alexis,
ReplyDeleteHow can i execute query string to search data with nutch 2.0 using hbase?. In nutch 1.0,1.1, i can show the result throught webpage ui
i have mistake in configuration so that it thrown that exception. thanks to Alexis for supporting.
ReplyDeleteTrang, I have same exception, Can you tell me what you did to solve it? Thanks!
Deletereally helpful post, thanks
ReplyDeleteDear Pham,
ReplyDeleteYou might want to use HBase shell to query the data.
To do full text search, you need to index the crawled data, via the solrindex Nutch command, after parsing it (the crawl command automatically does it). I just added a new section on how to do the indexing part:
http://techvineyard.blogspot.com/2010/12/build-nutch-20.html#solrindex
It shows how to query Solr API and get XML results back.
I have not tested yet the webapp yet, neither on Nutch nor Solr sides.
Dear Alexis,
ReplyDeleteI had downloaded Nutch 2.0, but i found that missing some function such as Scoring, so i'm wondering when will nutch 2.0 release?
Dear Trang,
ReplyDeleteSee http://search.lucidimagination.com/search/document/a2522f02270e1125/release_planning
that mentions a forthcoming Nutch 2.0 release.
There is still some work to be done though.
Hi all,
ReplyDeleteI have been trying to integrate Nutch crawler with Mysql database..
For this I'm following this tutorial mentioned in the thread.
http://techvineyard.blogspot.com/2010/12/build-nutch-20.html
When I tried to do this step right click on ivy/ivy.xml, "Add Ivy Library ..." after modifying the mentioned files, I am facing problems.
'Ivy resolve job of ivy/ivy.xml in 'nutch'' has encountered a problem.
Impossible to reslove dependencies of org.apache.nutch#${ant.project.name};
Impossible to resolve dependencies of org.apache.nutch#${ant.project.name};working@arjun-ninjas
unresolved dependency: org.apache.gora#gora-core;0.1: not found
unresolved dependency: org.apache.gora#gora-sql;0.1: not found
unresolved dependency: org.restlet.jse#org.restlet;2.0.0: not found
unresolved dependency: org.restlet.jse#org.restlet.ext.jackson;2.0.0: not found
unresolved dependency: org.apache.gora#gora-core;0.1: not found
unresolved dependency: org.apache.gora#gora-sql;0.1: not found
unresolved dependency: org.restlet.jse#org.restlet;2.0.0: not found
unresolved dependency: org.restlet.jse#org.restlet.ext.jackson;2.0.0: not found
unresolved dependency: org.apache.gora#gora-core;0.1: not found
unresolved dependency: org.apache.gora#gora-sql;0.1: not found
unresolved dependency: org.restlet.jse#org.restlet;2.0.0: not found
unresolved dependency: org.restlet.jse#org.restlet.ext.jackson;2.0.0: not found
I am attaching the screen shot
Could you please help me in this regard
Thanks and regards,
Ch. Arjun Kumar Reddy,
International Institute of Information Technology – Bangalore (IIITB),
Dear Arjun Kumar,
ReplyDeletei think that you had missed some step in the tutorial thread, Maybe you dont add "src/plugin/protocol-sftp/ivy.xm" so that it could not download dependency.
i had same problem to you, but after adding "src/plugin/protocol-sftp/ivy.xm" i had resolved that.
Good luck.
Hello,
ReplyDeleteWhen I try to install gora I get
::::::::::::::::::::::::::::::::::::::::::::::
[ivy:resolve] :: UNRESOLVED DEPENDENCIES ::
[ivy:resolve] ::::::::::::::::::::::::::::::::::::::::::::::
[ivy:resolve] :: com.sun.jersey#jersey-core;1.4: not found
[ivy:resolve] :: com.sun.jersey#jersey-json;1.4: not found
[ivy:resolve] :: com.sun.jersey#jersey-server;1.4: not found
Thanks.
Alex.
Dear Alex,
ReplyDeleteI moved the Gora description part out of this page to a new blog post. See http://techvineyard.blogspot.com/2011/02/gora-orm-framework-for-hadoop-jobs.html#Gora_development_in_Eclipse
I hope this will help.
Alexis, Thank you very much for your article.
ReplyDeleteWhether you have a problem with the PrimaryKey in mysql, namely the length allowed (255)?
My baseUrl length is very long > 255 b.
I don't know how i can change unique id=> md5(baseUrl) and thеn for new Сrawl cicle get unReversedUrl() by custom column.
Sorry for my bad language.
nice post .. thanks
ReplyDeleteI tried to complete this tutorial but failed. A more detailed describtion can be found here (http://lucene.472066.n3.nabble.com/TestFetcher-hangs-td3091057.html).
ReplyDeleteIf you have any ideas, how to solve this, I would like to hear of them.
Hi Alexis,
ReplyDeleteJust getting back into this again...
I found that I could not compile gora (SVN downloaded 2011-12-31) with ant (don't know ivy, tried deleting the cache, no luck). Ultimately I was able to compile it with mvn (mvn compile; mvn jar:jar). I needed to do this to put the gora-cassandra.jar into the nutch runtime/local/lib directory.
I am also using cassandra-1.0.6. I had to download the following additional jar files:
- apache-cassandra-thrift-1.0.1.jar and libthrift-0.6.1.jar (both from cassandra's lib).
- hector-core-1.0-1.jar and guava-r09.jar (from hector download).
to put into the runtime/local/lib directory.
After that I was able to do nutch inject successfully.
Hi Alexis,
ReplyDeleteI am trying to test Nutch 2.0 with eclipse, but it is giving compilation error,
"The method createDataStore(Class, Class, Class, Properties, String) in the type
DataStoreFactory is not applicable for the arguments (Class>, Class, Class, Configuration, String)"
in org.apache.nutch.storage.StorageUtils.createWebStore(StorageUtils.java:69).
I tried to find whether there are other versions where the signatures match, but couldn't find.
Could please help in this regard.
Thanks.
This is deprecated. The updated steps are here: https://wiki.apache.org/nutch/RunNutchInEclipse
ReplyDeleteI have this issue with Hadoop-2.5.2 and Nutch-2.3.1 and hbase x.x.x.98-hadoop2
ReplyDeleteInjectorJob: starting at 2016-09-16 17:21:21
InjectorJob: Injecting urlDir: urls
InjectorJob: Using class org.apache.gora.hbase.store.HBaseStore as the Gora storage class.
Error: org.apache.gora.util.GoraException: java.lang.RuntimeException: java.net.MalformedURLException
at org.apache.gora.store.DataStoreFactory.createDataStore(DataStoreFactory.java:167)
at org.apache.gora.store.DataStoreFactory.createDataStore(DataStoreFactory.java:118)
at org.apache.gora.mapreduce.GoraOutputFormat.getRecordWriter(GoraOutputFormat.java:88)
at org.apache.hadoop.mapred.MapTask$NewDirectOutputCollector.(MapTask.java:624)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:744)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:340)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:168)
at java.security.AccessController.doPrivileged(AccessController.java:686)
at javax.security.auth.Subject.doAs(Subject.java:569)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1614)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:163)
Caused by: java.lang.RuntimeException: java.net.MalformedURLException
at org.apache.gora.hbase.store.HBaseStore.initialize(HBaseStore.java:132)
at org.apache.gora.store.DataStoreFactory.initializeDataStore(DataStoreFactory.java:102)
at org.apache.gora.store.DataStoreFactory.createDataStore(DataStoreFactory.java:161)
... 10 more
Caused by: java.net.MalformedURLException
at java.net.URL.(URL.java:639)
at java.net.URL.(URL.java:502)
at java.net.URL.(URL.java:451)
at org.apache.xerces.impl.XMLEntityManager.setupCurrentEntity(Unknown Source)
at org.apache.xerces.impl.XMLVersionDetector.determineDocVersion(Unknown Source)
at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source)
at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source)
at org.apache.xerces.parsers.XMLParser.parse(Unknown Source)
at org.apache.xerces.parsers.AbstractSAXParser.parse(Unknown Source)
at org.apache.xerces.jaxp.SAXParserImpl$JAXPSAXParser.parse(Unknown Source)
at org.jdom.input.SAXBuilder.build(SAXBuilder.java:518)
at org.jdom.input.SAXBuilder.build(SAXBuilder.java:865)
at org.apache.gora.hbase.store.HBaseStore.readMapping(HBaseStore.java:719)
at org.apache.gora.hbase.store.HBaseStore.initialize(HBaseStore.java:116)
... 12 more
Caused by: java.lang.NullPointerException
at java.net.URL.(URL.java:544)
... 25 more
Does anyone knows what is that?
Thx in advance
Wilson