Sunday, February 20, 2011

Increase your Swap partition

Build error with Mahout


Interestingly enough, with as low as 1GB for the RAM and Swap sizes, you should run into the exact same issue during the build of Mahout project from source than the one described in this previous post, HBase memory issue.

$ svn co http://svn.apache.org/repos/asf/mahout/trunk mahout
$ cd mahout
$ mvn

From core/target/surefire-reports/org.apache.mahout.cf.taste.hadoop.item.RecommenderJobTest.txt, I had the same error:


-------------------------------------------------------------------------------
Test set: org.apache.mahout.cf.taste.hadoop.item.RecommenderJobTest
-------------------------------------------------------------------------------
Tests run: 21, Failures: 0, Errors: 2, Skipped: 0, Time elapsed: 33.882 sec <<< FAILURE!
testCompleteJobBoolean(org.apache.mahout.cf.taste.hadoop.item.RecommenderJobTest)  Time elapsed: 15.717 sec  <<< ERROR!
java.io.IOException: Cannot run program "chmod": java.io.IOException: error=12, Cannot allocate memory
        at java.lang.ProcessBuilder.start(ProcessBuilder.java:460)
        at org.apache.hadoop.util.Shell.runCommand(Shell.java:149)
        at org.apache.hadoop.util.Shell.run(Shell.java:134)
        at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:286)
        at org.apache.hadoop.util.Shell.execCommand(Shell.java:354)
        at org.apache.hadoop.util.Shell.execCommand(Shell.java:337)
        at org.apache.hadoop.fs.RawLocalFileSystem.execCommand(RawLocalFileSystem.java:481)
        at org.apache.hadoop.fs.RawLocalFileSystem.setPermission(RawLocalFileSystem.java:473)
        at org.apache.hadoop.fs.FilterFileSystem.setPermission(FilterFileSystem.java:280)
        at org.apache.hadoop.fs.FileSystem.mkdirs(FileSystem.java:266)
        at org.apache.hadoop.mapred.JobClient.configureCommandLineOptions(JobClient.java:573)
        at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:761)
        at org.apache.hadoop.mapreduce.Job.submit(Job.java:432)
        at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:447)
        at org.apache.mahout.cf.taste.hadoop.item.RecommenderJob.run(RecommenderJob.java:234)



As mentioned here, you should increase the size of your swap partition. The swap partition was not big enough for the system to fork the java process while simultaneously keeping the memory pages required by the other running applications.

Increase your Swap partition



To get more swap, you can create a swap file. A cleaner way is to change your partition table, shrink one partition to get some space and then increase the swap one. You want to know what you are doing here. Read carefully the Partition guide and back-up your personal files before changing anything. In my case, I got lucky as I was able to shrink an unimportant primary partition (/dev/sda2) to give additional space to the swap (the logical partition /dev/sda8) located in a different primary partition (/dev/sda1). Following is the former partition table that had the same amount of swap than RAM, 1GB. The newer now has twice as much as RAM, 2GB. This was the old partition table:


# fdisk /dev/sda

WARNING: DOS-compatible mode is deprecated. It's strongly recommended to
         switch off the mode (command 'c') and change display units to
         sectors (command 'u').

Command (m for help): p

Disk /dev/sda: 58.5 GB, 58506416640 bytes
255 heads, 63 sectors/track, 7113 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disk identifier: 0xcfce28df

   Device Boot      Start         End      Blocks   Id  System
/dev/sda1               1        3649    29310561    5  Extended
/dev/sda2   *        3650        7112    27816547+   7  HPFS/NTFS
/dev/sda5               1         973     7815559+  83  Linux
/dev/sda6             974        1095      979933+  83  Linux
/dev/sda7            1096        3527    19535008+  83  Linux
/dev/sda8            3528        3649      979933+  82  Linux swap / Solaris


/dev/sda8 got assigned the cylinders from 3528 to 3649. Its size is 121 * 8225280 / 1024 = 971 932 kB ( ~ 1 GB)


To perform the modification, I followed these steps:

  1. Count the number of additionnal cylinders required for the swap partition (121)
  2. Run fdisk /dev/sda
  3. Shrink primary partition /dev/sda2 : delete it (d), add it again (n) with less cylinders (starting cylinder is now 3771 instead of 3650) and write the partition table to disk (w).
  4. Run kpartx /dev/sda
  5. Expand extended partition /dev/sda1 : (d) then (n), with more cylinders (ending cylinder is 3770 instead of 3649). Recreate logical partitions /dev/sda5, /dev/sda6, /dev/sda7 ((n) 3 times with the same corresponding start/end cylinders than before) and finally /dev/sda8 with more cylinders (ending cylinder is now 3770 instead of 3649). Assign 82 as system id to /dev/sda8 to tag it as "Linux swap / Solaris" (t). Then (w).
  6. Run kpartx /dev/sda again
  7. Reformat the modified partitions:

Format /dev/sda2 partition with ext3 filesytem. You will loose everything.

# mke2fs -j /dev/sda2

Format /dev/sda8 as swap:

# swapoff /dev/sda8
# mkswap -f /dev/sda8
# swapon /dev/sda8


This is the new partition table:


   Device Boot      Start         End      Blocks   Id  System
/dev/sda1               1        3770    30282493+   5  Extended
/dev/sda2            3771        7113    26852647+  83  Linux
/dev/sda5               1         973     7815559+  83  Linux
/dev/sda6             974        1095      979933+  83  Linux
/dev/sda7            1096        3527    19535008+  83  Linux
/dev/sda8            3528        3770     1951866   82  Linux swap / Solaris




With the top command, you should see the new total amount of swap available.


top - 09:45:16 up  1:48,  6 users,  load average: 0.01, 0.04, 0.01
Tasks: 139 total,   1 running, 138 sleeping,   0 stopped,   0 zombie
Cpu(s):  1.0%us,  1.0%sy,  0.0%ni, 96.8%id,  0.0%wa,  0.7%hi,  0.5%si,  0.0%st
Mem:   1026468k total,   668960k used,   357508k free,    22036k buffers
Swap:  1951856k total,   168876k used,  1782980k free,   438632k cached


Conclusion


Run the build in Mahout directory with mvn command. It should now be successful.

Sunday, February 6, 2011

Gora, an ORM framework for Hadoop jobs

Table of Contents
Introduction
Background
Gora development in Eclipse
I/O Frequency
Cassandra in Gora
  gora-cassandra module
  Avro schema
References



Introduction


Lately I have been focusing on rewriting the whole Cassandra stack in Gora, for GORA-22. It was hinted that it needed to be revamped due to some concurrency issues? I tried to port the old backend by updating the API calls to make it compatible with Cassandra 0.8. I could not make it work, so I just rewrote the module from scratch. Instead of rewriting the entire Thrift layer, we now delegate Cassandra Read/Write operations to Hector, the first Cassandra client listed on the official wiki.

Background


Here is some Gora background, from what I understand.

Nutch performs a web crawl by running iterations via generate/fetch/parse/updatedb steps implemented through Hadoop jobs. To access the data, it relies on Gora, which is actually an ORM framework (Object-Relationnal Mapping, a bit like activerecord in Rails), instead of previously manipulating segments.

The gora-mapreduce module intends to abstract away the data access within Map/Reduce. It replaces the data storage that is usually done through HDFS files (hadoop-hdfs). Instead, you're given the ability to query your data from a database, the row-oriented (RDBMS such as MySQL, HSQL) or column-oriented (no-SQL dbs such as HBase or Cassandra) style.

This of course has impacts on performance by adding network overhead when the mappers need to connect to a centralized remote server instead of reading distributed files from the cluster. It kills as well a few intrinsic HDFS features, such as data recovery through replication (connection failures) or speedup through network topology analysis. Here I'm not quite aware of the implications of using Gora so please don't hesitate to share your own impressions.


Gora development in Eclipse


Setup a Gora project the same way that Nutch.
To resolve dependencies, we can use the maven m2eclipse or the ivyde plugin.


If you want to use IvyDE, you should replace every occurence of ${project.dir} by ${basedir}/trunk (assuming trunk is the local directory containing Gora checkout) as well as commenting the gora-core dependency in gora-cassandra/ivy/ivy.xml, to make it work in Eclipse. I have no idea how to load such project.dir property in IvyDE Eclipse plugin.

Index: gora-core/ivy/ivy.xml
===================================================================
--- gora-core/ivy/ivy.xml       (revision 1149427)
+++ gora-core/ivy/ivy.xml       (working copy)
@@ -23,7 +23,7 @@
       status="integration"/>
 
   <configurations>
-    <include file="${project.dir}/ivy/ivy-configurations.xml"/>
+    <include file="${basedir}/trunk/ivy/ivy-configurations.xml"/>
   </configurations>
 
   <publications defaultconf="compile">
@@ -44,10 +44,11 @@
       <exclude org="org.eclipse.jdt" name="core"/>
       <exclude org="org.mortbay.jetty" name="jsp-*"/>
     </dependency>
+    
     <dependency org="org.apache.hadoop" name="avro" rev="1.3.2" conf="*->default">
       <exclude org="ant" name="ant"/>
     </dependency>
-
+    
     <!-- test dependencies -->
     <dependency org="org.apache.hadoop" name="hadoop-test" rev="0.20.2" conf="test->master"/>
     <dependency org="org.slf4j" name="slf4j-simple" rev="1.5.8" conf="test -> *,!sources,!javadoc"/>
Index: gora-sql/ivy/ivy.xml
===================================================================
--- gora-sql/ivy/ivy.xml        (revision 1149427)
+++ gora-sql/ivy/ivy.xml        (working copy)
@@ -23,7 +23,7 @@
       status="integration"/>
 
   <configurations>
-    <include file="${project.dir}/ivy/ivy-configurations.xml"/>
+    <include file="${basedir}/trunk/ivy/ivy-configurations.xml"/>
   </configurations>
   
   <publications>
@@ -33,12 +33,13 @@
 
   <dependencies>
     <!-- conf="*->@" means every conf is mapped to the conf of the same name of the artifact-->
-    <dependency org="org.apache.gora" name="gora-core" rev="latest.integration" changing="true" conf="*->@"/> 
+    <dependency org="org.apache.gora" name="gora-core" rev="latest.integration" changing="true" conf="*->@"/>
     <dependency org="org.jdom" name="jdom" rev="1.1" conf="*->master"/>
     <dependency org="com.healthmarketscience.sqlbuilder" name="sqlbuilder" rev="2.0.6" conf="*->default"/>
 
     <!-- test dependencies -->
     <dependency org="org.hsqldb" name="hsqldb" rev="2.0.0" conf="test->default"/>
+    <dependency org="mysql" name="mysql-connector-java" rev="5.1.13" conf="*->default"/>
 
   </dependencies>
     
Index: gora-cassandra/ivy/ivy.xml
===================================================================
--- gora-cassandra/ivy/ivy.xml  (revision 1149427)
+++ gora-cassandra/ivy/ivy.xml  (working copy)
@@ -24,7 +24,7 @@
       status="integration"/>
       
   <configurations>
-    <include file="${project.dir}/ivy/ivy-configurations.xml"/>
+    <include file="${basedir}/trunk/ivy/ivy-configurations.xml"/>
   </configurations>
   
   <publications>
@@ -35,9 +35,9 @@
   
   <dependencies>
     <!-- conf="*->@" means every conf is mapped to the conf of the same name of the artifact-->
-    
+    <!--
     <dependency org="org.apache.gora" name="gora-core" rev="latest.integration" changing="true" conf="*->@"/>
-    
+    -->
     <dependency org="org.jdom" name="jdom" rev="1.1">
        <exclude org="xerces" name="xercesImpl"/>
     </dependency>


This is how the Gora project looks like




In the Nutch project, we want to comment gora-core, gora-hbase, gora-cassandra or gora-mysql dependencies in Ivy since they are already loaded with the Gora project being included in the Nutch Java Build Path. We add Gora as a project dependency in the Java Build Path




That's how we can update Gora code and test it on the fly by running Nutch classes.

I/O Frequency


By default the records shuffled in the Hadoop job are buffered to memory. At some point, you want to flush the buffer and write the records in the actual database. Similarly, you can not load the entire content of the database then start the mappers, so you need to limit the number of records you read at a time.


People might be familiar with mapred-site.xml configuration file when they write a Hadoop job and do not necesarilly use Gora together with Nutch. You can overwrite the default number of rows fetched per select query for read operations and the default number of rows buffered into memory before it gets flushed for write operations. Default is 10000.

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
 <property>
  <name>gora.buffer.read.limit</name>
  <value>500</value>
 </property> 
 <property>
  <name>gora.buffer.write.limit</name>
  <value>100</value>
 </property>
</configuration>



Cassandra in Gora


A Gora Hadoop job just fetches or emits key/value pairs to process the data. The task performed by gora-cassandra module is to the write the values, coming as specific instances of Avro records then store them to Cassandra. So this subproject is just a thin layer between Gora Hadoop job code and the Cassandra client, which itself is interacting with the Cassandra server. That's it.

gora-cassandra module


Here are the main changes/improvements:

  • Compatibility with Cassandra 0.8
  • Use Hector as Cassandra client
  • Concurrency now relies on Hector.
  • Not all the features have not yet been implemented: delete query...

Avro schema


The object serialization is dictated by an Avro schema. An example can be found in $NUTCH_HOME/src/gora/webpage.avsc:


{"name": "WebPage",
 "type": "record",
 "namespace": "org.apache.nutch.storage",
 "fields": [
        {"name": "baseUrl", "type": "string"}, 
        {"name": "status", "type": "int"},
        {"name": "fetchTime", "type": "long"},
        {"name": "prevFetchTime", "type": "long"},
        {"name": "fetchInterval", "type": "int"},
        {"name": "retriesSinceFetch", "type": "int"},
        {"name": "modifiedTime", "type": "long"},
        {"name": "protocolStatus", "type": {
            "name": "ProtocolStatus",
            "type": "record",
            "namespace": "org.apache.nutch.storage",
            "fields": [
                {"name": "code", "type": "int"},
                {"name": "args", "type": {"type": "array", "items": "string"}},
                {"name": "lastModified", "type": "long"}
            ]
            }},
        {"name": "content", "type": "bytes"},
        {"name": "contentType", "type": "string"},
        {"name": "prevSignature", "type": "bytes"},
        {"name": "signature", "type": "bytes"},
        {"name": "title", "type": "string"},
        {"name": "text", "type": "string"},
        {"name": "parseStatus", "type": {
            "name": "ParseStatus",
            "type": "record",
            "namespace": "org.apache.nutch.storage",
            "fields": [
                {"name": "majorCode", "type": "int"},
                {"name": "minorCode", "type": "int"},
                {"name": "args", "type": {"type": "array", "items": "string"}}
            ]
            }},
        {"name": "score", "type": "float"},
        {"name": "reprUrl", "type": "string"},
        {"name": "headers", "type": {"type": "map", "values": "string"}},
        {"name": "outlinks", "type": {"type": "map", "values": "string"}},
        {"name": "inlinks", "type": {"type": "map", "values": "string"}},
        {"name": "markers", "type": {"type": "map", "values": "string"}},
        {"name": "metadata", "type": {"type": "map", "values": "bytes"}}
   ]
}


The schema is hardcoded in the Nutch class org.apache.nutch.storage.WebPage, which is a POJO (Plain Old Java Object?) that contains the logics used for crawling a web page. It extends org.apache.gora.persistency.impl.PersistentBase. gora-cassandra considers the PersistentBase class as a base class for the RECORD fields. It considers org.apache.gora.persistency.StatefulHashMap as a base class for the MAP fields.

The 2 complex types, MAP and RECORD, are represented in Cassandra by "super columns", which are maps, ie sets of key/value pairs. The ARRAY type is represented by a simple column, via a coma separated list bounded by square brackets, like "[one, two, three]". Not the best.



References


Tuesday, January 18, 2011

Trying Nutch 2.0 HBase storage

Introduction



It is likely one can run into issues using HBase as datastore for Nutch, especially with a commodity hardware that has very limited memory. This article follows up two posts:




which explain how to setup Nutch 2.0 with HBase.


Memory Issue


In my case, HBase is running in an environment that consists of a laptop that only has 1 GB of memory. This is too little. When Java forks the process, it duplicates the parent process' pages in order to load memory for the child one. Hence requiring twice as much it was using before the fork.


You can see that the JVM doubles the Java Heap Space size at around 5 PM, which crashes HBase. Keep reading to see what are the errors in the corresponding logs .


HBase error


First let's take a look at the data Nutch created.

$ bin/hbase shell
HBase Shell; enter 'help' for list of supported commands.
Version: 0.20.6, r965666, Mon Jul 19 16:54:48 PDT 2010
hbase(main):001:0> list        
webpage                                                                                                       
1 row(s) in 0.1480 seconds
hbase(main):002:0> describe "webpage"
{
 NAME => 'webpage',
 FAMILIES => [
  {NAME => 'f', COMPRESSION => 'NONE', VERSIONS => '3', TTL => '2147483647', BLOCKSIZE => '65536', IN_MEMORY => 'false', BLOCKCACHE => 'true'},
  {NAME => 'h', COMPRESSION => 'NONE', VERSIONS => '3', TTL => '2147483647', BLOCKSIZE => '65536', IN_MEMORY => 'false', BLOCKCACHE => 'true'},
  {NAME => 'il', COMPRESSION => 'NONE', VERSIONS => '3', TTL => '2147483647', BLOCKSIZE => '65536', IN_MEMORY => 'false', BLOCKCACHE => 'true'},
  {NAME => 'mk', COMPRESSION => 'NONE', VERSIONS => '3', TTL => '2147483647', BLOCKSIZE => '65536', IN_MEMORY => 'false', BLOCKCACHE => 'true'},
  {NAME => 'mtdt', COMPRESSION => 'NONE', VERSIONS => '3', TTL => '2147483647', BLOCKSIZE => '65536', IN_MEMORY => 'false', BLOCKCACHE => 'true'},
  {NAME => 'ol', COMPRESSION => 'NONE', VERSIONS => '3', TTL => '2147483647', BLOCKSIZE => '65536', IN_MEMORY => 'false', BLOCKCACHE => 'true'},
  {NAME => 'p', COMPRESSION => 'NONE', VERSIONS => '3', TTL => '2147483647', BLOCKSIZE => '65536', IN_MEMORY => 'false', BLOCKCACHE => 'true'},
  {NAME => 's', COMPRESSION => 'NONE', VERSIONS => '3', TTL => '2147483647', BLOCKSIZE => '65536', IN_MEMORY => 'false', BLOCKCACHE => 'true'}
 ]
}
hbase(main):003:0> scan "webpage", { LIMIT => 1 }
ROW                          COLUMN+CELL                                                                      
 com.richkidzradio:http/     column=f:bas, timestamp=1295012635817, value=http://richkidzradio.com/
...


I had issues with the updatedb command on 200k rows after parsing around 20k rows. When the fork happened, the logs in $HBASE_HOME/logs/hbase-alex-master-maison.log show:


2011-01-18 16:59:16,685 DEBUG org.apache.hadoop.hbase.regionserver.HRegion: Flush requested on webpage,com.richkidzradio:http/,1295020425887
2011-01-18 16:59:16,685 DEBUG org.apache.hadoop.hbase.regionserver.HRegion: Started memstore flush for region webpage,com.richkidzradio:http/,1295020425887. Current region memstore size 64.0m
2011-01-18 16:59:16,686 DEBUG org.apache.hadoop.hbase.regionserver.HRegion: Finished snapshotting, commencing flushing stores
2011-01-18 16:59:16,728 FATAL org.apache.hadoop.hbase.regionserver.MemStoreFlusher: Replay of hlog required. Forcing server shutdown
org.apache.hadoop.hbase.DroppedSnapshotException: region: webpage,com.richkidzradio:http/,1295020425887
        at org.apache.hadoop.hbase.regionserver.HRegion.internalFlushcache(HRegion.java:1041)
        at org.apache.hadoop.hbase.regionserver.HRegion.flushcache(HRegion.java:896)
        at org.apache.hadoop.hbase.regionserver.MemStoreFlusher.flushRegion(MemStoreFlusher.java:258)
        at org.apache.hadoop.hbase.regionserver.MemStoreFlusher.flushRegion(MemStoreFlusher.java:231)
        at org.apache.hadoop.hbase.regionserver.MemStoreFlusher.run(MemStoreFlusher.java:154)
Caused by: java.io.IOException: Cannot run program "chmod": java.io.IOException: error=12, Cannot allocate memory
        at java.lang.ProcessBuilder.start(ProcessBuilder.java:460)
        at org.apache.hadoop.util.Shell.runCommand(Shell.java:149)
        at org.apache.hadoop.util.Shell.run(Shell.java:134)
        at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:286)
        at org.apache.hadoop.util.Shell.execCommand(Shell.java:354)
        at org.apache.hadoop.util.Shell.execCommand(Shell.java:337)
        at org.apache.hadoop.fs.RawLocalFileSystem.execCommand(RawLocalFileSystem.java:481)
        at org.apache.hadoop.fs.RawLocalFileSystem.setPermission(RawLocalFileSystem.java:473)
        at org.apache.hadoop.fs.FilterFileSystem.setPermission(FilterFileSystem.java:280)
        at org.apache.hadoop.fs.ChecksumFileSystem.create(ChecksumFileSystem.java:372)
        at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:484)
        at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:465)
        at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:372)
        at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:364)
        at org.apache.hadoop.hbase.io.hfile.HFile$Writer.<init>(HFile.java:296)
        at org.apache.hadoop.hbase.regionserver.StoreFile.getWriter(StoreFile.java:393)
        at org.apache.hadoop.hbase.regionserver.Store.getWriter(Store.java:585)
        at org.apache.hadoop.hbase.regionserver.Store.getWriter(Store.java:576)
        at org.apache.hadoop.hbase.regionserver.Store.internalFlushCache(Store.java:540)
        at org.apache.hadoop.hbase.regionserver.Store.flushCache(Store.java:516)
        at org.apache.hadoop.hbase.regionserver.Store.access$100(Store.java:88)
        at org.apache.hadoop.hbase.regionserver.Store$StoreFlusherImpl.flushCache(Store.java:1597)
        at org.apache.hadoop.hbase.regionserver.HRegion.internalFlushcache(HRegion.java:1000)
        ... 4 more
Caused by: java.io.IOException: java.io.IOException: error=12, Cannot allocate memory
        at java.lang.UNIXProcess.<init>(UNIXProcess.java:148)
        at java.lang.ProcessImpl.start(ProcessImpl.java:65)
        at java.lang.ProcessBuilder.start(ProcessBuilder.java:453)
        ... 26 more
2011-01-18 16:59:16,730 DEBUG org.apache.hadoop.hbase.regionserver.HRegion: Flush requested on webpage,com.richkidzradio:http/,1295020425887
2011-01-18 16:59:16,758 INFO org.apache.hadoop.ipc.HBaseServer: IPC Server handler 11 on 34511, call put([B@24297, [Lorg.apache.hadoop.hbase.client.Put;@61f9c6) from 0:0:0:0:0:0:0:1:38369: error: java.io.IOException: Server not running, aborting
java.io.IOException: Server not running, aborting
        at org.apache.hadoop.hbase.regionserver.HRegionServer.checkOpen(HRegionServer.java:2307)
        at org.apache.hadoop.hbase.regionserver.HRegionServer.put(HRegionServer.java:1773)
        at sun.reflect.GeneratedMethodAccessor46.invoke(Unknown Source)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
        at java.lang.reflect.Method.invoke(Method.java:597)
        at org.apache.hadoop.hbase.ipc.HBaseRPC$Server.call(HBaseRPC.java:657)
        at org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:915)
2011-01-18 16:59:16,759 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Dump of metrics: request=0.0, regions=7, stores=43, storefiles=42, storefileIndexSize=0, memstoreSize=129, compactionQueueSize=0, usedHeap=317, maxHeap=996, blockCacheSize=175238616, blockCacheFree=33808120, blockCacheCount=2196, blockCacheHitRatio=88, fsReadLatency=0, fsWriteLatency=0, fsSyncLatency=0
2011-01-18 16:59:16,759 INFO org.apache.hadoop.hbase.regionserver.MemStoreFlusher: RegionServer:0.cacheFlusher exiting
2011-01-18 16:59:16,944 INFO org.apache.hadoop.ipc.HBaseServer: Stopping server on 34511


The exception occurs when HBase was trying to update the first row from the batch. The class that forks the process is org.apache.hadoop.fs.RawLocalFileSystem. It's actually a hadoop related issue, reported in HADOOP-5059.
Here are the versions being used:

  • HBase 0.20.6
  • lib/hadoop-0.20.2-core.jar


Recover HBase


Running Nutch updatedb command again, you might see an error in the Nutch logs:

org.apache.hadoop.hbase.client.RetriesExhaustedException: Trying to contact region server Some server, retryOnlyOne=true, index=0, islastrow=true, tries=9, numtries=10, i=0, listsize=1, region=webpage,com.richkidzradio:http/,1295020425887 for region webpage,com.richkidzradio:http/,1295020425887, row 'hf:http/', but failed after 10 attempts.
Exceptions:
        at org.apache.hadoop.hbase.client.HConnectionManager$TableServers$Batch.process(HConnectionManager.java:1157)
        at org.apache.hadoop.hbase.client.HConnectionManager$TableServers.processBatchOfRows(HConnectionManager.java:1238)
        at org.apache.hadoop.hbase.client.HTable.flushCommits(HTable.java:666)
        at org.apache.hadoop.hbase.client.HTable.put(HTable.java:510)
        at org.apache.gora.hbase.store.HBaseStore.put(HBaseStore.java:245)
        at org.apache.gora.mapreduce.GoraRecordWriter.write(GoraRecordWriter.java:70)
        at org.apache.hadoop.mapred.ReduceTask$NewTrackingRecordWriter.write(ReduceTask.java:508)
        at org.apache.hadoop.mapreduce.TaskInputOutputContext.write(TaskInputOutputContext.java:80)
        at org.apache.nutch.crawl.DbUpdateReducer.reduce(DbUpdateReducer.java:164)
        at org.apache.nutch.crawl.DbUpdateReducer.reduce(DbUpdateReducer.java:23)
        at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:176)
        at org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:566)
        at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:408)
        at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:216)

and in the HBase logs

2011-01-18 15:56:11,195 DEBUG org.apache.hadoop.hbase.regionserver.HRegionServer: Batch puts interrupted at index=0 because:Requested row out of range for HRegion webpage,com.richkidzradio:http/,1295020425887, startKey='com.richkidzradio:http/', getEndKey()='es.plus.www:http/', row='hf:http/'


As mentioned here, you might have "holes". You want to specify to add_table.rb script the webpage table directory to fix the problem

$ cd $HBASE_HOME/bin
$ stop-hbase.sh
$ start-hbase.sh
$ ./hbase org.jruby.Main add_table.rb /home/alex/hbase/hbase-alex/hbase/webpage


Conclusion


To fix this issue, check out this new post, Increase your Swap size. With the current versions of Hadoop and HBase, very limited RAM and insufficient Swap, you will not go very far, due to silly fork operations. Maybe with the new version of Hadoop, the 0.22 yet to be released, as well as 0.90 for HBase, these issues will be fixed. I guess it's now time to give it a try to Cassandra...

Thursday, January 6, 2011

SSH Setup For HBase

SSH Setup For HBase




SSH passwordless login
 Login with password
 Login with passphrase
 Login without password
HBase
 Configure HBase
 Start HBase
References




HBase can be used as datastore for Nutch 2.0.

This is a tutorial to get started quickly with a standalone instance of HBase. I did not find it so straightforward on the original guide, hence this blog entry.



SSH passwordless login


We are going to setup the SSH environment for HBase. Personally I just use a standalone instance of HBase on a single machine, so the server and client are the same. I use Debian as Linux OS.

$ whoami
alex

First, if trying to login with no instance of the SSH daemon running, I get this error:

$ ssh alex@localhost
ssh: connect to host localhost port 22: Connection refused

You want to setup an SSH server. I installed the Debian package:
# apt-get install openssh-server

This will start it automatically. In case you need that later, this is how to start it:

$ sudo /etc/init.d/ssh start
Starting OpenBSD Secure Shell server: sshd.

Login with password


Now you can login to the server by using your regular linux user password:

$ ssh alex@localhost
The authenticity of host 'localhost (127.0.0.1)' can't be established.
RSA key fingerprint is b6:96:06:e1:fb:f1:9f:23:40:32:ac:cb:ac:c9:bc:12.
Are you sure you want to continue connecting (yes/no)? yes
Warning: Permanently added 'localhost' (RSA) to the list of known hosts.
alex@localhost's password:  
Linux maison 2.6.32.24 #1 SMP Tue Oct 5 08:36:28 CEST 2010 i686

The programs included with the Debian GNU/Linux system are free software;
the exact distribution terms for each program are described in the
individual files in /usr/share/doc/*/copyright.

Debian GNU/Linux comes with ABSOLUTELY NO WARRANTY, to the extent
permitted by applicable law.
You have mail.
Last login: Sun Jan  2 13:32:46 2011 from localhost
$ exit


On the client, you want to generate an SSH key:

$ ssh-keygen -t rsa
Generating public/private rsa key pair.
Enter file in which to save the key (/home/alex/.ssh/id_rsa):    
Enter passphrase (empty for no passphrase): 
Enter same passphrase again: 
Your identification has been saved in /home/alex/.ssh/id_rsa.
Your public key has been saved in /home/alex/.ssh/id_rsa.pub.
The key fingerprint is:
47:79:b7:e8:e4:25:1b:8d:f6:0a:86:36:67:fa:1d:94 alex@maison
The key's randomart image is:
+--[ RSA 2048]----+
|                 |
|           .     |
|          o . .  |
|         . . * . |
|        S . E +  |
|         o * *   |
|        + = = .  |
|       . * o o   |
|        ... o    |
+-----------------+
$ chmod 755 ~/.ssh


Copy the public key to the SSH server.


$ scp ~/.ssh/id_rsa.pub alex@localhost:.ssh/authorized_keys
alex@localhost's password: 
id_rsa.pub                                                                                                                                       100%  393     0.4KB/s   00:00


Now on the server:

$ su alex
$ chmod 600 ~/.ssh/authorized_keys



Login with passphrase


From the client, login to the server with your passphrase:

$ ssh alex@localhost
Enter passphrase for key '/home/alex/.ssh/id_rsa': 
Linux maison 2.6.32.24 #1 SMP Tue Oct 5 08:36:28 CEST 2010 i686

The programs included with the Debian GNU/Linux system are free software;
the exact distribution terms for each program are described in the
individual files in /usr/share/doc/*/copyright.

Debian GNU/Linux comes with ABSOLUTELY NO WARRANTY, to the extent
permitted by applicable law.
You have mail.
Last login: Wed Dec 29 15:57:24 2010 from localhost
$ exit

On the client, start an SSH agent to avoid typing your passphrase in the future:+

$ exec /usr/bin/ssh-agent $SHELL
$ ssh-add
Enter passphrase for /home/alex/.ssh/id_rsa: 
Identity added: /home/alex/.ssh/id_rsa (/home/alex/.ssh/id_rsa)
$ nohup ssh-agent -s > ~/.ssh-agent
nohup: ignoring input and redirecting stderr to stdout

Login without password


Now login to the server. It should be passwordless.

$ ssh alex@localhost
Linux maison 2.6.32.24 #1 SMP Tue Oct 5 08:36:28 CEST 2010 i686

The programs included with the Debian GNU/Linux system are free software;
the exact distribution terms for each program are described in the
individual files in /usr/share/doc/*/copyright.

Debian GNU/Linux comes with ABSOLUTELY NO WARRANTY, to the extent
permitted by applicable law.
You have mail.
Last login: Sun Jan  2 13:41:59 2011 from localhost
$ exit


HBase


Now that we have setup SSH, we can configure HBase.

Configure HBase


Change HBase properties in conf/hbase-site.xml according to your needs. For example you can change the directory where the data is stored.

<configuration>
  <property>
    <name>hbase.rootdir</name>
    <value>file:///home/alex/hbase/hbase-${user.name}/hbase</value>
  </property>
  <property>
    <name>hbase.tmp.dir</name>
    <value>/home/alex/hbase/hbase-${user.name}</value>
  </property>
</configuration>


Start HBase


$ bin/start-hbase.sh 
localhost: starting zookeeper, logging to /home/alex/java/ext/hbase-0.20.6/bin/../logs/hbase-alex-zookeeper-maison.out
starting master, logging to /home/alex/java/ext/hbase-0.20.6/logs/hbase-alex-master-maison.out
localhost: starting regionserver, logging to /home/alex/java/ext/hbase-0.20.6/bin/../logs/hbase-alex-regionserver-maison.out

These 2 Java processes are now running on the machine:

  • org.apache.hadoop.hbase.zookeeper.HQuorumPeer
  • org.apache.hadoop.hbase.master.HMaster



References