Table of Contents
Introduction |
Background |
Gora development in Eclipse |
I/O Frequency |
Cassandra in Gora |
gora-cassandra module |
Avro schema |
References |
Introduction
Lately I have been focusing on rewriting the whole Cassandra stack in Gora, for GORA-22. It was hinted that it needed to be revamped due to some concurrency issues? I tried to port the old backend by updating the API calls to make it compatible with Cassandra 0.8. I could not make it work, so I just rewrote the module from scratch. Instead of rewriting the entire Thrift layer, we now delegate Cassandra Read/Write operations to Hector, the first Cassandra client listed on the official wiki.
Background
Here is some Gora background, from what I understand.
Nutch performs a web crawl by running iterations via generate/fetch/parse/updatedb steps implemented through Hadoop jobs. To access the data, it relies on Gora, which is actually an ORM framework (Object-Relationnal Mapping, a bit like activerecord in Rails), instead of previously manipulating segments.
The gora-mapreduce module intends to abstract away the data access within Map/Reduce. It replaces the data storage that is usually done through HDFS files (hadoop-hdfs). Instead, you're given the ability to query your data from a database, the row-oriented (RDBMS such as MySQL, HSQL) or column-oriented (no-SQL dbs such as HBase or Cassandra) style.
This of course has impacts on performance by adding network overhead when the mappers need to connect to a centralized remote server instead of reading distributed files from the cluster. It kills as well a few intrinsic HDFS features, such as data recovery through replication (connection failures) or speedup through network topology analysis. Here I'm not quite aware of the implications of using Gora so please don't hesitate to share your own impressions.
Gora development in Eclipse
Setup a Gora project the same way that Nutch.
To resolve dependencies, we can use the maven m2eclipse or the ivyde plugin.
To resolve dependencies, we can use the maven m2eclipse or the ivyde plugin.
If you want to use IvyDE, you should replace every occurence of ${project.dir} by ${basedir}/trunk (assuming trunk is the local directory containing Gora checkout) as well as commenting the gora-core dependency in gora-cassandra/ivy/ivy.xml, to make it work in Eclipse. I have no idea how to load such project.dir property in IvyDE Eclipse plugin.
Index: gora-core/ivy/ivy.xml =================================================================== --- gora-core/ivy/ivy.xml (revision 1149427) +++ gora-core/ivy/ivy.xml (working copy) @@ -23,7 +23,7 @@ status="integration"/> <configurations> - <include file="${project.dir}/ivy/ivy-configurations.xml"/> + <include file="${basedir}/trunk/ivy/ivy-configurations.xml"/> </configurations> <publications defaultconf="compile"> @@ -44,10 +44,11 @@ <exclude org="org.eclipse.jdt" name="core"/> <exclude org="org.mortbay.jetty" name="jsp-*"/> </dependency> + <dependency org="org.apache.hadoop" name="avro" rev="1.3.2" conf="*->default"> <exclude org="ant" name="ant"/> </dependency> - + <!-- test dependencies --> <dependency org="org.apache.hadoop" name="hadoop-test" rev="0.20.2" conf="test->master"/> <dependency org="org.slf4j" name="slf4j-simple" rev="1.5.8" conf="test -> *,!sources,!javadoc"/> Index: gora-sql/ivy/ivy.xml =================================================================== --- gora-sql/ivy/ivy.xml (revision 1149427) +++ gora-sql/ivy/ivy.xml (working copy) @@ -23,7 +23,7 @@ status="integration"/> <configurations> - <include file="${project.dir}/ivy/ivy-configurations.xml"/> + <include file="${basedir}/trunk/ivy/ivy-configurations.xml"/> </configurations> <publications> @@ -33,12 +33,13 @@ <dependencies> <!-- conf="*->@" means every conf is mapped to the conf of the same name of the artifact--> - <dependency org="org.apache.gora" name="gora-core" rev="latest.integration" changing="true" conf="*->@"/> + <dependency org="org.apache.gora" name="gora-core" rev="latest.integration" changing="true" conf="*->@"/> <dependency org="org.jdom" name="jdom" rev="1.1" conf="*->master"/> <dependency org="com.healthmarketscience.sqlbuilder" name="sqlbuilder" rev="2.0.6" conf="*->default"/> <!-- test dependencies --> <dependency org="org.hsqldb" name="hsqldb" rev="2.0.0" conf="test->default"/> + <dependency org="mysql" name="mysql-connector-java" rev="5.1.13" conf="*->default"/> </dependencies> Index: gora-cassandra/ivy/ivy.xml =================================================================== --- gora-cassandra/ivy/ivy.xml (revision 1149427) +++ gora-cassandra/ivy/ivy.xml (working copy) @@ -24,7 +24,7 @@ status="integration"/> <configurations> - <include file="${project.dir}/ivy/ivy-configurations.xml"/> + <include file="${basedir}/trunk/ivy/ivy-configurations.xml"/> </configurations> <publications> @@ -35,9 +35,9 @@ <dependencies> <!-- conf="*->@" means every conf is mapped to the conf of the same name of the artifact--> - + <!-- <dependency org="org.apache.gora" name="gora-core" rev="latest.integration" changing="true" conf="*->@"/> - + --> <dependency org="org.jdom" name="jdom" rev="1.1"> <exclude org="xerces" name="xercesImpl"/> </dependency>
This is how the Gora project looks like
In the Nutch project, we want to comment gora-core, gora-hbase, gora-cassandra or gora-mysql dependencies in Ivy since they are already loaded with the Gora project being included in the Nutch Java Build Path. We add Gora as a project dependency in the Java Build Path
That's how we can update Gora code and test it on the fly by running Nutch classes.
I/O Frequency
By default the records shuffled in the Hadoop job are buffered to memory. At some point, you want to flush the buffer and write the records in the actual database. Similarly, you can not load the entire content of the database then start the mappers, so you need to limit the number of records you read at a time.
People might be familiar with mapred-site.xml configuration file when they write a Hadoop job and do not necesarilly use Gora together with Nutch. You can overwrite the default number of rows fetched per select query for read operations and the default number of rows buffered into memory before it gets flushed for write operations. Default is 10000.
<?xml version="1.0"?> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?> <configuration> <property> <name>gora.buffer.read.limit</name> <value>500</value> </property> <property> <name>gora.buffer.write.limit</name> <value>100</value> </property> </configuration>
Cassandra in Gora
A Gora Hadoop job just fetches or emits key/value pairs to process the data. The task performed by gora-cassandra module is to the write the values, coming as specific instances of Avro records then store them to Cassandra. So this subproject is just a thin layer between Gora Hadoop job code and the Cassandra client, which itself is interacting with the Cassandra server. That's it.
gora-cassandra module
Here are the main changes/improvements:
- Compatibility with Cassandra 0.8
- Use Hector as Cassandra client
- Concurrency now relies on Hector.
- Not all the features have not yet been implemented: delete query...
Avro schema
The object serialization is dictated by an Avro schema. An example can be found in $NUTCH_HOME/src/gora/webpage.avsc:
{"name": "WebPage", "type": "record", "namespace": "org.apache.nutch.storage", "fields": [ {"name": "baseUrl", "type": "string"}, {"name": "status", "type": "int"}, {"name": "fetchTime", "type": "long"}, {"name": "prevFetchTime", "type": "long"}, {"name": "fetchInterval", "type": "int"}, {"name": "retriesSinceFetch", "type": "int"}, {"name": "modifiedTime", "type": "long"}, {"name": "protocolStatus", "type": { "name": "ProtocolStatus", "type": "record", "namespace": "org.apache.nutch.storage", "fields": [ {"name": "code", "type": "int"}, {"name": "args", "type": {"type": "array", "items": "string"}}, {"name": "lastModified", "type": "long"} ] }}, {"name": "content", "type": "bytes"}, {"name": "contentType", "type": "string"}, {"name": "prevSignature", "type": "bytes"}, {"name": "signature", "type": "bytes"}, {"name": "title", "type": "string"}, {"name": "text", "type": "string"}, {"name": "parseStatus", "type": { "name": "ParseStatus", "type": "record", "namespace": "org.apache.nutch.storage", "fields": [ {"name": "majorCode", "type": "int"}, {"name": "minorCode", "type": "int"}, {"name": "args", "type": {"type": "array", "items": "string"}} ] }}, {"name": "score", "type": "float"}, {"name": "reprUrl", "type": "string"}, {"name": "headers", "type": {"type": "map", "values": "string"}}, {"name": "outlinks", "type": {"type": "map", "values": "string"}}, {"name": "inlinks", "type": {"type": "map", "values": "string"}}, {"name": "markers", "type": {"type": "map", "values": "string"}}, {"name": "metadata", "type": {"type": "map", "values": "bytes"}} ] }
The schema is hardcoded in the Nutch class org.apache.nutch.storage.WebPage, which is a POJO (Plain Old Java Object?) that contains the logics used for crawling a web page. It extends org.apache.gora.persistency.impl.PersistentBase. gora-cassandra considers the PersistentBase class as a base class for the RECORD fields. It considers org.apache.gora.persistency.StatefulHashMap as a base class for the MAP fields.
The 2 complex types, MAP and RECORD, are represented in Cassandra by "super columns", which are maps, ie sets of key/value pairs. The ARRAY type is represented by a simple column, via a coma separated list bounded by square brackets, like "[one, two, three]". Not the best.
Hi Alexis,
ReplyDeleteI've been reading your posts and getting my head around as much Cassandra literature as I can, however I', wondering if you can provide any experiences with what type of properties to set in gora.properties file. I'm aware that there are obvious properties to add, but I just don't know what the syntax should be, and/or how extensive the list is. Have you any pointers you can provide from your own experiences?
Kind Regards
Lewis