Sunday, November 28, 2010

Java HttpComponents

HttpComponents & Non-blocking IO


Table of Contents
Introduction
Non-blocking HTTP client
HttpComponents architecture
   1. The IO reactors
   2. The HTTP client and request execution handlers
   3. The HTTP connection
Java HTTP client application
  Multi-Threading
  Non blocking I/O
     java.nio simple application
     httpcore-nio application
     Encoding detection
Conclusion


Introduction


At my company, I initiated a project that consists of checking millions of URLs for their status code, 200, 404 ..., in order to flag those that would lead the user to a dead experience. So I was looking for a way to download URLs simultaneously from multiple hosts efficiently.

In this post, we take a deep look at the HttpComponents framework, former Apache Commons HttpClient. Alternatives to HttpComponents exist for both client and server sides in Java: For clients, you can take a look at another Apache project, Mina, or JBoss' Netty. For servers, Jetty or Sun's Glassfish.



Non-blocking HTTP client



Two approaches are possible when implementing an HTTP client/server. You can create a multi-threaded application that runs a thread per host/client, or a single-threaded one that leverages event-driven non-blocking IO. The second method saves the context switch overhead in terms of CPU required in the multi-threaded model when you start handling a request to another host.

That approach creates a reactor, an infinite loop that blocks on a Linux kernel epoll call, in order to realize a readiness selection among all the sockets we are trying to talk to. Then when a socket is ready to be written to or read from, epoll returns it, and the appropriate action is triggered to handle the request according to the I/O event. The design of such a client follows the pattern of a finite state machine? as we need to define the next step to be performed given the current state of the request: for example wait for a readable socket after it was written to in order to get the response to the request that was just sent.

The java.nio package made this approach possible in Java, by introducing the notions of channels and selectors. We now look into the httpcore-nio library from HttpComponents since it provides an API around java.nio that lets you build an asynchronous HTTP client, hence download URLs the fastest way.

HttpComponents architecture


A good starting point is the chapter 2 called "NIO extension" from the tutorial. We will not try to explain the API by commenting a few snippets, but by describing the workflow while looking at the relationships between all the classes involved via diagrams. I hope you will find complementary information in both resources. One interface plays a key role in the application. It will be detailed in this section: org.apache.http.nio.protocol.NHttpRequestExecutionHandler, the HTTP request execution handler. The org.apache.http.nio.protocol.RequestExecutionHandler implementation is described throughout the 3 following points.

The following UML analysis focuses on 3 subgroups of the entire HttpComponents ecosystem:

  1. The IO reactors
  2. The HTTP client and request execution handlers
  3. The HTTP connection

1. The IO reactors



The entry point of the diagram is the DefaultConnectingIOReactor class. By calling its constructor you will be able to create a main reactor that establishes connections in a non-blocking way.


Before doing any HTTP requests, you need to connect to the remote host via the connect method from the ConnectingIOReactor interface, which prototype is:

(org.apache.http.nio.reactor.ConnectingIOReactor)
 SessionRequest connect(SocketAddress remoteAddress,  SocketAddress localAddress, Object attachment,  SessionRequestCallback callback);

This registers the main Selector object to a newly created SocketChannel object, in order to wait for this socket's connectability. As soon as the main reactor selects the associated SelectionKey, we can finalize the connection to its associated SocketChannel.





The designers decided to create worker reactors aside from the main reactor, which will take care of the HTTP request per se. So far we just managed to establish the connection!


Once the worker reactor detects a new ChannelEntry, it registers its worker selector to the associated channel to wait for readibility. The socket will not become readable till we submit the HTTP request.

The third argument of the connect method called "attachment" will be set as a user-defined attribute to the new IOSession object.

2. The HTTP client and request execution handlers


A session just got created after the channel was added to the list of new channels. The IO event dispatch first adds a new NHttpClientConnection attribute to the session object. It then dispatches the event to the HTTP client handler. Once notified by the connected method,

(org.apache.http.nio.NHttpClientHandler)
 void connected(NHttpClientConnection conn, Object attachment);

the NHttpClientHandler calls initializeContext on its NHttpRequestExectionHandler member,

(org.apache.http.nio.protocol.RequestExecutionHandler)
 public void initalizeContext(final HttpContext context, final Object attachment) {
  context.setAttribute("queue", attachment);
 }

in order to let the request execution handler know that we are now connected and set-up the application-specific data.







Do you remember we added an "attachment" to the session when requesting a connection. We just propagated this attachment as an attribute to the HttpContext object owned by the HTTP connection in our implementation of NHttpRequestExecutionHandler. Let's call the attribute "queue". Indeed, it can be for example a queue of jobs that represent the list of urls we want to download from a single host.

That connected method inside the client handler then calls the requestReady method from the same class:

(org.apache.http.nio.NHttpClientHandler)
 void requestReady(NHttpClientConnection conn);

It will initialize the HTTP request by calling the submitRequest method on the request execution handler. Our implementation loads a new attribute in the HttpContext object by adding a new job that is polled from the queue. This will allow us to retrieve the job attribute later once we receive a valid response, so that we update the job with its HTTP status code, for example.

(org.apache.http.nio.protocol.RequestExecutionHandler)
 public HttpRequest submitRequest(final HttpContext context) {
  @SuppressWarnings("unchecked")
  Queue queue = (Queue) context.getAttribute("queue");
  if (queue == null) {
   throw new IllegalStateException("Queue is null");
  }

  Job testjob = queue.poll();
  context.setAttribute("job", testjob);

  if (testjob != null) {
   return generateRequest(testjob);
  } else {
  return null;
  }
 }


"connected" then call the submitRequest method over the NHttpClientConnection object.

3. The HTTP connection



The HTTP connection object performs the request submission. It writes the request in a buffer and turns on the writable mask in the selection key associated to the channel.






To summarize, the sequence of actions performed in order to send the first request is:

BaseIOReactor.sessionCreated / DefaultClientIOEventDispatch.connected / AsyncNHttpClientHandler.connected, requestReady / RequestExecutionHandler.submitRequest /  DefaultNHttpClientConnection.submitRequest / DefaultHttpRequestWriter.write / IOSessionImpl.setEvent(EventMask.WRITE)

and in order to send subsequent requests: 

BaseIOReactor.writable / DefaultClientIOEventDispatch.outputReady / DefaultNHttpClientConnection.produceInput / AsyncNHttpClientHandler.requestReady / RequestExecutionHandler.submitRequest / DefaultNHttpClientConnection.submitRequest / DefaultHttpRequestWriter.write / IOSessionImpl.setEvent(EventMask.WRITE) 

Once the socket becomes readable, here is the cascade of listeners that are triggered by the IO Reactor:

BaseIOReactor.readable / DefaultClientIOEventDispatch.inputReady / DefaultNHttpClientConnection.consumeInput / AsyncNHttpClientHandler.inputReady, processResponse / RequestExecutionHandler.handleResponse


Once the worker reactor detects the socket's readability, the system reads the response, parses it and notifies our request execution handler. At this point we need to retrieve the current job from the HttpContext and we can finally update its result according to the HttpResponse:

(org.apache.http.nio.protocol.RequestExecutionHandler)
 public void handleResponse(final HttpResponse response, final HttpContext context) {
  Job testjob = (Job) context.removeAttribute("job");
  if (testjob == null) {
   throw new IllegalStateException("TestJob is null");
  }

  int statusCode = response.getStatusLine().getStatusCode();
  String content = null;

  HttpEntity entity = response.getEntity();
  if (entity != null) {
   try {
    content = EntityUtils.toString(entity);
   } catch (IOException ex) {
    content = "I/O exception: " + ex.getMessage();
   }
  }
  testjob.setResult(statusCode, content);
 }

That's it. I guess we pretty much went over how the httpcore-nio library handles the lifecycle of the HTTP request. The next section describes an HttpComponents-based client that executes HEAD requests simultaneously.


Java HTTP client application

Before starting anything, checkout HttpComponents httpclient and httpcore trunk versions with SVN. I created an Eclipse project for each directories.


All the application classes are checked-in on Github, within this directory. The purpose of the application is doing HEAD requests to multiple hosts simultaneously. The input is a list of tab separated host/url pairs:

6.cn http://6.cn/w/2j5J9gtTAfDpFPUMbZZz2g
6.cn http://6.cn/w/4QNOFBPKza/zbkQDI7ncRg
academicearth.org http://academicearth.org/lectures/biot-savart-law-gauss-law-for-magnetic-fields
academicearth.org http://academicearth.org/lectures/captial-structure-healthcare
affiliate.kickapps.com http://affiliate.kickapps.com/_Stack-and-Tilt/VIDEO/445590/71460.html
agourahills.patch.com http://agourahills.patch.com//articles/elementary-schools-out-for-sumac-fifth-graders#video-500899
alkislarlayasiyorum.com http://alkislarlayasiyorum.com/icerik/40154/
alkislarlayasiyorum.com http://alkislarlayasiyorum.com/icerik/40314/
alkislarlayasiyorum.com http://alkislarlayasiyorum.com/icerik/40321/
alkislarlayasiyorum.com http://alkislarlayasiyorum.com/icerik/40326/
alkislarlayasiyorum.com http://alkislarlayasiyorum.com/icerik/40367/
alkislarlayasiyorum.com http://alkislarlayasiyorum.com/icerik/40443/



A JobQueue object represents the list of urls to be checked within the same host. A Job2 object is mapped to every url. Here is what the Eclipse setup looks like:









Multi-Threading


The parallelization relies on a thread pool that can get as many running thread as connections. See SyncDeadlinkChecker class. For every JobQueue returned by the iterator, we spawn a new thread to fetch its list of urls.



Non blocking I/O


java.nio simple application

Let's take a look at a java.nio based HTTP client. The source is located in SimpleNHttpClient class. It is "simple" because it only manages one connection and it sends the same type of request, without reacting to the information in the response headers, such as the connection state or the cookies. For example this is how we create the request:

private void loadRequest(String path) {
  writeLine("HEAD " + path + " HTTP/1.1");
  writeLine("Connection: Keep-Alive");
  writeLine("Host: " + this.host);
  writeLine("User-Agent: TEST-CLIENT/1.1");
  writeLine("");
 }


I guess this class is a good starting point to debug urls that may or may not work on httpcore-nio. An interesting point is that you need 2 arrays, one that contains the raw format of the data, the other the actual decoded characters. This means you need to pick an appropriate decoder to parse the response.

Let's run an example on these 2 urls:

http://video.tvguide.com/Date+Night+2010/Date+Night/4866938?autoplay=true%20partnerid=OVG
http://video.tvguide.com/Brooks++Dunn/Put+a+Girl+in+It/5445966?autoplay=true%20partnerid=OVG

Execution log:

0 DEBUG [main] SimpleNHttpClient - Connected non blocking: false
16 DEBUG [main] SimpleNHttpClient - Key is connectable
17 DEBUG [main] SimpleNHttpClient - Connected: true
17 DEBUG [main] SimpleNHttpClient - Key is writable
21 DEBUG [main] SimpleNHttpClient - HEAD /Date+Night+2010/Date+Night/4866938?autoplay=true%20partnerid=OVG HTTP/1.1
Connection: Keep-Alive
Host: video.tvguide.com
User-Agent: TEST-CLIENT/1.1

22 DEBUG [main] SimpleNHttpClient - Number of bytes written: 161
218 DEBUG [main] SimpleNHttpClient - Key is readable
219 DEBUG [main] SimpleNHttpClient - Number of bytes read: 291
219 DEBUG [main] SimpleNHttpClient - HTTP/1.1 200 OK
Server: Microsoft-IIS/6.0
P3P: policyref=" /w3c/p3p.xml", CP="CAO PSA OUR BUS"
X-Powered-By: ASP.NET
X-AspNet-Version: 2.0.50727
Content-Type: text/html; charset=utf-8
Cache-Control: private, max-age=2700
Date: Sat, 20 Nov 2010 05:35:56 GMT
Connection: keep-alive


219 DEBUG [main] SimpleNHttpClient - Key is writable
220 DEBUG [main] SimpleNHttpClient - HEAD /Brooks++Dunn/Put+a+Girl+in+It/5445966?autoplay=true%20partnerid=OVG HTTP/1.1
Connection: Keep-Alive
Host: video.tvguide.com
User-Agent: TEST-CLIENT/1.1

220 DEBUG [main] SimpleNHttpClient - Number of bytes written: 164
221 DEBUG [main] SimpleNHttpClient - Key is readable
221 DEBUG [main] SimpleNHttpClient - Number of bytes read: -1
221 DEBUG [main] SimpleNHttpClient - EOF was reached
222 DEBUG [main] SimpleNHttpClient - Adding again /Brooks++Dunn/Put+a+Girl+in+It/5445966?autoplay=true%20partnerid=OVG
233 DEBUG [main] SimpleNHttpClient - Connected non blocking: false
242 DEBUG [main] SimpleNHttpClient - Key is connectable
242 DEBUG [main] SimpleNHttpClient - Connected: true
243 DEBUG [main] SimpleNHttpClient - Key is writable
243 DEBUG [main] SimpleNHttpClient - HEAD /Brooks++Dunn/Put+a+Girl+in+It/5445966?autoplay=true%20partnerid=OVG HTTP/1.1
Connection: Keep-Alive
Host: video.tvguide.com
User-Agent: TEST-CLIENT/1.1

243 DEBUG [main] SimpleNHttpClient - Number of bytes written: 164
366 DEBUG [main] SimpleNHttpClient - Key is readable
367 DEBUG [main] SimpleNHttpClient - Number of bytes read: 291
367 DEBUG [main] SimpleNHttpClient - HTTP/1.1 200 OK
Server: Microsoft-IIS/6.0
P3P: policyref=" /w3c/p3p.xml", CP="CAO PSA OUR BUS"
X-Powered-By: ASP.NET
X-AspNet-Version: 2.0.50727
Content-Type: text/html; charset=utf-8
Cache-Control: private, max-age=2700
Date: Sat, 20 Nov 2010 05:35:56 GMT
Connection: keep-alive

367 DEBUG [main] SimpleNHttpClient - All responses were received

As you can see, we run into a small caveat at time = 221 ms. In this particuliar case, we immediately reach EOF when reading the socket. This means we need to disconnect and reconnect to be able to receive a valid response to the sent request.


httpcore-nio application



It's easy to run into a few pitfalls while writing the application. One needs to respect the event-driven (asynchronous) nature of programming with non-blocking IO. It may turn writing a unit test into a challenge.


Testing an event-driven application


The test case I took inspiration from is located here:

(org.apache.http.nio.protocol.TestAsyncNHttpHandlers)
 public void testHttpHeads() throws Exception;


It reverts the Inversion Of Control by using a wait/notify handshake between threads. A unit test would wait for the job completion by blocking on the wait method:

synchronized(job) {
  try {
   job.wait();
  } catch (InterruptedException ie) {
   LOG.warn(ie);
  }
 }

The request handler would notify the test thread as soon as the response was received:

synchronized(job) {
  job.notify();
 }

This would resume the sleeping test thread which could move on to the next job in a synchronous fashion.



Thread starvation



When I used the very convenient yet treacherous wait/notify exchange per job from above, a thread starvation issue would pop-up regularly when processing several urls on a single host. Basically the greedy I/O dispatcher thread would notify the main thread but hold the lock, and prevents the idle thread from resuming. So the IO reactor would keep waiting for additional urls without seeing any coming in. A second mistake was to reconnect to the host in the main thread. For similar reason, the request connection should be issued within the thread that runs the reactor.


Let's try to move away from threads as much as possible, since the application is expected to be single-threaded yet with the highest performance. This wait/notify exchange per job would block us from processing several hosts in parallel, since it is synchronous by design. Unless we spawn a thread per host, which is what we want to avoid...



Features


  • We implement redirect following pretty easily: We just add a new job to the queue after parsing the Location header.
  • We close the connection when the latest response's Connection header is "closed" or when the job queue is empty.
  • We create a SessionRequestCallback implementation which "completed" method gets called when the connection got established, and a NHttpRequestExecutionHandler one which "handleResponse" method gets called once the response was received.

Encoding detection



The URL below is not supported by the library, because a non standard character (§) is sent in ISO-8859-1 encoding.


We need to first parse the charset value in the Content-Type header to be able to select the appropriate decoder that converts the raw bytes to characters. A quick hack consists of replacing the characters that would break the decoding, as suggested in http://old.nabble.com/Please-make-CharsetDecoder-less-strict-in-SessionInputBufferImpl-td24296440.html. Here is a diff of the hack:

===================================================================
--- httpcore-nio/src/main/java/org/apache/http/impl/nio/reactor/SessionInputBufferImpl.java (revision 1037110)
+++ httpcore-nio/src/main/java/org/apache/http/impl/nio/reactor/SessionInputBufferImpl.java (working copy)
@@ -36,6 +36,7 @@
 import java.nio.charset.Charset;
 import java.nio.charset.CharsetDecoder;
 import java.nio.charset.CoderResult;
+import java.nio.charset.CodingErrorAction;
 
 import org.apache.http.nio.reactor.SessionInputBuffer;
 import org.apache.http.nio.util.ByteBufferAllocator;
@@ -73,6 +74,7 @@
         this.charbuffer = CharBuffer.allocate(linebuffersize);
         this.charset = Charset.forName(HttpProtocolParams.getHttpElementCharset(params));
         this.chardecoder = this.charset.newDecoder();
+        this.chardecoder.onMalformedInput(CodingErrorAction.REPLACE);
     }
 
     public SessionInputBufferImpl(


Conclusion


Let's execute the AsyncDeadlinkChecker class and compare its performance to the SyncDeadlinkChecker one.


The benchmark is a 40k url input spanning 1k hosts. The machine runs a "Genuine Intel(R) CPU T2050 @ 1.60GHz" dual core with 1GB of RAM, under an advertised 10 Mbps Cable connection, running in reality at 300 kB/s. We obtain the following chart:




First, we have to acknowledge that the download rate showing up here is pretty low. This post is not (yet) about writing the highest performance HTTP client. The chart displays a rate's peak at around 200 URLs per second. This is actually normal given the size of my input, which contained only 1000 distinct hosts. I need more URLs... At least we have some data to compare both models.

Second, the startup is pretty slow. Indeed it takes around 10k URLs to reach a "reasonable" rate, higher than 100. This is due to Java internals I am not quite aware of: excessive CPU required to read the input, heap size increase or memory management to fit more data?

Third, HttpClient does not handle well concurrency since the failure rate is way too high at the beginning (and at the end). I might use a wrong multi-threaded implementation, or misunderstand the Thread pool class.

Finally, the whole point of this blog entry was to show that the non blocking I/O implementation performs better than the multi-threaded one. This is reflected in the graph when we remove the noise showing up in the httclient, synchronous curve, ie remove the points of the blue curve that see too many IOExceptions indicated by the yellow curve.