Saturday, August 12, 2017

Java Parser Benchmark

Intro

I ran into slow parsing issues when loading this Hackerrank problem input. When parsing 1M integers, you can no longer rely on high level library such as java.util.Scanner as they are not optimized to be fast. You can easily replace this handy yet slow utility by a lower level function that wraps simple String.split / Integer.parseInt operations.

We also compare loading file from disk using JVM Heap buffer vs native, off-Heap buffer. We also look at the encoding overhead to convert binary data to UTF-8 characters required before manipulating strings. This will show the benefits of just converting bytes to int data to achieve the best performance.

Maven Project

The code is bundled in this SameOccurrence maven project. The input is a 16 MB plain text file containing more than 1M integers to scan. This was the input of Test Case #18 in the Hackerrank challenge.

Benchmark


Antlr parser is the slowest, loading 1M integers in 3 seconds. This is due to extra overhead to generate the parse tree. Parser generators are very helpful to validate input syntax using a grammar, but are not tailored for performance.

On the other end the raw byte parser finishes in 10 ms as it does not do extra conversions / validation. It assumes the input is valid and simply converts sequence of bytes separated by whitespaces into base 10 integers.

Here we compare the time to load a file into a heap buffer, a byte array allocated within the JVM and the time to load the file using mmap system call, implemented via a native function that allocates the buffer off-heap leveraging Virtual Memory. We reduce our 20 ms loading time down to 4 ms.

The 3rd bar shows the UTF-8 conversion overhead you will need to include to start working with Strings. On top of loading raw data in 4 ms, you need to spend an additional 80 ms to just get UTF-8 data ...

Antlr

ANTLR is a parser generator written in Java that converts a grammar into a parser. It provides a maven plugin that wraps the Tool utility to parse the grammar and generate associated parser.
You implement a parse listener to load the data as rules get executed.


public void antlr() {
    CharStream charStream = CharStreams.fromString(input);

    Lexer lexer = new SameOccurrenceLexer(charStream);
    CommonTokenStream stream = new CommonTokenStream(lexer);
    SameOccurrenceParser parser = new SameOccurrenceParser(stream);
    parser.addParseListener(new SameOccurrenceParseListener());
    parser.getInterpreter().setPredictionMode(PredictionMode.LL);
    parser.r();
}

Scanner

With the JDK Scanner it is very easy to load raw data types from an input stream into memory.

public void scan() {
    Scanner scanner = new Scanner(input);
    int n = scanner.nextInt();
    int q = scanner.nextInt();
    int[] a = new int[n];
    for (int i = 0; i < n; i++) {
        a[i] = scanner.nextInt();
    }
    for (int t = 0; t < q; t++) {
        int x = scanner.nextInt();
        int y = scanner.nextInt();
    }
    scanner.close();
}

Split

With the JDK String split method, you can break down your String input into tokens using a regex separator. Then convert string to int using Integer.parseInt. Returning raw data type is faster than Integer object via Integer.valueOf.

int tokenOffset = 0;
public void split() {
    String[] tokens = input.split("\\s+");
    int n = parseToken(tokens);
    int q = parseToken(tokens);
    int[] a = new int[n];
    for (int i = 0; i < n; i++) {
        a[i] = parseToken(tokens);
    }
    for (int t = 0; t < q; t++) {
        int x = parseToken(tokens);
        int y = parseToken(tokens);
    }
    assert tokenOffset == tokens.length;
}

private int parseToken(String[] tokens) {
    return Integer.parseInt(tokens[tokenOffset++]);
}

Raw

You just scan for the next whitespace delimiter, then converts the byte sequence to an integer. This is the same logics as Integer.parseInt implementation.

int byteOffset = 0;
public void raw() {
    int n = parseInt();
    int q = parseInt();
    int[] a = new int[n];
    for (int i = 0; i < n; i++) {
        a[i] = parseInt();
    }
    for (int t = 0; t < q; t++) {
        int x = parseInt();
        int y = parseInt();
    }
}

private static final int BASE = 10;
private static final char ZERO = '0';
private int parseInt() {
    int start = byteOffset;
    int end = getTokenEnd();

    int i = 0;
    int pow = 1;
    for (int pos = end; pos >= start; pos--) {
        int digit = byteBuffer.get(pos) - ZERO;
        i += pow * digit;
        pow *= BASE;
    }
    return i;
}

private int getTokenEnd() {
    while (isWhitespace(byteBuffer.get(byteOffset))) {
        byteOffset++;
    }
    int pos = byteOffset;
    while(! isWhitespace(byteBuffer.get(pos))) {
        pos++;
    }
    byteOffset = pos;
    return pos-1;
}

private static final char LF = '\n';
private static final char CR = '\r';
private static final char SPACE = ' ';
private boolean isWhitespace(byte b) {
    return b == LF || b == CR || b == SPACE;
}