Cassandra performance in Python: Avoid namedtuple

Companion code for this post available on Github

TLDR: Use dict_factory instead of named_tuple_factory for the python cassandra driver.

Late last year, namedtuple was fingered as a culprit causing slow startup time in larger python applications. After some back and forth (covered by LWN), it appears that some improvements will come in later versions of Python 3, but namedtuple creation remains expensive.

For applications with a fixed number of named tuples, this is a startup penalty but not a drag on operations while running - once the tuples are made, there is no need to remake them. Unfortunately, there are some libraries that do repeatedly create named tuples at runtime. Including, as I discovered while looking for something else entirely, the DataStax python driver for Cassandra.

If you have used Cassandra from python before, you’re likely familiar with the fact that for each select query, you are returned a Row object with a field for each column. These Row objects are in fact named tuples, generated by the library after it receives the raw results from the Cassandra nodes, according to the following factory implementation (somewhat edited for brevity):

def named_tuple_factory(colnames, rows):
    """
    Returns each row as a namedtuple
    https://docs.python.org/2/library/collections.html#collections.namedtuple
    This is the default row factory.
    [...]
    """
    clean_column_names = map(_clean_column_name, colnames)
    try:
        Row = namedtuple('Row', clean_column_names)
    except Exception:
        # Create list because py3 map object will be consumed by first attempt
        clean_column_names = list(map(_clean_column_name, colnames))
        # [...]
        Row = namedtuple('Row', _sanitize_identifiers(clean_column_names))

    return [Row(*row) for row in rows]

As we can see, a new named tuple class is created every time you make a query returning a result set. If your application makes a large number of small queries, the named tuple processing time can quickly add up. Even if you are reading the same table repeatedly, there is no cache mechanism for re-using the previously generated tuple type. For the program I was profiling, a staggering 34% of worker run time was spend in named tuple creation. Luckily, the row factory for the Cassandra client can be easily changed, and others are available - raw tuples, dictionaries, and ordered dictionaries. I consider ordinary tuples to be non-ideal since the return data will be position-dependent instead of keyed, so that leaves the dictionary factory as the best alternative on paper. If we take a look at it, it’s rather simple:

def dict_factory(colnames, rows):
    """
    Returns each row as a dict.
    [...]
    """
    return [dict(zip(colnames, row)) for row in rows]

Based on our assumptions about the relative performance of dictionaries and named tuple creation, we can assume that dict_factory will be more performant than named_tuple_factory. But by how much? In order to get a ballpark, I’ve constructed a small benchmark that repeatedly queries a random subset of data from Cassandra, using different row factories. The source code can be found on GitHub if you wish to test for yourself. Run against all of the built-in row factories, here are my local results (some output removed for clarity):

(venv) ross@mjolnir:/h/r/P/cass_speedtest$ python --version
Python 2.7.13
(venv) ross@mjolnir:/h/r/P/cass_speedtest$ python main.py
Loaded 10000 rows of test data.
Warming cassandra up a bit first...done
--------------------------------------------------------------------------------
Beginning test for row factory <cyfunction tuple_factory at 0x7ff9b6e44a10>
Benchmark complete.
Runtime avg: 0.884321 seconds (stddev: 0.000252)
QPS avg: 1131.533536 seconds (stddev: 405.529567)
--------------------------------------------------------------------------------
Beginning test for row factory <cyfunction named_tuple_factory at 0x7ff9b6e44ad0>
Benchmark complete.
Runtime avg: 1.480597 seconds (stddev: 0.000065)
QPS avg: 675.442898 seconds (stddev: 13.412463)
--------------------------------------------------------------------------------
Beginning test for row factory <cyfunction dict_factory at 0x7ff9b6e44b90>
Benchmark complete.
Runtime avg: 0.876114 seconds (stddev: 0.000070)
QPS avg: 1141.611256 seconds (stddev: 118.118469)
--------------------------------------------------------------------------------
Beginning test for row factory <cyfunction ordered_dict_factory at 0x7ff9b6e44c50>
Benchmark complete.
Runtime avg: 0.945361 seconds (stddev: 0.000033)
QPS avg: 1057.873886 seconds (stddev: 40.724691)

Even before we table that up, we can see the named tuple clearly lags behind the others by a significant margin:

Row Factory Run Time (Seconds) Queries Per Second
Tuple 0.884321 1131.533536
Named Tuple 1.480597 675.442898
Dict 0.876114 1141.611256
Ordered Dict 0.945361 1057.873886

A real-world application will likely see less significant (as there is presumably other business logic vying for CPU time), but likely appreciable gains from switching from namedtuple to dict rows. To change it in your application is simple - just set the row_factory property on your Cassandra session instance, as detailed in the Cassandra docs here.

Clocking In

Companion code for this post available on Github

Some time ago now, I built my first moderately complex electronics project, an ethernet connected Nixie clock that pulled the time over NTP (all the code and schematics for which are on Github for those interested). Sadly, the one that had been running continuously since then croaked in a non-obvious enough way that I couldn’t repair it. Rather than re-fabricate the old design, I took this as an opportunity to revise the design and make some modifications.

The original two-part design, witn IN-16 tubes

While the previous version did have NTP, it was over ethernet (the ENC28J60 is considerably cheaper than any wifi chipset, especially at the time) which limited the locations that the clock could be plugged in. Version one was also a four digit display, which was nice and compact but doesn’t have the engagement of a noticeably ticking seconds place. Since adding a seconds place would make the design quite wide, I opted for more of a desk-placard form factor than the relatively cubelike design of v1. For style, the front would be just the display. Since the IN-16 tubes used for the first version are uprights, for this design I switched over to IN-12 tubes, which have their display in the same plane as the PCB they are mounted to.

So the first step was to make the smallest possible board that can hold the display tubes themselves, and use this as a reference for all the other boards:

This gives us a roughly 6.25” x 2” area to be getting on with, so next is to make sure that the brains of the operation can all fit into that space. The main things we need to cram in are:

The MCU: This will need to wrangle the WiFi chipset to fetch NTP offsets from the net, as well as marshal that data in such a way as to drive the display logic. Since there’s not too much processing going on here, we don’t need a hugely beefy chip. Here I’ll be using the Atmel ATMEGA238P, since it’s more than capable for the task and allows me to take advantage of the Arduino software ecosystem. This MCU doesn’t require much by way of supporting components - just a crystal.

Wifi: I know that the ESP8266 is the darling these days, but I could never get one to program correctly and have a (perhaps irrational) preference for SPI over UART. To that end, my go-to wifi chip is the Atmel ATWINC1500, which is somewhat more expensive but has proved easier to work with. It does run at 3.3V though, unlike the ATMEGA which will be running off 5V.

Boost Converter: In order to drive the Nixie tubes, we’ll need around a 170V potential on the anode. For step-up designs like this I usually use the MAX1771 controller, which tends to work well for high voltage boost circuits. I ran intp an issue with the previous design where there was a barely audible ringing, which I think was due to small traces on the boost circuit. This time I resolved to use much larger fills for the high current path on the boost circuit to try and eliminate that issue.

At the end of the day, the control board is split into three broad nets - the 12V input from the barrel jack on the back feeds the boost circuit on the right of the PCB, as well as feeding the 5V logic through an ordinary linear regulator. Since the current draw on the MCU isn’t significant, the regulator doesn’t seem to get too hot. The 5V logic is then linked to the WiFi module through some level shifters, and the power is fed through a second regulator.

Final layout of the control board, with room to spare

The middle board is fairly boring - in order to drive the nixies, the active number needs the cathode pulled low to allow current to flow. In order to handle any possible high potential on the cathode lines, each one is connected to ground through a high voltage MPSA42 transistor, with a collector-emitter voltage of 300V (more than enough for the 170V that it could be required to handle in this circuit). Since the MCU can’t drive all 45 cathode pins itself, I’ll be using ordinary LS595 shift registers to translate serial data from the MCU into parallel output to all the cathode driving transistors. If we wanted a design that could possibly be used for other numeric display functions we’d wire up all of the digits on the nixie tubes, but since that would add significantly more connections (and so more 595 chips (and so more complexity)) I’ve opted not to here.

The shift register mux board

Now that all of the hardware is together, the only thing left is to program the thing. The full code is available here, if you want to have a read through the entire thing, but the gist is thus:

  • On boot, and every hour thereafter, ask the NTP server for the current time
  • Transform the time to local time, and store the difference between the current local time and the current internal monotonic clock.
  • Every 50 milliseconds, calculate the current local time based on the millisecond offset from the NTP server and convert it into a series of bytes to be written to the shift register

The most interesting part to write was the NTP code - I’ve never gone in depth on the NTP spec before, and like many RFCs it’s an interesting read. Version 4, which we’ll use here, is defined in RFC #5905. Under section 7.3, in figure 8, we can see a nice diagram of the packet header format. This is all we need to be able to get the time from the net, as demonstrated here:

/**
 * Issue an NTP sync request to our timeserver.
 * This involves building up a packet by hand - reference to RFC5905 is key
 * https://tools.ietf.org/html/rfc5905#section-7.3
 */
unsigned long sendNTPpacket(IPAddress& address) {
    // Zero out our packet buffer
    memset(packetBuffer, 0, NTP_PACKET_SIZE);

    // Our first byte contains the Leap Indicator, Version Number and Mode.
    packetBuffer[0] = (
        // Leap indicator is set to 3 (clock not sync'd)
        0x03 << 6 |
        // Version number is 4, the current version of NTP
        0x04 << 3 |
        // Our mode is 3, client
        0x03 << 0
    );

    // Next is stratum, in this case unspecified
    packetBuffer[1] = 0x00;

    // Polling interval - the maximum interval between successive messages.
    // This is a funky variable; it's in log2 seconds.
    // 6 is the recommended lower bound, so we'll use that
    packetBuffer[2] = 0x06;

    // The precision of our own clock.
    // This is also in log2 seconds (signed). So to specify microseconds, that's
    // 1 / 1,000,000 seconds. log_2(1000000) is 20 (well, 19,9), so for one
    // onemillionth of a second we want -20.
    packetBuffer[3] = 0xEC;

    // 8 bytes intentionally left blank for the root delay and root dispersion.
    // These count time lags for the reference clock, and we don't really care.

    // Next 32 bits are the Reference ID. I can't quite determine what exactly
    // the spec wants here, but 'INIT' is a valid kiss code for not having
    // synchronized yet so we'll go with that
    packetBuffer[12]  = 'I';
    packetBuffer[13]  = 'N';
    packetBuffer[14]  = 'I';
    packetBuffer[15]  = 'T';

    // There are more fields in an NTP packet, but we don't care about them here
    // Now that the packet is all set, we can send it off to the timeserver
    // NTP runs on port 123
    Udp.beginPacket(address, 123);
    Udp.write(packetBuffer, NTP_PACKET_SIZE);
    Udp.endPacket();
}

Once we’ve sent off that request, we hope that we’ll get a response back, and can deal with it pretty easily:

/**
 * Parse the response datagram from the NTP timeserver
 */
void parseNtpResponse() {
    // Read the response packet into the buffer
    Udp.read(packetBuffer, NTP_PACKET_SIZE);

    // The info we care about, the NTP timestamp, begins at byte 40 in the
    // packet and is 4 bytes long. Grab those bytes and convert them to an
    // unsigned long.
    unsigned long highWord = word(packetBuffer[40], packetBuffer[41]);
    unsigned long lowWord = word(packetBuffer[42], packetBuffer[43]);

    // NTP counts seconds since 1900, not seconds since 1970.
    unsigned long secsSince1900 = highWord << 16 | lowWord;

    // So to convert to UNIX epoch, just need to subtract 70 years in seconds
    const unsigned long seventyYears = 2208988800UL;
    unsigned long epoch = secsSince1900 - seventyYears;

    // Now, convert that epoch to our local timezone
    unsigned long localTime =  usEastern.toLocal(epoch);

    // And update our millisecond offset
    updateOffset(localTime);
}

Finally, we have our update code. It’s pretty basic - switching on each of the digits allows us to easily control each bit output, but seems verbose. It may be possible to get a better density using some refined bit twiddling, but the explicit form is easy to modify, and handles the fact that our outputs are somewhat willy-nilly.

/**
 * Takes the hours, minutes and seconds and maps them into bits on the shift
 * registers. Elegant? Perhaps not. Functional? Yes.
 */
void write595Time(uint8_t hours, uint8_t minutes, uint8_t seconds) {

    // Create an output buffer
    uint8_t out[] = { 0, 0, 0, 0, 0, 0 };

    // Low 3 bites on the first shift register control the hours tens
    switch (hours / 10) {
        case 0: out[0] |= (1 << 0); break;
        case 1: out[0] |= (1 << 1); break;
        case 2: out[0] |= (1 << 2); break;
    }

    // The upper bits of the first, and lower of the second, registers control
    // hours ones
    switch (hours % 10) {
        [...]
    }

    [...]

    // Now shift out the data to the registers
    digitalWrite(N_RCK, LOW);
    for (int i = 5; i >= 0; i--) {
        shiftOut(N_SER, N_SCK, MSBFIRST, out[i]);
    }
    digitalWrite(N_RCK, HIGH);
}

The easily editable switch tables allowed for some last minute tweaking of values, after I realized that some of my between-board tracings had been pin-shifted slightly

Testing the output mapping

With the code loaded, I once again have a functioning timepiece on my desk. At some point I’ll fabricate some sort of housing, but until then the bare board aesthetic will have to do.

The assembled version two

Bonus GIF

All of the EAGLE design files for the clock, as well as the code for the atmega, are available in this Github repo.

Parsing OFX with Erlang and leex

Companion code for this post available on Github

As part of a push to make keeping track of my finances easier without surrendering banking credentials to popular money-management tools, I have been working on a project that allows me to track my incomes and expenditures in a database, with a simple companion app for adding transactions and running visualizations of the data. But one of the major hurdles to keeping track of my spending is the fact that I had to manually enter each and every transaction, not just categorize it. To solve this problem, we can tap into the data used by apps like Quicken to manage your banking information - OFX. Here, we’ll go over what OFX is, how to get data from your bank in an OFX format and how to lex and parse that data to make it useful.

If you want to skip ahead, the full library is on Github.

History of OFX

OFX is a product of collaboration between Microsoft and Intuit (of Quicken) in the late 90s. The initial versions were built on top of SGML, which is a precursor of XML. As used in OFX, there are no closing tags for leaf values in SGML - a fact that makes modern XML parsers unsuitable for translating it into a document that we can then work with in code. Banks that offer an ‘Online banking with Quicken’ feature will usually do so through an endpoint that speaks OFX, and access to this API is usually accessible for $10 per month or so, depending on the bank.

Fetching OFX from an institution

As a prerequisite for parsing OFX data, we need to acquire some. OFX data transfer takes place over a single API endpoint, by sending an OFX document with one or more stanzas in it and receiving another document with a response For an example OFX request, here’s a request to fetch account information from my Chase account (certain information redacted, of course):

<OFX>
  <SIGNONMSGSRQV1>
    <SONRQ>
      <DTCLIENT>20170326192550
      <USERID>my_bank_username
      <USERPASS>my_bank_password
      <LANGUAGE>ENG
      <FI>
        <ORG>B1
        <FID>10898
      </FI>
      <APPID>QWIN
      <APPVER>2200
      <CLIENTUID>9a7b311e-b6d0-4836-ac07-22d144cfc836
    </SONRQ>
  </SIGNONMSGSRQV1>
  <SIGNUPMSGSRQV1>
    <ACCTINFOTRNRQ>
      <TRNUID>3a23ee72-dcea-4c42-a509-05f6700545a9
      <CLTCOOKIE>1
      <ACCTINFORQ>
        <DTACCTUP>19691231
      </ACCTINFORQ>
    </ACCTINFOTRNRQ>
  </SIGNUPMSGSRQV1>
</OFX>

As you can see, there are two stanzas in this request - the first, the SIGNONMSGSRQV1, is common to all requests you will make to the server. It identifies who you are (USERID, USERPASS), which bank you want to talk to; Chase in this case, represented by it’s ORG and FID identifiers. Values for your bank can be found online, GNUCash has a good list here. This section also identifies the application we are using to talk to the bank. In this case I am ‘Quicken’ version 2200. Some banks will refuse to talk to you unless you tell them that yes, you are definitely Quicken.

The second stanza is our actual request for information - we are making an account information transaction request (ACCTINFOTRNRQ), with a unique transaction UUID, with an account information section stating that we last checked for account info the day before the UNIX epoch, and so should be assumed to know nothing.

To send this request to our bank, we will attach our generic OFX header and then send our request off, being sure to state the content type and Connection: close, which seems to be necessary for some banks.

ofx_request(Url, Body) ->
    Headers = [
        {"user-agent", "InetClntApp/3.0"},
        {"connection", "close"},
        {"accept", "*/*, application/x-ofx"}
    ],
    ContentType = "application/x-ofx",
    {ok, {_, _, Resp}} = httpc:request(
        post,
        {Url, Headers, ContentType, lists:flatten(Body)},
        [],
        [{body_format, binary}]
    ),
    {ok, Resp}.

Lexing OFX

Hopefully, the bank will respond to our information request with a nice blob of SGML (indented for readability):

OFXHEADER:100
DATA:OFXSGML
VERSION:103
SECURITY:NONE
ENCODING:USASCII
CHARSET:1252
COMPRESSION:NONE
OLDFILEUID:NONE
NEWFILEUID:6841bcd0-bc42-11e6-9ef8-f30ad20f18fe

<OFX>
  <SIGNONMSGSRSV1>
    <SONRS>
      <STATUS>
        <CODE>0
        <SEVERITY>INFO
        <MESSAGE>SUCCESS
      </STATUS>
      <DTSERVER>20170319173139.086[-4:EDT]
      <LANGUAGE>ENG
      <FI>
        <ORG>B1
        <FID>10898
      </FI>
    </SONRS>
  </SIGNONMSGSRSV1>
  <SIGNUPMSGSRSV1>
    <ACCTINFOTRNRS>
      <TRNUID>20161207002420.000
      <STATUS>
        <CODE>0
        <SEVERITY>INFO
      </STATUS>
      <CLTCOOKIE>1
      <ACCTINFORS>
        <DTACCTUP>20170319173139.379[-4:EDT]
        <ACCTINFO>
          <DESC>CREDIT CARD
          <CCACCTINFO>
            <CCACCTFROM>
              <ACCTID>my_credit_card_number
            </CCACCTFROM>
            <SUPTXDL>Y
            <XFERSRC>N
            <XFERDEST>N
            <SVCSTATUS>ACTIVE
          </CCACCTINFO>
        </ACCTINFO>
      </ACCTINFORS>
    </ACCTINFOTRNRS>
  </SIGNUPMSGSRSV1>
</OFX>

Excellent. Now let’s define a representation for this data that we can work with more easily in Erlang. There are two distinct node types - they may either have a value, and no close tag, or some children and a close tag. Let’s represent them as two different records, as such:

-record(ofx_node, {
          name :: nonempty_string(),
          children :: [#ofx_leaf{}]
         }).
-record(ofx_leaf, {
          name :: nonempty_string(),
          value :: nonempty_string()
         }).

Enter Leex

Leex is a lexer, a tool for taking our blob of OFX text and turning it into a list of meaningful tokens. In order to do so, we need to specify a couple of rules first. Leex input files have three sections: Definitions, Rules and Erlang code.

The definitions section is a context-free grammar for defining patterns that can then be used for building up rules. For example, U can be defined as [A-Z], or a shorthand for all uppercase letters. L can then be all lowercase ([a-z]) and the two can then be combined to refer to all letters as ALPHA = ({U}|{L}).

Once we have a set of definitions for character groups, we can then write the rules section. This is where the requisites for tokenisation are defined - for example, we want to emit a token every time we see an opening tag, and want to include in that token the name of the tag. On the left hand side of the tag, we write the match expression - in this case, <({TAGCHAR})+>, for one or more characters in the set of allowable tag names bounded by angle brackets. On the right hand side, we then specify what the lexer should do when it encounters something that matches this pattern. In this case we want to emit a token, so we’ll write {token, {opentag, lists:sublist(TokenChars, 2, TokenLen-2)}}. This means it will emit a token that is a 2-tuple of the atom opentag and a substring of the matched string that removes the enclosing ‘<>‘. So, for example, if the lexer encountered the tag <OFX> it would then emit the token {opentag, "OFX"}.

The final section allows for the definition of generic Erlang methods that can then be used in the right hand side of rules. For example, we could take our substringing code from the match rule we just defined and place it in a convenience method in the code section.

Once we are finished writing our rules, we end up with a leex file that looks like this:

Definitions.

U = [A-Z]
L = [a-z]
D = [0-9]
SAFESYM = [_\-.:+]
SYM = [_\-.:/*+\[\]']
WHITESPACE = [\s\t\n\r]
ALPHA = ({U}|{L})
ALNUM = ({ALPHA}|{D})
ALSYM = ({ALNUM}|{SYM}|{WHITESPACE})
TAGCHAR = ({ALNUM}|{SAFESYM})

Rules.

<({TAGCHAR})+>    : {token, {opentag, lists:sublist(TokenChars, 2, TokenLen-2)}}.
</({TAGCHAR})+>   : {token, {closetag, lists:sublist(TokenChars, 3, TokenLen-3)}}.
{WHITESPACE}+ : skip_token.
{ALSYM}+        : {token, {string, string:strip(TokenChars)}}.

Erlang code.

We emit three kinds of tokens - opentag, when a tag is opened, closetag, when a tag is closed and string when we encounter a string literal (tag value). With just these three types, we can then build a parser that can turn this list of tokens into a document tree.

Parsing the tokens

As stated when we built our records, we only have two cases we need to deal with here - leaf nodes, which will always be [{opentag, Tag}, [{string, Value}] and parent nodes, which will be [{opentag, Tag}, ...tag_children..., [{closetag, Tag}]. This means that leaf nodes can be parsed easily by matching on the head of the tag list, and our more complex case of a parent node can be handled by a secondary method that accumulates all nodes until it encounters a specified terminal node. The implementation is as follows:

% Parses a list of tags into an OFX data tree.
% Will error out in there are tokens that cannot be parsed as part of the tree.
parse(Tags) ->
    {Tree, Unparsed} = parse_node(Tags),
    [] = Unparsed,
    Tree.

% Parse a single OFX node from tokens.
% Returns the node, and any unused tokens.
parse_node([{opentag, Tag}|[{string, Value}|Tags]]) ->
    {#ofx_leaf{name=Tag, value=Value}, Tags};
parse_node([{opentag, Tag}|Tags]) ->
    {Children, Tags2} = parse_node_list(Tag, Tags),
    {#ofx_node{name=Tag,children=Children}, Tags2}.

% Convenience method for parse_node_list/3.
parse_node_list(EndTag, Tags) ->
    parse_node_list(EndTag, Tags, []).

% Parses a list of child nodes. Stops parsing when a {closetag, } tuple is found
% with a name matching the EndTag.
parse_node_list(_EndTag, [], Nodes) ->
    Nodes;
parse_node_list(EndTag, [Tag|Tags], Nodes) ->
    {Node, Tags2} = parse_node([Tag|Tags]),
    case hd(Tags2) of
        {closetag, EndTag} ->
            {[Node|Nodes], tl(Tags2)};
        _ ->
            parse_node_list(EndTag, Tags2, [Node|Nodes])
    end.

This isn’t robust against malformed SGML, and will need as many stack frames as the tree is deep, but it gets the job done. If we now take the our list of lexed tags from the previous step, we can run them through the parser and we should get a workable tree of records like so:

1> {ok, Client} = ofx:new_client(
        "username", "password", "B1", "10898", "https://ofx.chase.com").
{ok,<0.415.0>}
2> ofx_client:list_accounts(Client).
{ofx_node,"OFX",
    [{ofx_node,"SIGNUPMSGSRSV1",
        [{ofx_node,"ACCTINFOTRNRS",
            [{ofx_node,"ACCTINFORS",
                [{ofx_node,"ACCTINFO",
                    [{ofx_node,"CCACCTINFO",
                        [{ofx_leaf,"SVCSTATUS","ACTIVE"},
                         {ofx_leaf,"XFERDEST","N"},
                         {ofx_leaf,"XFERSRC",[...]},
                         {ofx_leaf,[...],...},
                         {ofx_node,...}]},
                         {ofx_leaf,"DESC","CREDIT CARD"}]},
                 {ofx_leaf,"DTACCTUP","20170328125651.415[-4:EDT]"}]},
             {ofx_leaf,"CLTCOOKIE","1"},
             {ofx_node,"STATUS",
                 [{ofx_leaf,"SEVERITY","INFO"},{ofx_leaf,"CODE","0"}]},
             {ofx_leaf,"TRNUID",
                 "12a14b40-2154-43ab-b986-67200cab0ec3"}]}]},
            [...]

Using the methods exposed in the ofx_tree module, we can then easily parse out the information we might care about:

OfxRoot = ofx_client:list_accounts(Client),
SIGNUPMSGSRSV1 = ofx_tree:get_child("SIGNUPMSGSRSV1", OfxRoot),
ACCTINFOTRNRS = ofx_tree:get_child("ACCTINFOTRNRS", SIGNUPMSGSRSV1),
ACCTINFORS = ofx_tree:get_child("ACCTINFORS", ACCTINFOTRNRS),
Accounts = ofx_tree:get_children("ACCTINFO", ACCTINFORS)],
lists:map(
    fun(Acct) -> io:format("Got account: ~p~n", [Acct]) end,
    Accounts
).

Custom FindBugs detectors in Android

Companion code for this post available on Github

Modern compilers can detect all sorts of things, from the humble type error to mismatched format strings, but in some cases it’s just not feasible or the the use case is not widespread enough for an error pattern to be detected at compile time. Luckily, in the Java/Android ecosystem, there are two tools that exist to take compile(ish)-time checking to the next level - Android Lint, a tool supplied as part of the Android SDK for catching Android-specific errors (resource size inconsistencies, missing translatins, etc) and the FindBugs Project, a well established project from the university of Maryland, and what we will be digging into here. We’ll take a quick look at what it is, go over a small refresher on Java’s try-with-resources pattern and then dive into writing our own detector that will ensure all Cursor operations are wrapped in one of these try-with-resources blocks.

Some background

Findbugs is a static analysis tool that operates on compiled java bytecode to detect code that is deemed acceptable by the compiler, but not necessarily what the programmer mind. Examples would be detecting a null return from a method that should return Boolean, inconsistent synchronization of variables and unnecessary math operations.

At it’s core, FindBugs is powered by the Apache BCEL, a library for the inspection and manipulation of compiled Java bytecode. On top of this, FindBugs adds some extra parsing for easy access of operands, a visitor pattern for iterating over bytecode, and a mechanism for accumulating and displaying bugs. If you don’t already use FindBugs as a part of your android testing and deployment pipeline, even just the core detectors are well worth adding to a project. An example of how to add FindBugs as a task to an existing Android project (with gradle) can be seen here.

Customization

But what happens if you have a code case that is too specific to your application (e.g. invoves Android classes) or to your internal practises (style guides, design patterns, etc)? Luckily, findbugs makes it fairly simple to add additional detectors, and it’s even possible to include a set of custom detectors as part of an existing android project to be run alongside the built in detectors whenever the project is tested by hand or CI server.

A complete, buildable Android project with the detectors built in this post is online here if you wish to use it as an implementation reference or just follow along.

The use case

In Java 7, the try-with-resources pattern was added. This allowed for the declaration of resources as part of the try() header that would be automatically closed when the block exited, either normally or with an exception. This is very convenient when dealing with files, sockets, database cursors or other objects that must be closed when you are finished with them. Naturally, you’ll want to implement this everywhere you use Cursors in Android, because it’s a nice simple way to avoid leaking them:

try (Cursor c = db.query(...)) {
    c.moveToFirst();
    while (!c.isAfterLast()) {
        return new Foo(
                c.getString(c.getColumnIndex(...))
                ...
        );
    }
}

Alas, if you really want to target as many Android users as possible, you inevitably have to make sacrifices for compatibility. One of them is that try-with-resources requires API 19, which cuts off the small but not wholly insignificant ~10% (at time of writing) of users on API 18 and older. There is a workaround, however, which is to use an explicit finally block to close your cursors:

Cursor c = db.query(...);
try {
    c.moveToFirst();
    while (!c.isAfterLast()) {
        return new Foo(
                c.getString(c.getColumnIndex(...))
                ...
        );
    }
} finally {
    c.close();
}

This achieves the same result, albeit with slightly more lines. But good enough. Now the issue we want to address is that every now and then, someone is going to forget to wrap their cursor operations, and leak one. javac won’t catch it, android lint will only catch it sometimes, and neither really care about using try-with-resources. So let’s implement our own detector for findbugs that does! As acceptance criteria, let’s say that our detector needs to be able to

  • Detect cursor operations (other than Cursor#close) that are called outside of a try {} block
  • Detect try/catch blocks in methods that open cursors that do not close the cursor as part of their cleanup routine.

Detective work

First things first, we need to create a new detector. In order to make it easier to bundle my detectors with my app, I’ve added them to a module in the same project tree as can be seen here. The key ingredients are three files:

  • findbugs.xml: Your main plugin definition. This declares your plugin’s package, which classes within it are detectors, and which bugs they can be expected to surface.
  • messages.xml: A collection of strings that describe your plugin, detectors and each bug instance that you can raise.
  • A detector! This should be a class that extends either the OpcodeStackDetector or BytecodeScanningDetector.

For this detector, we’ll extend the more basic BytecodeScanningDetector since we don’t really need to fuss with the stack at all. Since at this point we don’t really know that much about how we’ll actually write this detector, the best first thing to do is take a look at how the bytecode for the cases we care about is structured. So let’s code up a quick ‘detector’ that just prints out the code for methods that involve cursors:

public class TestDetector BytecodeScanningDetector {

    private static final String ANDROID_CURSOR = "Landroid/database/Cursor;";

    @Override
    public void visitMethod(Method method) {
        super.visitMethod(method);

        // Fetch the table of local variables for this new method
        LocalVariableTable localVariableTable = method.getLocalVariableTable();

        // If any of the local variables in this method are of the type Cursor,
        // then let's print a dump of the method's bytecode.
        if (variableTableContainsType(localVariableTable, ANDROID_CURSOR)) {
            System.out.println(method.getCode());
        }
    }

    /**
     * Simple method that iterates over a variable table and returns whether or
     * not any of the entries have the specified type signature.
     * @param table Local variable table
     * @param type Java class type we're searching for
     * @return True if any of the local variables are of class type
     */
    private static boolean variableTableContainsType(LocalVariableTable table, String type) {
        for (LocalVariable variable : table.getLocalVariableTable()) {
            if (type.equals(variable.getSignature())) {
                return true;
            }
        }
        return false;
    }

}

And in order to get a more readable output, let’s create a dummy test method that assumes it has and then closes a cursor, with some string literals to help us pinpoint operations:

public void tryFinallyExample() {
    Cursor c = null;
    System.out.println("Before");
    try {
        System.out.println("Try");
    } finally {
        System.out.println("Finally");
        c.close();
    }
    System.out.println("After");
}

Now let’s assemble both our project and our ‘detector’ and then run the detector (outside of gradle, since gradle will swallow our System.out debugging lines)

ross@mjolnir:/h/r/P/A/ExampleFindbugs$ ./gradlew :app:assembleDebug :findbugs:assemble
[...]
ross@mjolnir:/h/r/P/A/ExampleFindbugs$ findbugs \
    -pluginList ./findbugs/build/libs/findbugs.jar \  # Our compiled 'detector'
    -effort:max \
    ./app/build/intermediates/classes/debug/com/schlaikjer/findbugs/database/LeakyDatabase.class

Then as output we find the following:

Code(max_stack = 2, max_locals = 3, code_length = 61)
0:    aconst_null
1:    astore_1
2:    getstatic         java.lang.System.out:Ljava/io/PrintStream; (36)
5:    ldc               "Before" (37)
7:    invokevirtual     java.io.PrintStream.println:(Ljava/lang/String;)V (38)
10:   getstatic         java.lang.System.out:Ljava/io/PrintStream; (36)
13:   ldc               "Try" (39)
15:   invokevirtual     java.io.PrintStream.println:(Ljava/lang/String;)V (38)
18:   getstatic         java.lang.System.out:Ljava/io/PrintStream; (36)
21:   ldc               "Finally" (40)
23:   invokevirtual     java.io.PrintStream.println:(Ljava/lang/String;)V (38)
26:   aload_1
27:   invokeinterface   android.database.Cursor.close:()V (35)  1   0
32:   goto              #52
35:   astore_2
36:   getstatic         java.lang.System.out:Ljava/io/PrintStream; (36)
39:   ldc               "Finally" (40)
41:   invokevirtual     java.io.PrintStream.println:(Ljava/lang/String;)V (38)
44:   aload_1
45:   invokeinterface   android.database.Cursor.close:()V (35)  1   0
50:   aload_2
51:   athrow
52:   getstatic         java.lang.System.out:Ljava/io/PrintStream; (36)
55:   ldc               "After" (41)
57:   invokevirtual     java.io.PrintStream.println:(Ljava/lang/String;)V (38)
60:   return

Exception handler(s) =
From    To  Handler Type
10  18  35  <Any exception>(0)

Attribute(s) =
LineNumber(0, 74), LineNumber(2, 75), LineNumber(10, 77), LineNumber(18, 79),
LineNumber(26, 80), LineNumber(32, 81), LineNumber(35, 79), LineNumber(44, 80),
LineNumber(52, 82), LineNumber(60, 83)
LocalVariable(start_pc = 0, length = 61, index = 0:com.schlaikjer.findbugs.database.LeakyDatabase this)
LocalVariable(start_pc = 2, length = 59, index = 1:android.database.Cursor c)
StackMap(
    (
        FULL, offset delta=35,
         locals={
            (type=Object, class=com.schlaikjer.findbugs.database.LeakyDatabase),
            (type=Object, class=android.database.Cursor)
        },
        stack items={
            (type=Object, class=java.lang.Throwable)
        }
    ),
    (
        SAME,
        offset delta=16
    )
)

Looking at the code dump, we learn something interesting about how the try ... finally block has been implemented at the bytecode level. The body of our finally appears in two places - once at the end of the contents of the try block at opcode 18, and once at opcode 36. So rather than having a single finally label and jumping to it both when an exception is thrown and when one isn’t, the two paths exist separately. If the try block exists normally, then control flows through the finally copy in codes 18-32, after which it jumps to the ‘After’ code we added at code 52 and exits.

If, however, an exception is thrown, then the source is checked againts the exception table for the method. We have one entry, for any exception type, that covers codes 10-18 and has a handler located at code 58. Codes 10-18 (not inclusive) are our try block, so this adds up. Code 35 is just after the jump to ‘After’ that would have ended the method in the no-exception case, and is the start of our exception handling routine. In this routine, we can see that the first thing we do is astore_2: take the topmost value from the operand stack and store in local variable 2. If we look at the StackMap dump at the end, we can see that there’s an entry for this section - one stack item, of type Throwable. So far, so good. We then call the same finally block code that was called in the other branch, but afterwards we then perform the re-throwing of the exception by loading it back onto the stack (aload_2, code 50) and throwing it (athrow, 51)

Bytecode wrangling

OK, now that we have an idea of what our try structure looks like as bytecode and the data we have available to us at detector runtime, let’s take a look at how we can meet the criteria we set out earlier. First, let’s tackle the easier case where a cursor method is called while we’re outside of a try block.

So firstly, we want to be able to know if a given instruction is a method call. Conveniently, our BytecodeScanningDetector extends the DismantleBytecode class, which at each opcode decodes the instruction as well as any arguments and makes them readily accessible. This means that in order to check if we’re at a method call on a cursor, we need only do the following for each opcode we see:

private static final String ANDROID_CURSOR_CLASS_CONST_OPERAND = "android/database/Cursor";

private void checkIfCursorMethodsCalledOutsideTry(int seen) {
    // Not a method call, return
    if (!isMethodCall()) {
        return;
    }

    // If the method is not being called on a cursor, return
    if (!ANDROID_CURSOR_CLASS_CONST_OPERAND.equals(getClassConstantOperand())) {
        return;
    }

    // Figure out try block later
}

Now that we can know if we’re at a call to an instance method of a cursor, we need to be able to check if the call is happening inside of a try. Luckily, we can use the info encoded in the ‘Exception handlers’ section of the code dump above to help us out. We can see that we have one handler registered, which covers codes 10-18 and has a handler method at code 35. Since it looks like bytecode indexes 10-18 are the body of the try block, we can easily use the offsets from the exception table to find out if a given program counter index is inside a try block or not! So let’s translate that logic to code:

private static boolean isInTryBlock(Method method, int pc) {
    CodeException[] exceptionTable = method.getCode().getExceptionTable();
    for (CodeException exception : exceptionTable) {
        if (exception.getStartPC() <= pc && pc < exception.getEndPC()) {
            return true;
        }
    }

    return false;
}

and update our detector method from before:

private static final String ANDROID_CURSOR_CLASS_CONST_OPERAND = "android/database/Cursor";
private static final String CLOSE = "close";

private void checkIfCursorMethodsCalledOutsideTry(int seen) {
    // Not a method call, return
    if (!isMethodCall()) {
        return;
    }

    // If the method is not being called on a cursor, return
    if (!ANDROID_CURSOR_CLASS_CONST_OPERAND.equals(getClassConstantOperand())) {
        return;
    }

    // If a method is called on a cursor outside a try block, and that method is not
    // close, that's an error!
    if (!CLOSE.equals(getNameConstantOperand()) && !isInTryBlock(getMethod(), getPC())) {
        System.out.println("Cursor." + getNameConstantOperand() + " called outside of try block!");
    }
}

Excellent! Now all we need to do is pass the bug info up into findbugs so it can be processed and displayed with other detector output. This can be done with a quick snippet:

bugAccumulator.accumulateBug(
        new BugInstance(
                this,
                "DB_CURSOR_METHODS_CALLED_OUTSIDE_TRY",
                HIGH_PRIORITY
        ).addClassAndMethod(this),
        this
);

Word of warning: the detector is highly stateful! When sawOpcode is called, all of the isMethodCall() / etc. checks, and the line numbers recorded by the above accumulateBug call refer to the current opcode. If you, e.g., have a bug case where you mark the start of a possible bug at one opcode and confirm it at a later point, accumulating the bug at the second location will report it as though that is where it occurred!

That detector wasn’t too bad at all. Let’s move on to our second goal - detecting try {} blocks in methods with cursors that don’t close those cursors. In order to check whether we are operating inside a finally block, we’re going to use a similar trick for checking if we’re in a try block. This time we can’t just use the provided numbers though, since the table only has the entry point for the handler. That’s OK though, since we know that the handler will have to end in either a goto, areturn or athrow.

/**
 * Finally blocks are defined as the PCs between the handler PC and the next call to athrow,
 * goto or return.
 *
 * @param method
 * @param pc
 * @return
 */
private int getFinallyBlockIndex(Method method, int pc) {
    CodeException[] exceptionTable = method.getCode().getExceptionTable();
    int blockIndex = 0;
    for (CodeException exception : exceptionTable) {
        if (exception.getHandlerPC() <= pc) {
            int pc2 = pc;
            int codeByte;
            while ((codeByte = getCodeByte(pc2)) != ATHROW && codeByte != ARETURN && codeByte != GOTO && pc2 < getMaxPC()) {
                pc2++;
            }
            if ((codeByte == ATHROW || codeByte == ARETURN || codeByte == GOTO) && pc < pc2) {
                return blockIndex;
            }
        }
        blockIndex++;
    }

    return -1;
}

Not the best runtime complexity to be calling this all the time, but good enough for now. This detection method is going to be a bit more stateful - for each opcode, we’re going to check if we’re in a finally block, and if so:

  • Add an entry to our local log of the start of the block
  • Check if the current opcode in the block is a method call
    • If it is, check if it’s on a cursor
      • If so, check if it’s a close
        • If it is, mark this finally block as good
        • If it’s not, then this block might be leaky!

So let’s port that logic over:

private final Map<String, FinallyInfo> suspectFinallys = new HashMap<>();

private static class FinallyInfo {
    boolean callsCursorClose;
    BugInstance bugInstance;

    public FinallyInfo(BugInstance instance) {
        this.callsCursorClose = false;
        this.bugInstance = instance;
    }
}

private void checkExceptionHandlersCloseCursors(int seen) {
    // Check if we're in a finally block
    int blockIndex = getFinallyBlockIndex(getMethod(), getPC());
    if (blockIndex < 0) {
        return;
    }

    // We create a bug instance immediately on entering *all* finally blocks;
    // this is just so that we get the line numbers in the right place.
    // If the finally block does close the cursor, we just toss the buginstance
    String finallyReference = getMethodName() + blockIndex;
    if (!suspectFinallys.containsKey(finallyReference)) {
        suspectFinallys.put(finallyReference, new FinallyInfo(new BugInstance(
                this,
                "DB_CURSOR_NOT_FINALLY_CLOSED",
                HIGH_PRIORITY
        ).addClassAndMethod(this)));
    }

    // Not a method call, return
    if (!isMethodCall()) {
        return;
    }

    // If the method is not being called on a cursor, return
    if (!ANDROID_CURSOR_CLASS_CONST_OPERAND.equals(getClassConstantOperand())) {
        return;
    }

    // If the method isn't close, return
    if (!CLOSE.equals(getNameConstantOperand())) {
        return;
    }

    // Mark this finally block as OK
    suspectFinallys.get(finallyReference).callsCursorClose = true;
}

And that’s pretty much it! There’s a little bit of extra accounting that is necessary to actually finish up and report all the bugs accumulated in that fashion, which can be found here for those curious. Now that we have these detectors set up, the next time we run this project through our CI system (in this case Jenkins), we should see it error out with our expected bug instances:

Perfect! No more unclosed cursors. From the basic ideas here, it should be possible to add checks on more or less any code pattern that you want to make sure to implement or avoid in your production code.

Profiling Android apps with Flamegraphs

Companion code for this post available on Github

TL;DR:

I built a tool to turn Android trace output in to flame graphs. You can check out the source code here, or get started immediately by uploading a trace file here

If you’ve ever tried to debug a performance issue in an Android app, you’ve probably become familiar with Traceview, which reads the .trace files generated by the Debug.startMethodTracing API call and displays them in a more or less readable manner. However, for me at least, Traceview is less than ideal. The interface is rather clunky, the scrolling behaviour is questionable (zooming and scrolling down at the same time? Just what I wanted?) and it’s very difficult to interpret the call chains that are consuming the most of your time, especially if multiple threads are involved.

Traceview. Isn't obvious from the coloured bars what's going on?

One of the most useful performance visualization and analysis tools I am aware of is Brendan Gregg’s Flame Graphs, which make it easy to identify long-running sections of your code. However, I couldn’t find any existing tooling for converting the Android trace format to flat stack format expected by the flamegraph generator, so it was time to get familiar with the internals of the format.

First things first, we need to gather a trace. In my case, I have an app that takes an appreciable amount of time to load the first Activity, so I’m going to add trace sections to onCreate, onStart and onResume, following this pattern:

private static final int MEGABYTE = 1024 * 1024;

@Override
protected void onCreate(Bundle savedInstanceState) {
    super.onCreate(savedInstanceState);
    // Call the resulting trace file 'onCreate.trace', and allow a 128MiB
    // buffer for collecting trace data.
    Debug.startMethodTracing("onCreate", 128 * MEGABYTE);
    // Existing onCreate code
    [...]
    // Stop method tracing
    Debug.stopMethodTracing();
}

Once that’s added, we can start up the app normally and wait for it to finish booting. Note that like all profiliers, this trace mechanism adds overhead! Don’t make decisions based on the absolute timings as correct when dealing with these traces, but rather the differences between traces. Your app will also take noticeably longer to start while profiling is active - this is normal.

Once it has loaded, open up adb and take a look to check that your traces have been created:

ross@mjolnir:/h/ross$ adb shell ls -l /sdcard/Android/data/com.schlaikjer.cookbook/files/
total 112400
-rw-rw---- 1 u0_a122 sdcard_rw 57473027 2017-02-26 14:26 onCreate.trace
-rw-rw---- 1 u0_a122 sdcard_rw     6255 2017-02-26 14:26 onResume.trace
-rw-rw---- 1 u0_a122 sdcard_rw    60809 2017-02-26 14:26 onStart.trace

And then pull all of them up so that we can take a look at them:

ross@mjolnir:/h/ross$ for F in {Create,Start,Resume}; do adb pull /sdcard/Android/data/com.schlaikjer.cookbook/files/on$F.trace; done
4693 KB/s (57473027 bytes in 11.959s)
710 KB/s (60809 bytes in 0.083s)
79 KB/s (6255 bytes in 0.077s)

If we crack one open with our editor of choice, we can see that the files begin with three plain text sections, followed by what looks like a lot of binary data:

*version
3
data-file-overflow=false
clock=dual
elapsed-time-usec=18482984
num-method-calls=4086234
clock-call-overhead-nsec=767
vm=art
pid=17816
*threads
17816   main
17821   Jit thread pool worker thread 0
[...]
*methods
0x7b0   java.lang.BootClassLoader   getInstance ()Ljava/lang/BootClassLoader; ClassLoader.java
0x7ac   java.lang.ClassLoader   findLoadedClass (Ljava/lang/String;)Ljava/lang/Class;   ClassLoader.java
[...]
*end
SLOW^C^@ ^@&<95>
[...]

Some of these (elapsed time, number of calls, vm name &c) are pretty intuitive. For the rest of them, the best way to figure out what they are is to look at the code that generates them!

os << StringPrintf("%cversion\n", kTraceTokenChar);
os << StringPrintf("%d\n", GetTraceVersion(clock_source_));
os << StringPrintf("data-file-overflow=%s\n", overflow_ ? "true" : "false");
if (UseThreadCpuClock()) {
  if (UseWallClock()) {
    os << StringPrintf("clock=dual\n");
  } else {
    os << StringPrintf("clock=thread-cpu\n");
  }
} else {
  os << StringPrintf("clock=wall\n");
}
os << StringPrintf("elapsed-time-usec=%" PRIu64 "\n", elapsed);
if (trace_output_mode_ != TraceOutputMode::kStreaming) {
  size_t num_records = (final_offset - kTraceHeaderLength) / GetRecordSize(clock_source_);
  os << StringPrintf("num-method-calls=%zd\n", num_records);
}
os << StringPrintf("clock-call-overhead-nsec=%d\n", clock_overhead_ns_);
os << StringPrintf("vm=art\n");
os << StringPrintf("pid=%d\n", getpid());
if ((flags_ & kTraceCountAllocs) != 0) {
  os << StringPrintf("alloc-count=%d\n", Runtime::Current()->GetStat(KIND_ALLOCATED_OBJECTS));
  os << StringPrintf("alloc-size=%d\n", Runtime::Current()->GetStat(KIND_ALLOCATED_BYTES));
  os << StringPrintf("gc-count=%d\n", Runtime::Current()->GetStat(KIND_GC_INVOCATIONS));
}

Based on this, it looks like our version header consists of:

  • Version: The trace format version (3 for all devices I tested)
  • Data file overflow: The overflow_ flag seems to be set when if the amount of calls in the binary trace section overflows the buffer specified in the start trace call. If you see this set to true, you should re-run your trace with a larger buffer to ensure you aren’t missing any information.
  • Clock: Whether the trace data uses wallclock time, per-thread CPU time, or both. All tested devices reported both clock times.
  • Elapsed clock: The total trace time, in microseconds.
  • Clock call overhead: The amount of time it takes to check the time. Any measurements close to or below this number should be assumed to be below the noise floor for this trace.
  • VM: Art or Dalvik
  • Pid: The process ID of the process under trace

There are also three fields that were not present in the trace we took a look at:

  • Alloc count: Number of allocated objects
  • Alloc size: Size of all allocated objects
  • GC count: The number of collections that have occurred

With that under our belt, let’s move on to the Threads section. This one is pretty simple:

void Trace::DumpThreadList(std::ostream& os) {
  Thread* self = Thread::Current();
  for (auto it : exited_threads_) {
    os << it.first << "\t" << it.second << "\n";
  }
  Locks::thread_list_lock_->AssertNotHeld(self);
  MutexLock mu(self, *Locks::thread_list_lock_);
  Runtime::Current()->GetThreadList()->ForEach(DumpThread, &os);
}

So each entry under the thread section is just a tuple of Thread ID and a human readable thread name. The method section is similar, but has a few more fields:

std::string Trace::GetMethodLine(ArtMethod* method) {
  method = method->GetInterfaceMethodIfProxy(kRuntimePointerSize);
  return StringPrintf("%#x\t%s\t%s\t%s\t%s\n", (EncodeTraceMethod(method) << TraceActionBits),
      PrettyDescriptor(method->GetDeclaringClassDescriptor()).c_str(), method->GetName(),
      method->GetSignature().ToString().c_str(), method->GetDeclaringClassSourceFile());
}

So the method section is a list of tuples of:

  • Method ID
  • Declaring class
  • Method name
  • Method type signature
  • Declaring class’s source file.
  • Method declaration line number (only present for some runtimes)

That’s all the text sections dealt with. Now we can take a look at the binary data at the end of the file. Luckily, this section is actually described at the top of trace.h:

// File format:
//     header
//     record 0
//     record 1
//     ...
//
// Header format:
//     u4  magic ('SLOW')
//     u2  version
//     u2  offset to data
//     u8  start date/time in usec
//     u2  record size in bytes (version >= 2 only)
//     ... padding to 32 bytes
//
[...]
//
// Record format v3:
//     u2  thread ID
//     u4  method ID | method action
//     u4  time delta since start, in usec
//     u4  wall time since start, in usec (when clock == "dual" only)
//
// 32 bits of microseconds is 70 minutes.
//
// All values are stored in little-endian order.

The interesting thing to note here (and something that tripped me up for a while) is the method ID | method action section of the record format. If we take a look over in trace.cc, we can see how that’s implemented:

uint32_t Trace::EncodeTraceMethodAndAction(ArtMethod* method, TraceAction action) {
  uint32_t tmid = (EncodeTraceMethod(method) << TraceActionBits) | action;
  DCHECK_EQ(method, DecodeTraceMethod(tmid));
  return tmid;
}

Where TraceAction is defined as:

enum TraceAction {
    kTraceMethodEnter = 0x00,       // method entry
    kTraceMethodExit = 0x01,        // method exit
    kTraceUnroll = 0x02,            // method exited by exception unrolling
    // 0x03 currently unused
    kTraceMethodActionMask = 0x03,  // two bits
};

So with this data, we know that the method ID encoded in one of the trace records with the lower two bits masked off will match one of the method IDs in the *methods section of the plain text header. We can then use the lower two bits to work out whether each entry is a method entry or exit (via either return or stack unwind).

Armed with this, lets start writing a parser for these files. I chose Erlang for a learning exercise, and also because I intended to make use of it’s excellent binary matching syntax in conjunction with binary comprehensions. Since we know the magic for the binary section of the trace file (SLOW), let’s take a look at how we can easily parse out the header and the records using binary matching.

-define(TRACE_HEADER_MAGIC, "SLOW").

% Find the location of the trace header
{HeaderPos, _} = binary:match(Data, <<?TRACE_HEADER_MAGIC>>),

% Match out the entire header specification into variables
<<?TRACE_HEADER_MAGIC, VersionBin:2/binary, DataOffsetBin:2/binary,
  StartTimeBin:8/binary, RecordSizeBin:2/binary>> = binary:part(Data, {HeaderPos, 18}),

% Remember all numbers are little endian
DataOffset = binary:decode_unsigned(DataOffsetBin, little),
RecordSize = binary:decode_unsigned(RecordSizeBin, little),

% Now that we have the header start and header size, we can start parsing out
% the call records themselves. First, excerpt the section of the trace that
% contains the binary data
SectionStart = HeaderPos + DataOffset,
SectionEnd = byte_size(Data),
RecordSection = binary_part(Data, {SectionStart, SectionEnd - SectionStart}),

% Now that we have the records, we can break them up based on the RecordSize
% that the header speficied and parse them
Records = [Record || <<Record:RecordSize/binary>> <= RecordSection],
ParsedRecords = [parse_trace_record(Record) || Record <- Records].

As you can see, extracting the records section and parsing the header was pretty simple using the binary syntax. We use the same approach to parse out the records themselves:

parse_trace_record(Record) ->
    <<ThreadId:2/binary, MethodIdActionBin:4/binary,
      TimeDelta:4/binary, WallTimeDelta:4/binary>> = Record,

    % Decode the method ID and action from a binary to an integer
    MethodIdAction = binary:decode_unsigned(MethodIdActionBin, little),

    % Now remember that this is a 4-byte integer, and that the top bits
    % are the actual method ID
    MethodId = MethodIdAction band 16#FFFFFFFC,

    % While the action is the lower two bits.
    % Convert to an atom for readability
    MethodAction = case MethodIdAction band 16#00000003 of
                       16#00 -> enter;
                       16#01 -> exit;
                       16#02 -> unwind
                   end,

    #call_record{
       thread_id=binary:decode_unsigned(ThreadId, little),
       method_id=MethodId,
       method_action=MethodAction,
       time_delta=binary:decode_unsigned(TimeDelta, little),
       wall_time_delta=binary:decode_unsigned(WallTimeDelta, little),
       child_time=0
      }.

That’s most of the hard work! The full parser code, including the sections for the thread and method tables, can be seen in the final parser implementation here. Now that we have our call records, method IDs and thread IDs, we need to actually convert that data into the format that the flame graph generator can handle. It expects to receive data as ; delimited stack frames, followed by a space and a number representing the time / samples / cycles spent in the final call in that stack. To calculate this, we iterate over all these records per-thread and perform the following:

For a method entry: Push the method name onto a stack, so that we can keep track of what methods have been called. Also push the full method record onto a second stack, so that we can reference its timings later.

For a method exit/unwind: Here’s where the real logic happens. When a method exits, the stack should contain the matching method entry record. We can use the time on the two records to calculate how much time was spent in this method call altogether. If there is another parent on the stack, we update it to reflect how much time was spent in the current call - this allows for separate tracking of self and child call times. We then take the self time (subtracting any children from the current call), and update a map of method name list -> time with the self time. This deduplicates identical call chains.

Once we have iterated over all the calls, we should have a map of lists of method names to timings. From there, we can just join the names with semicolons, append a space and the timing, and it’s all set to be processed by flamegraph.pl.

The code implementing this logic can be found here.

Once we have the trace data, we can process the graph:

flamegraph.pl \
        --title "onCreate" \
        --hash \
        --countname "microseconds" \
        onCreate.flat > onCreate.svg

Et voilá!

In order to make this process a lot simpler and less manual, the trace parsing and graph generation have all been rolled into a simple server than you can run. Instructions are available on Github, and there is also a public copy of the server running at https://aflame.rhye.org/ if you want to try out uploading your own traces.

The full per-thread breakdown for the example onCreate trace we generated can be found online here.