Erlang Liveness Checks in Kubernetes

I have an ever-increasing number of small projects and deployments that I use either internally or with some availability to the public, and have been relying on Kubernetes to make managing them easy. Not too long ago, I started adding a liveness probe to each pod definition as a contingency against a hung runtime. My pod definition at the time looked like this example from my Aflame project:

containers:
- name: aflame
image: docker-registry:5000/erlang-aflame
livenessProbe:
  exec:
    command:
    - /deploy/bin/aflame
    - ping
  initialDelaySeconds: 5

This worked fine, and I was able to verify that nodes that failed the ping test would be taken down. However, some weeks later I was poking around and realized that CPU utilization on my server was significantly higher than I would expect.

Note the recent spike in CPU utilization

Looking at htop, it was difficult to see a precise culprit, except for a large number of processes called erl_child_setup coming into existence, pegging a CPU core and disappearing again. After googling around, I landed on the source code for this task and found this section of main:

/* We close all fds except the uds from beam.
   All other fds from now on will have the
   CLOEXEC flags set on them. This means that we
   only have to close a very limited number of fds
   after we fork before the exec. */
#if defined(HAVE_CLOSEFROM)
    closefrom(4);
#else
    for (i = 4; i < max_files; i++)
#if defined(__ANDROID__)
        if (i != system_properties_fd())
#endif
        (void) close(i);
#endif

According to the ps output on my system, max_files was getting set to 1,048,576 - so every time this program run, it was hot looping over a million possible file descriptors and calling close on each! No wonder it was resulting in so much system time. But what was actually causing all of these erl_child_setup calls? I had initially suspected a misbehaving deployment code, but the spike in load lined up with when I added the liveness checks. It turns out that the relx ping command is surprisingly heavyweight; or at least ends up that way when run in an environment with a very high max file count, which was the case under docker.

The fix

In order to fix this, I altered my liveness probe to run the ping check with a low ulimit on the number of open files. We still need to loop, but we are at least looping over a considerably more restrained number of descriptors. I also took the opportunity to increase the interval between checks, since the default check period is rather quick.

 containers:
 - name: aflame
 image: docker-registry:5000/erlang-aflame
 livenessProbe:
   exec:
     command:
+    - softlimit
+    - -o
+    - "128"
     - /deploy/bin/aflame
     - ping
   initialDelaySeconds: 5
+  periodSeconds: 60

Note that in order to get softlimit in your container, you may need to rebuild with daemontools package installed, or another source that contains a limit utility.

Noticably reduced CPU usage

With these changes, system load rapidly dropped from nearly 20 to a more reasonable 3. The default performance of this wrapper would likely be helped by an adoption of the closefrom syscall into the Linux kernel, but unfortunately the only references I can find to this are a pessimistic ticket from 2009 and an unmerged patch from Zheng Liu in 2014.

Respecting the scheduler in Erlang NIFs

Companion code for this post available on Github

Recently, I did a small writeup on creating natively implemented functions (NIFs) in Erlang. But as was brought up by several people on reddit and lobste.rs, that example did not account for any sort of coöperative scheduling with the VM.

The BEAM has one scheduler per CPU core, and these (and, in later versions of Erlang, the dirty schedulers) are the executors on which all code is run. In order for the BEAm to try and guarantee a soft-real-time execution environment, it needs to be able to track how much work each process has done, so that it can keep CPU hogs in check and not starve other processes. To do this, it uses the concept of reductions. Each time a statement is evaluated (reduced), the number of reductions performed by that process is incremented, until it reaches a set limit and the process is context switched out.

If we take a look at the erl_nif man page, we can see that there is one method that deals with accounting for nif time: enif_consume_timeslice. This method expects to be called relatively regularly from your NIF, with an argument that is the percentage of a time slice that you believe you have used since the last time you called that method. The idea of a time slice is somewhat lackluster in its specificity:

The time is specified as a percent of the timeslice that a process is allowed to execute Erlang code until it can be suspended to give time for other runnable processes. The scheduling timeslice is not an exact entity, but can usually be approximated to about 1 millisecond.

If we take a look at the actual implementation we can see what’s actually happening here:

int enif_consume_timeslice(ErlNifEnv* env, int percent) {
    Process *proc;
    Sint reds;

    execution_state(env, &proc, NULL);

    ASSERT(is_proc_bound(env) && percent >= 1 && percent <= 100);
    if (percent < 1) percent = 1;
    else if (percent > 100) percent = 100;

    reds = ((CONTEXT_REDS+99) / 100) * percent;
    ASSERT(reds > 0 && reds <= CONTEXT_REDS);
    BUMP_REDS(proc, reds);
    return ERTS_BIF_REDS_LEFT(proc) == 0;
}

When we call this method, we increase the number of reductions this process has executed by CONTEXT_REDS (the number of reductions a process may perform before it should be context-switched out) divided by our percentage.

So far so good. But now, we need to see how we can actually restructure our code to allow it to be context switched out. The best way to do this is to use the primitive enif_schedule_nif, which allows you to specify a function pointer and arguments that should be called to continue your calculations. Note that the NIF scheduled in this way does not need to be exported - meaning that it is fairly safe to create helper NIFs under the assumption that end users will not try and call them. So let’s take a look at how we might re-write our Levenshtein example to allow it to perform work in chunks:

// The first thing you'll want to do is create a struct to maintain any
// internal state you want to pass between calls to your NIF.
// In this case, I need to keep track of my matrix, the input strings,
// and which x, y position I had iterated up to.
struct LevenshteinState {
    // The matrix being used to calculate the distance
    unsigned int *matrix;

    // The input strings + their sizes
    unsigned char *s1;
    unsigned s1len;
    unsigned char *s2;
    unsigned s2len;

    // The index of the last processed row of the matrix,
    // so that the next iteration can pick up where we left off
    unsigned int lastX;
    unsigned int lastY;
};

// Now, let's rewrite our entry point so that all it does is read in the
// command arguments, validate them and then yield a call to our internal NIF
// that will do all the actual work.
static ERL_NIF_TERM erl_levenshtein(ErlNifEnv* env, int argc, const
                                    ERL_NIF_TERM argv[]) {
    // Not pictured: verifying argc, the type of the arguments,
    // and casting the binaries to ErlNifBinary structs.

    // Retrieve the state resource descriptor from our priv data,
    // and allocate a new structure
    // PrivData here is a custom struct, initialized during our module load
    // callback. See the code on github for the full implementation with
    // regards to this.
    struct PrivData *priv_data = enif_priv_data(env);
    struct LevenshteinState* state = enif_alloc_resource(
        priv_data->levenshtein_state_resource,
        sizeof(struct LevenshteinState)
    );

    //// Initialize the calculation state
    // Allocate our matrix
    size_t matrix_size = (
        sizeof(unsigned int) * (binary1.size + 1) * (binary2.size + 1)
    );
    state->matrix = malloc(matrix_size);
    // Copy the binary term info
    state->s1 = binary1.data;
    state->s1len = binary1.size;
    state->s2 = binary2.data;
    state->s2len = binary2.size;
    // Set our initial X and Y values
    state->lastX = 1;
    state->lastY = 1;

    // In the full version, here is where I also initialize the first row and
    // column of the matrix, in order to simplify the code in the helper NIF.

    // Now that the setup is complete, we can call erl_schedule_nif to
    // tell the beam our continuation function.
    // First, we need to wrap our data resource so that it can be passed
    // through the BEAM. enif_make_resource takes our state pointer and returns
    // an ERL_NIF_TERM.
    ERL_NIF_TERM state_term = enif_make_resource(env, state);
    // The NIF name here does not seem to be used for determining what code to
    // call, and is likely only used when debugging what code is running.
    return enif_schedule_nif(
        env,
        "levenshtein_yielding", // NIF to call
        0, // Flags
        erl_levenshtein_yielding,
        1,
        args
    );

Now that the glue code is out of the way, we can create our erl_levenshtein_yielding method, which for workloads greater than a millisecond we can expect will be called multiple times for the given input. It will take a single argument, our wrapped state from before, unwrap it, and continue wherever the previous call left off.

static ERL_NIF_TERM erl_levenshtein_yielding(ErlNifEnv* env, int argc,
                                             const ERL_NIF_TERM argv[]) {
    // Not pictured: argc check

    // Extract the state term. In the same way we wrapped it before, we need
    // to now unwrap the resource we used to pass our struct through.
    struct PrivData *priv_data = enif_priv_data(env);
    struct LevenshteinState* state;
    if (!enif_get_resource(env, argv[0],
                           priv_data->levenshtein_state_resource,
                           ((void*) (&state)))) {
        return mk_error(env, "bad_internal_state");
    }

    // Start processing wherever the previous slice left off
    const unsigned int xsize = state->s1len + 1;
    unsigned int x = state->lastX;
    unsigned int y = state->lastY;

    // Specs for tracking function runtime
    struct timespec start_time;
    struct timespec current_time;

    // Grab the function start time
    clock_gettime(CLOCK_MONOTONIC, &start_time);

    // Create a tracker for the number of loop iterations we've done.
    // This operation count will act as a punctuator for us to check
    // whether it's time for us to yield again.
    unsigned long operations = 0;

    // This is a bit slimy, but is the simplest way to preload
    // the x and y loop vars for the first inner loop iteration
    goto loop_inner;

    // Loop over the matrix
    for (x = state->lastX; x <= state->s2len; x++) {
        for (y = 1; y <= state->s1len; y++) {
loop_inner:
            // Ordinary Levenshtein implementation
            MATRIX_ELEMENT(state->matrix, xsize, x, y) = MIN3(
                MATRIX_ELEMENT(state->matrix, xsize, x-1, y) + 1,
                MATRIX_ELEMENT(state->matrix, xsize, x, y-1) + 1,
                MATRIX_ELEMENT(state->matrix, xsize, x-1, y-1) +
                    (state->s1[y-1] == state->s2[x-1] ? 0 : 1)
            );

            // For each cell, increment the op count until we hit our
            // check threshold.
            if (unlikely(operations++ > OPERATIONS_BETWEN_TIMECHEKS)) {
                // When we do, get the current time
                clock_gettime(CLOCK_MONOTONIC, ¤t_time);

                // Figure out how many nanoseconds have passed since we started
                // calculating
                unsigned long nanoseconds_diff = (
                    (current_time.tv_nsec - start_time.tv_nsec) +
                    (current_time.tv_sec - start_time.tv_sec) * 1000000000
                );

                // Convert that to a percentage of a timeslice, assuming that
                // a time slice is 1 millisecond.
                int slice_percent = (nanoseconds_diff * 100) / TIMESLICE_NANOSECONDS;

                // enif_consume_timeslice requires a percentage in the range
                // 1 <= timeslice <= 100
                if (slice_percent < 1) {
                    slice_percent = 1;
                } else if (slice_percent > 100) {
                    slice_percent = 100;
                }

                // Consume that amount of a timeslice.
                // If the result is 1, then we have consumed the entire slice and
                // should now yield.
                if (enif_consume_timeslice(env, slice_percent)) {
                    // Break out of both loops
                    goto loop_exit;
                }

                // If we're not done, shift the times over and keep looping
                start_time.tv_sec = current_time.tv_sec;
                start_time.tv_nsec = current_time.tv_nsec;
                operations = 0;
            }
        }
    }

loop_exit:
    // If we exited the loop via jump, we must have run out of time
    // in this slice. Update our state and yield the next cycle.
    if (likely(x <= state->s2len || y <= state->s1len)) {
        // Update the state with the next row value to process
        state->lastX = x;
        state->lastY = y;

        // Yield another call to ourselves.
        // We can re-use our argv, since we're reusing the same state struct.
        return enif_schedule_nif(
            env,
            "levenshtein_yielding", // NIF to call
            0, // Flags
            erl_levenshtein_yielding,
            1,
            argv
        );
    }

    // If we are done, grab the result
    unsigned int result = MATRIX_ELEMENT(
        state->matrix, xsize, state->s2len, state->s1len);

    // We've finished, so it's time to free the work state
    // state.
    free(state->matrix);
    enif_release_resource(state);

    // Return the calculated value
    return enif_make_int(env, result);
}

Now that we’ve added all that complexity, let’s see whether it was worth it. The hypothesis is that without the timeslice accounting, we would hog the scheduler and not allow other processes to run in time. So to test that, let’s create a process that tries to sleep for exactly one second, and prints how much over/under one second it actually slept for. While it’s running, we’ll also saturate all of our cores with processes that do nothing but run Levenshtein on large inputs. Here’s how we’ll do it:

realtime_test() ->
    % Allocate two large binaries
    A = << <<0>> || _ <- lists:seq(1, 10000) >>,
    B = << <<1>> || _ <- lists:seq(1, 10000) >>,

    % Create a printer process that tries to print regularly
    _Printer = spawn_link(fun() -> realtime_printer(os:system_time()) end),

    % Create enough adversarial worker processes to saturate all cores
    _Workers = [
        spawn_link(fun() -> realtime_worker(A, B) end)
        || _ <- lists:seq(1, erlang:system_info(logical_processors_available))
    ],
    ok.

% Spins forever, running our NIF on the input strings
realtime_worker(A, B) ->
    levenshtein:levenshtein(A, B),
    realtime_worker(A, B).

% Attempt to run exactly every second, and print how much we were off by.
realtime_printer(LastRan) ->
    timer:sleep(1000),
    Delta = os:system_time() - LastRan,
    DeltaMs = Delta / 1000000,
    Jitter = 1000000000 - Delta,
    JitterMs = Jitter / 1000000,
    io:format("Time since last schedule: ~p ms, Jitter: ~p ms~n", [
        DeltaMs, abs(JitterMs)
    ]),
    realtime_printer(os:system_time()).

First, as a baseline, let’s actually run this with an entirely Erlang version of Levenshtein to see what amount of jitter we should expect:

2> perftest:realtime_test().
Time since last schedule: 1004.905694 ms, Jitter: -4.905694 ms
Time since last schedule: 1003.176042 ms, Jitter: -3.176042 ms
Time since last schedule: 1003.292757 ms, Jitter: -3.292757 ms
Time since last schedule: 1003.264791 ms, Jitter: -3.264791 ms

As expected, we have a fairly low amount of deviation from our expected one second print loop. Now let’s see how it looks with our scheduler-friendly NIF implementation:

1> % With the yielding NIF implementation
1> perftest:realtime_test().
ok
Time since last schedule: 1002.378093 ms, Jitter: -2.378093 ms
Time since last schedule: 1002.232311 ms, Jitter: -2.232311 ms
Time since last schedule: 1003.469838 ms, Jitter: -3.469838 ms
Time since last schedule: 1002.724563 ms, Jitter: -2.724563 ms
Time since last schedule: 1002.120373 ms, Jitter: -2.120373 ms
Time since last schedule: 1003.110727 ms, Jitter: -3.110727 ms
Time since last schedule: 1002.924888 ms, Jitter: -2.924888 ms
Time since last schedule: 1002.408802 ms, Jitter: -2.408802 ms
Time since last schedule: 1002.575524 ms, Jitter: -2.575524 ms

Hardly any difference! Looks like our time slice management is enough to keep call latencies in check. Let’s see what happens when we run this test case on our previous non-yielding NIF implementation:

1> perftest:realtime_test().
ok
[shell becomes unresponsive]

In the end I wasn’t able to compare the jitter between this scheduler-friendly implementation and older implementation, because the older version completely hogs the scheduler, rendering the shell entirely inoperable. So I think we can consider this a good reason to make NIFs that perform heavy lifting report their status!

While much better at respecting the soft-realtime native of the BEAM, the naïve implementation here adds a lot of calls to clock_gettime and the extra overhead of having to yield. If we compare the performance against the old version, we do have a performance decrease:

1> perftest:perftest(100000, fun levenshtein:unfair_levenshtein/2).
209974.53409952734
2> perftest:perftest(100000, fun levenshtein:yielding_levenshtein/2).
149317.74831923426

Our unfair code is approximately 1.4x the speed of the fair code. A noticeable amount, but worth it if you are running this calclation in an environment that also has time-sensitive processes you do not want to disrupt.

As before, a full working copy of the source code can be seen on Github

Natively Implemented Functions in Erlang

Companion code for this post available on Github

Erlang is a surprisingly performant language for something that provides so many high level primitives, but occasionally there comes a time when you need to perform a complex calculation fast, and the immutability of Erlang data or the slight overhead of the BEAM becomes an annoyance. As an example, I’ve recently been working on a project that requires calculating the edit distance between streaming pairs of data in real time. Erlang is an excellent fit for streaming data from the network to many parallel workers, but the Levenshtein distance calculation is somewhat taxing. As a first pass, I adapted a memoized distance calculation using plain Erlang:

erlang_leven(<<Bin/binary>>, <<Bin2/binary>>) ->
    {Ed, _Cache} = erlang_leven(Bin, Bin2, dict:new()),
    Ed.

erlang_leven(<<>>, <<Bin/binary>>, Cache) ->
    {byte_size(Bin), dict:store({<<>>, Bin}, byte_size(Bin), Cache)};
erlang_leven(<<Bin/binary>>, <<"">>, Cache) ->
    {byte_size(Bin), dict:store({Bin, <<>>}, byte_size(Bin), Cache)};
erlang_leven(<<B:8, B1/binary>>, <<B:8, B2/binary>>, Cache) ->
    erlang_leven(B1, B2, Cache);
erlang_leven(Bin1 = <<_:8, B1/binary>>, Bin2 = <<_:8, B2/binary>>, Cache) ->
    case dict:is_key({Bin1, Bin2}, Cache) of
        true -> {dict:fetch({Bin1, Bin2}, Cache), Cache};
        false ->
            {L1, C1} = erlang_leven(Bin1, B2, Cache),
            {L2, C2} = erlang_leven(B1, Bin2, C1),
            {L3, C3} = erlang_leven(B1, B2, C2),
            L = 1 + lists:min([L1, L2, L3]),
            {L, dict:store({Bin1, Bin2}, L, C3)}
    end.

If we create a simple test function, we can see how this performs with a midsize input:

perftest(Iterations, Method) ->
    Start = os:system_time(),
    method_loop(Iterations, Method),
    Diff = os:system_time() - Start,
    (Iterations / Diff) * 1000000000.

method_loop(0, _) -> ok;
method_loop(I, Method) ->
    Method(
        <<"7ab02d24-2d67-11e8-835d-0b1d27744c6d">>,
        <<"80393ca4-2d67-11e8-8f7f-c7a8488e4904">>
    ),
    method_loop(I - 1, Method).
8> c('src/perftest').
{ok,perftest}
9> perftest:perftest(1000, fun perftest:erlang_leven/2).
211.50049535298368

Slightly over 200 iterations per second - not stellar. Even with parallelism, this is going to struggle with even a moderate amount of real-time data. So it’s time to turn to NIFs.

Natively Implemented Functions

Erlang, like many languages, supports a foreign function interface (FFI) allowing simple integration of code written in C/C++ into the runtime. In order to add some native code to a project, let’s first create a simple library using a rebar template.

ross@mjolnir:/h/r/P/Erlang$ rebar3 new lib levenshtein
===> Writing levenshtein/src/levenshtein.erl
===> Writing levenshtein/src/levenshtein.app.src
===> Writing levenshtein/rebar.config
===> Writing levenshtein/.gitignore
===> Writing levenshtein/LICENSE
===> Writing levenshtein/README.md

Now, we need to update the project to reflect the fact that we’ll be including native code. Rebar has an invocation for this as well:

ross@mjolnir:/h/r/P/Erlang$ cd levenshtein
ross@mjolnir:/h/r/P/E/levenshtein$ rebar3 new cmake
===> Writing c_src/Makefile

We won’t touch this Makefile since the defaults work well enough, but if one wanted to this is where they’d specify any extra compiler/linker flags they may need. What we will need to do is start writing our C code though. So let’s create our files (c_src/levenshtein.{c,h}) and add some of the basic glue code.

// The first thing we'll want to do in our header (except for include guards)
// is include the methods and structures we'll need to interact with the Erlang
// runtime.
#include <erl_nif.h>

// The second thing will be to define the module callbacks - when a native
// module is registered, it may specify three optional callbacks that the
// system can invoke at different parts of the code lifecycle. These methods
// are not required, but if they are not specified then the ERTS will not allow
// the module to be upgraded in-place.
// Here, we do not keep any state in the module, so these methods will be
// no-ops.
int load(ErlNifEnv* env, void** priv_data, ERL_NIF_TERM load_info);
int upgrade(ErlNifEnv* env, void** priv_data, void** old_priv_data,
            ERL_NIF_TERM load_info);
void unload(ErlNifEnv* env, void* priv_data);

// Now, what we came here for: our native method.
// All native methods have the same signature:
// - A pointer to the Erlang environment from which they have been called
// - The number of arguments they have been passed
// - The arguments themselves, all of which are of type ERL_NIF_TERM
// All methods must return a value, also of ERL_NIF_TERM.
// The actual argument verification must occur in the method itself.
static ERL_NIF_TERM erl_levenshtein(
    ErlNifEnv* env,
    int argc,
    const ERL_NIF_TERM argv[]
);

// Now, we define the exported methods for this native module, just as we
// would for a normal Erlang source file.
// In this case, the functions are a list of ErlNifFunc structs, each of which
// contains the name of the function, the number of arguments (arity),
// the function pointer and a flags field.
// This flags field can be left as zero for most cases, but can be used to
// indicate whether a NIF is dirty. For more info, see the Erlang docs on this:
// http://erlang.org/doc/man/erl_nif.html#dirty_nifs
static ErlNifFunc nif_funcs[] = {
    {"levenshtein", 2, erl_levenshtein, 0}
};

// Finally, we invoke the macro that seals the pact with the ERTS.
// This registers the module under the given name, with the export list and
// given callback methods.
ERL_NIF_INIT(levenshtein, nif_funcs, load, NULL, upgrade, unload);

So far so good, we have just about all the glue we need on the C side with only a handfule of lines. Now, all we have to do is implement the function itself. So let’s switch over to c_src/levenshtein.c and get started. For a full rundown of the Erlang C api, you can take a look a the man page for erl_nif.

static ERL_NIF_TERM erl_levenshtein(ErlNifEnv* env, int argc, const
                                    ERL_NIF_TERM argv[]) {
    // The very first thing we'll want to do in our function is check we've
    // been called with the correct arity.
    if (argc != 2) {
        // If we didn't get the two arguments we were expecting, we need to
        // raise a badarg error.
        // There is a provided function for precisely this:
        return enif_make_badarg(env);
    }

    // We now know that we have two arguments, but all arguments provided are
    // of type ERL_NIF_TERM, so we need to make sure that they are what we
    // expect - in this case, two binaries.
    if(!enif_is_binary(env, argv[0])
        || !enif_is_binary(env, argv[1])) {
        // If they aren't both binaries, we return an error tuple
        // {error, not_a_binary}. However, we need to possibly create those
        // atoms ourselves, as well as manually wrap them in a tuple.
        // The implementation of mk_atom can be seen below.
        return enif_make_tuple2(
            env, mk_atom(env, "error"), mk_atom(env, "not_a_binary"));
    }

    // At this point, we know we have both arguments and that both are
    // binaries. We can now safely query them as such, and load their data.
    // The ErlNifBinary struct contains only the length of the binary, and the
    // pointer to the bytes.
    ErlNifBinary binary1, binary2;
    enif_inspect_binary(env, argv[0], &binary1);
    enif_inspect_binary(env, argv[1], &binary2);

    // Now, we can call through to our actual C implementation of the
    // Levenshtein distance (omitted here).
    unsigned int edit_distance = levenshtein(
        binary1.data, binary1.size,
        binary2.data, binary2.size,
    )

    // We can't return the number directly though - we need to convert it to an
    // Erlang term first.
    return enif_make_int(env, editDistance);
}

// Convert a C string to an atom.
ERL_NIF_TERM mk_atom(ErlNifEnv* env, const char* atom) {
    ERL_NIF_TERM ret;
    // enif_make_existing_atom checks the atom table to see if we've already
    // created an atom with this string representation. If this atom already
    // exists, then it will return true and ret will contain the valid atom
    // data. Always try and fetch the atom first, and not make a new one each
    // time, or you risk exhausting the finite space allocated for the atom
    // table.
    if(!enif_make_existing_atom(env, atom, &ret, ERL_NIF_LATIN1)) {
        // If the atom was not already part of the atom table, then we can
        // safely create a new atom.
        return enif_make_atom(env, atom);
    }
    return ret;
}

With this, we’re now done with the C portion of our integration, and need to wire it up from the Erlang side. To do this, we’ll be taking advantage of the on_load directive, which specifies a method to be invoked when a module’s code is loaded by the ERTS. In this case, we want to load the .so file with our native code as soon as the module is added.

-module(levenshtein).
-export([levenshtein/2]).
-on_load(init/0).

% We need to include a dummy method, so that the compiler has something to work
% with. The implementation of it will be replaced when the native code is
% loaded, so it should just raise an error in the exceptional case that it
% actually gets invoked.
levenshtein(_, _) -> exit({not_loaded, [{module, ?MODULE}, {line, ?LINE}]}).

% Now, we define our code loading init method
init() ->
    % Find our priv dir, and the .so file with our module name
    SoName = case code:priv_dir(?MODULE) of
        {error, bad_name} -> exit({error, missing_priv_dir});
        Dir -> filename:join(Dir, ?MODULE)
    end,
    % Attempt to load the code.
    % The second parameter is passed to the module's load callback as
    % `load_info`, but we have no use for it here.
    ok = erlang:load_nif(SoName, 0).

With this in place, all that’s left is to open a shell and test it. Running rebar3 shell should successfully compile and link the C library and place it in the priv dir. Now, let’s see how much of a speed benefit we picked up from rewriting our algorithm natively:

1> c('src/levenshtein').
2> c('src/perftest').
3> perftest:perftest(1000000, fun levenshtein:levenshtein/2).
263647.94814025454

Three orders of magnitude faster! Much more like it. Hopefully this has shown that implementing something as a NIF is not only an excellent performance choice in some cases, it’s also not as daunting a task as you might think. If you want to take a full look at a natively implemented library, you can view the complete code for this project on Github.

Update: As a followup to this post, I have also written a little about how to make a NIF behave nicely with regards to the Erlang scheduler here.

Advanced Android Heap Analysis

Companion code for this post available on Github

OutOfMemoryError is, in my opinion, one of the more insidious errors you can encounter as an app developer. It can be tricky to tie to a specific piece of code or allocation path, as the final straw may be some other small allocation, and even if you know that the crash occurs when loading a large bitmap it can be tricky in a large application to figure out which large bitmap might be the culprit.

Luckily, the Android JVM implements a dumper to the HProf heap profile format, which is relatively standard between JVMs (but in this case with some ART extensions). Android studio supplies some basic introspection functionality for heap dumps, but it’s not great for following reference chains or seeing at a glance what’s allocating all your memory. Some time ago I wrote an android method trace parser , so I figured it was time to implement a sister tool for introspecting heap dumps. To that end, I have created the erlang-hprof library on Github. The repo itself has a basic overview and usage examples, but here I will go deeper into how it can be applied to identify a complex memory leak issue.

Reconnaissance

Before you can analyze a heap dump, you must first acquire one. If you’re seeing out of memory errors cropping up in your reporting solution of choice, you may or may not actually be able to replicate them yourself. Maybe there’s some codepath you just don’t hit, maybe your phone has 6GiB of ram and it would take you forever to OOM it. So the best way to get exactly the sort of situations your users encounter is to simply try and dump that user’s heap when they encounter an OOM exception. Android supplies a Debug#dumpHprofData method you can use to dump the heap to a file, so that takes care of how to get the heap in a form we can send back. In order to actually catch trigger this code though, we’ll rely on Thread#setDefaultUncaughtExceptionHandler to trigger our dumping code whenever an uncaught OutOfMemoryException bubbles all the way up. The basics of this will look a little like so:

public class HProfExceptionHandler implements Thread.UncaughtExceptionHandler {

    private final Context appContext;
    private final Thread.UncaughtExceptionHandler parentExceptionHandler;

    public HProfExceptionHandler(Context context, Thread.UncaughtExceptionHandler parentHandler) {
        // If this exception handler is installed into a VM with an existing exception handler,
        // we want to be sure that we call through to the already-registered handler so that any
        // other crash reporters still work.
        this.appContext = context.getApplicationContext();
        this.parentExceptionHandler = parentHandler;
    }

    @Override
    public void uncaughtException(Thread t, Throwable e) {
        // Check if this error is an out of memory error
        if (OutOfMemoryError.class.isAssignableFrom(e.getClass())) {
            // If it is, then try and dump it to the filesystem
            dumpHeapToFile();
        }

        // Propagate the error to the parent exception handler, if present
        if (this.parentExceptionHandler != null) {
            parentExceptionHandler.uncaughtException(t, e);
        }
    }

    private void dumpHeapToFile() {
        // First, ensure our dump location exists
        File dumpDir = getDumpDirectory();
        if (!dumpDir.exists()) {
            if (!dumpDir.mkdirs()) {
                Log.w(TAG, "Failed to create dump directory " + dumpDir.getAbsolutePath());
                return;
            }
        }

        // Create a new filename for this dump
        File dumpPath = new File(dumpDir, String.format(Locale.US, "%s.hprof", UUID.randomUUID().toString()));

        // Attempt to write out the heap
        try {
            Debug.dumpHprofData(dumpPath.getAbsolutePath());
            Log.d(TAG, "Successfully wrote heap dump " + dumpPath.getName());
        } catch (IOException e1) {
            Log.e(TAG, e1.getMessage(), e1);
        }
    }
}

Once this dump is complete, the application will continue to execute the next uncaught handler (if present), and then die. On the next start (or, for full marks, in a periodic job that requires unmetered network + charging) you can then upload these dump files to your server. There are two things to be wary of here:

  • These heap dumps can be large. On the order of several hundred megabytes. However, they tend to be highly receptive to compression. Based on production data for an app, most dumps tended to be between 15-25MiB when gzipped, as opposed to 100-200MiB uncompressed. If you are going to upload these dumps, I would very much recommend gzipping them on the fly as you do, as a courtesy to your user.
  • These heap dumps contain everything your application has in memory by definition. This may mean app secrets, sensitive user data / PII and other dangerous information. Please undergo the same data protection precautions with this crash data as you would any other sensitive information.

Initial Inspection

Once you have the heap dumps in hand, it’s time to figure out if there’s a common thread (or threads) to the allocation patterns that resulted in an OOM. Now the way that the HPROF format works at a high level is that it contains an encoded structure of every class present in the heap, including their static variables, instance variables and their types, and then follows vast numbers of instance dumps, each with a class object ID that describes it and the raw data for each of its fields. So let’s assume that we have an HProf file we’ve recovered from a device, and want to see what’s going on. First things first, we need to parse it. So fetch the parser as a dependency, and open up an Erlang shell.

% Now, let's open up that heap dump
{ok, Parser} = hprof_parser:parse_file('my_heap.dump').

% Since we're going in blind here, the first thing I'd check is the largest
% objects we might expect to allocate - bitmaps.
% So first, let's get all the instances in this dump that are of the type
% android.graphics.Bitmap
{ok, BmpRef} = hprof_parser:get_instances_for_class(Parser, <<"android.graphics.Bitmap">>).
{ok, Bitmaps} = hprof:await_acc(BmpRef, [], fun(A, I) -> [I|A] end).

% Now, each bitmap stores its actual image data in a byte array, called
% 'mBuffer', so let's read the mBuffer value for each of these bitmaps
% To make this easy, first read in the record definitions
rr("include/records.hrl").
BufferVars = [maps:get(<<"mBuffer">>, Vars) || #hprof_heap_instance{instance_values=Vars} <- Bitmaps].

% Now we have a list of #hprof_instance_field{}, each of which has the ID of
% the relevant native array, so let's fetch those
ByteArrayIds = [V || #hprof_instance_field{value=V} <- BufferVars].

% Now that we know which byte arrays are bitmaps, let's fetch all the
% primitive arrays and filter them down.
ByteArrayIdSet = sets:from_list(ByteArrayIds).
{ok, ByteArrayRef} = hprof_parser:get_primitive_arrays_of_type(Parser, byte).
{ok, BitmapByteArrays} = hprof:await_acc(
    ByteArrayRef,
    [],
    fun(Acc, Array) ->
        case sets:is_element(Array#hprof_primitive_array.object_id, ByteArrayIdSet) of
            true -> [Array|Acc];
            false -> Acc
        end
    end
).

% At this point, we can do some quick maths to see how much of the heap is
% taken up by bitmap data:
ImageSizesBytes = [byte_size(E) || #hprof_primitive_array{elements=E} <- BitmapByteArrays].
TotalImageSize = lists:foldl(fun(A, B) -> A + B end, 0, ImageSizesBytes).

In a lot of cases, you might find that the total bytes allocated by bitmaps accounts for most of your heap. If it does, then now you can dig deeper into which bitmaps these are. One nice way to do this is to reconstruct them and take a look - we have the image data after all! So back to the shell:

% Odds are that if we see the images, as the developer we'll know what they're
% from. So let's convert all those bitmaps we fetched earlier to PNGs and save
% them somewhere
lists:foreach(
    fun(Bitmap) ->
        case hprof_bitmap:make_png(Parser, Bitmap) of
            {ok, PngData} ->
                file:write_file(
                    integer_to_list(Bitmap#hprof_heap_instance.object_id) ++ ".png",
                    PngData
                );
            _ -> ok
        end
    end,
    Bitmaps
).

If you take a look in your working directory, you should now have each of the images that were in-memory written out for your perusal. Note that since Android doesn’t seem to store the bitmap configuration with the java class, the library has to guess at the bit settings for images with two bytes per pixel, since they could be either ARGB_4444 or RGB_565. It attempts to guess based on whether the image looks like it has a lot of pixels that would be transparent, but of course isn’t 100% reliable.

Digging deeper

This is all well and good, and maybe worked for finding an image asset that wasn’t getting resized properly for the device’s pixel density. But what do you do if you find an image, but can’t figure out how it and 30 copies of it are all sticking around in the heap? Well, since we have available to us all of the heap objects and all their pointers, we have a graph that tells us precisely which objects point to these images! But just knowing which objects point to which other objects isn’t perfect - if you follow every reference, you will very quickly get bogged down by the sheer number of references, back references and weak references that the android view classes come with. But we know one thing - if these images (or other problem instances) aren’t getting garbage collected when they should, then the only way this is possible is if somewhere, somehow, there is a reference chain that ends in a static pointer.

So let’s go over how we can identify these reference chains (A complete implementation is visible on Github).

The first step is going to be generating a mapping of which objects are pointed to by which other objects. For this, I’m going to use the bag ETS type to store tuples of {referenced, reference_holder} object ID pairs. We can generate this table fairly simply by streaming all of the instances from the parser:

ReferenceMap = ets:new(objects, [bag]).
{ok, InstancesRef} = hprof_parser:get_all_instances(Parser).
ok = hprof_await(
    InstancesRef,
    fun(I=#hprof_heap_instance{object_id=Oid, instance_values=Values}) ->
        % For each instance, we want to check if it has any fields that point
        % to other objects. If it does, for each of these fields we store a
        % tuple of {referent object id, this object id}
        % Note that you may also want to exclude certain instance types, such
        % as WeakReference and FinalizerReference, from analysis
        maps:fold(
            fun(_VariableName, #hprof_instance_field{type=T, value=V}, _Acc) ->
                % If this variable is of type object, and isn't null, then
                % track this reference in the ETS table
                case (T =:= object) and (V =/= 0) of
                    true -> ets:insert(ReferenceMap, {V, Oid});
                    false -> ok
                end
            end,
            ok,
            Values
        )
    end
).

This data allows us to figure out who holds onto who (though note that in the final implementation you would also include object arrays), but doesn’t have enough info to tie these instances back to a static culprit. For that, we need to take a look at the class dump instances, which contain all the static fields. If we find any reference chains that are held by any of these, we know that that’s a memory leak. So let’s build up a table in a similar way:

StaticallyReferenced = ets:new(statics, [bag]).
{ok, ClassDumpRef} = hprof_parser:get_class_dumps(Parser).
ok = hprof:await(
    ClassDumpRef,
    fun(#hprof_class_dump{class_id=ClsId, static_fields=Statics}) ->
        lists:foreach(
            fun(#hprof_static_field{type=T, data=D}) ->
                case (T =:= ?HPROF_BASIC_OBJECT) and (D =/= 0) of
                    true ->
                        % So if this class has a static reference to this
                        % object D, then we both update the reference map from
                        % before and tag that object as statically referenced
                        % by inserting it into the second table.
                        ets:insert(ReferenceMap, {D, ClsId}),
                        ets:insert(StaticallyReferenced, {D});
                    false -> ok
                end
            end,
            Statics
        )
    end
).

With these two data sets, we can now do a depth first search from a target object to try and find statically held reference chains.

detect_reference_chain(ReferenceMappings, StaticallyHeld, TargetId, PreviousRoots, RefChain) ->
    % At each step of our search, we want to first find all of the objects that
    % reference our 'target' object.
    % We can look this up using the first table we created:
    Referers = [Referrer || {_, Referrer} <- ets:lookup(ReferenceMappings, TargetId)],

    % Now, we want to iterate over all of the referers:
    % - If the referer is already in the PreviousRoots set, then we've already
    % traversed it while trying to detect a reference chain - this means there's
    % a circular reference, so don't recurse in this case or we'd be here forever
    % - If the referer is not statically referenced, then recurse with the referer
    % as the new target, and the current target added to the previous root set
    % and RefChain, which is the reference stack so far.
    % - If the referer *is* statically referenced, then we've found a potential
    % memory leak! In the full implementation, this would be passed to the
    % pretty-printer worker to resolve the names and numbers of the objects involved.
    lists:foreach(
        fun(Referer) ->
            % This is where we check if we've already seen it (reference loop)
            case sets:is_element(Referer, PreviousRoots) of
                true -> ok; % Return back up the stack
                false ->
                    % Check if this object has a static root
                    case ets_contains(StaticallyHeld, Referer) of
                        false ->
                            % Recurse
                            detect_reference_chain(
                                ReferenceMappings, StaticallyHeld,
                                Referer, % Takes the place of target
                                % Add the target here to prevent any future loops
                                sets:add_element(TargetId, PreviousRoots),
                                [TargetId|RefChain]
                            );
                        true ->
                            % The RefChain stack contains the list of object
                            % pointers that got us here, so add the referer to
                            % it and fob it off on the pretty printer
                            spawn(?MODULE, pretty_print, [[Referer|RefChain]])
                    end
            end
        end,
        Referers
    ).

Depending on the density of your heap it may take a few minutes to generate your ETS mappings, but afterwards it should be fairly quick to start detecting static reference chains, if they do exist. The full script will resolve and print reference chains in a readable form, like so:

$ _build/default/bin/main dump.hprof 896555536
Populating reference mapping table...
 - Instances...
 - Object arrays...
 - Class dumps...
Loaded 1805702 object mappings in 201 seconds
Waiting for reference chains...
Found statically referenced chain:
com.example.utility.CardCache -static cachedCards-> 866149120
java.lang.Object[] array (857898624) -> 896566080
com.example.cards.ContentCard (896566080) -mParent-> 896566032
com.example.cards.ContentCard$ViewHolder(895754848) -mParent-> 896571264
com.example.activity.MyNonGcActivity$1 (896571264) -this$0-> 896555536
com.example.activity.MyNonGcActivity (896555536)

This is of course an example application, but this sort of bug is something you can easily run into in a large, real world application if you’re unlucky. But as with so many bugs, the work is in the finding, and once you know where a reference is being held it’s a lot easier to find out why and remedy it.

I hope that this little library comes in handy if you run into memory problems with your own applications. If you find any optimizations or useful heap examination scripts, please do file a PR or open an issue on Github.

Optimizing Erlang Binary Matching

Too much heap

Recently I’ve been implementing a binary file format parser in Erlang, taking advantage of the powerful binary matching primitives to cleanly consume and parse values out of the bytestream. However, once the parser was working, I found that processing a 377MiB input file resulted in over 6GiB of heap usage for my program! Using an order of mangitude more memory than the size of my input strikes me as less than ideal, so I started looking into how to profile and optimize binary processing.

The Erlang documentation website has a page relating to binary handling that was useful for getting started, but it took me some time playing with toy examples to really get a handle on which circumstances would result in binary copying or subpar performance, and how to interpret binary optimization warning messages. I’ll try and explain each one as best I can, but if you just want the conclusion you can skip to the bottom where I summarize the three key points I came away with.

Enabling Compiler Warnings

The first step in optimizing binary matching is to turn on relevant compiler warnings. To get binary optimization info for a module, compile it with erlc using the bin_opt_info flag (erlc +bin_opt_info module.erl) or, to always have it enabled, export the flag as a compiler option using export ERL_COMPILER_OPTIONS=bin_opt_info. If you are in the REPL, you can also get binary optimization output for the built in c command by passing it as an argument. c(module, [bin_opt_info]).

Rule 1: Always use << >>

One of the first quirks about binary match optimization, which took me some time to realize, is that simply passing a binary to a function head as a variable results in deoptimization. For example, here is a very simple function that just whittles down a binary:

1 test_head_variable(<<>>) ->
2     ok;
3 test_head_variable(Binary) when is_binary(Binary) ->
4     <<_:1/binary, Rest/binary>> = Binary,
5     test_head_variable(Rest).

Intuitively, this should be possible for the compiler to optimize - the first head matches the empty binary, and the second head requires a binary value, and the first thing it does is a match on that binary value. However!

> c(foo, [bin_opt_info]).
foo.erl:4: Warning: NOT OPTIMIZED: called function test_head_variable/1 does
not begin with a suitable binary matching instruction

What gives! Both functions start with a binary match. However, the second function head, despite always being a binary, is still an open variable capture which breaks the optimizer. Luckily, there exists a simple (if somewhat tiresome to repeatedly apply) fix:

1 test_head_binmatch(<<>>) ->
2     ok;
3 test_head_binmatch(<<Binary/binary>>) when is_binary(Binary) ->
4     <<_:1/binary, Rest/binary>> = Binary,
5     test_head_binmatch(Rest).

The only thing we’ve done is that instead of matching a raw variable called Binary in the second head, we’ve used a binary match of a single term, <<Binary/binary>>. However, this is enough to make the optimizer happy:

> c(foo, [bin_opt_info]).
foo.erl:3: Warning: OPTIMIZED: creation of sub binary delayed
foo.erl:4: Warning: OPTIMIZED: creation of sub binary delayed

Note that in this case, you could simply place the entire binary decomposition in the head of the function. However, if you have a lengthy match head (e.g, are writing a parser for a binary format), it can be considerably more readable to match the entire binary in the function head and then break it apart in the function body.

Even though this seems like a trivial change, it can have a surprisingly large effect on the performance of your program. Take for example this simple module that just consumes a binary:

make_big_binary(Size) ->
    make_big_binary(Size, <<>>).

make_big_binary(0, Binary) ->
    Binary;
make_big_binary(Size, Binary) ->
    make_big_binary(Size - 1, <<Binary/binary, 0>>).

timer(M, F, A) ->
    Start = os:system_time(),
    apply(M, F, A),
    _Diff = os:system_time() - Start.

consume_binary_unopt(<<>>) ->
    ok;
consume_binary_unopt(Binary) ->
    <<A:2/binary, B:2/binary, C/binary>> = Binary,
    consume_binary_unopt(C).

consume_binary_opt(<<>>) ->
    ok;
consume_binary_opt(<<Binary/binary>>) ->
    <<A:2/binary, B:2/binary, C/binary>> = Binary,
    consume_binary_opt(C).

As we can see, the two consume methods are identical except for the <<>> match in the function head. Running both of the forms with the timer wrapper, we get the following result:

1> BigBin = foo:make_big_binary(256 * 1024 * 1024).
2> foo:timer(foo, consume_binary_unopt, [BigBin]).
3577990533
3> foo:timer(foo, consume_binary_opt, [BigBin]).
901171248
4> 3577990533 / 901171248.
3.8094635416064673

Shockingly, the variant with the binary match syntax in the function head is almost 4 times faster! If we take a look at the Abstract Form of the compiled test code, we can see that the binary head is handled using a different pattern on the code level:

34> UnOptAbsForm.
{function,17,consume_binary_unopt,1,
          [{clause,17,[{bin,17,[]}],[],[{atom,18,ok}]},
           {clause,19,
                   [{var,19,'Binary'}],
                   [],
                   [{match,20,
                           {bin,20,
                                [{bin_element,20,{var,20,'A'},{integer,20,2},[binary]},
                                 {bin_element,20,{var,20,'B'},{integer,20,2},[binary]},
                                 {bin_element,20,{var,20,'C'},default,[binary]}]},
                           {var,20,'Binary'}},
                    {call,21,{atom,21,consume_binary_unopt},[{var,21,'C'}]}]}]}
35> OptAbsForm.
{function,23,consume_binary_opt,1,
          [{clause,23,[{bin,23,[]}],[],[{atom,24,ok}]},
           {clause,25,
                   [{bin,25,
                         [{bin_element,25,{var,25,'Binary'},default,[binary]}]}],
                   [],
                   [{match,26,
                           {bin,26,
                                [{bin_element,26,{var,26,'A'},{integer,26,2},[binary]},
                                 {bin_element,26,{var,26,'B'},{integer,26,2},[binary]},
                                 {bin_element,26,{var,26,'C'},default,[binary]}]},
                           {var,26,'Binary'}},
                    {call,27,{atom,27,consume_binary_opt},[{var,27,'C'}]}]}]}

Even though it appears that the second match clause in the optimized form is doing more work (checking that the head contains a binary), the compiled form is evidently able to take advantage of the fact that all inputs are of type bin and can be safely known to be binary data at runtime, whereas the unoptimised function has a var that could be unsafe (non-binary) data. If instead of matching the entire binary in the function head of the optimized method and then re-matching A, B, and C in the body we move that match into the function head, we gain another speed increase - 5.5x the speed of the non-optimized version!

Rule 2: Binaries before Constraints

Now that we aren’t getting deoptimized out of the gate by our function head, let’s start working on parsing a binary for real. In this case, let our binary data be a series of datagrams, each prefixed with an integer value that determines the type of the following datagram. Since we want to handle each datagram differently, it makes sense to parse off the type, and then call through to a method that given a type and a data stream, handles parsing that datagram. Something like so:

 1 parse_datagrams_unoptimized(<<>>) ->
 2     ok;
 3 parse_datagrams_unoptimized(<<Bin/binary>>) ->
 4     <<DgramType:4/unsigned-integer-unit:8,
 5       DgramSize:4/unsigned-integer-unit:8,
 6       Rest/binary>> = Bin,
 7     parse_datagram_unoptimized(DgramType, DgramSize, Bin).
 8 
 9 parse_datagram_unoptimized(?DGRAM_TYPE_A, Size, <<Binary/binary>>) ->
10     % Logic specific to datagram A...
11     <<_:Size, Rest/binary>> = Binary,
12     parse_datagrams_unoptimized(Rest).

In the call from parse_datagrams_unoptimized/1 to parse_datagram_unoptimized/3, the compiler will be unable to optimize the binary match, because the function head for /3 is constrained by a non-variable (in this case the constant ?DGRAM_TYPE_A) before the binary match. If we compile with bin_opt_info on, the compiler will actually suggest the remediation in this case (though in more complex cases, might not be able to):

> c(foo, [bin_opt_info]).
foo.erl:3: Warning: OPTIMIZED: creation of sub binary delayed
foo.erl:6: Warning: NOT OPTIMIZED: called function
parse_datagram_unoptimized/3 does not begin with a suitable binary matching
instruction
foo.erl:9: Warning: INFO: matching anything else but a plain variable to the
left of binary pattern will prevent delayed sub binary optimization; SUGGEST
changing argument order
foo.erl:9: Warning: OPTIMIZED: creation of sub binary delayed
foo.erl:11: Warning: OPTIMIZED: creation of sub binary delayed

By simply swapping the constant to after the binary, we make the compiler able to delay the creation of the sub-binary here:

 1 parse_datagrams_optimized(<<>>) ->
 2     ok;
 3 parse_datagrams_optimized(<<Bin/binary>>) ->
 4     <<DgramType:4/unsigned-integer-unit:8,
 5       DgramSize:4/unsigned-integer-unit:8,
 6       Rest/binary>> = Bin,
 7     parse_datagram_optimized(Rest, DgramType, DgramSize).
 8 
 9 parse_datagram_optimized(<<Binary/binary>>, ?DGRAM_TYPE_A, Size) ->
10     % Logic specific to datagram A...
11     <<_:Size, Rest/binary>> = Binary,
12     parse_datagrams_optimized(Rest).
3> c(foo, [bin_opt_info]).
foo.erl:3: Warning: OPTIMIZED: creation of sub binary delayed
foo.erl:6: Warning: OPTIMIZED: creation of sub binary delayed
foo.erl:9: Warning: OPTIMIZED: creation of sub binary delayed
foo.erl:11: Warning: OPTIMIZED: creation of sub binary delayed

Since this post was originally written, the bit syntax matching handler in the Erlang compiler has been updated by @bjorng to handle simple rearrangement optimizations at compile time. This performance benefit will be part of OTP version 21.

For the sample from the Erlang efficiency guide:

-module(test).
-compile(bin_opt_info).
-export([non_opt_eq/2]).

non_opt_eq([H|T1], <<H,T2/binary>>) ->
    non_opt_eq(T1, T2);
non_opt_eq([_|_], <<_,_/binary>>) ->
    false;
non_opt_eq([], <<>>) ->
    true.

We can see with a development build that cases like this can now be automatically rearranged:

$ erlc test.erl  # Erlang/OTP 19
test.erl:5: Warning: INFO: matching anything else but a plain variable to the left of binary pattern will prevent delayed sub binary optimization; SUGGEST changing argument order
test.erl:5: Warning: NOT OPTIMIZED: called function non_opt_eq/2 does not begin with a suitable binary matching instruction
$ ./otp/bin/erlc test.erl  # Erlang/OTP 21 [DEVELOPMENT]
test.erl:5: Warning: OPTIMIZED: creation of sub binary delayed

Rule 3: No Unbounded Sub-binaries

Another gotcha comes when handling individual chunks of data. Lets’s say you’re parsing some protocol, where each packet is prefixed with a size, contains two fixed fields and one field that is the remainder of the packet. The simple way to parse these packets from a byte stream would be along these lines:

 1 parse_packets_unopt(<<>>, Acc) ->
 2     Acc;
 3 parse_packets_unopt(<<Binary/binary>>, Acc) ->
 4     <<PacketSize:4/unsigned-integer-unit:8,
 5       Rest/binary>> = Binary,
 6     <<PacketBinary:PacketSize/binary, Rest1/binary>> = Rest,
 7     Packet = parse_packet_unopt(PacketBinary),
 8     parse_packets_unopt(Rest1, [Packet|Acc]).
 9 
10 parse_packet_unopt(<<Packet/binary>>) ->
11     <<A:1/binary, B:2/binary, C/binary>> = Packet,
12     {A, B, C}.

For each packet, you match off the packet size, take that many bytes, then pass the packet data to a second method that handles the packet-internal data. (For simplicity, parse_packet_unopt/1 has only one case, but you can imagine a more complex handling logic that performs different operations based on the values of A or B).

If we run this through the optimizer though, we have a problem:

> c(foo, [bin_opt_info]).
foo.erl:3: Warning: OPTIMIZED: creation of sub binary delayed
foo.erl:5: Warning: OPTIMIZED: creation of sub binary delayed
foo.erl:6: Warning: OPTIMIZED: creation of sub binary delayed
foo.erl:10: Warning: OPTIMIZED: creation of sub binary delayed
foo.erl:11: Warning: NOT OPTIMIZED: sub binary is used or returned

Since C isn’t bounded in size, and escapes the function as part of the return, the compiler isn’t able to optimize it away. We can fix this by doing some manual size accounting, which introduces more room for programmer error (miscalculation of total field sizes), but makes the optimizer happy.

 1 parse_packets_opt(<<Binary/binary>>, Acc) ->
 2     <<PacketSize:4/unsigned-integer-unit:8,
 3       Rest/binary>> = Binary,
 4     <<PacketBinary:PacketSize/binary, Rest1/binary>> = Rest,
 5     Packet = parse_packet_opt(PacketSize, PacketBinary),
 6     parse_packets_opt(Rest1, [Packet|Acc]).
 7 
 8 parse_packet_opt(PacketSize, <<Packet/binary>>) ->
 9     FieldCSize = PacketSize - 3h
10     <<A:1/binary, B:2/binary, C:FieldCSize/binary>> = Packet,
11     {A, B, C}.

With this version of the code, we manually calculate the correct size for C. Running this through erlc, we see an improvement:

> c(foo, [bin_opt_info]).
foo.erl:3: Warning: OPTIMIZED: creation of sub binary delayed
foo.erl:5: Warning: OPTIMIZED: creation of sub binary delayed
foo.erl:6: Warning: OPTIMIZED: creation of sub binary delayed
foo.erl:10: Warning: OPTIMIZED: creation of sub binary delayed

Unbounded matches on binaries seem to only be optimizable if control flow then recurses on that tail. In the non-optimized case above, the unbounded match on C is stashed, and recursion continues on Rest1. Since the optimization results from bumping a single pointer along as the binary is matched, having to keep a reference to C breaks the optimization chain.

Hard and fast takeaways:

  • Always use << >> in your function head, even if you are matching on the identity
    • Bad: function(Binary) -> ...
    • Good: function(<<Binary/binary>>) -> ...
    • Best: function(<<Value/integer, V2:1/binary, .../binary>>) -> ...
  • Never place a constraint to the left of your binary in the function head
    • Bad: function(?DGRAM_TYPE_A, <<Bin/binary>>) -> ...
    • Good: function(<<Bin/binary>>, ?DGRAM_TYPE_A) -> ...
  • Never use unknown-length matches on sub-binaries
    • Bad: <<A:1/binary, B:2/binary, C/binary>> = Packet
    • Good: <<A:1/binary, B:2/binary, C:CalculatedFieldSize/binary>> = Packet

Reworking my code using these learnings resulted in a net speed increase and significant (though not yet satisfactory) reduction in heap usage. Hopefully there exist more binary handling tricks to avoid unnecessary allocations when processing large sets of binary data.

Cassandra performance in Python: Avoid namedtuple

Companion code for this post available on Github

TLDR: Use dict_factory instead of named_tuple_factory for the python cassandra driver.

Late last year, namedtuple was fingered as a culprit causing slow startup time in larger python applications. After some back and forth (covered by LWN), it appears that some improvements will come in later versions of Python 3, but namedtuple creation remains expensive.

For applications with a fixed number of named tuples, this is a startup penalty but not a drag on operations while running - once the tuples are made, there is no need to remake them. Unfortunately, there are some libraries that do repeatedly create named tuples at runtime. Including, as I discovered while looking for something else entirely, the DataStax python driver for Cassandra.

If you have used Cassandra from python before, you’re likely familiar with the fact that for each select query, you are returned a Row object with a field for each column. These Row objects are in fact named tuples, generated by the library after it receives the raw results from the Cassandra nodes, according to the following factory implementation (somewhat edited for brevity):

def named_tuple_factory(colnames, rows):
    """
    Returns each row as a namedtuple
    https://docs.python.org/2/library/collections.html#collections.namedtuple
    This is the default row factory.
    [...]
    """
    clean_column_names = map(_clean_column_name, colnames)
    try:
        Row = namedtuple('Row', clean_column_names)
    except Exception:
        # Create list because py3 map object will be consumed by first attempt
        clean_column_names = list(map(_clean_column_name, colnames))
        # [...]
        Row = namedtuple('Row', _sanitize_identifiers(clean_column_names))

    return [Row(*row) for row in rows]

As we can see, a new named tuple class is created every time you make a query returning a result set. If your application makes a large number of small queries, the named tuple processing time can quickly add up. Even if you are reading the same table repeatedly, there is no cache mechanism for re-using the previously generated tuple type. For the program I was profiling, a staggering 34% of worker run time was spend in named tuple creation. Luckily, the row factory for the Cassandra client can be easily changed, and others are available - raw tuples, dictionaries, and ordered dictionaries. I consider ordinary tuples to be non-ideal since the return data will be position-dependent instead of keyed, so that leaves the dictionary factory as the best alternative on paper. If we take a look at it, it’s rather simple:

def dict_factory(colnames, rows):
    """
    Returns each row as a dict.
    [...]
    """
    return [dict(zip(colnames, row)) for row in rows]

Based on our assumptions about the relative performance of dictionaries and named tuple creation, we can assume that dict_factory will be more performant than named_tuple_factory. But by how much? In order to get a ballpark, I’ve constructed a small benchmark that repeatedly queries a random subset of data from Cassandra, using different row factories. The source code can be found on GitHub if you wish to test for yourself. Run against all of the built-in row factories, here are my local results (some output removed for clarity):

(venv) ross@mjolnir:/h/r/P/cass_speedtest$ python --version
Python 2.7.13
(venv) ross@mjolnir:/h/r/P/cass_speedtest$ python main.py
Loaded 10000 rows of test data.
Warming cassandra up a bit first...done
--------------------------------------------------------------------------------
Beginning test for row factory <cyfunction tuple_factory at 0x7ff9b6e44a10>
Benchmark complete.
Runtime avg: 0.884321 seconds (stddev: 0.000252)
QPS avg: 1131.533536 seconds (stddev: 405.529567)
--------------------------------------------------------------------------------
Beginning test for row factory <cyfunction named_tuple_factory at 0x7ff9b6e44ad0>
Benchmark complete.
Runtime avg: 1.480597 seconds (stddev: 0.000065)
QPS avg: 675.442898 seconds (stddev: 13.412463)
--------------------------------------------------------------------------------
Beginning test for row factory <cyfunction dict_factory at 0x7ff9b6e44b90>
Benchmark complete.
Runtime avg: 0.876114 seconds (stddev: 0.000070)
QPS avg: 1141.611256 seconds (stddev: 118.118469)
--------------------------------------------------------------------------------
Beginning test for row factory <cyfunction ordered_dict_factory at 0x7ff9b6e44c50>
Benchmark complete.
Runtime avg: 0.945361 seconds (stddev: 0.000033)
QPS avg: 1057.873886 seconds (stddev: 40.724691)

Even before we table that up, we can see the named tuple clearly lags behind the others by a significant margin:

Row Factory Run Time (Seconds) Queries Per Second
Tuple 0.884321 1131.533536
Named Tuple 1.480597 675.442898
Dict 0.876114 1141.611256
Ordered Dict 0.945361 1057.873886

A real-world application will likely see less significant (as there is presumably other business logic vying for CPU time), but likely appreciable gains from switching from namedtuple to dict rows. To change it in your application is simple - just set the row_factory property on your Cassandra session instance, as detailed in the Cassandra docs here.

Clocking In

Companion code for this post available on Github

Some time ago now, I built my first moderately complex electronics project, an ethernet connected Nixie clock that pulled the time over NTP (all the code and schematics for which are on Github for those interested). Sadly, the one that had been running continuously since then croaked in a non-obvious enough way that I couldn’t repair it. Rather than re-fabricate the old design, I took this as an opportunity to revise the design and make some modifications.

The original two-part design, witn IN-16 tubes

While the previous version did have NTP, it was over ethernet (the ENC28J60 is considerably cheaper than any wifi chipset, especially at the time) which limited the locations that the clock could be plugged in. Version one was also a four digit display, which was nice and compact but doesn’t have the engagement of a noticeably ticking seconds place. Since adding a seconds place would make the design quite wide, I opted for more of a desk-placard form factor than the relatively cubelike design of v1. For style, the front would be just the display. Since the IN-16 tubes used for the first version are uprights, for this design I switched over to IN-12 tubes, which have their display in the same plane as the PCB they are mounted to.

So the first step was to make the smallest possible board that can hold the display tubes themselves, and use this as a reference for all the other boards:

This gives us a roughly 6.25” x 2” area to be getting on with, so next is to make sure that the brains of the operation can all fit into that space. The main things we need to cram in are:

The MCU: This will need to wrangle the WiFi chipset to fetch NTP offsets from the net, as well as marshal that data in such a way as to drive the display logic. Since there’s not too much processing going on here, we don’t need a hugely beefy chip. Here I’ll be using the Atmel ATMEGA238P, since it’s more than capable for the task and allows me to take advantage of the Arduino software ecosystem. This MCU doesn’t require much by way of supporting components - just a crystal.

Wifi: I know that the ESP8266 is the darling these days, but I could never get one to program correctly and have a (perhaps irrational) preference for SPI over UART. To that end, my go-to wifi chip is the Atmel ATWINC1500, which is somewhat more expensive but has proved easier to work with. It does run at 3.3V though, unlike the ATMEGA which will be running off 5V.

Boost Converter: In order to drive the Nixie tubes, we’ll need around a 170V potential on the anode. For step-up designs like this I usually use the MAX1771 controller, which tends to work well for high voltage boost circuits. I ran intp an issue with the previous design where there was a barely audible ringing, which I think was due to small traces on the boost circuit. This time I resolved to use much larger fills for the high current path on the boost circuit to try and eliminate that issue.

At the end of the day, the control board is split into three broad nets - the 12V input from the barrel jack on the back feeds the boost circuit on the right of the PCB, as well as feeding the 5V logic through an ordinary linear regulator. Since the current draw on the MCU isn’t significant, the regulator doesn’t seem to get too hot. The 5V logic is then linked to the WiFi module through some level shifters, and the power is fed through a second regulator.

Final layout of the control board, with room to spare

The middle board is fairly boring - in order to drive the nixies, the active number needs the cathode pulled low to allow current to flow. In order to handle any possible high potential on the cathode lines, each one is connected to ground through a high voltage MPSA42 transistor, with a collector-emitter voltage of 300V (more than enough for the 170V that it could be required to handle in this circuit). Since the MCU can’t drive all 45 cathode pins itself, I’ll be using ordinary LS595 shift registers to translate serial data from the MCU into parallel output to all the cathode driving transistors. If we wanted a design that could possibly be used for other numeric display functions we’d wire up all of the digits on the nixie tubes, but since that would add significantly more connections (and so more 595 chips (and so more complexity)) I’ve opted not to here.

The shift register mux board

Now that all of the hardware is together, the only thing left is to program the thing. The full code is available here, if you want to have a read through the entire thing, but the gist is thus:

  • On boot, and every hour thereafter, ask the NTP server for the current time
  • Transform the time to local time, and store the difference between the current local time and the current internal monotonic clock.
  • Every 50 milliseconds, calculate the current local time based on the millisecond offset from the NTP server and convert it into a series of bytes to be written to the shift register

The most interesting part to write was the NTP code - I’ve never gone in depth on the NTP spec before, and like many RFCs it’s an interesting read. Version 4, which we’ll use here, is defined in RFC #5905. Under section 7.3, in figure 8, we can see a nice diagram of the packet header format. This is all we need to be able to get the time from the net, as demonstrated here:

/**
 * Issue an NTP sync request to our timeserver.
 * This involves building up a packet by hand - reference to RFC5905 is key
 * https://tools.ietf.org/html/rfc5905#section-7.3
 */
unsigned long sendNTPpacket(IPAddress& address) {
    // Zero out our packet buffer
    memset(packetBuffer, 0, NTP_PACKET_SIZE);

    // Our first byte contains the Leap Indicator, Version Number and Mode.
    packetBuffer[0] = (
        // Leap indicator is set to 3 (clock not sync'd)
        0x03 << 6 |
        // Version number is 4, the current version of NTP
        0x04 << 3 |
        // Our mode is 3, client
        0x03 << 0
    );

    // Next is stratum, in this case unspecified
    packetBuffer[1] = 0x00;

    // Polling interval - the maximum interval between successive messages.
    // This is a funky variable; it's in log2 seconds.
    // 6 is the recommended lower bound, so we'll use that
    packetBuffer[2] = 0x06;

    // The precision of our own clock.
    // This is also in log2 seconds (signed). So to specify microseconds, that's
    // 1 / 1,000,000 seconds. log_2(1000000) is 20 (well, 19,9), so for one
    // onemillionth of a second we want -20.
    packetBuffer[3] = 0xEC;

    // 8 bytes intentionally left blank for the root delay and root dispersion.
    // These count time lags for the reference clock, and we don't really care.

    // Next 32 bits are the Reference ID. I can't quite determine what exactly
    // the spec wants here, but 'INIT' is a valid kiss code for not having
    // synchronized yet so we'll go with that
    packetBuffer[12]  = 'I';
    packetBuffer[13]  = 'N';
    packetBuffer[14]  = 'I';
    packetBuffer[15]  = 'T';

    // There are more fields in an NTP packet, but we don't care about them here
    // Now that the packet is all set, we can send it off to the timeserver
    // NTP runs on port 123
    Udp.beginPacket(address, 123);
    Udp.write(packetBuffer, NTP_PACKET_SIZE);
    Udp.endPacket();
}

Once we’ve sent off that request, we hope that we’ll get a response back, and can deal with it pretty easily:

/**
 * Parse the response datagram from the NTP timeserver
 */
void parseNtpResponse() {
    // Read the response packet into the buffer
    Udp.read(packetBuffer, NTP_PACKET_SIZE);

    // The info we care about, the NTP timestamp, begins at byte 40 in the
    // packet and is 4 bytes long. Grab those bytes and convert them to an
    // unsigned long.
    unsigned long highWord = word(packetBuffer[40], packetBuffer[41]);
    unsigned long lowWord = word(packetBuffer[42], packetBuffer[43]);

    // NTP counts seconds since 1900, not seconds since 1970.
    unsigned long secsSince1900 = highWord << 16 | lowWord;

    // So to convert to UNIX epoch, just need to subtract 70 years in seconds
    const unsigned long seventyYears = 2208988800UL;
    unsigned long epoch = secsSince1900 - seventyYears;

    // Now, convert that epoch to our local timezone
    unsigned long localTime =  usEastern.toLocal(epoch);

    // And update our millisecond offset
    updateOffset(localTime);
}

Finally, we have our update code. It’s pretty basic - switching on each of the digits allows us to easily control each bit output, but seems verbose. It may be possible to get a better density using some refined bit twiddling, but the explicit form is easy to modify, and handles the fact that our outputs are somewhat willy-nilly.

/**
 * Takes the hours, minutes and seconds and maps them into bits on the shift
 * registers. Elegant? Perhaps not. Functional? Yes.
 */
void write595Time(uint8_t hours, uint8_t minutes, uint8_t seconds) {

    // Create an output buffer
    uint8_t out[] = { 0, 0, 0, 0, 0, 0 };

    // Low 3 bites on the first shift register control the hours tens
    switch (hours / 10) {
        case 0: out[0] |= (1 << 0); break;
        case 1: out[0] |= (1 << 1); break;
        case 2: out[0] |= (1 << 2); break;
    }

    // The upper bits of the first, and lower of the second, registers control
    // hours ones
    switch (hours % 10) {
        [...]
    }

    [...]

    // Now shift out the data to the registers
    digitalWrite(N_RCK, LOW);
    for (int i = 5; i >= 0; i--) {
        shiftOut(N_SER, N_SCK, MSBFIRST, out[i]);
    }
    digitalWrite(N_RCK, HIGH);
}

The easily editable switch tables allowed for some last minute tweaking of values, after I realized that some of my between-board tracings had been pin-shifted slightly

Testing the output mapping

With the code loaded, I once again have a functioning timepiece on my desk. At some point I’ll fabricate some sort of housing, but until then the bare board aesthetic will have to do.

The assembled version two

Bonus GIF

All of the EAGLE design files for the clock, as well as the code for the atmega, are available in this Github repo.

Parsing OFX with Erlang and leex

Companion code for this post available on Github

As part of a push to make keeping track of my finances easier without surrendering banking credentials to popular money-management tools, I have been working on a project that allows me to track my incomes and expenditures in a database, with a simple companion app for adding transactions and running visualizations of the data. But one of the major hurdles to keeping track of my spending is the fact that I had to manually enter each and every transaction, not just categorize it. To solve this problem, we can tap into the data used by apps like Quicken to manage your banking information - OFX. Here, we’ll go over what OFX is, how to get data from your bank in an OFX format and how to lex and parse that data to make it useful.

If you want to skip ahead, the full library is on Github.

History of OFX

OFX is a product of collaboration between Microsoft and Intuit (of Quicken) in the late 90s. The initial versions were built on top of SGML, which is a precursor of XML. As used in OFX, there are no closing tags for leaf values in SGML - a fact that makes modern XML parsers unsuitable for translating it into a document that we can then work with in code. Banks that offer an ‘Online banking with Quicken’ feature will usually do so through an endpoint that speaks OFX, and access to this API is usually accessible for $10 per month or so, depending on the bank.

Fetching OFX from an institution

As a prerequisite for parsing OFX data, we need to acquire some. OFX data transfer takes place over a single API endpoint, by sending an OFX document with one or more stanzas in it and receiving another document with a response For an example OFX request, here’s a request to fetch account information from my Chase account (certain information redacted, of course):

<OFX>
  <SIGNONMSGSRQV1>
    <SONRQ>
      <DTCLIENT>20170326192550
      <USERID>my_bank_username
      <USERPASS>my_bank_password
      <LANGUAGE>ENG
      <FI>
        <ORG>B1
        <FID>10898
      </FI>
      <APPID>QWIN
      <APPVER>2200
      <CLIENTUID>9a7b311e-b6d0-4836-ac07-22d144cfc836
    </SONRQ>
  </SIGNONMSGSRQV1>
  <SIGNUPMSGSRQV1>
    <ACCTINFOTRNRQ>
      <TRNUID>3a23ee72-dcea-4c42-a509-05f6700545a9
      <CLTCOOKIE>1
      <ACCTINFORQ>
        <DTACCTUP>19691231
      </ACCTINFORQ>
    </ACCTINFOTRNRQ>
  </SIGNUPMSGSRQV1>
</OFX>

As you can see, there are two stanzas in this request - the first, the SIGNONMSGSRQV1, is common to all requests you will make to the server. It identifies who you are (USERID, USERPASS), which bank you want to talk to; Chase in this case, represented by it’s ORG and FID identifiers. Values for your bank can be found online, GNUCash has a good list here. This section also identifies the application we are using to talk to the bank. In this case I am ‘Quicken’ version 2200. Some banks will refuse to talk to you unless you tell them that yes, you are definitely Quicken.

The second stanza is our actual request for information - we are making an account information transaction request (ACCTINFOTRNRQ), with a unique transaction UUID, with an account information section stating that we last checked for account info the day before the UNIX epoch, and so should be assumed to know nothing.

To send this request to our bank, we will attach our generic OFX header and then send our request off, being sure to state the content type and Connection: close, which seems to be necessary for some banks.

ofx_request(Url, Body) ->
    Headers = [
        {"user-agent", "InetClntApp/3.0"},
        {"connection", "close"},
        {"accept", "*/*, application/x-ofx"}
    ],
    ContentType = "application/x-ofx",
    {ok, {_, _, Resp}} = httpc:request(
        post,
        {Url, Headers, ContentType, lists:flatten(Body)},
        [],
        [{body_format, binary}]
    ),
    {ok, Resp}.

Lexing OFX

Hopefully, the bank will respond to our information request with a nice blob of SGML (indented for readability):

OFXHEADER:100
DATA:OFXSGML
VERSION:103
SECURITY:NONE
ENCODING:USASCII
CHARSET:1252
COMPRESSION:NONE
OLDFILEUID:NONE
NEWFILEUID:6841bcd0-bc42-11e6-9ef8-f30ad20f18fe

<OFX>
  <SIGNONMSGSRSV1>
    <SONRS>
      <STATUS>
        <CODE>0
        <SEVERITY>INFO
        <MESSAGE>SUCCESS
      </STATUS>
      <DTSERVER>20170319173139.086[-4:EDT]
      <LANGUAGE>ENG
      <FI>
        <ORG>B1
        <FID>10898
      </FI>
    </SONRS>
  </SIGNONMSGSRSV1>
  <SIGNUPMSGSRSV1>
    <ACCTINFOTRNRS>
      <TRNUID>20161207002420.000
      <STATUS>
        <CODE>0
        <SEVERITY>INFO
      </STATUS>
      <CLTCOOKIE>1
      <ACCTINFORS>
        <DTACCTUP>20170319173139.379[-4:EDT]
        <ACCTINFO>
          <DESC>CREDIT CARD
          <CCACCTINFO>
            <CCACCTFROM>
              <ACCTID>my_credit_card_number
            </CCACCTFROM>
            <SUPTXDL>Y
            <XFERSRC>N
            <XFERDEST>N
            <SVCSTATUS>ACTIVE
          </CCACCTINFO>
        </ACCTINFO>
      </ACCTINFORS>
    </ACCTINFOTRNRS>
  </SIGNUPMSGSRSV1>
</OFX>

Excellent. Now let’s define a representation for this data that we can work with more easily in Erlang. There are two distinct node types - they may either have a value, and no close tag, or some children and a close tag. Let’s represent them as two different records, as such:

-record(ofx_node, {
          name :: nonempty_string(),
          children :: [#ofx_leaf{}]
         }).
-record(ofx_leaf, {
          name :: nonempty_string(),
          value :: nonempty_string()
         }).

Enter Leex

Leex is a lexer, a tool for taking our blob of OFX text and turning it into a list of meaningful tokens. In order to do so, we need to specify a couple of rules first. Leex input files have three sections: Definitions, Rules and Erlang code.

The definitions section is a context-free grammar for defining patterns that can then be used for building up rules. For example, U can be defined as [A-Z], or a shorthand for all uppercase letters. L can then be all lowercase ([a-z]) and the two can then be combined to refer to all letters as ALPHA = ({U}|{L}).

Once we have a set of definitions for character groups, we can then write the rules section. This is where the requisites for tokenisation are defined - for example, we want to emit a token every time we see an opening tag, and want to include in that token the name of the tag. On the left hand side of the tag, we write the match expression - in this case, <({TAGCHAR})+>, for one or more characters in the set of allowable tag names bounded by angle brackets. On the right hand side, we then specify what the lexer should do when it encounters something that matches this pattern. In this case we want to emit a token, so we’ll write {token, {opentag, lists:sublist(TokenChars, 2, TokenLen-2)}}. This means it will emit a token that is a 2-tuple of the atom opentag and a substring of the matched string that removes the enclosing ‘<>‘. So, for example, if the lexer encountered the tag <OFX> it would then emit the token {opentag, "OFX"}.

The final section allows for the definition of generic Erlang methods that can then be used in the right hand side of rules. For example, we could take our substringing code from the match rule we just defined and place it in a convenience method in the code section.

Once we are finished writing our rules, we end up with a leex file that looks like this:

Definitions.

U = [A-Z]
L = [a-z]
D = [0-9]
SAFESYM = [_\-.:+]
SYM = [_\-.:/*+\[\]']
WHITESPACE = [\s\t\n\r]
ALPHA = ({U}|{L})
ALNUM = ({ALPHA}|{D})
ALSYM = ({ALNUM}|{SYM}|{WHITESPACE})
TAGCHAR = ({ALNUM}|{SAFESYM})

Rules.

<({TAGCHAR})+>    : {token, {opentag, lists:sublist(TokenChars, 2, TokenLen-2)}}.
</({TAGCHAR})+>   : {token, {closetag, lists:sublist(TokenChars, 3, TokenLen-3)}}.
{WHITESPACE}+ : skip_token.
{ALSYM}+        : {token, {string, string:strip(TokenChars)}}.

Erlang code.

We emit three kinds of tokens - opentag, when a tag is opened, closetag, when a tag is closed and string when we encounter a string literal (tag value). With just these three types, we can then build a parser that can turn this list of tokens into a document tree.

Parsing the tokens

As stated when we built our records, we only have two cases we need to deal with here - leaf nodes, which will always be [{opentag, Tag}, [{string, Value}] and parent nodes, which will be [{opentag, Tag}, ...tag_children..., [{closetag, Tag}]. This means that leaf nodes can be parsed easily by matching on the head of the tag list, and our more complex case of a parent node can be handled by a secondary method that accumulates all nodes until it encounters a specified terminal node. The implementation is as follows:

% Parses a list of tags into an OFX data tree.
% Will error out in there are tokens that cannot be parsed as part of the tree.
parse(Tags) ->
    {Tree, Unparsed} = parse_node(Tags),
    [] = Unparsed,
    Tree.

% Parse a single OFX node from tokens.
% Returns the node, and any unused tokens.
parse_node([{opentag, Tag}|[{string, Value}|Tags]]) ->
    {#ofx_leaf{name=Tag, value=Value}, Tags};
parse_node([{opentag, Tag}|Tags]) ->
    {Children, Tags2} = parse_node_list(Tag, Tags),
    {#ofx_node{name=Tag,children=Children}, Tags2}.

% Convenience method for parse_node_list/3.
parse_node_list(EndTag, Tags) ->
    parse_node_list(EndTag, Tags, []).

% Parses a list of child nodes. Stops parsing when a {closetag, } tuple is found
% with a name matching the EndTag.
parse_node_list(_EndTag, [], Nodes) ->
    Nodes;
parse_node_list(EndTag, [Tag|Tags], Nodes) ->
    {Node, Tags2} = parse_node([Tag|Tags]),
    case hd(Tags2) of
        {closetag, EndTag} ->
            {[Node|Nodes], tl(Tags2)};
        _ ->
            parse_node_list(EndTag, Tags2, [Node|Nodes])
    end.

This isn’t robust against malformed SGML, and will need as many stack frames as the tree is deep, but it gets the job done. If we now take the our list of lexed tags from the previous step, we can run them through the parser and we should get a workable tree of records like so:

1> {ok, Client} = ofx:new_client(
        "username", "password", "B1", "10898", "https://ofx.chase.com").
{ok,<0.415.0>}
2> ofx_client:list_accounts(Client).
{ofx_node,"OFX",
    [{ofx_node,"SIGNUPMSGSRSV1",
        [{ofx_node,"ACCTINFOTRNRS",
            [{ofx_node,"ACCTINFORS",
                [{ofx_node,"ACCTINFO",
                    [{ofx_node,"CCACCTINFO",
                        [{ofx_leaf,"SVCSTATUS","ACTIVE"},
                         {ofx_leaf,"XFERDEST","N"},
                         {ofx_leaf,"XFERSRC",[...]},
                         {ofx_leaf,[...],...},
                         {ofx_node,...}]},
                         {ofx_leaf,"DESC","CREDIT CARD"}]},
                 {ofx_leaf,"DTACCTUP","20170328125651.415[-4:EDT]"}]},
             {ofx_leaf,"CLTCOOKIE","1"},
             {ofx_node,"STATUS",
                 [{ofx_leaf,"SEVERITY","INFO"},{ofx_leaf,"CODE","0"}]},
             {ofx_leaf,"TRNUID",
                 "12a14b40-2154-43ab-b986-67200cab0ec3"}]}]},
            [...]

Using the methods exposed in the ofx_tree module, we can then easily parse out the information we might care about:

OfxRoot = ofx_client:list_accounts(Client),
SIGNUPMSGSRSV1 = ofx_tree:get_child("SIGNUPMSGSRSV1", OfxRoot),
ACCTINFOTRNRS = ofx_tree:get_child("ACCTINFOTRNRS", SIGNUPMSGSRSV1),
ACCTINFORS = ofx_tree:get_child("ACCTINFORS", ACCTINFOTRNRS),
Accounts = ofx_tree:get_children("ACCTINFO", ACCTINFORS)],
lists:map(
    fun(Acct) -> io:format("Got account: ~p~n", [Acct]) end,
    Accounts
).

Custom FindBugs detectors in Android

Companion code for this post available on Github

Modern compilers can detect all sorts of things, from the humble type error to mismatched format strings, but in some cases it’s just not feasible or the the use case is not widespread enough for an error pattern to be detected at compile time. Luckily, in the Java/Android ecosystem, there are two tools that exist to take compile(ish)-time checking to the next level - Android Lint, a tool supplied as part of the Android SDK for catching Android-specific errors (resource size inconsistencies, missing translatins, etc) and the FindBugs Project, a well established project from the university of Maryland, and what we will be digging into here. We’ll take a quick look at what it is, go over a small refresher on Java’s try-with-resources pattern and then dive into writing our own detector that will ensure all Cursor operations are wrapped in one of these try-with-resources blocks.

Some background

Findbugs is a static analysis tool that operates on compiled java bytecode to detect code that is deemed acceptable by the compiler, but not necessarily what the programmer mind. Examples would be detecting a null return from a method that should return Boolean, inconsistent synchronization of variables and unnecessary math operations.

At it’s core, FindBugs is powered by the Apache BCEL, a library for the inspection and manipulation of compiled Java bytecode. On top of this, FindBugs adds some extra parsing for easy access of operands, a visitor pattern for iterating over bytecode, and a mechanism for accumulating and displaying bugs. If you don’t already use FindBugs as a part of your android testing and deployment pipeline, even just the core detectors are well worth adding to a project. An example of how to add FindBugs as a task to an existing Android project (with gradle) can be seen here.

Customization

But what happens if you have a code case that is too specific to your application (e.g. invoves Android classes) or to your internal practises (style guides, design patterns, etc)? Luckily, findbugs makes it fairly simple to add additional detectors, and it’s even possible to include a set of custom detectors as part of an existing android project to be run alongside the built in detectors whenever the project is tested by hand or CI server.

A complete, buildable Android project with the detectors built in this post is online here if you wish to use it as an implementation reference or just follow along.

The use case

In Java 7, the try-with-resources pattern was added. This allowed for the declaration of resources as part of the try() header that would be automatically closed when the block exited, either normally or with an exception. This is very convenient when dealing with files, sockets, database cursors or other objects that must be closed when you are finished with them. Naturally, you’ll want to implement this everywhere you use Cursors in Android, because it’s a nice simple way to avoid leaking them:

try (Cursor c = db.query(...)) {
    c.moveToFirst();
    while (!c.isAfterLast()) {
        return new Foo(
                c.getString(c.getColumnIndex(...))
                ...
        );
    }
}

Alas, if you really want to target as many Android users as possible, you inevitably have to make sacrifices for compatibility. One of them is that try-with-resources requires API 19, which cuts off the small but not wholly insignificant ~10% (at time of writing) of users on API 18 and older. There is a workaround, however, which is to use an explicit finally block to close your cursors:

Cursor c = db.query(...);
try {
    c.moveToFirst();
    while (!c.isAfterLast()) {
        return new Foo(
                c.getString(c.getColumnIndex(...))
                ...
        );
    }
} finally {
    c.close();
}

This achieves the same result, albeit with slightly more lines. But good enough. Now the issue we want to address is that every now and then, someone is going to forget to wrap their cursor operations, and leak one. javac won’t catch it, android lint will only catch it sometimes, and neither really care about using try-with-resources. So let’s implement our own detector for findbugs that does! As acceptance criteria, let’s say that our detector needs to be able to

  • Detect cursor operations (other than Cursor#close) that are called outside of a try {} block
  • Detect try/catch blocks in methods that open cursors that do not close the cursor as part of their cleanup routine.

Detective work

First things first, we need to create a new detector. In order to make it easier to bundle my detectors with my app, I’ve added them to a module in the same project tree as can be seen here. The key ingredients are three files:

  • findbugs.xml: Your main plugin definition. This declares your plugin’s package, which classes within it are detectors, and which bugs they can be expected to surface.
  • messages.xml: A collection of strings that describe your plugin, detectors and each bug instance that you can raise.
  • A detector! This should be a class that extends either the OpcodeStackDetector or BytecodeScanningDetector.

For this detector, we’ll extend the more basic BytecodeScanningDetector since we don’t really need to fuss with the stack at all. Since at this point we don’t really know that much about how we’ll actually write this detector, the best first thing to do is take a look at how the bytecode for the cases we care about is structured. So let’s code up a quick ‘detector’ that just prints out the code for methods that involve cursors:

public class TestDetector BytecodeScanningDetector {

    private static final String ANDROID_CURSOR = "Landroid/database/Cursor;";

    @Override
    public void visitMethod(Method method) {
        super.visitMethod(method);

        // Fetch the table of local variables for this new method
        LocalVariableTable localVariableTable = method.getLocalVariableTable();

        // If any of the local variables in this method are of the type Cursor,
        // then let's print a dump of the method's bytecode.
        if (variableTableContainsType(localVariableTable, ANDROID_CURSOR)) {
            System.out.println(method.getCode());
        }
    }

    /**
     * Simple method that iterates over a variable table and returns whether or
     * not any of the entries have the specified type signature.
     * @param table Local variable table
     * @param type Java class type we're searching for
     * @return True if any of the local variables are of class type
     */
    private static boolean variableTableContainsType(LocalVariableTable table, String type) {
        for (LocalVariable variable : table.getLocalVariableTable()) {
            if (type.equals(variable.getSignature())) {
                return true;
            }
        }
        return false;
    }

}

And in order to get a more readable output, let’s create a dummy test method that assumes it has and then closes a cursor, with some string literals to help us pinpoint operations:

public void tryFinallyExample() {
    Cursor c = null;
    System.out.println("Before");
    try {
        System.out.println("Try");
    } finally {
        System.out.println("Finally");
        c.close();
    }
    System.out.println("After");
}

Now let’s assemble both our project and our ‘detector’ and then run the detector (outside of gradle, since gradle will swallow our System.out debugging lines)

ross@mjolnir:/h/r/P/A/ExampleFindbugs$ ./gradlew :app:assembleDebug :findbugs:assemble
[...]
ross@mjolnir:/h/r/P/A/ExampleFindbugs$ findbugs \
    -pluginList ./findbugs/build/libs/findbugs.jar \  # Our compiled 'detector'
    -effort:max \
    ./app/build/intermediates/classes/debug/com/schlaikjer/findbugs/database/LeakyDatabase.class

Then as output we find the following:

Code(max_stack = 2, max_locals = 3, code_length = 61)
0:    aconst_null
1:    astore_1
2:    getstatic         java.lang.System.out:Ljava/io/PrintStream; (36)
5:    ldc               "Before" (37)
7:    invokevirtual     java.io.PrintStream.println:(Ljava/lang/String;)V (38)
10:   getstatic         java.lang.System.out:Ljava/io/PrintStream; (36)
13:   ldc               "Try" (39)
15:   invokevirtual     java.io.PrintStream.println:(Ljava/lang/String;)V (38)
18:   getstatic         java.lang.System.out:Ljava/io/PrintStream; (36)
21:   ldc               "Finally" (40)
23:   invokevirtual     java.io.PrintStream.println:(Ljava/lang/String;)V (38)
26:   aload_1
27:   invokeinterface   android.database.Cursor.close:()V (35)  1   0
32:   goto              #52
35:   astore_2
36:   getstatic         java.lang.System.out:Ljava/io/PrintStream; (36)
39:   ldc               "Finally" (40)
41:   invokevirtual     java.io.PrintStream.println:(Ljava/lang/String;)V (38)
44:   aload_1
45:   invokeinterface   android.database.Cursor.close:()V (35)  1   0
50:   aload_2
51:   athrow
52:   getstatic         java.lang.System.out:Ljava/io/PrintStream; (36)
55:   ldc               "After" (41)
57:   invokevirtual     java.io.PrintStream.println:(Ljava/lang/String;)V (38)
60:   return

Exception handler(s) =
From    To  Handler Type
10  18  35  <Any exception>(0)

Attribute(s) =
LineNumber(0, 74), LineNumber(2, 75), LineNumber(10, 77), LineNumber(18, 79),
LineNumber(26, 80), LineNumber(32, 81), LineNumber(35, 79), LineNumber(44, 80),
LineNumber(52, 82), LineNumber(60, 83)
LocalVariable(start_pc = 0, length = 61, index = 0:com.schlaikjer.findbugs.database.LeakyDatabase this)
LocalVariable(start_pc = 2, length = 59, index = 1:android.database.Cursor c)
StackMap(
    (
        FULL, offset delta=35,
         locals={
            (type=Object, class=com.schlaikjer.findbugs.database.LeakyDatabase),
            (type=Object, class=android.database.Cursor)
        },
        stack items={
            (type=Object, class=java.lang.Throwable)
        }
    ),
    (
        SAME,
        offset delta=16
    )
)

Looking at the code dump, we learn something interesting about how the try ... finally block has been implemented at the bytecode level. The body of our finally appears in two places - once at the end of the contents of the try block at opcode 18, and once at opcode 36. So rather than having a single finally label and jumping to it both when an exception is thrown and when one isn’t, the two paths exist separately. If the try block exists normally, then control flows through the finally copy in codes 18-32, after which it jumps to the ‘After’ code we added at code 52 and exits.

If, however, an exception is thrown, then the source is checked againts the exception table for the method. We have one entry, for any exception type, that covers codes 10-18 and has a handler located at code 58. Codes 10-18 (not inclusive) are our try block, so this adds up. Code 35 is just after the jump to ‘After’ that would have ended the method in the no-exception case, and is the start of our exception handling routine. In this routine, we can see that the first thing we do is astore_2: take the topmost value from the operand stack and store in local variable 2. If we look at the StackMap dump at the end, we can see that there’s an entry for this section - one stack item, of type Throwable. So far, so good. We then call the same finally block code that was called in the other branch, but afterwards we then perform the re-throwing of the exception by loading it back onto the stack (aload_2, code 50) and throwing it (athrow, 51)

Bytecode wrangling

OK, now that we have an idea of what our try structure looks like as bytecode and the data we have available to us at detector runtime, let’s take a look at how we can meet the criteria we set out earlier. First, let’s tackle the easier case where a cursor method is called while we’re outside of a try block.

So firstly, we want to be able to know if a given instruction is a method call. Conveniently, our BytecodeScanningDetector extends the DismantleBytecode class, which at each opcode decodes the instruction as well as any arguments and makes them readily accessible. This means that in order to check if we’re at a method call on a cursor, we need only do the following for each opcode we see:

private static final String ANDROID_CURSOR_CLASS_CONST_OPERAND = "android/database/Cursor";

private void checkIfCursorMethodsCalledOutsideTry(int seen) {
    // Not a method call, return
    if (!isMethodCall()) {
        return;
    }

    // If the method is not being called on a cursor, return
    if (!ANDROID_CURSOR_CLASS_CONST_OPERAND.equals(getClassConstantOperand())) {
        return;
    }

    // Figure out try block later
}

Now that we can know if we’re at a call to an instance method of a cursor, we need to be able to check if the call is happening inside of a try. Luckily, we can use the info encoded in the ‘Exception handlers’ section of the code dump above to help us out. We can see that we have one handler registered, which covers codes 10-18 and has a handler method at code 35. Since it looks like bytecode indexes 10-18 are the body of the try block, we can easily use the offsets from the exception table to find out if a given program counter index is inside a try block or not! So let’s translate that logic to code:

private static boolean isInTryBlock(Method method, int pc) {
    CodeException[] exceptionTable = method.getCode().getExceptionTable();
    for (CodeException exception : exceptionTable) {
        if (exception.getStartPC() <= pc && pc < exception.getEndPC()) {
            return true;
        }
    }

    return false;
}

and update our detector method from before:

private static final String ANDROID_CURSOR_CLASS_CONST_OPERAND = "android/database/Cursor";
private static final String CLOSE = "close";

private void checkIfCursorMethodsCalledOutsideTry(int seen) {
    // Not a method call, return
    if (!isMethodCall()) {
        return;
    }

    // If the method is not being called on a cursor, return
    if (!ANDROID_CURSOR_CLASS_CONST_OPERAND.equals(getClassConstantOperand())) {
        return;
    }

    // If a method is called on a cursor outside a try block, and that method is not
    // close, that's an error!
    if (!CLOSE.equals(getNameConstantOperand()) && !isInTryBlock(getMethod(), getPC())) {
        System.out.println("Cursor." + getNameConstantOperand() + " called outside of try block!");
    }
}

Excellent! Now all we need to do is pass the bug info up into findbugs so it can be processed and displayed with other detector output. This can be done with a quick snippet:

bugAccumulator.accumulateBug(
        new BugInstance(
                this,
                "DB_CURSOR_METHODS_CALLED_OUTSIDE_TRY",
                HIGH_PRIORITY
        ).addClassAndMethod(this),
        this
);

Word of warning: the detector is highly stateful! When sawOpcode is called, all of the isMethodCall() / etc. checks, and the line numbers recorded by the above accumulateBug call refer to the current opcode. If you, e.g., have a bug case where you mark the start of a possible bug at one opcode and confirm it at a later point, accumulating the bug at the second location will report it as though that is where it occurred!

That detector wasn’t too bad at all. Let’s move on to our second goal - detecting try {} blocks in methods with cursors that don’t close those cursors. In order to check whether we are operating inside a finally block, we’re going to use a similar trick for checking if we’re in a try block. This time we can’t just use the provided numbers though, since the table only has the entry point for the handler. That’s OK though, since we know that the handler will have to end in either a goto, areturn or athrow.

/**
 * Finally blocks are defined as the PCs between the handler PC and the next call to athrow,
 * goto or return.
 *
 * @param method
 * @param pc
 * @return
 */
private int getFinallyBlockIndex(Method method, int pc) {
    CodeException[] exceptionTable = method.getCode().getExceptionTable();
    int blockIndex = 0;
    for (CodeException exception : exceptionTable) {
        if (exception.getHandlerPC() <= pc) {
            int pc2 = pc;
            int codeByte;
            while ((codeByte = getCodeByte(pc2)) != ATHROW && codeByte != ARETURN && codeByte != GOTO && pc2 < getMaxPC()) {
                pc2++;
            }
            if ((codeByte == ATHROW || codeByte == ARETURN || codeByte == GOTO) && pc < pc2) {
                return blockIndex;
            }
        }
        blockIndex++;
    }

    return -1;
}

Not the best runtime complexity to be calling this all the time, but good enough for now. This detection method is going to be a bit more stateful - for each opcode, we’re going to check if we’re in a finally block, and if so:

  • Add an entry to our local log of the start of the block
  • Check if the current opcode in the block is a method call
    • If it is, check if it’s on a cursor
      • If so, check if it’s a close
        • If it is, mark this finally block as good
        • If it’s not, then this block might be leaky!

So let’s port that logic over:

private final Map<String, FinallyInfo> suspectFinallys = new HashMap<>();

private static class FinallyInfo {
    boolean callsCursorClose;
    BugInstance bugInstance;

    public FinallyInfo(BugInstance instance) {
        this.callsCursorClose = false;
        this.bugInstance = instance;
    }
}

private void checkExceptionHandlersCloseCursors(int seen) {
    // Check if we're in a finally block
    int blockIndex = getFinallyBlockIndex(getMethod(), getPC());
    if (blockIndex < 0) {
        return;
    }

    // We create a bug instance immediately on entering *all* finally blocks;
    // this is just so that we get the line numbers in the right place.
    // If the finally block does close the cursor, we just toss the buginstance
    String finallyReference = getMethodName() + blockIndex;
    if (!suspectFinallys.containsKey(finallyReference)) {
        suspectFinallys.put(finallyReference, new FinallyInfo(new BugInstance(
                this,
                "DB_CURSOR_NOT_FINALLY_CLOSED",
                HIGH_PRIORITY
        ).addClassAndMethod(this)));
    }

    // Not a method call, return
    if (!isMethodCall()) {
        return;
    }

    // If the method is not being called on a cursor, return
    if (!ANDROID_CURSOR_CLASS_CONST_OPERAND.equals(getClassConstantOperand())) {
        return;
    }

    // If the method isn't close, return
    if (!CLOSE.equals(getNameConstantOperand())) {
        return;
    }

    // Mark this finally block as OK
    suspectFinallys.get(finallyReference).callsCursorClose = true;
}

And that’s pretty much it! There’s a little bit of extra accounting that is necessary to actually finish up and report all the bugs accumulated in that fashion, which can be found here for those curious. Now that we have these detectors set up, the next time we run this project through our CI system (in this case Jenkins), we should see it error out with our expected bug instances:

Perfect! No more unclosed cursors. From the basic ideas here, it should be possible to add checks on more or less any code pattern that you want to make sure to implement or avoid in your production code.

Profiling Android apps with Flamegraphs

Companion code for this post available on Github

TL;DR:

I built a tool to turn Android trace output in to flame graphs. You can check out the source code here, or get started immediately by uploading a trace file here

If you’ve ever tried to debug a performance issue in an Android app, you’ve probably become familiar with Traceview, which reads the .trace files generated by the Debug.startMethodTracing API call and displays them in a more or less readable manner. However, for me at least, Traceview is less than ideal. The interface is rather clunky, the scrolling behaviour is questionable (zooming and scrolling down at the same time? Just what I wanted?) and it’s very difficult to interpret the call chains that are consuming the most of your time, especially if multiple threads are involved.

Traceview. Isn't obvious from the coloured bars what's going on?

One of the most useful performance visualization and analysis tools I am aware of is Brendan Gregg’s Flame Graphs, which make it easy to identify long-running sections of your code. However, I couldn’t find any existing tooling for converting the Android trace format to flat stack format expected by the flamegraph generator, so it was time to get familiar with the internals of the format.

First things first, we need to gather a trace. In my case, I have an app that takes an appreciable amount of time to load the first Activity, so I’m going to add trace sections to onCreate, onStart and onResume, following this pattern:

private static final int MEGABYTE = 1024 * 1024;

@Override
protected void onCreate(Bundle savedInstanceState) {
    super.onCreate(savedInstanceState);
    // Call the resulting trace file 'onCreate.trace', and allow a 128MiB
    // buffer for collecting trace data.
    Debug.startMethodTracing("onCreate", 128 * MEGABYTE);
    // Existing onCreate code
    [...]
    // Stop method tracing
    Debug.stopMethodTracing();
}

Once that’s added, we can start up the app normally and wait for it to finish booting. Note that like all profiliers, this trace mechanism adds overhead! Don’t make decisions based on the absolute timings as correct when dealing with these traces, but rather the differences between traces. Your app will also take noticeably longer to start while profiling is active - this is normal.

Once it has loaded, open up adb and take a look to check that your traces have been created:

ross@mjolnir:/h/ross$ adb shell ls -l /sdcard/Android/data/com.schlaikjer.cookbook/files/
total 112400
-rw-rw---- 1 u0_a122 sdcard_rw 57473027 2017-02-26 14:26 onCreate.trace
-rw-rw---- 1 u0_a122 sdcard_rw     6255 2017-02-26 14:26 onResume.trace
-rw-rw---- 1 u0_a122 sdcard_rw    60809 2017-02-26 14:26 onStart.trace

And then pull all of them up so that we can take a look at them:

ross@mjolnir:/h/ross$ for F in {Create,Start,Resume}; do adb pull /sdcard/Android/data/com.schlaikjer.cookbook/files/on$F.trace; done
4693 KB/s (57473027 bytes in 11.959s)
710 KB/s (60809 bytes in 0.083s)
79 KB/s (6255 bytes in 0.077s)

If we crack one open with our editor of choice, we can see that the files begin with three plain text sections, followed by what looks like a lot of binary data:

*version
3
data-file-overflow=false
clock=dual
elapsed-time-usec=18482984
num-method-calls=4086234
clock-call-overhead-nsec=767
vm=art
pid=17816
*threads
17816   main
17821   Jit thread pool worker thread 0
[...]
*methods
0x7b0   java.lang.BootClassLoader   getInstance ()Ljava/lang/BootClassLoader; ClassLoader.java
0x7ac   java.lang.ClassLoader   findLoadedClass (Ljava/lang/String;)Ljava/lang/Class;   ClassLoader.java
[...]
*end
SLOW^C^@ ^@&<95>
[...]

Some of these (elapsed time, number of calls, vm name &c) are pretty intuitive. For the rest of them, the best way to figure out what they are is to look at the code that generates them!

os << StringPrintf("%cversion\n", kTraceTokenChar);
os << StringPrintf("%d\n", GetTraceVersion(clock_source_));
os << StringPrintf("data-file-overflow=%s\n", overflow_ ? "true" : "false");
if (UseThreadCpuClock()) {
  if (UseWallClock()) {
    os << StringPrintf("clock=dual\n");
  } else {
    os << StringPrintf("clock=thread-cpu\n");
  }
} else {
  os << StringPrintf("clock=wall\n");
}
os << StringPrintf("elapsed-time-usec=%" PRIu64 "\n", elapsed);
if (trace_output_mode_ != TraceOutputMode::kStreaming) {
  size_t num_records = (final_offset - kTraceHeaderLength) / GetRecordSize(clock_source_);
  os << StringPrintf("num-method-calls=%zd\n", num_records);
}
os << StringPrintf("clock-call-overhead-nsec=%d\n", clock_overhead_ns_);
os << StringPrintf("vm=art\n");
os << StringPrintf("pid=%d\n", getpid());
if ((flags_ & kTraceCountAllocs) != 0) {
  os << StringPrintf("alloc-count=%d\n", Runtime::Current()->GetStat(KIND_ALLOCATED_OBJECTS));
  os << StringPrintf("alloc-size=%d\n", Runtime::Current()->GetStat(KIND_ALLOCATED_BYTES));
  os << StringPrintf("gc-count=%d\n", Runtime::Current()->GetStat(KIND_GC_INVOCATIONS));
}

Based on this, it looks like our version header consists of:

  • Version: The trace format version (3 for all devices I tested)
  • Data file overflow: The overflow_ flag seems to be set when if the amount of calls in the binary trace section overflows the buffer specified in the start trace call. If you see this set to true, you should re-run your trace with a larger buffer to ensure you aren’t missing any information.
  • Clock: Whether the trace data uses wallclock time, per-thread CPU time, or both. All tested devices reported both clock times.
  • Elapsed clock: The total trace time, in microseconds.
  • Clock call overhead: The amount of time it takes to check the time. Any measurements close to or below this number should be assumed to be below the noise floor for this trace.
  • VM: Art or Dalvik
  • Pid: The process ID of the process under trace

There are also three fields that were not present in the trace we took a look at:

  • Alloc count: Number of allocated objects
  • Alloc size: Size of all allocated objects
  • GC count: The number of collections that have occurred

With that under our belt, let’s move on to the Threads section. This one is pretty simple:

void Trace::DumpThreadList(std::ostream& os) {
  Thread* self = Thread::Current();
  for (auto it : exited_threads_) {
    os << it.first << "\t" << it.second << "\n";
  }
  Locks::thread_list_lock_->AssertNotHeld(self);
  MutexLock mu(self, *Locks::thread_list_lock_);
  Runtime::Current()->GetThreadList()->ForEach(DumpThread, &os);
}

So each entry under the thread section is just a tuple of Thread ID and a human readable thread name. The method section is similar, but has a few more fields:

std::string Trace::GetMethodLine(ArtMethod* method) {
  method = method->GetInterfaceMethodIfProxy(kRuntimePointerSize);
  return StringPrintf("%#x\t%s\t%s\t%s\t%s\n", (EncodeTraceMethod(method) << TraceActionBits),
      PrettyDescriptor(method->GetDeclaringClassDescriptor()).c_str(), method->GetName(),
      method->GetSignature().ToString().c_str(), method->GetDeclaringClassSourceFile());
}

So the method section is a list of tuples of:

  • Method ID
  • Declaring class
  • Method name
  • Method type signature
  • Declaring class’s source file.
  • Method declaration line number (only present for some runtimes)

That’s all the text sections dealt with. Now we can take a look at the binary data at the end of the file. Luckily, this section is actually described at the top of trace.h:

// File format:
//     header
//     record 0
//     record 1
//     ...
//
// Header format:
//     u4  magic ('SLOW')
//     u2  version
//     u2  offset to data
//     u8  start date/time in usec
//     u2  record size in bytes (version >= 2 only)
//     ... padding to 32 bytes
//
[...]
//
// Record format v3:
//     u2  thread ID
//     u4  method ID | method action
//     u4  time delta since start, in usec
//     u4  wall time since start, in usec (when clock == "dual" only)
//
// 32 bits of microseconds is 70 minutes.
//
// All values are stored in little-endian order.

The interesting thing to note here (and something that tripped me up for a while) is the method ID | method action section of the record format. If we take a look over in trace.cc, we can see how that’s implemented:

uint32_t Trace::EncodeTraceMethodAndAction(ArtMethod* method, TraceAction action) {
  uint32_t tmid = (EncodeTraceMethod(method) << TraceActionBits) | action;
  DCHECK_EQ(method, DecodeTraceMethod(tmid));
  return tmid;
}

Where TraceAction is defined as:

enum TraceAction {
    kTraceMethodEnter = 0x00,       // method entry
    kTraceMethodExit = 0x01,        // method exit
    kTraceUnroll = 0x02,            // method exited by exception unrolling
    // 0x03 currently unused
    kTraceMethodActionMask = 0x03,  // two bits
};

So with this data, we know that the method ID encoded in one of the trace records with the lower two bits masked off will match one of the method IDs in the *methods section of the plain text header. We can then use the lower two bits to work out whether each entry is a method entry or exit (via either return or stack unwind).

Armed with this, lets start writing a parser for these files. I chose Erlang for a learning exercise, and also because I intended to make use of it’s excellent binary matching syntax in conjunction with binary comprehensions. Since we know the magic for the binary section of the trace file (SLOW), let’s take a look at how we can easily parse out the header and the records using binary matching.

-define(TRACE_HEADER_MAGIC, "SLOW").

% Find the location of the trace header
{HeaderPos, _} = binary:match(Data, <<?TRACE_HEADER_MAGIC>>),

% Match out the entire header specification into variables
<<?TRACE_HEADER_MAGIC, VersionBin:2/binary, DataOffsetBin:2/binary,
  StartTimeBin:8/binary, RecordSizeBin:2/binary>> = binary:part(Data, {HeaderPos, 18}),

% Remember all numbers are little endian
DataOffset = binary:decode_unsigned(DataOffsetBin, little),
RecordSize = binary:decode_unsigned(RecordSizeBin, little),

% Now that we have the header start and header size, we can start parsing out
% the call records themselves. First, excerpt the section of the trace that
% contains the binary data
SectionStart = HeaderPos + DataOffset,
SectionEnd = byte_size(Data),
RecordSection = binary_part(Data, {SectionStart, SectionEnd - SectionStart}),

% Now that we have the records, we can break them up based on the RecordSize
% that the header speficied and parse them
Records = [Record || <<Record:RecordSize/binary>> <= RecordSection],
ParsedRecords = [parse_trace_record(Record) || Record <- Records].

As you can see, extracting the records section and parsing the header was pretty simple using the binary syntax. We use the same approach to parse out the records themselves:

parse_trace_record(Record) ->
    <<ThreadId:2/binary, MethodIdActionBin:4/binary,
      TimeDelta:4/binary, WallTimeDelta:4/binary>> = Record,

    % Decode the method ID and action from a binary to an integer
    MethodIdAction = binary:decode_unsigned(MethodIdActionBin, little),

    % Now remember that this is a 4-byte integer, and that the top bits
    % are the actual method ID
    MethodId = MethodIdAction band 16#FFFFFFFC,

    % While the action is the lower two bits.
    % Convert to an atom for readability
    MethodAction = case MethodIdAction band 16#00000003 of
                       16#00 -> enter;
                       16#01 -> exit;
                       16#02 -> unwind
                   end,

    #call_record{
       thread_id=binary:decode_unsigned(ThreadId, little),
       method_id=MethodId,
       method_action=MethodAction,
       time_delta=binary:decode_unsigned(TimeDelta, little),
       wall_time_delta=binary:decode_unsigned(WallTimeDelta, little),
       child_time=0
      }.

That’s most of the hard work! The full parser code, including the sections for the thread and method tables, can be seen in the final parser implementation here. Now that we have our call records, method IDs and thread IDs, we need to actually convert that data into the format that the flame graph generator can handle. It expects to receive data as ; delimited stack frames, followed by a space and a number representing the time / samples / cycles spent in the final call in that stack. To calculate this, we iterate over all these records per-thread and perform the following:

For a method entry: Push the method name onto a stack, so that we can keep track of what methods have been called. Also push the full method record onto a second stack, so that we can reference its timings later.

For a method exit/unwind: Here’s where the real logic happens. When a method exits, the stack should contain the matching method entry record. We can use the time on the two records to calculate how much time was spent in this method call altogether. If there is another parent on the stack, we update it to reflect how much time was spent in the current call - this allows for separate tracking of self and child call times. We then take the self time (subtracting any children from the current call), and update a map of method name list -> time with the self time. This deduplicates identical call chains.

Once we have iterated over all the calls, we should have a map of lists of method names to timings. From there, we can just join the names with semicolons, append a space and the timing, and it’s all set to be processed by flamegraph.pl.

The code implementing this logic can be found here.

Once we have the trace data, we can process the graph:

flamegraph.pl \
        --title "onCreate" \
        --hash \
        --countname "microseconds" \
        onCreate.flat > onCreate.svg

Et voilá!

In order to make this process a lot simpler and less manual, the trace parsing and graph generation have all been rolled into a simple server than you can run. Instructions are available on Github, and there is also a public copy of the server running at https://aflame.rhye.org/ if you want to try out uploading your own traces.

The full per-thread breakdown for the example onCreate trace we generated can be found online here.