Optimizing Erlang Binary Matching

Too much heap

Recently I’ve been implementing a binary file format parser in Erlang, taking advantage of the powerful binary matching primitives to cleanly consume and parse values out of the bytestream. However, once the parser was working, I found that processing a 377MiB input file resulted in over 6GiB of heap usage for my program! Using an order of mangitude more memory than the size of my input strikes me as less than ideal, so I started looking into how to profile and optimize binary processing.

The Erlang documentation website has a page relating to binary handling that was useful for getting started, but it took me some time playing with toy examples to really get a handle on which circumstances would result in binary copying or subpar performance, and how to interpret binary optimization warning messages. I’ll try and explain each one as best I can, but if you just want the conclusion you can skip to the bottom where I summarize the three key points I came away with.

Enabling Compiler Warnings

The first step in optimizing binary matching is to turn on relevant compiler warnings. To get binary optimization info for a module, compile it with erlc using the bin_opt_info flag (erlc +bin_opt_info module.erl) or, to always have it enabled, export the flag as a compiler option using export ERL_COMPILER_OPTIONS=bin_opt_info. If you are in the REPL, you can also get binary optimization output for the built in c command by passing it as an argument. c(module, [bin_opt_info]).

Rule 1: Always use << >>

One of the first quirks about binary match optimization, which took me some time to realize, is that simply passing a binary to a function head as a variable results in deoptimization. For example, here is a very simple function that just whittles down a binary:

1 test_head_variable(<<>>) ->
2     ok;
3 test_head_variable(Binary) when is_binary(Binary) ->
4     <<_:1/binary, Rest/binary>> = Binary,
5     test_head_variable(Rest).

Intuitively, this should be possible for the compiler to optimize - the first head matches the empty binary, and the second head requires a binary value, and the first thing it does is a match on that binary value. However!

> c(foo, [bin_opt_info]).
foo.erl:4: Warning: NOT OPTIMIZED: called function test_head_variable/1 does
not begin with a suitable binary matching instruction

What gives! Both functions start with a binary match. However, the second function head, despite always being a binary, is still an open variable capture which breaks the optimizer. Luckily, there exists a simple (if somewhat tiresome to repeatedly apply) fix:

1 test_head_binmatch(<<>>) ->
2     ok;
3 test_head_binmatch(<<Binary/binary>>) when is_binary(Binary) ->
4     <<_:1/binary, Rest/binary>> = Binary,
5     test_head_binmatch(Rest).

The only thing we’ve done is that instead of matching a raw variable called Binary in the second head, we’ve used a binary match of a single term, <<Binary/binary>>. However, this is enough to make the optimizer happy:

> c(foo, [bin_opt_info]).
foo.erl:3: Warning: OPTIMIZED: creation of sub binary delayed
foo.erl:4: Warning: OPTIMIZED: creation of sub binary delayed

Note that in this case, you could simply place the entire binary decomposition in the head of the function. However, if you have a lengthy match head (e.g, are writing a parser for a binary format), it can be considerably more readable to match the entire binary in the function head and then break it apart in the function body.

Even though this seems like a trivial change, it can have a surprisingly large effect on the performance of your program. Take for example this simple module that just consumes a binary:

make_big_binary(Size) ->
    make_big_binary(Size, <<>>).

make_big_binary(0, Binary) ->
    Binary;
make_big_binary(Size, Binary) ->
    make_big_binary(Size - 1, <<Binary/binary, 0>>).

timer(M, F, A) ->
    Start = os:system_time(),
    apply(M, F, A),
    _Diff = os:system_time() - Start.

consume_binary_unopt(<<>>) ->
    ok;
consume_binary_unopt(Binary) ->
    <<A:2/binary, B:2/binary, C/binary>> = Binary,
    consume_binary_unopt(C).

consume_binary_opt(<<>>) ->
    ok;
consume_binary_opt(<<Binary/binary>>) ->
    <<A:2/binary, B:2/binary, C/binary>> = Binary,
    consume_binary_opt(C).

As we can see, the two consume methods are identical except for the <<>> match in the function head. Running both of the forms with the timer wrapper, we get the following result:

1> BigBin = foo:make_big_binary(256 * 1024 * 1024).
2> foo:timer(foo, consume_binary_unopt, [BigBin]).
3577990533
3> foo:timer(foo, consume_binary_opt, [BigBin]).
901171248
4> 3577990533 / 901171248.
3.8094635416064673

Shockingly, the variant with the binary match syntax in the function head is almost 4 times faster! If we take a look at the Abstract Form of the compiled test code, we can see that the binary head is handled using a different pattern on the code level:

34> UnOptAbsForm.
{function,17,consume_binary_unopt,1,
          [{clause,17,[{bin,17,[]}],[],[{atom,18,ok}]},
           {clause,19,
                   [{var,19,'Binary'}],
                   [],
                   [{match,20,
                           {bin,20,
                                [{bin_element,20,{var,20,'A'},{integer,20,2},[binary]},
                                 {bin_element,20,{var,20,'B'},{integer,20,2},[binary]},
                                 {bin_element,20,{var,20,'C'},default,[binary]}]},
                           {var,20,'Binary'}},
                    {call,21,{atom,21,consume_binary_unopt},[{var,21,'C'}]}]}]}
35> OptAbsForm.
{function,23,consume_binary_opt,1,
          [{clause,23,[{bin,23,[]}],[],[{atom,24,ok}]},
           {clause,25,
                   [{bin,25,
                         [{bin_element,25,{var,25,'Binary'},default,[binary]}]}],
                   [],
                   [{match,26,
                           {bin,26,
                                [{bin_element,26,{var,26,'A'},{integer,26,2},[binary]},
                                 {bin_element,26,{var,26,'B'},{integer,26,2},[binary]},
                                 {bin_element,26,{var,26,'C'},default,[binary]}]},
                           {var,26,'Binary'}},
                    {call,27,{atom,27,consume_binary_opt},[{var,27,'C'}]}]}]}

Even though it appears that the second match clause in the optimized form is doing more work (checking that the head contains a binary), the compiled form is evidently able to take advantage of the fact that all inputs are of type bin and can be safely known to be binary data at runtime, whereas the unoptimised function has a var that could be unsafe (non-binary) data. If instead of matching the entire binary in the function head of the optimized method and then re-matching A, B, and C in the body we move that match into the function head, we gain another speed increase - 5.5x the speed of the non-optimized version!

Rule 2: Binaries before Constraints

Now that we aren’t getting deoptimized out of the gate by our function head, let’s start working on parsing a binary for real. In this case, let our binary data be a series of datagrams, each prefixed with an integer value that determines the type of the following datagram. Since we want to handle each datagram differently, it makes sense to parse off the type, and then call through to a method that given a type and a data stream, handles parsing that datagram. Something like so:

 1 parse_datagrams_unoptimized(<<>>) ->
 2     ok;
 3 parse_datagrams_unoptimized(<<Bin/binary>>) ->
 4     <<DgramType:4/unsigned-integer-unit:8,
 5       DgramSize:4/unsigned-integer-unit:8,
 6       Rest/binary>> = Bin,
 7     parse_datagram_unoptimized(DgramType, DgramSize, Bin).
 8 
 9 parse_datagram_unoptimized(?DGRAM_TYPE_A, Size, <<Binary/binary>>) ->
10     % Logic specific to datagram A...
11     <<_:Size, Rest/binary>> = Binary,
12     parse_datagrams_unoptimized(Rest).

In the call from parse_datagrams_unoptimized/1 to parse_datagram_unoptimized/3, the compiler will be unable to optimize the binary match, because the function head for /3 is constrained by a non-variable (in this case the constant ?DGRAM_TYPE_A) before the binary match. If we compile with bin_opt_info on, the compiler will actually suggest the remediation in this case (though in more complex cases, might not be able to):

> c(foo, [bin_opt_info]).
foo.erl:3: Warning: OPTIMIZED: creation of sub binary delayed
foo.erl:6: Warning: NOT OPTIMIZED: called function
parse_datagram_unoptimized/3 does not begin with a suitable binary matching
instruction
foo.erl:9: Warning: INFO: matching anything else but a plain variable to the
left of binary pattern will prevent delayed sub binary optimization; SUGGEST
changing argument order
foo.erl:9: Warning: OPTIMIZED: creation of sub binary delayed
foo.erl:11: Warning: OPTIMIZED: creation of sub binary delayed

By simply swapping the constant to after the binary, we make the compiler able to delay the creation of the sub-binary here:

 1 parse_datagrams_optimized(<<>>) ->
 2     ok;
 3 parse_datagrams_optimized(<<Bin/binary>>) ->
 4     <<DgramType:4/unsigned-integer-unit:8,
 5       DgramSize:4/unsigned-integer-unit:8,
 6       Rest/binary>> = Bin,
 7     parse_datagram_optimized(Rest, DgramType, DgramSize).
 8 
 9 parse_datagram_optimized(<<Binary/binary>>, ?DGRAM_TYPE_A, Size) ->
10     % Logic specific to datagram A...
11     <<_:Size, Rest/binary>> = Binary,
12     parse_datagrams_optimized(Rest).
3> c(foo, [bin_opt_info]).
foo.erl:3: Warning: OPTIMIZED: creation of sub binary delayed
foo.erl:6: Warning: OPTIMIZED: creation of sub binary delayed
foo.erl:9: Warning: OPTIMIZED: creation of sub binary delayed
foo.erl:11: Warning: OPTIMIZED: creation of sub binary delayed

Rule 3: No Unbounded Sub-binaries

Another gotcha comes when handling individual chunks of data. Lets’s say you’re parsing some protocol, where each packet is prefixed with a size, contains two fixed fields and one field that is the remainder of the packet. The simple way to parse these packets from a byte stream would be along these lines:

 1 parse_packets_unopt(<<>>, Acc) ->
 2     Acc;
 3 parse_packets_unopt(<<Binary/binary>>, Acc) ->
 4     <<PacketSize:4/unsigned-integer-unit:8,
 5       Rest/binary>> = Binary,
 6     <<PacketBinary:PacketSize/binary, Rest1/binary>> = Rest,
 7     Packet = parse_packet_unopt(PacketBinary),
 8     parse_packets_unopt(Rest1, [Packet|Acc]).
 9 
10 parse_packet_unopt(<<Packet/binary>>) ->
11     <<A:1/binary, B:2/binary, C/binary>> = Packet,
12     {A, B, C}.

For each packet, you match off the packet size, take that many bytes, then pass the packet data to a second method that handles the packet-internal data. (For simplicity, parse_packet_unopt/1 has only one case, but you can imagine a more complex handling logic that performs different operations based on the values of A or B).

If we run this through the optimizer though, we have a problem:

> c(foo, [bin_opt_info]).
foo.erl:3: Warning: OPTIMIZED: creation of sub binary delayed
foo.erl:5: Warning: OPTIMIZED: creation of sub binary delayed
foo.erl:6: Warning: OPTIMIZED: creation of sub binary delayed
foo.erl:10: Warning: OPTIMIZED: creation of sub binary delayed
foo.erl:11: Warning: NOT OPTIMIZED: sub binary is used or returned

Since C isn’t bounded in size, and escapes the function as part of the return, the compiler isn’t able to optimize it away. We can fix this by doing some manual size accounting, which introduces more room for programmer error (miscalculation of total field sizes), but makes the optimizer happy.

 1 parse_packets_opt(<<Binary/binary>>, Acc) ->
 2     <<PacketSize:4/unsigned-integer-unit:8,
 3       Rest/binary>> = Binary,
 4     <<PacketBinary:PacketSize/binary, Rest1/binary>> = Rest,
 5     Packet = parse_packet_opt(PacketSize, PacketBinary),
 6     parse_packets_opt(Rest1, [Packet|Acc]).
 7 
 8 parse_packet_opt(PacketSize, <<Packet/binary>>) ->
 9     FieldCSize = PacketSize - 3h
10     <<A:1/binary, B:2/binary, C:FieldCSize/binary>> = Packet,
11     {A, B, C}.

With this version of the code, we manually calculate the correct size for C. Running this through erlc, we see an improvement:

> c(foo, [bin_opt_info]).
foo.erl:3: Warning: OPTIMIZED: creation of sub binary delayed
foo.erl:5: Warning: OPTIMIZED: creation of sub binary delayed
foo.erl:6: Warning: OPTIMIZED: creation of sub binary delayed
foo.erl:10: Warning: OPTIMIZED: creation of sub binary delayed

Unbounded matches on binaries seem to only be optimizable if control flow then recurses on that tail. In the non-optimized case above, the unbounded match on C is stashed, and recursion continues on Rest1. Since the optimization results from bumping a single pointer along as the binary is matched, having to keep a reference to C breaks the optimization chain.

Hard and fast takeaways:

Reworking my code using these learnings resulted in a net speed increase and significant (though not yet satisfactory) reduction in heap usage. Hopefully there exist more binary handling tricks to avoid unnecessary allocations when processing large sets of binary data.

Comments