Advanced Android Heap Analysis
Sat, Mar 10, 2018 Companion code for this post available on GithubOutOfMemoryError
is, in my opinion, one of the more insidious errors you can
encounter as an app developer. It can be tricky to tie to a specific piece of
code or allocation path, as the final straw may be some other small allocation,
and even if you know that the crash occurs when loading a large bitmap it can be
tricky in a large application to figure out which large bitmap might be the
culprit.
Luckily, the Android JVM implements a dumper to the HProf heap profile format, which is relatively standard between JVMs (but in this case with some ART extensions). Android studio supplies some basic introspection functionality for heap dumps, but it’s not great for following reference chains or seeing at a glance what’s allocating all your memory. Some time ago I wrote an android method trace parser , so I figured it was time to implement a sister tool for introspecting heap dumps. To that end, I have created the erlang-hprof library on Github. The repo itself has a basic overview and usage examples, but here I will go deeper into how it can be applied to identify a complex memory leak issue.
Reconnaissance
Before you can analyze a heap dump, you must first acquire one.
If you’re seeing out of memory errors cropping up in your reporting solution of
choice, you may or may not actually be able to replicate them yourself. Maybe
there’s some codepath you just don’t hit, maybe your phone has 6GiB of ram and
it would take you forever to OOM it. So the best way to get exactly the sort of
situations your users encounter is to simply try and dump that user’s heap when
they encounter an OOM exception. Android supplies a
Debug#dumpHprofData
method you can use to dump the heap to a file, so that takes care of how to get
the heap in a form we can send back. In order to actually catch trigger this
code though, we’ll rely on
Thread#setDefaultUncaughtExceptionHandler
to trigger our dumping code whenever an uncaught OutOfMemoryException
bubbles
all the way up. The basics of this will look a little like so:
public class HProfExceptionHandler implements Thread.UncaughtExceptionHandler {
private final Context appContext;
private final Thread.UncaughtExceptionHandler parentExceptionHandler;
public HProfExceptionHandler(Context context, Thread.UncaughtExceptionHandler parentHandler) {
// If this exception handler is installed into a VM with an existing exception handler,
// we want to be sure that we call through to the already-registered handler so that any
// other crash reporters still work.
this.appContext = context.getApplicationContext();
this.parentExceptionHandler = parentHandler;
}
@Override
public void uncaughtException(Thread t, Throwable e) {
// Check if this error is an out of memory error
if (OutOfMemoryError.class.isAssignableFrom(e.getClass())) {
// If it is, then try and dump it to the filesystem
dumpHeapToFile();
}
// Propagate the error to the parent exception handler, if present
if (this.parentExceptionHandler != null) {
parentExceptionHandler.uncaughtException(t, e);
}
}
private void dumpHeapToFile() {
// First, ensure our dump location exists
File dumpDir = getDumpDirectory();
if (!dumpDir.exists()) {
if (!dumpDir.mkdirs()) {
Log.w(TAG, "Failed to create dump directory " + dumpDir.getAbsolutePath());
return;
}
}
// Create a new filename for this dump
File dumpPath = new File(dumpDir, String.format(Locale.US, "%s.hprof", UUID.randomUUID().toString()));
// Attempt to write out the heap
try {
Debug.dumpHprofData(dumpPath.getAbsolutePath());
Log.d(TAG, "Successfully wrote heap dump " + dumpPath.getName());
} catch (IOException e1) {
Log.e(TAG, e1.getMessage(), e1);
}
}
}
Once this dump is complete, the application will continue to execute the next uncaught handler (if present), and then die. On the next start (or, for full marks, in a periodic job that requires unmetered network + charging) you can then upload these dump files to your server. There are two things to be wary of here:
- These heap dumps can be large. On the order of several hundred megabytes. However, they tend to be highly receptive to compression. Based on production data for an app, most dumps tended to be between 15-25MiB when gzipped, as opposed to 100-200MiB uncompressed. If you are going to upload these dumps, I would very much recommend gzipping them on the fly as you do, as a courtesy to your user.
- These heap dumps contain everything your application has in memory by definition. This may mean app secrets, sensitive user data / PII and other dangerous information. Please undergo the same data protection precautions with this crash data as you would any other sensitive information.
Initial Inspection
Once you have the heap dumps in hand, it’s time to figure out if there’s a common thread (or threads) to the allocation patterns that resulted in an OOM. Now the way that the HPROF format works at a high level is that it contains an encoded structure of every class present in the heap, including their static variables, instance variables and their types, and then follows vast numbers of instance dumps, each with a class object ID that describes it and the raw data for each of its fields. So let’s assume that we have an HProf file we’ve recovered from a device, and want to see what’s going on. First things first, we need to parse it. So fetch the parser as a dependency, and open up an Erlang shell.
% Now, let's open up that heap dump
{ok, Parser} = hprof_parser:parse_file('my_heap.dump').
% Since we're going in blind here, the first thing I'd check is the largest
% objects we might expect to allocate - bitmaps.
% So first, let's get all the instances in this dump that are of the type
% android.graphics.Bitmap
{ok, BmpRef} = hprof_parser:get_instances_for_class(Parser, <<"android.graphics.Bitmap">>).
{ok, Bitmaps} = hprof:await_acc(BmpRef, [], fun(A, I) -> [I|A] end).
% Now, each bitmap stores its actual image data in a byte array, called
% 'mBuffer', so let's read the mBuffer value for each of these bitmaps
% To make this easy, first read in the record definitions
rr("include/records.hrl").
BufferVars = [maps:get(<<"mBuffer">>, Vars) || #hprof_heap_instance{instance_values=Vars} <- Bitmaps].
% Now we have a list of #hprof_instance_field{}, each of which has the ID of
% the relevant native array, so let's fetch those
ByteArrayIds = [V || #hprof_instance_field{value=V} <- BufferVars].
% Now that we know which byte arrays are bitmaps, let's fetch all the
% primitive arrays and filter them down.
ByteArrayIdSet = sets:from_list(ByteArrayIds).
{ok, ByteArrayRef} = hprof_parser:get_primitive_arrays_of_type(Parser, byte).
{ok, BitmapByteArrays} = hprof:await_acc(
ByteArrayRef,
[],
fun(Acc, Array) ->
case sets:is_element(Array#hprof_primitive_array.object_id, ByteArrayIdSet) of
true -> [Array|Acc];
false -> Acc
end
end
).
% At this point, we can do some quick maths to see how much of the heap is
% taken up by bitmap data:
ImageSizesBytes = [byte_size(E) || #hprof_primitive_array{elements=E} <- BitmapByteArrays].
TotalImageSize = lists:foldl(fun(A, B) -> A + B end, 0, ImageSizesBytes).
In a lot of cases, you might find that the total bytes allocated by bitmaps accounts for most of your heap. If it does, then now you can dig deeper into which bitmaps these are. One nice way to do this is to reconstruct them and take a look - we have the image data after all! So back to the shell:
% Odds are that if we see the images, as the developer we'll know what they're
% from. So let's convert all those bitmaps we fetched earlier to PNGs and save
% them somewhere
lists:foreach(
fun(Bitmap) ->
case hprof_bitmap:make_png(Parser, Bitmap) of
{ok, PngData} ->
file:write_file(
integer_to_list(Bitmap#hprof_heap_instance.object_id) ++ ".png",
PngData
);
_ -> ok
end
end,
Bitmaps
).
If you take a look in your working directory, you should now have each of the
images that were in-memory written out for your perusal. Note that since
Android doesn’t seem to store the bitmap configuration with the java class, the
library has to guess at the bit settings for images with two bytes per pixel,
since they could be either ARGB_4444
or RGB_565
. It attempts to guess based
on whether the image looks like it has a lot of pixels that would be
transparent, but of course isn’t 100% reliable.
Digging deeper
This is all well and good, and maybe worked for finding an image asset that wasn’t getting resized properly for the device’s pixel density. But what do you do if you find an image, but can’t figure out how it and 30 copies of it are all sticking around in the heap? Well, since we have available to us all of the heap objects and all their pointers, we have a graph that tells us precisely which objects point to these images! But just knowing which objects point to which other objects isn’t perfect - if you follow every reference, you will very quickly get bogged down by the sheer number of references, back references and weak references that the android view classes come with. But we know one thing - if these images (or other problem instances) aren’t getting garbage collected when they should, then the only way this is possible is if somewhere, somehow, there is a reference chain that ends in a static pointer.
So let’s go over how we can identify these reference chains (A complete implementation is visible on Github).
The first step is going to be generating a mapping of which objects are pointed
to by which other objects. For this, I’m going to use the bag
ETS type to
store tuples of {referenced, reference_holder}
object ID pairs. We can
generate this table fairly simply by streaming all of the instances from the
parser:
ReferenceMap = ets:new(objects, [bag]).
{ok, InstancesRef} = hprof_parser:get_all_instances(Parser).
ok = hprof_await(
InstancesRef,
fun(I=#hprof_heap_instance{object_id=Oid, instance_values=Values}) ->
% For each instance, we want to check if it has any fields that point
% to other objects. If it does, for each of these fields we store a
% tuple of {referent object id, this object id}
% Note that you may also want to exclude certain instance types, such
% as WeakReference and FinalizerReference, from analysis
maps:fold(
fun(_VariableName, #hprof_instance_field{type=T, value=V}, _Acc) ->
% If this variable is of type object, and isn't null, then
% track this reference in the ETS table
case (T =:= object) and (V =/= 0) of
true -> ets:insert(ReferenceMap, {V, Oid});
false -> ok
end
end,
ok,
Values
)
end
).
This data allows us to figure out who holds onto who (though note that in the final implementation you would also include object arrays), but doesn’t have enough info to tie these instances back to a static culprit. For that, we need to take a look at the class dump instances, which contain all the static fields. If we find any reference chains that are held by any of these, we know that that’s a memory leak. So let’s build up a table in a similar way:
StaticallyReferenced = ets:new(statics, [bag]).
{ok, ClassDumpRef} = hprof_parser:get_class_dumps(Parser).
ok = hprof:await(
ClassDumpRef,
fun(#hprof_class_dump{class_id=ClsId, static_fields=Statics}) ->
lists:foreach(
fun(#hprof_static_field{type=T, data=D}) ->
case (T =:= ?HPROF_BASIC_OBJECT) and (D =/= 0) of
true ->
% So if this class has a static reference to this
% object D, then we both update the reference map from
% before and tag that object as statically referenced
% by inserting it into the second table.
ets:insert(ReferenceMap, {D, ClsId}),
ets:insert(StaticallyReferenced, {D});
false -> ok
end
end,
Statics
)
end
).
With these two data sets, we can now do a depth first search from a target object to try and find statically held reference chains.
detect_reference_chain(ReferenceMappings, StaticallyHeld, TargetId, PreviousRoots, RefChain) ->
% At each step of our search, we want to first find all of the objects that
% reference our 'target' object.
% We can look this up using the first table we created:
Referers = [Referrer || {_, Referrer} <- ets:lookup(ReferenceMappings, TargetId)],
% Now, we want to iterate over all of the referers:
% - If the referer is already in the PreviousRoots set, then we've already
% traversed it while trying to detect a reference chain - this means there's
% a circular reference, so don't recurse in this case or we'd be here forever
% - If the referer is not statically referenced, then recurse with the referer
% as the new target, and the current target added to the previous root set
% and RefChain, which is the reference stack so far.
% - If the referer *is* statically referenced, then we've found a potential
% memory leak! In the full implementation, this would be passed to the
% pretty-printer worker to resolve the names and numbers of the objects involved.
lists:foreach(
fun(Referer) ->
% This is where we check if we've already seen it (reference loop)
case sets:is_element(Referer, PreviousRoots) of
true -> ok; % Return back up the stack
false ->
% Check if this object has a static root
case ets_contains(StaticallyHeld, Referer) of
false ->
% Recurse
detect_reference_chain(
ReferenceMappings, StaticallyHeld,
Referer, % Takes the place of target
% Add the target here to prevent any future loops
sets:add_element(TargetId, PreviousRoots),
[TargetId|RefChain]
);
true ->
% The RefChain stack contains the list of object
% pointers that got us here, so add the referer to
% it and fob it off on the pretty printer
spawn(?MODULE, pretty_print, [[Referer|RefChain]])
end
end
end,
Referers
).
Depending on the density of your heap it may take a few minutes to generate your ETS mappings, but afterwards it should be fairly quick to start detecting static reference chains, if they do exist. The full script will resolve and print reference chains in a readable form, like so:
$ _build/default/bin/main dump.hprof 896555536
Populating reference mapping table...
- Instances...
- Object arrays...
- Class dumps...
Loaded 1805702 object mappings in 201 seconds
Waiting for reference chains...
Found statically referenced chain:
com.example.utility.CardCache -static cachedCards-> 866149120
java.lang.Object[] array (857898624) -> 896566080
com.example.cards.ContentCard (896566080) -mParent-> 896566032
com.example.cards.ContentCard$ViewHolder(895754848) -mParent-> 896571264
com.example.activity.MyNonGcActivity$1 (896571264) -this$0-> 896555536
com.example.activity.MyNonGcActivity (896555536)
This is of course an example application, but this sort of bug is something you can easily run into in a large, real world application if you’re unlucky. But as with so many bugs, the work is in the finding, and once you know where a reference is being held it’s a lot easier to find out why and remedy it.
I hope that this little library comes in handy if you run into memory problems with your own applications. If you find any optimizations or useful heap examination scripts, please do file a PR or open an issue on Github.