Erlang Liveness Checks in Kubernetes

Tue, Jun 12, 2018

I have an ever-increasing number of small projects and deployments that I use either internally or with some availability to the public, and have been relying on Kubernetes to make managing them easy. Not too long ago, I started adding a liveness probe to each pod definition as a contingency against a hung runtime. My pod definition at the time looked like this example from my Aflame project:

containers:
- name: aflame
image: docker-registry:5000/erlang-aflame
livenessProbe:
  exec:
    command:
    - /deploy/bin/aflame
    - ping
  initialDelaySeconds: 5

This worked fine, and I was able to verify that nodes that failed the ping test would be taken down. However, some weeks later I was poking around and realized that CPU utilization on my server was significantly higher than I would expect.

Note the recent spike in CPU utilization

Looking at htop, it was difficult to see a precise culprit, except for a large number of processes called erl_child_setup coming into existence, pegging a CPU core and disappearing again. After googling around, I landed on the source code for this task and found this section of main:

/* We close all fds except the uds from beam.
   All other fds from now on will have the
   CLOEXEC flags set on them. This means that we
   only have to close a very limited number of fds
   after we fork before the exec. */
#if defined(HAVE_CLOSEFROM)
    closefrom(4);
#else
    for (i = 4; i < max_files; i++)
#if defined(__ANDROID__)
        if (i != system_properties_fd())
#endif
        (void) close(i);
#endif

According to the ps output on my system, max_files was getting set to 1,048,576 - so every time this program run, it was hot looping over a million possible file descriptors and calling close on each! No wonder it was resulting in so much system time. But what was actually causing all of these erl_child_setup calls? I had initially suspected a misbehaving deployment code, but the spike in load lined up with when I added the liveness checks. It turns out that the relx ping command is surprisingly heavyweight; or at least ends up that way when run in an environment with a very high max file count, which was the case under docker.

The fix

In order to fix this, I altered my liveness probe to run the ping check with a low ulimit on the number of open files. We still need to loop, but we are at least looping over a considerably more restrained number of descriptors. I also took the opportunity to increase the interval between checks, since the default check period is rather quick.

 containers:
 - name: aflame
 image: docker-registry:5000/erlang-aflame
 livenessProbe:
   exec:
     command:
+    - softlimit
+    - -o
+    - "128"
     - /deploy/bin/aflame
     - ping
   initialDelaySeconds: 5
+  periodSeconds: 60

Note that in order to get softlimit in your container, you may need to rebuild with daemontools package installed, or another source that contains a limit utility.

With these changes, system load rapidly dropped from nearly 20 to a more reasonable 3. The default performance of this wrapper would likely be helped by an adoption of the closefrom syscall into the Linux kernel, but unfortunately the only references I can find to this are a pessimistic ticket from 2009 and an unmerged patch from Zheng Liu in 2014.

Erlang Liveness Checks in Kubernetes

Note the recent spike in CPU utilization

The fix

Noticably reduced CPU usage

Comments