Erlang Liveness Checks in Kubernetes

I have an ever-increasing number of small projects and deployments that I use either internally or with some availability to the public, and have been relying on Kubernetes to make managing them easy. Not too long ago, I started adding a liveness probe to each pod definition as a contingency against a hung runtime. My pod definition at the time looked like this example from my Aflame project:

containers:
- name: aflame
image: docker-registry:5000/erlang-aflame
livenessProbe:
  exec:
    command:
    - /deploy/bin/aflame
    - ping
  initialDelaySeconds: 5

This worked fine, and I was able to verify that nodes that failed the ping test would be taken down. However, some weeks later I was poking around and realized that CPU utilization on my server was significantly higher than I would expect.

Note the recent spike in CPU utilization

Looking at htop, it was difficult to see a precise culprit, except for a large number of processes called erl_child_setup coming into existence, pegging a CPU core and disappearing again. After googling around, I landed on the source code for this task and found this section of main:

/* We close all fds except the uds from beam.
   All other fds from now on will have the
   CLOEXEC flags set on them. This means that we
   only have to close a very limited number of fds
   after we fork before the exec. */
#if defined(HAVE_CLOSEFROM)
    closefrom(4);
#else
    for (i = 4; i < max_files; i++)
#if defined(__ANDROID__)
        if (i != system_properties_fd())
#endif
        (void) close(i);
#endif

According to the ps output on my system, max_files was getting set to 1,048,576 - so every time this program run, it was hot looping over a million possible file descriptors and calling close on each! No wonder it was resulting in so much system time. But what was actually causing all of these erl_child_setup calls? I had initially suspected a misbehaving deployment code, but the spike in load lined up with when I added the liveness checks. It turns out that the relx ping command is surprisingly heavyweight; or at least ends up that way when run in an environment with a very high max file count, which was the case under docker.

The fix

In order to fix this, I altered my liveness probe to run the ping check with a low ulimit on the number of open files. We still need to loop, but we are at least looping over a considerably more restrained number of descriptors. I also took the opportunity to increase the interval between checks, since the default check period is rather quick.

 containers:
 - name: aflame
 image: docker-registry:5000/erlang-aflame
 livenessProbe:
   exec:
     command:
+    - softlimit
+    - -o
+    - "128"
     - /deploy/bin/aflame
     - ping
   initialDelaySeconds: 5
+  periodSeconds: 60

Note that in order to get softlimit in your container, you may need to rebuild with daemontools package installed, or another source that contains a limit utility.

Noticably reduced CPU usage

With these changes, system load rapidly dropped from nearly 20 to a more reasonable 3. The default performance of this wrapper would likely be helped by an adoption of the closefrom syscall into the Linux kernel, but unfortunately the only references I can find to this are a pessimistic ticket from 2009 and an unmerged patch from Zheng Liu in 2014.

Comments