Debugging Memory Leaks on Heroku

heroku, memory, performance, rails, ruby

Recently at PetroFeed we upgraded from Ruby 1.9 to 2.2. It was a great upgrade for us. Some of our expensive requests dropped from 4 seconds to 1 second, and we saw a drastic reduction in the time spent doing Garbage Collection which has reduced our request times on average as well.

However our situation wasn’t all roses! We started to see memory impacts on Heroku. You can see in the image below a HUGE spike in our memory usage up to 1GB.

A heroku graph demonstrating a large spike in memory

And because we are running 2x dynos on Heroku our max memory is 1GB. It got to the point where our server was locking up and requiring a restart to resolve the issue. We were slightly nervous to put things mildly.

Beaker from the mupets quivering in fear

Our suspicion was that the area of our app that dropped from 4s/req to 1s/req since it is a very memory intensive area of our application. My teammate Fabio was able to craft a request that we could repeatedly run to drive up the memory and hit our limits. It looked like we had a memory leak!

Now I’m a huge fan of Sam Saffron and all of the amazing work he’s done in the Ruby community around performance. Specifically his work around debugging memory leaks in Ruby. For me the opportunity to try out some new debugging techniques on a real problem was really exciting.

My first crack at the problem had me trying to replicate Sam’s heap dumping technique locally against our suspected leak. I used a tool that Fabio created called boom-curl to hammer my local web server with requests. I would then do a heap dump and run some analysis against the dump. What was really confusing was that there was no obvious memory leak. Everything appeared to be GC’d correctly.

A frown emoticon

Not to give away the ending, but this should have been a signal that I ought to have taken notice of. Instead the conclusion I made was — well maybe this problem is on Heroku only!

Getting a Memory Dump from Heroku

One difference between PetroFeed and Discourse (the product Sam works on) is that PetroFeed runs on Heroku and Discourse does not. This presents some unique challenges when trying to dump memory.

One of those challenges is that Heroku will timeout a request that runs longer than 30 seconds. When doing a full heap dump, I was regularly hitting this timeout! To try and work around that timeout I ran through the following iterations:

  1. Do a direct text dump out of rails with render text: dump
  2. Dump into postgres
  3. Dump into redis and download in a separate request
  4. Dump to a file in /tmp and download it from /tmp
  5. Dump to a file in /tmp and upload to S3 from a delayed job

Each of those steps failed for a variety of reasons:

  • Took too much time to respond with 100MB+ of payload
  • Took too long to insert into postgres
  • Accessing /tmp from a dyno that didn’t write the file (Heroku’s ephemeral storage)
  • Truncation occurring when writing into Redis (this might not have been the problem, but appeared to be at the time)

Eventually I was able to get a successful dump out of Heroku without getting a timeout. To do that I wrote to a file in /tmp then did a direct upload to S3. My hypothesis is that because Heroku runs on AWS I was getting faster uploads to S3 which is why this was successful. And since there’s no code in the wild on how to do that, here’s how I did it:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
def dump_memory
  GC.start

  time = DateTime.now.strftime('%-l%M')
  path = Rails.root.join("tmp/ruby-heap.#{time}.dump")

  io = File.open(path, "w")
  ObjectSpace.dump_all(output: io)
  io.close

  bucket = $s3.buckets['mah-bucket']
  obj = bucket.objects[time]
  obj.write(IO.binread(path))

  render text: "success - #{time}"
end

A couple points of note:

  • I left out ObjectSpace.trace_object_allocations_start I set that up in an initializer.
  • I tried doing an IO.pipe instead of writing to a file but couldn’t get that to work.
  • I did use IO.binread which worked better than File.read
  • It seems weird to me that I needed to dump a file, then read a file, then write again. I’m sure there’s a better way to just write straight to S3 but I couldn’t find it! Toss me a message if you figure out a better way and I’ll amend the post!

Dump Analysis

Now that I was able to get a dump out of Heroku it was time to start analyzing the dump. I’ll skip that long boring process, but eventually I got to the point where I started to question whether we actually had a memory leak. Basically every time I did the analysis there was nothing surfacing that looked like a memory leak in our application.

At this point I circled back on my previous finding that there were no memory leaks occurring on my local machine. I grabbed one of my team mates and explained the problem to him (which is great for triggering questions against your assumptions!) We then went back to our dev environment and started to reduce the number of variables to try and isolate anything else that might be affecting a dyno’s memory.

We turned off New Relic, Increased to a PX dyno (which has 6GB of memory), and reduced the number of Unicorn Workers from 3 to 1. And then something strange started to happen … our memory would spike (like we were observing on production) and then plateau between 500MB – 600MB. Another dead obvious indicator that a leak wasn’t occurring! :P

After some reflection and thinking, I came to the following conclusion:

Each production dyno had 3 unicorn workers running, and each worker was consuming ~600MB of memory. Combining that with a 2x dyno that maxes out at 1,000MB, it’s easy to see that 3 * 600MB == 1,800MB would quickly exceed the 2x dyno limit and we’d see memory issues that looked like memory leaks.

Previously we were able to safely run 3 workers without exceeding the 2x dyno memory ceiling, but because Ruby 2.2 exchanges an increase in memory for a decrease in request times (performance increase) we hit the tipping point!

This page was delicately crafted on by Gavin Miller.