Recently at PetroFeed we upgraded from Ruby 1.9 to 2.2. It was a great upgrade for us. Some of our expensive requests dropped from 4 seconds to 1 second, and we saw a drastic reduction in the time spent doing Garbage Collection which has reduced our request times on average as well.
However our situation wasn’t all roses! We started to see memory impacts on Heroku. You can see in the image below a HUGE spike in our memory usage up to 1GB.
And because we are running 2x dynos on Heroku our max memory is 1GB. It got to the point where our server was locking up and requiring a restart to resolve the issue. We were slightly nervous to put things mildly.
Our suspicion was that the area of our app that dropped from 4s/req to 1s/req since it is a very memory intensive area of our application. My teammate Fabio was able to craft a request that we could repeatedly run to drive up the memory and hit our limits. It looked like we had a memory leak!
Now I’m a huge fan of Sam Saffron and all of the amazing work he’s done in the Ruby community around performance. Specifically his work around debugging memory leaks in Ruby. For me the opportunity to try out some new debugging techniques on a real problem was really exciting.
My first crack at the problem had me trying to replicate Sam’s heap dumping technique locally against our suspected leak. I used a tool that Fabio created called boom-curl to hammer my local web server with requests. I would then do a heap dump and run some analysis against the dump. What was really confusing was that there was no obvious memory leak. Everything appeared to be GC’d correctly.
Not to give away the ending, but this should have been a signal that I ought to have taken notice of. Instead the conclusion I made was — well maybe this problem is on Heroku only!
Getting a Memory Dump from Heroku
One difference between PetroFeed and Discourse (the product Sam works on) is that PetroFeed runs on Heroku and Discourse does not. This presents some unique challenges when trying to dump memory.
One of those challenges is that Heroku will timeout a request that runs longer than 30 seconds. When doing a full heap dump, I was regularly hitting this timeout! To try and work around that timeout I ran through the following iterations:
- Do a direct text dump out of rails with
render text: dump
- Dump into postgres
- Dump into redis and download in a separate request
- Dump to a file in /tmp and download it from /tmp
- Dump to a file in /tmp and upload to S3 from a delayed job
Each of those steps failed for a variety of reasons:
- Took too much time to respond with 100MB+ of payload
- Took too long to insert into postgres
- Accessing /tmp from a dyno that didn’t write the file (Heroku’s ephemeral storage)
- Truncation occurring when writing into Redis (this might not have been the problem, but appeared to be at the time)
Eventually I was able to get a successful dump out of Heroku without getting a timeout. To do that I wrote
to a file in
/tmp then did a direct upload to S3. My hypothesis is that because Heroku runs on AWS I was
getting faster uploads to S3 which is why this was successful. And since there’s no code in the wild on how
to do that, here’s how I did it:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
A couple points of note:
- I left out
ObjectSpace.trace_object_allocations_startI set that up in an initializer.
- I tried doing an
IO.pipeinstead of writing to a file but couldn’t get that to work.
- I did use
IO.binreadwhich worked better than
- It seems weird to me that I needed to dump a file, then read a file, then write again. I’m sure there’s a better way to just write straight to S3 but I couldn’t find it! Toss me a message if you figure out a better way and I’ll amend the post!
Now that I was able to get a dump out of Heroku it was time to start analyzing the dump. I’ll skip that long boring process, but eventually I got to the point where I started to question whether we actually had a memory leak. Basically every time I did the analysis there was nothing surfacing that looked like a memory leak in our application.
At this point I circled back on my previous finding that there were no memory leaks occurring on my local machine. I grabbed one of my team mates and explained the problem to him (which is great for triggering questions against your assumptions!) We then went back to our dev environment and started to reduce the number of variables to try and isolate anything else that might be affecting a dyno’s memory.
We turned off New Relic, Increased to a PX dyno (which has 6GB of memory), and reduced the number of Unicorn Workers from 3 to 1. And then something strange started to happen … our memory would spike (like we were observing on production) and then plateau between 500MB – 600MB. Another dead obvious indicator that a leak wasn’t occurring! :P
After some reflection and thinking, I came to the following conclusion:
Each production dyno had 3 unicorn workers running, and each worker was consuming ~600MB of memory. Combining that with a 2x dyno that maxes out at 1,000MB, it’s easy to see that 3 * 600MB == 1,800MB would quickly exceed the 2x dyno limit and we’d see memory issues that looked like memory leaks.
Previously we were able to safely run 3 workers without exceeding the 2x dyno memory ceiling, but because Ruby 2.2 exchanges an increase in memory for a decrease in request times (performance increase) we hit the tipping point!