Migrating from Heroku to AWS

Matt Basta
18 min readDec 31, 2020

--

Since the very beginning, Pinecast’s application servers have been hosted on Heroku. The main application database, the Django application that serves web requests, cron jobs, the CI pipeline, and the staging environment have all been built exclusively on Heroku infrastructure.

Heroku has never been perfect. Every year or so, I encounter a rough edge or a hiccup. I’m sure I’ve complained enough about Heroku’s shortcomings on Twitter. But in recent months, this has come to a head.

The impetus for migrating

In October, Pinecast started experiencing infrequent periods of degradation for 4–20 minutes at a time, roughly once every week or two. During these periods of degradation, around 50% of requests would time out or fail. These requests included requests for episodes, uncached feeds (though this would not affect the availability of a feed, and dashboard requests.

During the degradation, dyno load was essentially flat—there was nothing to indicate that the dynos were under heavy load. Memory usage remained normal. Of the 8 dynos running the site (each with four worker processes), no dyno or worker was unaffected.

I spoke with a friend who’d run a very large site on Heroku about this. He noted that I was using Standard dynos (versus Performance dynos). Standard dynos were run on multitenant infrastructure, meaning the underlying servers were shared with other Heroku customers. A “noisy neighbor” could be causing the issues—saturating the CPUs for the underlying servers or using up network throughput or otherwise causing meyhem. Another friend who had previously worked at Heroku confirmed that this was a likely cause.

I should note: I’d investigated this issue pretty thoroughly. The site would recover on its own with no interaction from me, especially during very short (<5min) periods of degradation. I remain convinced that this is not a result of my application code.

In December, these issues had become frequent—3–5 times per week—and were becoming unacceptable. I bit the bullet and upgraded to Performance-M dynos.

Standard-1x dynos are $25/mo. I was running 8 of them for production use, which comes out to $200/mo. Each dyno was running 4 instances of Django (using gunicorn). Four was chosen because it was the largest number of instances that could be run on a single dyno without exceeding the memory limit.

Performance-M dynos are $250/mo—the cost of ten Standard-1x dynos. That was disheartening. But moreover, one Performance-M dyno only has enough memory to run 21 instances of Django without exceeding the memory limit. That’s a 35% reduction in capacity overall. That said, the extra capacity isn’t strictly necessary: I kept a few extra Standard-1x dynos running to handle peak traffic (Monday mornings, for instance).

Performance dynos could compensate for this extra traffic with their autoscaling feature. Autoscaling isn’t available with Standard dynos. This has a downside, though: it can only scale up your dynos when response times to your application increase. If dyno load, memory, or the 5XX response count increases, you’re out of luck. At the very least, I was convinced that this would help scale up the site in the event that it didn’t have enough compute resources available.

I was wrong. The day after making the switch to Performance dynos, the site went hard-down just after midnight. A simple “restart all dynos” brought it back up. The next night, just after midnight, the same thing happened. This became a pattern. The hard-downtime presented itself as a spike in dyno load, followed by timeouts of most requests (the latter being the symptom of the degradation on Standard dynos).

I assumed that the site was experiencing a burst in traffic, causing a thundering herd that backlogged all of the workers, explaining the increase in dyno load. However, access logs from Cloudflare (which fronts Heroku) do not corroborate this: there was no burst in traffic.

Moreover, Heroku did not scale up the number of dynos when the requests began timing out. I reached out to their customer support, and as they’d done in the past, they suggested that the problem was in my application and pushing me to pay for NewRelic to debug. Pushing NewRelic seems like their go-to answer: they’d done this to me in the past.

They also said that the autoscaling feature will not kick in if there’s a large number of application failures. Which is wild to me, because the failures were caused by timeouts, which would have been avoided if the application was scaled up. If the damn feature wasn’t restricted to just response times, this would be less ridiculous, but this shows that Heroku’s autoscaling is effectively garbage: a spike in traffic large enough to cause timeouts before autoscaling kicks in will take your site down. I can’t begin to imagine the conversation that led to this decision.

If I had to take a guess as to what’s happening here, I’d say that there’s some process on Heroku’s side that affects their request router (the software that directs an incoming HTTP request to a dyno). That thing stutters for a moment, causing a backlog of requests to build up. All of the workers on the dyno become instantly occupied, taking a second or two to respond. There’s already a backlog at this point, but new incoming requests pile on. This would explain the dyno load increase. As the server is now completely saturated, it becomes bottlenecked on CPU, leading to timeouts, leading to more backlog, leading to what we see.

The answer here is to add more capacity: a second $250/mo Performance-M dyno (with another 21 workers). 42 workers is 10 more than I needed previously, doing even less work and costing more money.

The second dyno fixed the issue, but frankly this is outrageous. The promise of the cloud is that you pay for what you use! I’m not getting $500/mo of value from this service, I don’t appreciate being gaslit by their customer support, and the existing pile of issues that I already resented wasn’t helping:

  • Buggy charts
  • Frequent issues around deployments and observability
  • Autoscaling is next to useless
  • No Postgres replication out of Heroku (more on this later)
  • Heroku CI is incredibly slow. Github Actions is almost four times faster.
  • Lots of outdated and missing documentation

I’d had enough and decided to make the switch to something else.

Alternatives

My search was first for Heroku alternatives. Without going into many details, I found few alternatives that met my needs. A few newer hosting services provided compelling Heroku-like functionality, but lacked information or specific features.

My main criteria for a hosting service:

  • Separate staging and production environments that could be managed together
  • Must be managed PaaS (handle deployments, environment setup, etc.)—I don’t want to wrangle raw VMs
  • Must have tools for monitoring
  • Must be able to roll back a deployment
  • Must be able to autoscale
  • Must have support for cron jobs
  • Must have the ability to run one-off commands (a la heroku run)

After evaluating a lot of services, I decided to go with AWS Elastic Beanstalk. It has a few things going for it:

  • I’m already using AWS for Pinecast analytics, audio processing (detecting durations, etc.), and more.
  • Aurora is a decent managed Postgres option.
  • I’m fairly comfortable with AWS and (mostly) know how not to get burned.

Beanstalk isn’t perfect, though. Some obvious problems:

  • Deployment and configuration changes are notably slower than Heroku.
  • Unless you deploy from Docker containers, you’re limited to some fixed platforms. For Python, the most recent supported version is 3.7 (for comparison, I was on the cusp of upgrading Pinecast from 3.8 to 3.9—but I was using ~none of 3.8’s features). Amazon is at least a year behind the curve.
  • You don’t get fancy charts and graphs out-of-the-box, but you can build them with CloudWatch pretty easily. CloudWatch is well-known for having a lag in ingestion, so metrics may be worse overall than Heroku (in terms of latency, not breadth).
  • Nobody likes the AWS console, and the AWS CLI is hardly what I’d call user-friendly. The command to deploy Pinecast is now 171 bytes—far too long to reasonably type from memory.
  • The built-in health checks are cool, but come with an undefined Host header by default. That means you need to do some weird fiddling to make Django play nicely with it.

I have other complaints, but they’re hardly worth mentioning.

The migration process

CI

The first step was moving from Heroku CI to Github Actions. This was surprisingly easy.

The build log for a successful build

The whole process takes about a minute and a half to set up the environment, run the tests, deploy to my Beanstalk staging environment, and post to Slack. For comparison, Heroku CI (on a Performance dyno) takes two minutes and forty seconds just to set up the environment and run tests.

This was amazingly easy. Github Actions is something I’d wanted to play with for a while now, and I can confirm that it is good stuff.

Beanstalk staging environment

The second step was getting the Beanstalk environment set up. (You’ll notice the deployment in the Github Actions—I did these two steps in ~parallel.)

This amounted to the following:

  • Creating a new Beanstalk application. The application is in a way the equivalent of your Heroku pipeline.
  • Creating a new environment. The environment is your application running on an actual EC2 instance.
  • Copy all of the Heroku environment variables into the Beanstalk environment configuration. This was probably the most tedious step—Pinecast has a lot of configuration options and secrets.
  • Configure the environment’s load balancer to run the WSGI script.

Note that if you are using AWS for other things already, you probably have a VPC set up. If you do, Beanstalk does not create your environment in a VPC by default, and this will preclude you from using some instance types. It may be wise to set up a VPC to add your environment to so that you don’t need to rebuild things from scratch later: you cannot convert a non-VPC environment to an environment in a VPC.

Because I use Cloudflare, the next step was to create an origin certificate for my new staging environment. You can upload these to the AWS Certificate Manager, and simply select the certificate in your environment configuration. For Cloudflare origin certs, you’ll need to copy their origin CA root cert into the “certificate chain” field; they provide the PEM file (just open and copy its contents) here.

To get HTTPS set up, open Configuration > Load Balancer. Disable port 80 and a new listener for:

  • Port: 443 (TLS)
  • Listener protocol: HTTPS (don’t choose SSL here)
  • Instance port: 80
  • Instance protocol: HTTP
  • SSL certificate: (select the cert you added to the certificate manager)

What this does is accept traffic from Cloudflare on port 443 (the HTTPS default) and route it from the load balancer to your instance on port 80 as HTTP traffic.

At this point, you can point your CNAME record at the environment’s URL and it should pretty much just work.

The database

This was the most challenging step: moving the database out of Heroku and onto AWS.

If you copied your DATABASE_URL in the environment variables, your Beanstalk instance should be reading data directly from Heroku’s Postgres instance. But we want that database to live in AWS, too.

I used Aurora, so it was as simple as setting up a new Aurora cluster. Copy the credentials and create the new database URL value. Once the cluster is set up, check that you can connect to the instance successfully.

I’d attempted to first use AWS Database Migration Service, but this relies on the ability to set up replication on the source database, but Heroku doesn’t allow replication out of its systems. That means you’re stuck hauling your database out manually.

Pinecast’s application database is fairly small (certainly by comparison to many other applications), so it was pretty easy to copy over. The migration looked like this:

  1. Put the application into maintenance mode
  2. Perform a database backup with the CLI
  3. Download the database backup with the CLI
  4. Use pg_restore to push the contents of the backup into Aurora
  5. Update the Heroku application’s DATABASE_URL to point at Aurora
  6. Turn off maintenance mode

I ran this a few times to test, doing the following:

  • Don’t use maintenance mode
  • Instead of updating the Heroku application’s environment variable, update the staging instance’s environment variable
  • After testing, drop the database manually. I wasn’t able to get pg_restore to truly drop the database (perhaps there was some weirdness with Aurora?) before restoring a second time.

I made multiple test runs. Once with the Beanstalk environment so I wasn’t disrupting anything with Heroku. Once with the Heroku staging application, to avoid disrupting the production application. I performed the final test run twice to be sure that it was reproducible.

The final consideration: in your Postgres resource’s “Credentials” page, click the button to “detach” the credentials from your application’s DATABASE_URL environment variable. If you do not do this, it is effectively read-only and cannot be updated.

The whole migration process took about four minutes. I’d run it at 3AM to minimize disruption to the site. If you have the ability to make your application read-only, that’s a great option instead of using Heroku’s maintenance mode (which essentially stops all traffic).

After the migration was complete, the site was noticeably snappier. I assume this is because of the performance characteristics of Aurora Postgres. I updated all of the remaining environment variables, and the old database was safe to turn down.

Be sure to configure backups on your Aurora cluster!

Cutting over staging

At this point, it was safe to turn off Heroku CI and delete the staging environment from Heroku—it was running on Beanstalk just fine.

Getting the health checks for Beanstalk working was a bit of a challenge. The load balancer will ping your server to make sure it’s alive and well. However, it’ll hit a (configurable) URL without a Host header, which, if you’re using Django, will cause the request to fail because of the ALLOWED_HOSTS settings option.

To get around this, I essentially followed this guide from Andrés Álvarez. Instead of using requests I used urllib.request.urlopen because I don’t want anything especially spooky happening during the startup of the app. The fix essentially queries an internal EC2 instance service that returns the hostname to expect for the healthcheck and adds it to ALLOWED_HOSTS.

This is a great point to set up a CloudWatch dashboard for important metrics for your environment. I set one up for CPU use, requests broken out by response code, and abnormal responses (gateway timeouts, connection timeouts, etc.). You’ll want to also monitor memory use, for reasons that will become apparent.

Monitoring memory use is not as simple as selecting the CPUUtilization metric for the instance by ASG. Instead, you need to tell your app to start collecting it. AWS has a guide here:

Heed the note to replace the perl-Bundle-LWP yum package—if you’re setting up an environment now, it’s almost a certainty that you’re on Amazon Linux 2.

Also be aware that it’ll take a few minutes for the data to start appearing. If you see the metric appear in CloudWatch but don’t see the data, give it ten or fifteen minutes.

Cutting over production

Cutting over the production servers was a similar process. Beanstalk has an option to clone an environment, so this was essentially as simple as:

  1. Cloning the environment
  2. Changing the instance type (I’m using r5.large)
  3. Increase the number of processes and threads per instance
  4. Getting production Cloudflare certificates into Certificate Manager and switching the environment over
  5. Changing environment properties to reflect the keys and settings for production

One thing that Beanstalk does not make easy is planning the initial capacity for an environment, but I suppose this a problem with any service. Without having run your application on Beanstalk before, how can you know what is the right number of instances, processes, and threads to use in your environment? If any of these numbers are off, your service can fail in a number of ways:

  • If you have too few instances and your service is compute-heavy, you can become CPU-bound and requests will begin backing up and timing out.
  • If you have too few processes or threads, your service will bottleneck and requests will time out.
  • If you have too many processes, your instance may run out of memory and start swapping madly.
  • If you have too many threads (and you’re using Python), your requests will time out because only one thread can run at a time.

I don’t have expertise to outline authoritative rules-of-thumb for what these values should be. That said, you can do a few things.

First, change your staging environment to use your production environment’s instance type. This will let you compare apples to apples how your application will run in production. Once it’s up and running, start creating artificial traffic to your most popular endpoints, if that’s possible. I use Siege:

Siege can be configured in a huge number of ways. We really don’t care about the throughput at this point, we really just want the HTTP requests. If you have another tool that you prefer or pay for, that’ll work fine too.

Next, set the process count for your environment to 1 and the thread count to a reasonable number. I’m using 20 threads per process because my application uses relatively little CPU per request and often waits on the database. The goal here is to warm up your instance with a single process. After a few minutes, you’ll see what your memory use looks like for a single process with 20 threads. Unless you’re using a runtime that supports strong parallelism (Java, Go, etc.) you almost certainly want more processes. This exercise will tell you how your system performs with one process, and you can scale up from there.

It’s almost certainly the case (if you’re using an r5-class instance) that you’ll only be using a small fraction of your instance’s memory. Pinecast is memory-heavy, so I aimed to use about 60% of the instance’s total memory. That gives a nice buffer for memory use to increase in the future and to give time in case memory use jumps up (you can set an alert!). If your application is fairly memory-heavy, take your memory target (60% of your RAM) and divide that by the memory use after warming up the instance. If you have 8GB of RAM and your single process is using 200MB, 8GB*60% / 300MB == 16.

You’ll also want to be mindful of CPU use. If you’re using Siege, keep in mind that the traffic is artificial and probably pretty homogeneous. Some requests will be more intensive than others. You’ll want to avoid using too much CPU: if you take on too many requests at once, you could choke your server.

To avoid problems, increase the amount of synthetic traffic you’re generating after you increase the number of processes. You should be able to handle a majority of the volume of production traffic (that is, a similar number of requests) without exceeding about 50% CPU use at any time.

As for tuning the thread count, your mileage may vary. In a single-threaded system like Python, thread count helps you use your memory and CPU more efficiently. If you only have a single thread per process, you’re using one process’s worth of memory per request at any time. However, when that request is waiting on something (the database, an outbound HTTP request, etc.) it’s completely idle. Threads allow that single process to handle multiple requests at a time. If you set the number of threads too high, however, you risk threads waiting too long for the process to switch back to them (that is, too many threads are waiting their turn for CPU). You’ll want to play with this number a bit.

Last, you’ll want to set up rules for autoscaling. You want your environment to automatically add more instances when load increases, and turn off instances when load decreases. AWS gives you a ton of options for this, unlike Heroku. To determine which metric is the best one to scale on, think about how your application will behave under load and set conservative targets.

For Pinecast, I based this on CPU use. When the service comes under heavy load, there are enough processes and threads to easily begin to saturate the CPU (as you recall, this is what I believe triggered the degradation on Heroku). When CPU utilization exceeds 70%, I scale up. When CPU use drops below 20%, I scale down. These numbers are probably not close to perfect, and I have alerts set up so that when it happens I can monitor it and make adjustments.

You’ll also want to set minimum and maximum numbers of instances for your service. I’d strongly recommend setting the minimum to at least 2, even if you plan to keep it at one in the long term: if you chose bad numbers for your process and thread counts, this will give you some relief.

The cutover was pretty uneventful: I switched the DNS entry in Cloudflare and everything happened exactly as intended. I did adjust the process count afterward (upward: I was using well below half of the memory on the server). CPU utilization stayed low (below 20%), so I was confident in decreasing the minimum instance count.

Monitoring

You’ll likely want to set up CloudWatch alerts for some of the following other metrics:

  • If you’re using RDS for your database: too many and too few database connections, which can indicate a problem (or that it may be time to start using an RDS proxy).
  • An increase in ELB 5XX errors, which indicate 503/504 errors. These errors indicate that there are not enough back-end instances of your service to handle the incoming load. Do not set the threshold for this too low: if you have a low number of minimum instances for your Beanstalk environment, you’ll see small blips of ELB 5XX errors during deployments and configuration changes.
  • An unusual increase or substantial decrease in total requests to your the load balancer for your environment.
  • A change in the unhealthy host count for your load balancer

Cleaning up the fiddly bits

Rollbar

I use Rollbar for error tracking. I can’t say it’s my favorite service, and they could certainly do with hiring a designer (please!), but it does the job and has all of the features I care about. I’d signed up through the Heroku marketplace, which essentially means that billing goes through Heroku, and Heroku puts the magic environment variables with my API key(s) onto the dynos automatically.

This is hugely convenient for getting started, but sucks when you’re leaving. I messaged Rollbar about disconnecting my account from Heroku but they didn’t respond for over a week. I decided there wasn’t really anything in there that I couldn’t set up again (and the remaining unresolved errors were either known or noise), so I simply canceled the subscription through Heroku, created a new account directly through Rollbar, and replaced the API tokens everywhere.

I regret doing this, to be honest, because it was a pain (and now I need to re-mute a ton of annoying exceptions that aren’t my problem, like Chrome extension errors). If you run into this, nag support until they help you.

I should also note that Rollbar does have a live chat (Intercom?) widget on their dashboard, but it offers no support chat. I honestly cannot understand why they’d do this. It simply redirects you to email, but that seems to be a dead end.

Crons

The Heroku scheduler is honestly really convenient, for one single reason: it runs a command just like your Procfile runs your application. With environment variables. Beanstalk has cron jobs, and they’re more powerful. But annoyingly, they don’t come with environment variables by default. Annoyingly, you need to run a (long-ish) command to set environment variables, which is just exhausting:

This could be better, but I can’t complain too much.

I didn’t actually finish this part of the migration initially. I simply downgraded the Heroku instance to Hobby and kept the Heroku scheduler running. The dyno processes no requests, it just runs scheduled tasks periodically.

heroku run

Being able to run ad-hoc commands is invaluable. You can sort of do this with eb ssh, which lets you SSH into your Beanstalk instances. You can then load environment variables and run a command. Conceivably, you could write a little script that wraps some commands together to give you the same experience as heroku run. That said, I’m taking the same approach as crons (i.e., I’m just using Heroku for it) until I settle on something more substantive. I’d ideally like to do this without using the eb CLI tool, since I already have the aws CLI. At this point it’s just a matter of figuring out the right incantations.

A lot of advice says to just forgo ad-hoc commands, and to run your commands as deploy script or on crons. Some of the Django+Beanstalk tutorials I found online suggest running migrations this way. Frankly, this is incredibly dangerous advice: you can easily paint yourself into a corner, cause downtime (especially if you have multiple environments, or your migrations require interaction!), or even data loss. Don’t do that.

What’s left

Well, as I said, there’s only two things that haven’t been fully sunset at this point:

  1. Heroku scheduler
  2. Use of heroku run for ad-hoc commands

I’ll probably switch to Beanstalk crons soon, but it’s not a super high priority. And I’ll probably get stuck in on a SSH tool for ad-hoc commands soon as well, but that’s an exercise in reverse-engineering eb.

Would I recommend Beanstalk for a new project? I’m not sure I would. Heroku definitely got me through my first couple years. But frankly, it’s just not a good solution for anyone pumping a serious amount of traffic through their service (or at least anyone who doesn’t want to waste money).

Speaking of money, I’ve gone from $500/mo on Heroku dynos to about $80/mo on EC2 instances (the other non-RDS costs are pretty trivial). That’s a big win! On the other hand, I’ve gone from paying $50/mo for Heroku’s Postgres database to ~$400/mo for Aurora Postgres. That’s less ideal, but it’s also the case that I could probably save a bunch of money simply by downgrading the instance types: I’m using db.r5.large which is (probably) much more than is needed. I suspect that db.t3.large will work just fine.

Additionally, I’m not using reserved instances (yet!). I plan to purchase RIs for my staging and production environments, which will be ~large up-front costs, but save a big chunk of money over the course of the year. I also plan to purchase RIs for the Postgres instances, but only once I’ve gotten that optimized to my liking.

As for performance, it’s difficult to come up with a benchmark since there’s a lot of moving parts. Things feel snappier, but if anything that’s probably the result of using Aurora Postgres rather than Beanstalk. The site hasn’t become degraded, at least. That’s a welcome change. In fact, I haven’t had any issues since migrating.

Overall, I’d call this a win. If you have questions or comments, please don’t hesitate to reach out to me on Twitter.

--

--

Matt Basta

A salty software guy. Stripe by day, Pinecast by night.