Django Channels, ASGI, and Heroku

12 min readMay 5, 2020

This past week, to satisfy my urge to build something new, I’ve started putting together a project which has a real-time chat component, which means websockets. In a server-side JavaScript project, this is not something I’d think twice about: setting up a project with ws or another similar library is easy. But I love Django’s ORM and I’ve got a pile of useful Python from Pinecast, and Django 3 was recently released with ASGI support, so I decided to play with Django Channels for the first time.

Channels is a project from Django that allows you to build applications which interact with the consumer over protocols other than HTTP. Simply put, it’s a first-party way to continue to use Django for your project, but expose services over protocols like websockets, server-sent events, or other APIs. Folks outside of the Python world do this regularly, and folks within the Python world are busy writing crazy tools with Tornado or Twisted or even raw sockets. Until Channels came out, building asynchronous apps on the web in Python meant giving up many of the niceties that popular frameworks like Django offered.

Heroku is — for better or worse — my choice of host for new projects. I’ve recently bemoaned Heroku’s lack of recent public-facing product features (thanks Salesforce). And frankly, the lack of any meaningful developer-facing changes is disheartening, but Heroku still remains the easiest way — in my opinion — to get a new project off the ground and scale it without breaking the bank (or going insane with AWS boilerplate, or betting on a new startup).

In 2016, Heroku blogged about Django Channels. They even released some sample code (now archived). However, that was built for a very old version of Django, and an old version of Channels, and quite honestly any sort of example code that old and unmaintained is almost certainly a bad idea to play with in production.

The following is a walkthrough of what I did to get things up and running.

Getting it up and running

The first thing to do was get the service up and running. I started my project on Django 3.0.3 and Channels 2.4.0. The first and most important step is to get things running locally, and this is straightforward.

Locally, I’m using a sqlite3 database, so aside from the obvious steps to get Channels installed, simply following the latest installation guides got python manage.py runserver doing what it was supposed to be doing without any fuss.

The next step was getting the project running on Heroku. This is more challenging.

Deploying to Heroku

The article from 2016 recommends using Daphne. Daphne is a HTTP/websocket protocol server for ASGI, maintained by the Django Project. Though it’s seemingly well-maintained, numerous posts online indicate that it’s not very fast (taking seconds to load a route that would otherwise load almost instantly with gunicorn/WSGI).

Some basic research indicates that there are three choices: Daphne, Uvicorn, and Hypercorn (beta). Uvicorn is the only non-Daphne “stable” choice, and it’s based on the same underpinnings as Node’s event loop, so I’m reassured that it’s probably fast.

The biggest change to switch from WSGI to ASGI (gunicorn to uvicorn) is to install the latter in your Pipfile and change your Procfile to use the new server.

Two things stuck out:

uvicorn doesn’t pull from $PORT. You must specify --port $PORT or it won’t boot. I specified --host 0.0.0.0 while I was at it just in case, but it might be required.
uvicorn doesn’t have an equivalent to --max-requests-jitter in gunicorn. --max-requests in gunicorn allows you to restart a worker after some number of requests, and this is available in uvicorn as --limit-max-requests. Without the jitter option, though, you potentially set yourself up for thundering herds of server reboots. That’s not a problem for a project with no traffic (like mine), but I couldn’t for instance deploy it to Pinecast in its current state. This is perhaps a good open source contribution that I could make in the coming weeks.

Deploying this change allowed my server to boot and start serving requests.

Making it run

I have been very frugal so far and have stuck around on the Hobby tier to get this project built before I start getting beta users into it. Being on this tier allowed me to catch an issue early: after I logged into the Django Admin panel and started adding data, my server mysteriously started 500ing. Running heroku logs --tail showed an error ending with:

app[web.1]: File "/app/.heroku/python/lib/python3.7/site-packages/django/utils/asyncio.py", line 26, in inner
app[web.1]: return func(*args, **kwargs)
app[web.1]: File "/app/.heroku/python/lib/python3.7/site-packages/django/db/backends/postgresql/base.py", line 185, in get_new_connection
app[web.1]: connection = Database.connect(**conn_params)
app[web.1]: File "/app/.heroku/python/lib/python3.7/site-packages/psycopg2/__init__.py", line 130, in connect
app[web.1]: conn = _connect(dsn, connection_factory=connection_factory, **kwasync)
app[web.1]: django.db.utils.OperationalError: FATAL:  too many connections for role "gihouwrmsatwpc"

Running heroku pg:info showed that the database had exceeded 20 connections, which is the limit for a hobby-tier database.

Under WSGI, you deploy some number of Heroku dynos, and specify WEB_CONCURRENCY to choose how many instances of your app each dyno runs. I’ve had WEB_CONCURRENCY set to 3, which is a very reasonable number for a small app with no traffic. With one dyno, that should come out to 3 database connections.

That’s not true for Django Channels. The async event loop creates threads, and the number of threads is equal to the number of CPUs your server has, times five. Each thread creates its own database connection. On my cheap Heroku dyno:

>>> import multiprocessing
>>> multiprocessing.cpu_count()
8

Yikes, that means up to 40 database connections (potentially up to 120, if that number if multiplied by WEB_CONCURRENCY). Rebooting the dynos and mashing F5 in the browser shows the number of connections shooting up and quickly exceeding the 20-connection limit.

To limit this, you need to set the ASGI_THREADS environment variable. This tells Channels to limit the number of threads. I set mine to five, which seems sufficient for now. In the future, it’ll be worthwhile to reset this and install the pgbouncer buildpack.

Realtime

To make the app actually take advantage of the ASGI+Channels setup, I installed Graphene Subscriptions to go along with Graphene Django. With relatively little boilerplate, Graphene Subscriptions allows you to listen to Django signals for model save (create or update) and delete. For instance, the following code allows you to create a GraphQL subscription for changes to any of a Room‘s Messages in a chat application:

import graphene
from graphene_subscriptions.events import CREATEDfrom ..models import Message  # Our Django model for messages
from .message_type import MessageType  # Our Graphene Message type# Our GraphQL subscription class
class Subscription(graphene.ObjectType):
    # Define what the user can subscribe to in our schema
    message_created = graphene.Field(
        graphene.NonNull(MessageType),
        room=graphene.Argument(graphene.ID, required=True),
    )    def resolve_message_created(root, info, **kw):
        def filter(event):
            return (
                event.operation == CREATED and
                isinstance(event.instance, Message) and
                event.instance.room_id == kw.get('room')
            )
        return root.filter(filter).map(lambda e: e.instance)

You’ll want to follow the installation instructions for information on how to update your settings.py and routing.py files, add signal handlers, and link the Subscription class (above) to your Schema instance.

The Subscription class is defined as a Graphene object type, just like you’d do for a plain old query. The resolve_* methods, rather than receiving an object for root, receives an observable that emits events from signals generated in other instances of your application. So, for instance, if I save a new Message on one server, the act of saving that Message triggers a Django signal. Graphene Subscriptions puts an event object into a channel, which passes through the channel layer and pops out on the other server.

Meanwhile, our user should have established a GraphQL subscription. For instance:

subscription MessageCreated($room: ID!) {
  messageCreated(room: $room) {
    id
    created
    messageText
    sender {
      id
      name
      avatar(size: 32)
    }
  }
}

When the subscription is initiated, the resolve_message_created method fires. Graphene Subscriptions subscribes to the channel that receives events about new Messages. root is the observable that iterates over the events as they come in from the channel.

An observable behaves a lot like an array: you can map or filter the data, but the data isn’t readily available yet. Imagine if you had a Node event emitter that you could filter events with, and map to transform their bodies (just like you could with an array): you don’t know when the events will fire, but when they do you can perform some actions against them.

In our case, we use filter to filter out events that aren’t object creations, aren’t object creations for Messages, and aren’t events for Messages created in the chat room that we’re subscribed to. Note that in a production application, you’d want to check that info.context.user has access to the room in question, and perhaps periodically check that the user still has access to the room: if the user was kicked or blocked, their subscription won’t automatically close:

def resolve_message_created(root, info, **kw):
    def filter(event):
        return (
            event.operation == CREATED and
            isinstance(event.instance, Message) and
            event.instance.room_id == kw.get('room') and
            # nice and safe
            check_user_has_access_to_room(kw.get('room'))
        )
    return root.filter(filter).map(lambda e: e.instance)

Making it work with Apollo

Getting this up and running with Apollo wasn’t very hard. First, our ApolloClient needs to be updated with a WebsocketLink to our back-end. This teaches Apollo how to connect a web socket to our server. After installing the necessary dependencies, defining the link for this was simple:

const wsLink = new WebSocketLink({
  uri: `${window.location.protocol === 'https:' ? 'wss' : 'ws'}://${
    window.location.host
  }/graphql`,
  options: {
    reconnect: true,
  },
});

Then, to use the link, we need to tell Apollo how to decide whether to use HTTP or the web socket. For that, we use the split method from apollo-link.

import {split} from 'apollo-link';const link = split(
  ({query}) => {
    const definition = getMainDefinition(query);
    return (
      definition.kind === 'OperationDefinition' &&
      definition.operation === 'subscription'
    );
  },
  wsLink, // Our new web socket link
  httpLink, // Our existing HTTP link
);

This code looks at the operations our application is making. If the operation is a subscription, it uses the web socket link. If it’s not, it uses the old fashioned HTTP link.

In development, this worked great. Creating a new subscription immediately saw messages arriving when they were created. Magical!

In production, this didn’t work. The production instances immediately started flapping when our client attempted to establish a web socket connection:

ValueError: Django can only handle ASGI/HTTP connections, not websocket.

We also need to make a change to our Django application. Our asgi.py file needs to learn about Channels, otherwise Channels can’t handle web sockets: remember, Django can’t handle non-HTTP requests out of the box. To do that, we just need to make a few changes:

diff --git a/yourapp/asgi.py b/yourapp/asgi.py
index eb61b00..19e073e 100644
--- a/yourapp/asgi.py
+++ b/yourapp/asgi.py
@@ -9,8 +9,9 @@
 
 import os
 
-from django.core.asgi import get_asgi_application
+import django
+from channels.routing import get_default_application
 
 os.environ.setdefault('DJANGO_SETTINGS_MODULE', 'yourapp.settings')
-
-application = get_asgi_application()
+django.setup()
+application = get_default_application()

Essentially, we’re swapping out Django’s get_asgi_application method for Channels’ get_default_application method. We’re also calling django.setup(), which get_asgi_application would have done for us.

Making it work on Heroku

I managed to get the application up and running on Heroku without much fuss. Be sure to make sure you’ve enabled web sockets on Cloudflare if you are using them!

After the application is alive for tens of minutes, messages in my application started to fail to send. I received some spooky tracebacks in my Heroku logs ([...] is where I snipped some wordy stack trace details):

ERROR:    An error occurred while resolving field Mutations.createMessage
Traceback (most recent call last):File "[...]/site-packages/graphql/execution/executor.py", line 452, in resolve_or_error
return executor.execute(resolve_fn, source, info, **args)[...]File "/app/messaging/schema/mutations/create_message.py", line 68, in mutate
message.save()[...]File "[...]/site-packages/aioredis/connection.py", line 322, in execute
raise ConnectionClosedError(msg)aioredis.errors.ConnectionClosedError: Reader at end of file

Weird! The application consistently works when it’s first deployed, then fails a short time later.

The first thing that I noticed was some nearby log entries indicating that I was over the Redis connection limit for the free tier. Oops!

After upgrading to a Premium instance, I was seeing a very similar problem, but it took longer for the service to start failing. Double spooky!

I did some googling and came across this Github issue on the Redis channel layer. The channel layer, remember, is the software that distributes messages to Channels consumers on different hosts. If I send a chat message on one server, I expect it to be received by clients connected to all of my other servers, not just the one the message was sent from.

The comments indicate a few things:

The problem seems to be resolved by upgrading to the latest Redis channel layer. I was already on the latest version, so this was not the issue.
The problem seems to be mostly unique to Heroku (though some folks have experienced it locally), which indicates a configuration issue or bug in Heroku’s Redis implementation.
Folks have reported success when switching from Heroku’s first-party Redis offering to Redis Cloud (another Heroku add-on). Some folks reported intermittent issues even after moving to Redis Cloud, though.

The final comment (at the time of writing, from Oct 2019) suggests that the Redis idle timeout might be the culprit here. This would explain why the connection is active initially, but fails after a short time.

On Heroku, the idle timeout defaults to about five minutes. This is easily changed. Disabling the idle timeout altogether is as easy as running

$ heroku redis:timeout [your-redis-instance-0000] --seconds=0
Timeout for [your-redis-instance-0000] (REDIS_URL) set to 0 seconds.
Connections to the Redis instance can idle indefinitely.

Making it production-ready

The other part of this process is making sure the application is ready for production users! Having it work is only half of the battle.

The first big step is securing the Redis instance. Heroku recommends using the Stunnel buildpack for this. Note that this is not available for hobby-dev instances, so be sure to upgrade to a Premium Redis instance.

You can install the buildpack by running this command:

$ heroku buildpacks:add -i 1 heroku/redis

Now, configure your app to use Stunnel. You’ll want to add bin/start-stunnel to the start of your commands in your Procfile, like so:

web: bin/start-stunnel uvicorn ironcoach.asgi:application --limit-max-requests=1200 --port $PORT

You’ll do this for each dyno type you configure. If you have worker dynos that use Channels (or otherwise talk to Redis), you’ll do the same for each of them. One thing that the Heroku docs (and the buildpack docs) don’t mention is that you no longer want to use the REDIS_URL environment variable to connect to Redis. This is important: if you don’t do it, you’ll get a “connection refused” error on startup. In your settings.py file, you’ll want to do something like this:

# Remove this:
FOO = os.environ.get('REDIS_URL')# And do this:
REDIS_URL = os.environ.get('REDIS_URL_STUNNEL') or \
            os.environ.get('REDIS_URL')
FOO = REDIS_URL

When you deploy your application next, you’ll see this at the top of the build log:

-----> stunnel app detected
-----> Moving the configuration generation script into app/bin
-----> Moving the start-stunnel script into app/bin
-----> stunnel done

Other notes

While setting things up, I did encounter some problems or other hiccups along the way.

`graphene-subscriptions` only supports one group

I found that graphene-subscriptions was using the same Channels group for each message being broadcast. What this means is that every time a registered model is created, updated, or deleted (the events that you set up signals for), it is broadcast to every other server instance listening for messages.

For a toy project, this is no problem. And in fact, this probably wouldn’t be an issue for most projects that have a few requests per second. However, it poses a serious scalability issue for projects that have a meaningful volume of messages, or projects that have a large number of server instances.

Consider this: if each message is 100 bytes and you have 10 server instances, you’re broadcasting 1kb across the network for every message. If you have 30 server instances, you’re broadcasting 3kb across the network for every message. If your message is 500 bytes and you have 30 servers, each message results in 15kb of network traffic.

Beyond traffic volume, you end up dealing with a great deal of excess compute. Consider a system that processes ten messages per second. If every server instance receives a copy of every message, every instance must process ten messages per second—even if none of the messages are relevant to any active subscriptions. Receiving, parsing, and filtering the messages is expensive, and will quickly use up CPU time.

The first obvious fix is to broadcast each model’s messages through its own group. This means that if a server instance subscribes to ModelFoo changes, it won’t receive broadcasts for changes to instances of ModelBar. However, if you’re building a chat application and store messages with a Messages model, your application hasn’t improved much.

The second obvious fix is to broadcast each type of event to a model through its own group. That is, one group for creation of model instances, one group for updates, and one group for deletion (per model). A server instance subscribing to new ModelFoo instances won’t see deleted ModelFoo events. This is an improvement, but for applications where one type of event is far more common (in our chat example, creation), this still doesn’t scale very well.

The third and complete fix is to broadcast events on a group specific to the nature of the model. In a chat room application, this might mean broadcasting events to a group formatted as message-room:{room_id}. In this case, a server would only receive events for messages sent to the chat rooms that it has connected users participating in: a message sent to #random wouldn’t be broadcast to a server whose users are in #general and #announcements.

The first two changes will become the default behavior for graphene-subscriptions in version 2.0:

v2 Roadmap · Issue #10 · jaydenwindle/graphene-subscriptions

Now that graphene-subscriptions has been out in the wild for a little while, I've noticed a number of common stumbling…

github.com

Additionally, graphene-subscriptions v2 opens the door to defining custom groups, which would allow you to implement the third fix in your own application. Thanks to Jayden Windle for his amazing efforts maintaining this project.

stunnel crashes on startup

For an unknown reason, stunnel was failing to start when my dynos restarted, with a spooky error:

INTERNAL ERROR: systemd initialization failed at stunnel.c, line 101

It took a customer support ticket to Heroku to track this down. Two environment variables had been automatically set on my dyno: LISTEN_PID and LISTEN_FDS. The former was causing stunnel to crash. If you encounter this error, just remove these two environment variables.

What’s next?

I hope you found this useful! If you enjoyed it, please let me know what you’d like to see me write about in the future.