Django Channels, ASGI, and Heroku

This past week, to satisfy my urge to build something new, I’ve started putting together a project which has a real-time chat component, which means websockets. In a server-side JavaScript project, this is not something I’d think twice about: setting up a project with ws or another similar library is easy. But I love Django’s ORM and I’ve got a pile of useful Python from Pinecast, and Django 3 was recently released with ASGI support, so I decided to play with Django Channels for the first time.

Channels is a project from Django that allows you to build applications which interact with the consumer over protocols other than HTTP. Simply put, it’s a first-party way to continue to use Django for your project, but expose services over protocols like websockets, server-sent events, or other APIs. Folks outside of the Python world do this regularly, and folks within the Python world are busy writing crazy tools with Tornado or Twisted or even raw sockets. Until Channels came out, building asynchronous apps on the web in Python meant giving up many of the niceties that popular frameworks like Django offered.

Heroku is — for better or worse — my choice of host for new projects. I’ve recently bemoaned Heroku’s lack of recent public-facing product features (thanks Salesforce). And frankly, the lack of any meaningful developer-facing changes is disheartening, but Heroku still remains the easiest way — in my opinion — to get a new project off the ground and scale it without breaking the bank (or going insane with AWS boilerplate, or betting on a new startup).

In 2016, Heroku blogged about Django Channels. They even released some sample code (now archived). However, that was built for a very old version of Django, and an old version of Channels, and quite honestly any sort of example code that old and unmaintained is almost certainly a bad idea to play with in production.

The following is a walkthrough of what I did to get things up and running.

Getting it up and running

Locally, I’m using a sqlite3 database, so aside from the obvious steps to get Channels installed, simply following the latest installation guides got python manage.py runserver doing what it was supposed to be doing without any fuss.

The next step was getting the project running on Heroku. This is more challenging.

Deploying to Heroku

Some basic research indicates that there are three choices: Daphne, Uvicorn, and Hypercorn (beta). Uvicorn is the only non-Daphne “stable” choice, and it’s based on the same underpinnings as Node’s event loop, so I’m reassured that it’s probably fast.

The biggest change to switch from WSGI to ASGI (gunicorn to uvicorn) is to install the latter in your Pipfile and change your Procfile to use the new server.

Two things stuck out:

  1. uvicorn doesn’t pull from $PORT. You must specify --port $PORT or it won’t boot. I specified --host 0.0.0.0 while I was at it just in case, but it might be required.
  2. uvicorn doesn’t have an equivalent to --max-requests-jitter in gunicorn. --max-requests in gunicorn allows you to restart a worker after some number of requests, and this is available in uvicorn as --limit-max-requests. Without the jitter option, though, you potentially set yourself up for thundering herds of server reboots. That’s not a problem for a project with no traffic (like mine), but I couldn’t for instance deploy it to Pinecast in its current state. This is perhaps a good open source contribution that I could make in the coming weeks.

Deploying this change allowed my server to boot and start serving requests.

Making it run

app[web.1]: File "/app/.heroku/python/lib/python3.7/site-packages/django/utils/asyncio.py", line 26, in inner
app[web.1]: return func(*args, **kwargs)
app[web.1]: File "/app/.heroku/python/lib/python3.7/site-packages/django/db/backends/postgresql/base.py", line 185, in get_new_connection
app[web.1]: connection = Database.connect(**conn_params)
app[web.1]: File "/app/.heroku/python/lib/python3.7/site-packages/psycopg2/__init__.py", line 130, in connect
app[web.1]: conn = _connect(dsn, connection_factory=connection_factory, **kwasync)
app[web.1]: django.db.utils.OperationalError: FATAL: too many connections for role "gihouwrmsatwpc"

Running heroku pg:info showed that the database had exceeded 20 connections, which is the limit for a hobby-tier database.

Under WSGI, you deploy some number of Heroku dynos, and specify WEB_CONCURRENCY to choose how many instances of your app each dyno runs. I’ve had WEB_CONCURRENCY set to 3, which is a very reasonable number for a small app with no traffic. With one dyno, that should come out to 3 database connections.

That’s not true for Django Channels. The async event loop creates threads, and the number of threads is equal to the number of CPUs your server has, times five. Each thread creates its own database connection. On my cheap Heroku dyno:

>>> import multiprocessing
>>> multiprocessing.cpu_count()
8

Yikes, that means up to 40 database connections (potentially up to 120, if that number if multiplied by WEB_CONCURRENCY). Rebooting the dynos and mashing F5 in the browser shows the number of connections shooting up and quickly exceeding the 20-connection limit.

To limit this, you need to set the ASGI_THREADS environment variable. This tells Channels to limit the number of threads. I set mine to five, which seems sufficient for now. In the future, it’ll be worthwhile to reset this and install the pgbouncer buildpack.

Realtime

import graphene
from graphene_subscriptions.events import CREATED
from ..models import Message # Our Django model for messages
from .message_type import MessageType # Our Graphene Message type
# Our GraphQL subscription class
class Subscription(graphene.ObjectType):
# Define what the user can subscribe to in our schema
message_created = graphene.Field(
graphene.NonNull(MessageType),
room=graphene.Argument(graphene.ID, required=True),
)
def resolve_message_created(root, info, **kw):
def filter(event):
return (
event.operation == CREATED and
isinstance(event.instance, Message) and
event.instance.room_id == kw.get('room')
)
return root.filter(filter).map(lambda e: e.instance)

You’ll want to follow the installation instructions for information on how to update your settings.py and routing.py files, add signal handlers, and link the Subscription class (above) to your Schema instance.

The Subscription class is defined as a Graphene object type, just like you’d do for a plain old query. The resolve_* methods, rather than receiving an object for root, receives an observable that emits events from signals generated in other instances of your application. So, for instance, if I save a new Message on one server, the act of saving that Message triggers a Django signal. Graphene Subscriptions puts an event object into a channel, which passes through the channel layer and pops out on the other server.

Meanwhile, our user should have established a GraphQL subscription. For instance:

subscription MessageCreated($room: ID!) {
messageCreated(room: $room) {
id
created
messageText
sender {
id
name
avatar(size: 32)
}
}
}

When the subscription is initiated, the resolve_message_created method fires. Graphene Subscriptions subscribes to the channel that receives events about new Messages. root is the observable that iterates over the events as they come in from the channel.

An observable behaves a lot like an array: you can map or filter the data, but the data isn’t readily available yet. Imagine if you had a Node event emitter that you could filter events with, and map to transform their bodies (just like you could with an array): you don’t know when the events will fire, but when they do you can perform some actions against them.

In our case, we use filter to filter out events that aren’t object creations, aren’t object creations for Messages, and aren’t events for Messages created in the chat room that we’re subscribed to. Note that in a production application, you’d want to check that info.context.user has access to the room in question, and perhaps periodically check that the user still has access to the room: if the user was kicked or blocked, their subscription won’t automatically close:

def resolve_message_created(root, info, **kw):
def filter(event):
return (
event.operation == CREATED and
isinstance(event.instance, Message) and
event.instance.room_id == kw.get('room') and
# nice and safe
check_user_has_access_to_room(kw.get('room'))
)
return root.filter(filter).map(lambda e: e.instance)

Making it work with Apollo

const wsLink = new WebSocketLink({
uri: `${window.location.protocol === 'https:' ? 'wss' : 'ws'}://${
window.location.host
}/graphql`,
options: {
reconnect: true,
},
});

Then, to use the link, we need to tell Apollo how to decide whether to use HTTP or the web socket. For that, we use the split method from apollo-link.

import {split} from 'apollo-link';const link = split(
({query}) => {
const definition = getMainDefinition(query);
return (
definition.kind === 'OperationDefinition' &&
definition.operation === 'subscription'
);
},
wsLink, // Our new web socket link
httpLink, // Our existing HTTP link
);

This code looks at the operations our application is making. If the operation is a subscription, it uses the web socket link. If it’s not, it uses the old fashioned HTTP link.

In development, this worked great. Creating a new subscription immediately saw messages arriving when they were created. Magical!

In production, this didn’t work. The production instances immediately started flapping when our client attempted to establish a web socket connection:

ValueError: Django can only handle ASGI/HTTP connections, not websocket.

We also need to make a change to our Django application. Our asgi.py file needs to learn about Channels, otherwise Channels can’t handle web sockets: remember, Django can’t handle non-HTTP requests out of the box. To do that, we just need to make a few changes:

diff --git a/yourapp/asgi.py b/yourapp/asgi.py
index eb61b00..19e073e 100644
--- a/yourapp/asgi.py
+++ b/yourapp/asgi.py
@@ -9,8 +9,9 @@

import os

-from django.core.asgi import get_asgi_application
+import django
+from channels.routing import get_default_application

os.environ.setdefault('DJANGO_SETTINGS_MODULE', 'yourapp.settings')
-
-application = get_asgi_application()
+django.setup()
+application = get_default_application()

Essentially, we’re swapping out Django’s get_asgi_application method for Channels’ get_default_application method. We’re also calling django.setup(), which get_asgi_application would have done for us.

Making it work on Heroku

After the application is alive for tens of minutes, messages in my application started to fail to send. I received some spooky tracebacks in my Heroku logs ([...] is where I snipped some wordy stack trace details):

ERROR:    An error occurred while resolving field Mutations.createMessage
Traceback (most recent call last):
File "[...]/site-packages/graphql/execution/executor.py", line 452, in resolve_or_error
return executor.execute(resolve_fn, source, info, **args)
[...]File "/app/messaging/schema/mutations/create_message.py", line 68, in mutate
message.save()
[...]File "[...]/site-packages/aioredis/connection.py", line 322, in execute
raise ConnectionClosedError(msg)
aioredis.errors.ConnectionClosedError: Reader at end of file

Weird! The application consistently works when it’s first deployed, then fails a short time later.

The first thing that I noticed was some nearby log entries indicating that I was over the Redis connection limit for the free tier. Oops!

After upgrading to a Premium instance, I was seeing a very similar problem, but it took longer for the service to start failing. Double spooky!

I did some googling and came across this Github issue on the Redis channel layer. The channel layer, remember, is the software that distributes messages to Channels consumers on different hosts. If I send a chat message on one server, I expect it to be received by clients connected to all of my other servers, not just the one the message was sent from.

The comments indicate a few things:

  • The problem seems to be resolved by upgrading to the latest Redis channel layer. I was already on the latest version, so this was not the issue.
  • The problem seems to be mostly unique to Heroku (though some folks have experienced it locally), which indicates a configuration issue or bug in Heroku’s Redis implementation.
  • Folks have reported success when switching from Heroku’s first-party Redis offering to Redis Cloud (another Heroku add-on). Some folks reported intermittent issues even after moving to Redis Cloud, though.

The final comment (at the time of writing, from Oct 2019) suggests that the Redis idle timeout might be the culprit here. This would explain why the connection is active initially, but fails after a short time.

On Heroku, the idle timeout defaults to about five minutes. This is easily changed. Disabling the idle timeout altogether is as easy as running

$ heroku redis:timeout [your-redis-instance-0000] --seconds=0
Timeout for [your-redis-instance-0000] (REDIS_URL) set to 0 seconds.
Connections to the Redis instance can idle indefinitely.

Making it production-ready

The first big step is securing the Redis instance. Heroku recommends using the Stunnel buildpack for this. Note that this is not available for hobby-dev instances, so be sure to upgrade to a Premium Redis instance.

You can install the buildpack by running this command:

$ heroku buildpacks:add -i 1 heroku/redis

Now, configure your app to use Stunnel. You’ll want to add bin/start-stunnel to the start of your commands in your Procfile, like so:

web: bin/start-stunnel uvicorn ironcoach.asgi:application --limit-max-requests=1200 --port $PORT

You’ll do this for each dyno type you configure. If you have worker dynos that use Channels (or otherwise talk to Redis), you’ll do the same for each of them. One thing that the Heroku docs (and the buildpack docs) don’t mention is that you no longer want to use the REDIS_URL environment variable to connect to Redis. This is important: if you don’t do it, you’ll get a “connection refused” error on startup. In your settings.py file, you’ll want to do something like this:

# Remove this:
FOO = os.environ.get('REDIS_URL')
# And do this:
REDIS_URL = os.environ.get('REDIS_URL_STUNNEL') or \
os.environ.get('REDIS_URL')
FOO = REDIS_URL

When you deploy your application next, you’ll see this at the top of the build log:

-----> stunnel app detected
-----> Moving the configuration generation script into app/bin
-----> Moving the start-stunnel script into app/bin
-----> stunnel done

Other notes

graphene-subscriptions only supports one group

For a toy project, this is no problem. And in fact, this probably wouldn’t be an issue for most projects that have a few requests per second. However, it poses a serious scalability issue for projects that have a meaningful volume of messages, or projects that have a large number of server instances.

Consider this: if each message is 100 bytes and you have 10 server instances, you’re broadcasting 1kb across the network for every message. If you have 30 server instances, you’re broadcasting 3kb across the network for every message. If your message is 500 bytes and you have 30 servers, each message results in 15kb of network traffic.

Beyond traffic volume, you end up dealing with a great deal of excess compute. Consider a system that processes ten messages per second. If every server instance receives a copy of every message, every instance must process ten messages per second—even if none of the messages are relevant to any active subscriptions. Receiving, parsing, and filtering the messages is expensive, and will quickly use up CPU time.

The first obvious fix is to broadcast each model’s messages through its own group. This means that if a server instance subscribes to ModelFoo changes, it won’t receive broadcasts for changes to instances of ModelBar. However, if you’re building a chat application and store messages with a Messages model, your application hasn’t improved much.

The second obvious fix is to broadcast each type of event to a model through its own group. That is, one group for creation of model instances, one group for updates, and one group for deletion (per model). A server instance subscribing to new ModelFoo instances won’t see deleted ModelFoo events. This is an improvement, but for applications where one type of event is far more common (in our chat example, creation), this still doesn’t scale very well.

The third and complete fix is to broadcast events on a group specific to the nature of the model. In a chat room application, this might mean broadcasting events to a group formatted as message-room:{room_id}. In this case, a server would only receive events for messages sent to the chat rooms that it has connected users participating in: a message sent to #random wouldn’t be broadcast to a server whose users are in #general and #announcements.

The first two changes will become the default behavior for graphene-subscriptions in version 2.0:

Additionally, graphene-subscriptions v2 opens the door to defining custom groups, which would allow you to implement the third fix in your own application. Thanks to Jayden Windle for his amazing efforts maintaining this project.

stunnel crashes on startup

INTERNAL ERROR: systemd initialization failed at stunnel.c, line 101

It took a customer support ticket to Heroku to track this down. Two environment variables had been automatically set on my dyno: LISTEN_PID and LISTEN_FDS. The former was causing stunnel to crash. If you encounter this error, just remove these two environment variables.

What’s next?

A salty software guy. Stripe by day, Pinecast by night.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store