Django Channels, ASGI, and Heroku
This past week, to satisfy my urge to build something new, I’ve started putting together a project which has a real-time chat component, which means websockets. In a server-side JavaScript project, this is not something I’d think twice about: setting up a project with ws
or another similar library is easy. But I love Django’s ORM and I’ve got a pile of useful Python from Pinecast, and Django 3 was recently released with ASGI support, so I decided to play with Django Channels for the first time.
Channels is a project from Django that allows you to build applications which interact with the consumer over protocols other than HTTP. Simply put, it’s a first-party way to continue to use Django for your project, but expose services over protocols like websockets, server-sent events, or other APIs. Folks outside of the Python world do this regularly, and folks within the Python world are busy writing crazy tools with Tornado or Twisted or even raw sockets. Until Channels came out, building asynchronous apps on the web in Python meant giving up many of the niceties that popular frameworks like Django offered.
Heroku is — for better or worse — my choice of host for new projects. I’ve recently bemoaned Heroku’s lack of recent public-facing product features (thanks Salesforce). And frankly, the lack of any meaningful developer-facing changes is disheartening, but Heroku still remains the easiest way — in my opinion — to get a new project off the ground and scale it without breaking the bank (or going insane with AWS boilerplate, or betting on a new startup).
In 2016, Heroku blogged about Django Channels. They even released some sample code (now archived). However, that was built for a very old version of Django, and an old version of Channels, and quite honestly any sort of example code that old and unmaintained is almost certainly a bad idea to play with in production.
The following is a walkthrough of what I did to get things up and running.
Getting it up and running
The first thing to do was get the service up and running. I started my project on Django 3.0.3 and Channels 2.4.0. The first and most important step is to get things running locally, and this is straightforward.
Locally, I’m using a sqlite3 database, so aside from the obvious steps to get Channels installed, simply following the latest installation guides got python manage.py runserver
doing what it was supposed to be doing without any fuss.
The next step was getting the project running on Heroku. This is more challenging.
Deploying to Heroku
The article from 2016 recommends using Daphne. Daphne is a HTTP/websocket protocol server for ASGI, maintained by the Django Project. Though it’s seemingly well-maintained, numerous posts online indicate that it’s not very fast (taking seconds to load a route that would otherwise load almost instantly with gunicorn/WSGI).
Some basic research indicates that there are three choices: Daphne, Uvicorn, and Hypercorn (beta). Uvicorn is the only non-Daphne “stable” choice, and it’s based on the same underpinnings as Node’s event loop, so I’m reassured that it’s probably fast.
The biggest change to switch from WSGI to ASGI (gunicorn to uvicorn) is to install the latter in your Pipfile
and change your Procfile
to use the new server.
Two things stuck out:
uvicorn
doesn’t pull from$PORT
. You must specify--port $PORT
or it won’t boot. I specified--host 0.0.0.0
while I was at it just in case, but it might be required.uvicorn
doesn’t have an equivalent to--max-requests-jitter
ingunicorn
.--max-requests
ingunicorn
allows you to restart a worker after some number of requests, and this is available inuvicorn
as--limit-max-requests
. Without the jitter option, though, you potentially set yourself up for thundering herds of server reboots. That’s not a problem for a project with no traffic (like mine), but I couldn’t for instance deploy it to Pinecast in its current state. This is perhaps a good open source contribution that I could make in the coming weeks.
Deploying this change allowed my server to boot and start serving requests.
Making it run
I have been very frugal so far and have stuck around on the Hobby tier to get this project built before I start getting beta users into it. Being on this tier allowed me to catch an issue early: after I logged into the Django Admin panel and started adding data, my server mysteriously started 500ing. Running heroku logs --tail
showed an error ending with:
app[web.1]: File "/app/.heroku/python/lib/python3.7/site-packages/django/utils/asyncio.py", line 26, in inner
app[web.1]: return func(*args, **kwargs)
app[web.1]: File "/app/.heroku/python/lib/python3.7/site-packages/django/db/backends/postgresql/base.py", line 185, in get_new_connection
app[web.1]: connection = Database.connect(**conn_params)
app[web.1]: File "/app/.heroku/python/lib/python3.7/site-packages/psycopg2/__init__.py", line 130, in connect
app[web.1]: conn = _connect(dsn, connection_factory=connection_factory, **kwasync)
app[web.1]: django.db.utils.OperationalError: FATAL: too many connections for role "gihouwrmsatwpc"
Running heroku pg:info
showed that the database had exceeded 20 connections, which is the limit for a hobby-tier database.
Under WSGI, you deploy some number of Heroku dynos, and specify WEB_CONCURRENCY
to choose how many instances of your app each dyno runs. I’ve had WEB_CONCURRENCY
set to 3, which is a very reasonable number for a small app with no traffic. With one dyno, that should come out to 3 database connections.
That’s not true for Django Channels. The async event loop creates threads, and the number of threads is equal to the number of CPUs your server has, times five. Each thread creates its own database connection. On my cheap Heroku dyno:
>>> import multiprocessing
>>> multiprocessing.cpu_count()
8
Yikes, that means up to 40 database connections (potentially up to 120, if that number if multiplied by WEB_CONCURRENCY
). Rebooting the dynos and mashing F5 in the browser shows the number of connections shooting up and quickly exceeding the 20-connection limit.
To limit this, you need to set the ASGI_THREADS
environment variable. This tells Channels to limit the number of threads. I set mine to five, which seems sufficient for now. In the future, it’ll be worthwhile to reset this and install the pgbouncer buildpack.
Realtime
To make the app actually take advantage of the ASGI+Channels setup, I installed Graphene Subscriptions to go along with Graphene Django. With relatively little boilerplate, Graphene Subscriptions allows you to listen to Django signals for model save (create or update) and delete. For instance, the following code allows you to create a GraphQL subscription for changes to any of a Room
‘s Message
s in a chat application:
import graphene
from graphene_subscriptions.events import CREATEDfrom ..models import Message # Our Django model for messages
from .message_type import MessageType # Our Graphene Message type# Our GraphQL subscription class
class Subscription(graphene.ObjectType):
# Define what the user can subscribe to in our schema
message_created = graphene.Field(
graphene.NonNull(MessageType),
room=graphene.Argument(graphene.ID, required=True),
) def resolve_message_created(root, info, **kw):
def filter(event):
return (
event.operation == CREATED and
isinstance(event.instance, Message) and
event.instance.room_id == kw.get('room')
)
return root.filter(filter).map(lambda e: e.instance)
You’ll want to follow the installation instructions for information on how to update your settings.py
and routing.py
files, add signal handlers, and link the Subscription
class (above) to your Schema
instance.
The Subscription
class is defined as a Graphene object type, just like you’d do for a plain old query. The resolve_*
methods, rather than receiving an object for root
, receives an observable that emits events from signals generated in other instances of your application. So, for instance, if I save a new Message
on one server, the act of saving that Message
triggers a Django signal. Graphene Subscriptions puts an event object into a channel, which passes through the channel layer and pops out on the other server.
Meanwhile, our user should have established a GraphQL subscription. For instance:
subscription MessageCreated($room: ID!) {
messageCreated(room: $room) {
id
created
messageText
sender {
id
name
avatar(size: 32)
}
}
}
When the subscription is initiated, the resolve_message_created
method fires. Graphene Subscriptions subscribes to the channel that receives events about new Message
s. root
is the observable that iterates over the events as they come in from the channel.
An observable behaves a lot like an array: you can map
or filter
the data, but the data isn’t readily available yet. Imagine if you had a Node event emitter that you could filter
events with, and map
to transform their bodies (just like you could with an array): you don’t know when the events will fire, but when they do you can perform some actions against them.
In our case, we use filter
to filter out events that aren’t object creations, aren’t object creations for Message
s, and aren’t events for Message
s created in the chat room that we’re subscribed to. Note that in a production application, you’d want to check that info.context.user
has access to the room in question, and perhaps periodically check that the user still has access to the room: if the user was kicked or blocked, their subscription won’t automatically close:
def resolve_message_created(root, info, **kw):
def filter(event):
return (
event.operation == CREATED and
isinstance(event.instance, Message) and
event.instance.room_id == kw.get('room') and
# nice and safe
check_user_has_access_to_room(kw.get('room'))
)
return root.filter(filter).map(lambda e: e.instance)
Making it work with Apollo
Getting this up and running with Apollo wasn’t very hard. First, our ApolloClient
needs to be updated with a WebsocketLink
to our back-end. This teaches Apollo how to connect a web socket to our server. After installing the necessary dependencies, defining the link for this was simple:
const wsLink = new WebSocketLink({
uri: `${window.location.protocol === 'https:' ? 'wss' : 'ws'}://${
window.location.host
}/graphql`,
options: {
reconnect: true,
},
});
Then, to use the link, we need to tell Apollo how to decide whether to use HTTP or the web socket. For that, we use the split
method from apollo-link
.
import {split} from 'apollo-link';const link = split(
({query}) => {
const definition = getMainDefinition(query);
return (
definition.kind === 'OperationDefinition' &&
definition.operation === 'subscription'
);
},
wsLink, // Our new web socket link
httpLink, // Our existing HTTP link
);
This code looks at the operations our application is making. If the operation is a subscription, it uses the web socket link. If it’s not, it uses the old fashioned HTTP link.
In development, this worked great. Creating a new subscription immediately saw messages arriving when they were created. Magical!
In production, this didn’t work. The production instances immediately started flapping when our client attempted to establish a web socket connection:
ValueError: Django can only handle ASGI/HTTP connections, not websocket.
We also need to make a change to our Django application. Our asgi.py
file needs to learn about Channels, otherwise Channels can’t handle web sockets: remember, Django can’t handle non-HTTP requests out of the box. To do that, we just need to make a few changes:
diff --git a/yourapp/asgi.py b/yourapp/asgi.py
index eb61b00..19e073e 100644
--- a/yourapp/asgi.py
+++ b/yourapp/asgi.py
@@ -9,8 +9,9 @@
import os
-from django.core.asgi import get_asgi_application
+import django
+from channels.routing import get_default_application
os.environ.setdefault('DJANGO_SETTINGS_MODULE', 'yourapp.settings')
-
-application = get_asgi_application()
+django.setup()
+application = get_default_application()
Essentially, we’re swapping out Django’s get_asgi_application
method for Channels’ get_default_application
method. We’re also calling django.setup()
, which get_asgi_application
would have done for us.
Making it work on Heroku
I managed to get the application up and running on Heroku without much fuss. Be sure to make sure you’ve enabled web sockets on Cloudflare if you are using them!
After the application is alive for tens of minutes, messages in my application started to fail to send. I received some spooky tracebacks in my Heroku logs ([...]
is where I snipped some wordy stack trace details):
ERROR: An error occurred while resolving field Mutations.createMessage
Traceback (most recent call last):File "[...]/site-packages/graphql/execution/executor.py", line 452, in resolve_or_error
return executor.execute(resolve_fn, source, info, **args)[...]File "/app/messaging/schema/mutations/create_message.py", line 68, in mutate
message.save()[...]File "[...]/site-packages/aioredis/connection.py", line 322, in execute
raise ConnectionClosedError(msg)aioredis.errors.ConnectionClosedError: Reader at end of file
Weird! The application consistently works when it’s first deployed, then fails a short time later.
The first thing that I noticed was some nearby log entries indicating that I was over the Redis connection limit for the free tier. Oops!
After upgrading to a Premium instance, I was seeing a very similar problem, but it took longer for the service to start failing. Double spooky!
I did some googling and came across this Github issue on the Redis channel layer. The channel layer, remember, is the software that distributes messages to Channels consumers on different hosts. If I send a chat message on one server, I expect it to be received by clients connected to all of my other servers, not just the one the message was sent from.
The comments indicate a few things:
- The problem seems to be resolved by upgrading to the latest Redis channel layer. I was already on the latest version, so this was not the issue.
- The problem seems to be mostly unique to Heroku (though some folks have experienced it locally), which indicates a configuration issue or bug in Heroku’s Redis implementation.
- Folks have reported success when switching from Heroku’s first-party Redis offering to Redis Cloud (another Heroku add-on). Some folks reported intermittent issues even after moving to Redis Cloud, though.
The final comment (at the time of writing, from Oct 2019) suggests that the Redis idle timeout might be the culprit here. This would explain why the connection is active initially, but fails after a short time.
On Heroku, the idle timeout defaults to about five minutes. This is easily changed. Disabling the idle timeout altogether is as easy as running
$ heroku redis:timeout [your-redis-instance-0000] --seconds=0
Timeout for [your-redis-instance-0000] (REDIS_URL) set to 0 seconds.
Connections to the Redis instance can idle indefinitely.
Making it production-ready
The other part of this process is making sure the application is ready for production users! Having it work is only half of the battle.
The first big step is securing the Redis instance. Heroku recommends using the Stunnel buildpack for this. Note that this is not available for hobby-dev
instances, so be sure to upgrade to a Premium Redis instance.
You can install the buildpack by running this command:
$ heroku buildpacks:add -i 1 heroku/redis
Now, configure your app to use Stunnel. You’ll want to add bin/start-stunnel
to the start of your commands in your Procfile, like so:
web: bin/start-stunnel uvicorn ironcoach.asgi:application --limit-max-requests=1200 --port $PORT
You’ll do this for each dyno type you configure. If you have worker
dynos that use Channels (or otherwise talk to Redis), you’ll do the same for each of them. One thing that the Heroku docs (and the buildpack docs) don’t mention is that you no longer want to use the REDIS_URL
environment variable to connect to Redis. This is important: if you don’t do it, you’ll get a “connection refused” error on startup. In your settings.py
file, you’ll want to do something like this:
# Remove this:
FOO = os.environ.get('REDIS_URL')# And do this:
REDIS_URL = os.environ.get('REDIS_URL_STUNNEL') or \
os.environ.get('REDIS_URL')
FOO = REDIS_URL
When you deploy your application next, you’ll see this at the top of the build log:
-----> stunnel app detected
-----> Moving the configuration generation script into app/bin
-----> Moving the start-stunnel script into app/bin
-----> stunnel done
Other notes
While setting things up, I did encounter some problems or other hiccups along the way.
graphene-subscriptions
only supports one group
I found that graphene-subscriptions
was using the same Channels group for each message being broadcast. What this means is that every time a registered model is created, updated, or deleted (the events that you set up signals for), it is broadcast to every other server instance listening for messages.
For a toy project, this is no problem. And in fact, this probably wouldn’t be an issue for most projects that have a few requests per second. However, it poses a serious scalability issue for projects that have a meaningful volume of messages, or projects that have a large number of server instances.
Consider this: if each message is 100 bytes and you have 10 server instances, you’re broadcasting 1kb across the network for every message. If you have 30 server instances, you’re broadcasting 3kb across the network for every message. If your message is 500 bytes and you have 30 servers, each message results in 15kb of network traffic.
Beyond traffic volume, you end up dealing with a great deal of excess compute. Consider a system that processes ten messages per second. If every server instance receives a copy of every message, every instance must process ten messages per second—even if none of the messages are relevant to any active subscriptions. Receiving, parsing, and filtering the messages is expensive, and will quickly use up CPU time.
The first obvious fix is to broadcast each model’s messages through its own group. This means that if a server instance subscribes to ModelFoo
changes, it won’t receive broadcasts for changes to instances of ModelBar
. However, if you’re building a chat application and store messages with a Messages
model, your application hasn’t improved much.
The second obvious fix is to broadcast each type of event to a model through its own group. That is, one group for creation of model instances, one group for updates, and one group for deletion (per model). A server instance subscribing to new ModelFoo
instances won’t see deleted ModelFoo
events. This is an improvement, but for applications where one type of event is far more common (in our chat example, creation), this still doesn’t scale very well.
The third and complete fix is to broadcast events on a group specific to the nature of the model. In a chat room application, this might mean broadcasting events to a group formatted as message-room:{room_id}
. In this case, a server would only receive events for messages sent to the chat rooms that it has connected users participating in: a message sent to #random
wouldn’t be broadcast to a server whose users are in #general
and #announcements
.
The first two changes will become the default behavior for graphene-subscriptions
in version 2.0:
Additionally, graphene-subscriptions
v2 opens the door to defining custom groups, which would allow you to implement the third fix in your own application. Thanks to Jayden Windle for his amazing efforts maintaining this project.
stunnel crashes on startup
For an unknown reason, stunnel
was failing to start when my dynos restarted, with a spooky error:
INTERNAL ERROR: systemd initialization failed at stunnel.c, line 101
It took a customer support ticket to Heroku to track this down. Two environment variables had been automatically set on my dyno: LISTEN_PID
and LISTEN_FDS
. The former was causing stunnel to crash. If you encounter this error, just remove these two environment variables.
What’s next?
I hope you found this useful! If you enjoyed it, please let me know what you’d like to see me write about in the future.