The load increase of a Mastodon server
I help to administrate the Mastodon server tooting.ch which is maintained by the non-profit FairSocialNet. In this context, after the acquisition of a social network whose logo was a blue bird and the first unleash of chaos on the platform, I had to manage the allocation of resources to the different underlying components whose requirements had inevitably increased.
This article won't go in the details of what is Mastodon, or the global network, the Fediverse, also called "the federation". There are a lot of articles and videos on the web that explain these two concepts.
This article is written around one year after the facts. Some fine details can be omitted to simplify the article, or I don't recall exactly how things went down.
- A web server to answer HTTP queries from users and servers
- A Websocket server to send posts in real time to the connected users
- A Redis server (external service) for caching and the queues of the next service
- A piece of software managing queues to run background tasks
The last component is crucial to the good working condition of the service. In fact, it sends posts to remote servers, finishes the reception of remote publications, finalizes image uploads, translates posts on demand and generates preview cards for links. In other words, this component does a lot.
The overload of the component can be felt more or less directly for the user, with, for example, file uploads taking an eternity to finalize.
The calm before the storm
Tooting was a service at a small scale (a hundred of users, I think ?), when one compares it to its current size. Each software brick was running the "default" configuration, which was sufficient back then. The storage was spacious enough to keep the remote media and the data in PostgreSQL.
Since sign-ups go through a manual validation by administrators and moderators, to get rid of the spam and the problem at the source, they have a vague idea of what could happen and if more resources should be planned. The amount of sign-ups had already made a few jumps with what was happening in an other realm of the Internet. However, nothing to write home about.
The weather is getting dark, windy and cloudy
When the purchase of Twitter was concluded, many people rushed on the servers of the global network, creating a more or less brutal load increase for everybody. Some servers arriving at capacity for technical or sizing reasons had to close open sign-ups or limit them with invitations.
To give an idea of the load increase on Tooting, there was around 20 new users per month; during the wave, about 100 users per day were creating accounts.
The new owner begins to make modifications to his newly acquired platform that eventually result in the site we know. These changes contributed to amplify the migration wave already sweeping across the Fediverse. For tooting, the interesting problems are just starting to appear.
The storm; the river wanders out of its riverbed
Initially, the consequences were an increase of volume of emails to send, volume that blew up the quota of the mail hosting service. The SMTP1 server rejected the email and the task went back in the queue of things to do. To accommodate the volume that exploded, one request to increase the quota was sent to Infomaniak, eventually granted. While waiting for the answer, not much was to do with regards to emails: disable superfluous emails for the moderation and administration, and wait. To the pace of 100 users per day, mail inboxes were able to breathe.
In a nutshell, you register on Tooting, you could take a coffee or do your things; the email address confirmation will arrive by physical mail !
Other than emails that were having trouble to be sent, consequences of an overloaded queue. Do you remember this piece of sentence ?
The last component [of Mastodon's software stack] is crucial to the good working condition of the service
It so happens that the counter of pending tasks was beating records and was keeping on increasing. Sidekiq's default configuration (I'll explain that later !) was no longer enough. I'll enumerate a few issues encountered with a queue that wouldn't stop growing:
- Federated posts would arrive "late"
- Relaying of local publications is delayed
- Sending images takes considerably longer. Technically, the image or attachment is on the server, the task representing the processing was just at the bottom of the todo-list.
- The generation of link integration cards was delayed
- Emails that need no further explaining !
The list is not exhaustive, of course.
I could waffle on the diverse issues that were cropping up on Tooting, but we understood that the river got out of its bed and causes flooding a bit everywhere.
How does Sidekiq, the queue manager, work in a nutshell
Before we continue, it is interesting to know how the queue manager, Sidekiq is the name, works in a nutshell. When the administrator starts a Sidekiq process, the main process creates as many secondary tasks (also called threads) as configured. The main process is in charge of receiving the tasks to realize from an external service and distribute them to the threads. The threads do the work.
Once the requested action has been executed without errors, a counter is incremented and the thread does the next job or waits. In case of problems, the task is put in another queue, the "Tentative" one, and will be reattempted later. When there are too many errors for a task, it will end up in the "Dead" queue and won't be attempted again unless an administrator manually puts it back in the more active queues.
Therefore, it is in the interest of the administrator to keep a short waiting queue so the platform is reactive. Otherwise, the diverse symptoms I described above will appear.
Sidekiq is also capable of executing scheduled tasks but it doesn't matter much in this context.
Managing the load increase
The priority was to shorten the queue in Sidekiq. It could take as long as needed, but the tasks counter must go down.
Only one thing to do: increase the amount of worker threads spawned to a good handful and keep only one main process. In theory, it's a good idea, in practice, the conductor finds itself fully occupied and unable to occupy all threads. The other solution was to start multiple main Sidekiq processes with a limited amount of threads.
The other components didn't need a particular attention; they seemed to manage what's happening without external intervention. The default configuration has been mostly kept as is.
During the "purge" of Sidekiq's todo-list, Tooting was… usable ? What I mean, a page loaded properly, we could consult and send posts. Of course, we'll talk about sending posts to remote servers later. Same goes for remote publications. Sending an attachment to a post was impossible mission, due to reasons I described above. At one moment, I did force a bit to shorten Sidekiq's todolist by releasing the kraken !
A new crusing pace
While the mass migration was getting over, the resource allocation to Tooting was adjusted by increasing what was at its disposal: CPU, RAM, storage space. The Sidekiq's runners have been configured as such: 4 processes with 15 theads each, for the magical total of 60. It will suffice to process new tasks and absorb the small extra work quickly when they happen.
The systemd service configuration was slighty tweaked so one can start a group of 15 Sidekiq threads "on demand"; it is a templated systemd service.
PostgreSQL — the database server — didn't see much change, so did the HTTP2 server. The current condfiguration seems enough to resopond to inbound HTTP queries, as far as I'm aware.
In the end, Tooting "was hit" but was able to whistand, considering it could been hit a bit more heavily. This migration wave is behind us, we're ready to have manage one more, be it a bit larger or smaller.
The protocol to send and transports emails to their recipients; it's the internet's postal service ! ↩︎
The protocol of the web. It lets you view web pages from your favorite internet sites in your web browser. By the way, this protocol has been used to request the page you're currently reading to the web server (without going into the details). ↩︎