Update: The plugin discussed in this post has been packaged into the delayed_job_heartbeat_plugin gem.
Previously we've blogged about how to write Delayed Job plugins and how to aggregate jobs into job groups. In this post we'll explore how to proactively detect failed Delayed Job workers so their jobs can be retried in a timely manner. This is useful if a worker crashes, is automatically restarted by your platform provider, or is shutdown by auto-scaling infrastructure.
The Delayed Job Lock Model
Let's start off with a brief introduction to how Delayed Job implements locking. We'll only consider the Active Record Delayed Job backend but the Mongoid backend uses a similar scheme. Jobs are stored in a delayed_jobs
table that includes a YAML encoding of the object that will do the job's work, the time the job was locked, the name of the worker that locked the job, and some additional metadata like the number of job attempts. When a worker picks up a job, it sets the job's locked_at
to the current time and sets the job's locked_by
to the worker's name (which should be unique across your pool of workers). Jobs are eligible to be picked up by a worker if they are not locked by another worker or they've been locked by another worker for more than max_run_time
seconds.
At this point you might be wondering why Delayed Job's max_run_time
setting isn't sufficient to unlock jobs that have been locked by failed workers. Well, it is, as long as you don't mind waiting that long for jobs to be unlocked. That doesn't work for us since we have max_run_time
set pretty high to accommodate some long running bulk import and export jobs.
The Delayed Job Heartbeat Plugin Overview
There are two parts to how we'll unlock jobs for failed workers:
- A Delayed Job plugin that runs on each worker and periodically updates a database table with heartbeat information
- A reaper process that periodically unlocks jobs locked by workers that haven't updated their heartbeat recently
Now let's dive into some details on each of these components.
The Delayed Job Heartbeat Plugin
Our heartbeat plugin will consist of a few classes:
- Delayed::Heartbeat::WorkerModel - A persistent model with the name and last heartbeat timestamp of each worker. In the future this could be extended to included additional information about workers like the version of the source code they're running. (Note we chose the name WorkerModel rather than just Worker to avoid confusion with Delayed Job's Worker class)
- Delayed::Heartbeat::WorkerHeartbeat - Asynchronously updates a workers heartbeat timestamp
- Delayed::Heartbeat::Plugin - A Delayed Job plugin that plugs into the worker's lifecycle to start and stop the WorkerHeartbeat.
First we'll need a database migration to create the table for our worker models:
Next let's create the WorkerModel
class that provides methods for updating the worker's heartbeat and some ActiveRecord scopes that we'll need for the reaper:
So far this has all been standard Rails stuff. Now things get a little more interesting with the WorkerHeartbeat
that uses a background thread to periodically update the worker's heartbeat:
Some of the code for telling the heartbeat thread to shutdown might look a bit funky but it's just using the self-pipe trick to perform an interruptible sleep.
We've done all the heavy lifting required to implement the plugin. Now let's plug into the worker's lifecycle to start/stop the heartbeat:
Finally let's register the plugin with Delayed Job in an appropriate initializer:
The Reaper
Now that we have a Delayed Job plugin that periodically updates a worker's heartbeat, we can unlock jobs that have been locked by workers that we haven't heard from recently:
We're running in Heroku so we've configured a clockwork process to periodically unlock orphaned jobs:
You should be able to do something similar with your favorite scheduler.
Finally update your application.rb with the appropriate configuration for the heartbeat plugin and reaper process:
That's it! We can now unlock orphaned jobs in a few minutes rather than waiting a job's maximum runtime to elapse.
Other Solutions to Detecting Failed Delayed Job Workers?
Have you had similar problems with retrying jobs locked by failed Delayed Job workers? We'd love to hear how you solved them and what you think about our solution.