Heartbeat monitoring versus scheduled checks

A practical comparison of heartbeat-based and check-based monitoring for scheduled work, with guidance on which combination to install for which class of job.

The two main strategies for monitoring scheduled work look superficially similar and produce very different operational outcomes. Heartbeat monitoring waits for a signal from the job and alarms on absence. Scheduled checks run a separate process on a separate schedule and verify that the job's work happened. Most teams pick one and stop. The teams that get this right pick both, deliberately, for different reasons.

This is the working comparison and the case for using each.

What heartbeat monitoring catches

Heartbeat monitoring is the simpler of the two. The job, on completion, sends a small signal to a monitoring service. The monitoring service knows the expected schedule and alarms when a signal does not arrive within the expected window.

This catches the failure modes where the job did not run, did not complete, or did not check in. The scheduler died, the container could not start, the credential expired, the script crashed before the heartbeat send. Each of these produces silence, and silence is the heartbeat alarm.

What heartbeats do not catch is the case where the job ran, completed normally, and produced no useful work. The heartbeat is sent because the job thinks it succeeded. The downstream system is starving anyway. This is the failure mode that requires a different approach.

What scheduled checks catch

A scheduled check is a separate process that runs on its own cadence and verifies the world. The nightly invoice job is supposed to produce invoices. A scheduled check, run a few hours after the invoice job, queries the invoice store and confirms that today's expected invoices exist. If the check finds that invoices are missing, it alarms.

Scheduled checks catch the failure modes where the work itself did not happen, regardless of whether the job thought it succeeded. The invoice job ran, sent its heartbeat, and produced zero invoices because the customer query returned empty. The heartbeat monitoring is happy. The scheduled check sees the empty invoice store and pages.

Scheduled checks have a higher operational cost. They are themselves scheduled jobs, and they need their own monitoring to make sure they ran. They require domain knowledge to know what to check for. They produce false positives when the underlying business state legitimately produced no work, which has to be filtered.

The benefit is that they catch the failure modes that heartbeats do not. For any scheduled work whose failure would silently affect customers, scheduled checks are not optional.

Choosing for a given job

Not every scheduled job needs both. The right framework is to think about what the job is actually for, and to choose the monitoring approach that catches the failure modes that matter.

Operational maintenance jobs. Log rotations, cache warmers, scheduled cleanups. These have minimal customer impact when they fail. A heartbeat alone is usually enough. The cost of a missed run is small.

Data movement jobs. Hourly syncs, ETL pipelines, replication. These have moderate customer impact when they fail. A heartbeat plus a scheduled freshness check on the destination is the right combination. The freshness check verifies that data is arriving, not just that the job ran.

Customer-facing scheduled work. Invoices, billing, scheduled emails, scheduled reports, scheduled deliveries. These have direct customer impact when they fail. A heartbeat plus a scheduled outcome check is required. The outcome check verifies that the customer-facing artifact was produced.

Compliance and security jobs. Backups, audit log rotations, key rotations, access reviews. These have severe regulatory consequences when they fail silently. A heartbeat plus a scheduled outcome check plus an additional scheduled verification check is appropriate. The third check is essentially a check on the second check, run by a different system, because the cost of all three failing the same way is too high to accept.

The pattern is that the importance of the work determines how many independent observers should be watching it. For routine work, one heartbeat is enough. For critical work, multiple independent observers each verifying the same outcome from different angles is the way to make silent failure functionally impossible.

Common mistakes in implementation

Several patterns reliably break monitoring of scheduled work.

The heartbeat is sent at the start of the job. This means the job that crashes mid-execution still sent its heartbeat. Send it at end, after the work is done, and ideally with a small payload describing the work.

The scheduled check uses the same database connection or the same code path as the job it is checking. A failure in the connection or the code affects both. The check needs to be independent enough that the failure mode it is watching for cannot also cause the check to silently succeed. Use a different connection, ideally a different process, ideally a different system.

The alarm is routed to the same on-call as the rest of the production system. For heartbeat absence in particular, the alarm needs to fire even when other systems are also failing. A heartbeat alarm that gets buried under fifty other pages during a major incident is a heartbeat alarm that does not work. Some teams route scheduled-job alarms to a separate channel with its own escalation, specifically to avoid this.

The alarm threshold is too tight. A job scheduled for 3:00 that is alarmed on if it has not checked in by 3:05 will fire constantly during normal jitter. The threshold needs to accommodate the realistic distribution of run times plus a margin. A useful starting threshold is the ninety-fifth percentile run duration plus fifty percent.

The alarm threshold is too loose. A nightly job that is alarmed on only after twenty-four hours of absence is a job that can fail silently for an entire business day before anyone hears about it. For customer-facing work, the alarm should fire as soon as the missed window is statistically meaningful, which is usually within an hour of the expected completion.

Each of these is the kind of mistake that does not show up until the first incident. The team that has set up the system right does not learn about these mistakes during the incident. The team that has set up the system wrong learns about all of them at once.

A reasonable default

For a team installing scheduled job monitoring for the first time, a reasonable default looks like this.

Every production scheduled job has a heartbeat sent on completion, with a small payload describing the run. The heartbeat goes to a centralized monitoring service. The service alarms on absence within an appropriate window for each job.

Customer-facing and compliance-relevant jobs additionally have a scheduled outcome check, run independently a few hours after the job is expected to complete, that verifies the customer-visible effect of the work. The outcome check has its own heartbeat, monitored the same way.

Both heartbeats and outcome checks produce audit log entries that are retained for the period the team's compliance posture requires.

A team with this posture in place has a good answer to the question that nobody wants to be asked the day a customer files a complaint about a missing invoice. "How did the system not notice." The honest answer is that the system did notice. The on-call was paged. The work was caught while it was still recoverable. The customer never got the chance to be the monitoring layer.

That is what the work is for.