cronalerts all systems nominal

What to do when a batch job goes silent

An operator's guide to recognizing the silent failures of scheduled jobs, and the small set of practices that prevent the next one from being a customer-facing surprise.

Every operator has the same story. A scheduled job that ran for years stopped running on a Tuesday and nobody noticed for three days. By the time the issue surfaced, the data was stale, the downstream system had been quietly working on bad inputs, and somebody on the customer side had already escalated. The postmortem was short. The job had been silent. The monitoring assumed silent meant healthy.

Silent jobs are the most common production failure that nobody monitors. The reason they go uncaught is straightforward. Most monitoring is built around things that happen. A failed job emits an error. An overloaded service emits high latency. A crashed container emits an exit. A cron job that simply does not run emits nothing at all, and a pipeline whose first step does not run is a pipeline whose entire dashboard goes quiet rather than red.

This is the practical view of how to find silent failures and how to keep the next one from costing you a customer.

The failure modes that produce silence

Silent failure of scheduled work falls into a small number of repeating patterns.

The scheduler did not fire. The cron service was down, the Kubernetes CronJob got rescheduled and the new namespace had a configuration drift, the cloud function's trigger expired, the on-prem orchestrator was held up by a separate maintenance task. Whatever the cause, the job that should have run did not. The system is operating exactly as designed. The design just did not account for the trigger itself failing.

The job started but exited early. A condition early in the script took the safe path and skipped the work, often without logging it as an error because the engineer who wrote it considered the early exit normal. Six months later, that early exit is firing every run because something upstream changed.

The job ran but did nothing. The query that pulls work to do returned zero rows because of a subtle change in an upstream system. Zero rows is not an error. The job exits successfully. The downstream system that expected work continues to wait.

The job ran but ran the wrong thing. The configuration that selects the work changed. A wrong environment variable, a stale config file, a credential rotation that was supposed to be transparent. The job runs every cycle and does work, just not the work that mattered. The dashboard is green. The downstream system is starving.

Each of these patterns produces a green status on most monitoring dashboards. The dashboard is asking the wrong question. It is asking "did the last run fail" when it should be asking "did the last run produce the correct result, on schedule, with the expected effect."

Heartbeats are the cheapest insurance

The single highest-leverage piece of monitoring for scheduled work is a heartbeat. The job, on every successful completion, sends a small signal somewhere. The monitoring system, on a separate schedule, expects that signal to arrive within a known window. If the signal does not arrive, the monitoring system pages.

The cost of installing heartbeats is small. A few minutes of work per job. The benefit is that every silent failure mode above becomes visible. The job that did not run does not send the heartbeat. The job that exited early does not send the heartbeat. The job that ran but did nothing also does not send the heartbeat, if the heartbeat is sent only when work was actually produced.

The mistake most teams make is sending the heartbeat at the start of the job rather than at successful completion. A heartbeat at start tells you the job started. A heartbeat at end with a short summary of what happened tells you the job finished correctly. The end-of-job heartbeat is what catches the silent failures.

A second mistake is sending heartbeats only on success. Send them on every termination, with a status code. The monitoring system can distinguish a successful run from a known failure. Both are better than silence.

Define what success looks like

A heartbeat that says "I ran" is better than nothing. A heartbeat that says "I ran and produced the expected effect" is better still.

For most scheduled work, success is not the absence of an error. It is the production of a result. The nightly billing job is successful when invoices were generated and sent. The hourly sync job is successful when records moved between systems. The weekly report is successful when the report landed in the destination.

The heartbeat that records the count of rows processed, the count of items emitted, or the bytes written gives the monitoring system enough to alarm on more than absence. It can alarm on "the job ran but produced zero rows when it normally produces ten thousand." That alarm catches the configuration drift and the upstream change long before the customer-facing system notices.

This is the difference between monitoring whether the job ran and monitoring whether the work happened. The second is what your operations team actually needs.

Audit trails for the boring questions

Many regulated environments require that scheduled work produce an audit trail showing what ran, when, with what configuration, and to what effect. The teams that do this from the beginning have a much easier time at the next audit. The teams that bolt it on later spend a quarter reconstructing logs from the cloud provider's retention.

A reasonable audit trail for a scheduled job records the schedule that triggered the run, the configuration in effect at the time, the time of start and the time of completion, the result of the run including counts and any handled errors, and a hash of the input and the output where applicable. The total cost is a few extra log lines and a destination that retains them for the period your compliance posture requires.

The trail pays for itself outside of audits as well. A team debugging a six-week-old data quality issue is much faster when they can see exactly which configuration ran on which day. The "we'll find out by reading the code" approach is slower than people remember.

What to alarm on

Once heartbeats and audit trails are in place, the alarms that actually pay for themselves are a small set.

Heartbeat absent. The job did not run, did not finish, or did not check in. This is the alarm that catches the most silent failures.

Heartbeat present but result outside the expected band. The job ran but produced an unusually high or unusually low count. This is the alarm that catches the configuration drift.

Sustained drift in run duration. The job is now taking three times as long as it used to. This is the alarm that catches creeping data growth and slow upstream systems before they become outages.

Sustained drift in error rate within the job. The job is succeeding overall, but a higher fraction of records are being skipped due to handled errors. This is the alarm that catches the gradual poisoning of upstream data.

These four alarms together catch most of the silent failure modes, with a manageable false positive rate, and without requiring sophisticated tooling.

A reasonable starting posture

A team that has not invested in scheduled job monitoring before can get most of the benefit by doing the following in a single quarter. Add a heartbeat to every production scheduled job, sent on completion with a small payload. Centralize the heartbeats into a monitoring system that alarms on absence. Add a result-based alarm to the small handful of jobs whose silent failure would directly affect customers. Set up an audit trail destination that retains the run records for at least a year.

The first time one of these alarms fires for a real incident, it pays for the rest of the work. The team that has lived through one silent failure remembers the cost. The team that has not lived through one yet will, eventually, and the question is whether they will hear about it from their monitoring or from their customers.