Cron Job Monitoring: How to Know When Your Scheduled Tasks Fail Silently

Learn how cron job monitoring works, why scheduled tasks fail silently, and how to set up heartbeat monitoring to catch failures before they cause real damage.

Written by Timothy Bramlett ยท

Your website is up. Your API responds in 200ms. Every dashboard is green. But somewhere in the background, a cron job that processes payments failed six hours ago and nobody noticed.

This is the cron job problem: they fail silently. There's no user refreshing a page and seeing an error. There's no spike in your error logs (because the job never ran at all). The first sign of trouble is usually a customer email asking why their data hasn't updated, or worse, a finance team discovering that invoices haven't been sent for a week.

Cron job monitoring solves this by flipping the model. Instead of checking if something is up, it checks if something happened. If your job doesn't report in on time, you get an alert.

The Silent Problem With Scheduled Tasks

Traditional uptime monitoring works by sending requests to your server and checking the response. That catches web server failures, but it completely misses background processes. Your server can return a perfect 200 OK while every scheduled task behind it is broken.

Consider what runs on cron in a typical application:

  • Database backups that run nightly. If they fail, you won't know until you need a restore.
  • Email queue processing that runs every minute. If it stops, transactional emails pile up silently.
  • Data synchronization between services. A failed sync means stale data across your platform.
  • Report generation for dashboards or stakeholders. Stale numbers lead to bad decisions.
  • SSL certificate renewal via Certbot or similar tools. A failed renewal means your site goes down when the cert expires.
  • Cleanup tasks that purge temp files or expired sessions. If they fail, disk fills up slowly until the application crashes.

Every one of these can fail without any visible symptoms for hours or days. By the time someone notices, the damage is done: lost backups, missed emails, stale data, or a full disk bringing down your entire application.

How Cron Job Monitoring Works

Cron job monitoring uses a concept called heartbeat monitoring (also known as a "dead man's switch"). The idea is simple:

  1. 1. You get a unique URL from your monitoring service (e.g., https://hb.example.com/abc123)
  2. 2. Your cron job pings that URL after it completes successfully
  3. 3. The monitoring service expects that ping within a defined window (e.g., every hour)
  4. 4. If the ping doesn't arrive, the service sends you an alert

This is the opposite of uptime monitoring. Instead of the monitoring service reaching out to your server, your server reaches out to the monitoring service. If your job fails, crashes, or never starts, the ping never arrives, and you get notified.

A Simple Example

Say you have a database backup that runs every night at 2:00 AM. Here's how you'd add heartbeat monitoring:

Before (no monitoring):

0 2 * * * /usr/local/bin/backup-database.sh

After (with heartbeat monitoring):

0 2 * * * /usr/local/bin/backup-database.sh && curl -fsS --retry 3 https://hb.example.com/abc123

The && is critical. It means the curl only runs if the backup succeeds. If the backup script exits with a non-zero code (failure), the ping never fires, and you get alerted.

Heartbeat vs. Uptime Monitoring

Aspect Uptime Monitoring Heartbeat Monitoring
Direction Service checks your server Your server checks in with service
Detects Server/website is down Background job didn't run
Setup Just add a URL Add a curl/HTTP call to your job
Catches silent failures No Yes
Best for Websites, APIs, services Cron jobs, backups, batch processing

Most teams need both. Uptime monitoring tells you when your website or API goes down. Heartbeat monitoring tells you when your background jobs stop running. Together, they cover your entire application.

9 Reasons Cron Jobs Fail Silently

Understanding failure modes helps you decide what to monitor and how to configure alerting thresholds.

1. The Job Never Starts

The cron daemon itself can fail. After a server reboot, cron might not restart automatically. On managed platforms like Heroku or AWS ECS, the scheduler can silently drop tasks during deployments. If the job never starts, there's nothing in the logs to investigate.

2. Environment Variable Issues

Cron jobs run in a minimal shell environment. Variables from .bashrc or .profile aren't loaded. A script that works perfectly when you run it manually fails under cron because $PATH, database credentials, or API keys aren't set. This is one of the most common cron failures and one of the hardest to debug without monitoring.

3. Resource Exhaustion

The job starts but gets killed mid-execution. Common causes: out of memory (OOM killer), disk full, too many open file handles, or hitting a process limit. The job exits with a signal rather than a clean error, so error handling in your script never triggers.

4. Overlapping Runs

A job scheduled every 5 minutes takes 7 minutes to complete. Now two instances run simultaneously. They compete for the same resources, corrupt shared data, or deadlock on database locks. Each instance might "succeed" individually while the data they produce is garbage.

5. Dependency Failures

Your job depends on an external API, a database, or a third-party service. If that dependency is down when the job runs, it fails. Unlike a web request that can be retried by the user, a cron job that fails at 3:00 AM won't be retried until the next scheduled run (which might also fail if the dependency is still down).

6. Timezone and DST Confusion

A job scheduled for 2:30 AM skips execution during spring daylight saving time (2:00 AM jumps to 3:00 AM). In fall, it runs twice. Servers in UTC avoid this, but if your crontab uses local time, DST transitions can silently skip or duplicate jobs.

7. Permission Changes

A deploy changes file permissions. A security update modifies directory ownership. The cron user can no longer write to the output directory or read the config file. The job fails with a permission denied error that goes to /dev/null because the crontab redirects stderr.

8. Partial Execution

The job runs but only completes part of its work. It processes 500 of 10,000 records before timing out. It writes a backup file but the file is truncated. The exit code is 0 (success) because the script doesn't validate its own output. This is the sneakiest failure mode because everything looks fine until someone inspects the actual results.

9. Platform Scheduler Failures

On managed platforms (Heroku Scheduler, AWS EventBridge, Google Cloud Scheduler, Kubernetes CronJobs), the scheduler itself can fail. Heroku Scheduler is explicitly documented as "best effort" and can skip executions. Kubernetes CronJobs can miss their schedule if the cluster is under pressure. These failures are invisible to your application code.

Which Cron Jobs Should You Monitor?

Not every cron job needs monitoring. Focus on jobs where a silent failure has real consequences.

High Priority (Monitor These First)

Job Type Impact of Failure Time to Notice Without Monitoring
Database backups Data loss during incident Days or weeks (until you need a restore)
Payment processing Revenue loss, customer complaints Hours (when customers report)
Email queue Undelivered transactional emails Hours (when users don't get confirmations)
SSL certificate renewal Site goes down when cert expires Days (when browser shows security warning)
Data sync between services Stale data, inconsistent state Hours to days (when users report wrong data)

Medium Priority

  • Report generation: Stakeholders get stale dashboards, but no immediate user impact
  • Search index updates: Search results become outdated but the site still works
  • Cache warming: Pages load slower until the cache rebuilds
  • Analytics aggregation: Metrics lag behind, but raw data is still captured

Lower Priority

  • Log rotation: Disk fills up slowly; you have time to fix it
  • Temp file cleanup: Same as log rotation, a slow burn
  • Non-critical notifications: Internal Slack digests, weekly summaries

How to Set Up Cron Job Monitoring

Regardless of which tool you use, the setup follows the same pattern. Here's a step by step guide.

Step 1: Create a Heartbeat Monitor

In your monitoring tool, create a new heartbeat/cron monitor. You'll need to set:

  • Name: Something descriptive (e.g., "Nightly DB Backup" or "Email Queue Processor")
  • Expected interval: How often the job should check in (e.g., every 24 hours for a nightly job)
  • Grace period: Extra time before alerting (e.g., 10 minutes for a job that usually takes 5 minutes)

The service gives you a unique URL to ping.

Step 2: Add the Ping to Your Cron Job

Add a curl call to the end of your cron job. The exact approach depends on your setup:

Simple shell script:

#!/bin/bash
set -e

# Your actual job
/usr/local/bin/backup-database.sh

# Only runs if the above succeeded (set -e exits on error)
curl -fsS --retry 3 https://hb.example.com/abc123

Python script:

import requests

def main():
    # Your actual job logic
    process_email_queue()

    # Signal success
    requests.get("https://hb.example.com/abc123", timeout=10)

if __name__ == "__main__":
    main()

Inline crontab:

# Ping only on success (&&)
0 2 * * * /usr/local/bin/backup.sh && curl -fsS --retry 3 https://hb.example.com/abc123

# Ping with status (send exit code)
0 2 * * * /usr/local/bin/backup.sh; curl -fsS "https://hb.example.com/abc123/$?"

Step 3: Configure Alerts

Set up your alert channels based on severity:

  • Critical jobs (backups, payments): SMS or phone call alerts so you wake up at 3 AM
  • Important jobs (data sync, email queue): Slack or email alerts during business hours
  • Low-priority jobs (cleanup, reports): Email digest or Slack notification

Step 4: Verify It Works

Don't just set it and forget it. Test the monitoring by intentionally breaking the job:

  1. Comment out the curl/ping line and wait for the expected interval to pass
  2. Verify you receive the alert through your configured channels
  3. Restore the curl line and confirm the monitor shows as healthy again

Cron Job Monitoring Tools Compared

Several tools offer heartbeat/cron monitoring. Here's how they compare.

Tool Free Tier Paid From Focus
Healthchecks.io 20 checks $20/mo Cron monitoring only. Open source, self-hostable.
Cronitor 5 monitors $12/mo Cron + uptime + telemetry. CLI tool for crontab integration.
Better Stack 10 heartbeats $24/mo Uptime + heartbeats + incident management. Full platform.
Dead Man's Snitch 1 snitch $5/mo Cron monitoring only. Simple and minimal.
UptimeRobot Limited $7/mo Primarily uptime monitoring. Heartbeats on paid plans.
Uptime Kuma Free (self-hosted) $0 Open source. Supports push monitors (heartbeat equivalent).

Which Tool Should You Choose?

  • If you only need cron monitoring: Healthchecks.io is purpose-built for this. The free tier covers most small teams, and you can self-host the open source version if you prefer.
  • If you need cron + uptime monitoring: Cronitor or Better Stack bundle both. Cronitor is more affordable; Better Stack adds incident management.
  • If you want full control: Uptime Kuma's push monitors work as heartbeat checks and the whole thing is self-hosted and free.
  • If you want the simplest possible setup: Dead Man's Snitch does one thing and does it well.

Pair Heartbeat Monitoring With Uptime Monitoring

Heartbeat monitoring covers your background jobs, but you still need traditional uptime monitoring for your website and API. The best setup uses both types of monitoring together. Notifier already handles uptime monitoring for your web-facing services (with a free tier of 10 monitors, status pages, and SMS/phone alerts), and heartbeat monitoring for cron jobs is coming soon. Once available, you'll be able to monitor both your website and your scheduled tasks from a single dashboard.

Cron Job Monitoring Best Practices

Always Ping After Success, Not Before

Place the heartbeat ping at the end of your job, gated behind a success check. If you ping at the start, you'll get a "healthy" signal even when the job crashes halfway through.

# Wrong: pings even if the job fails
curl https://hb.example.com/abc123 && /usr/local/bin/backup.sh

# Right: only pings on success
/usr/local/bin/backup.sh && curl -fsS --retry 3 https://hb.example.com/abc123

Set Grace Periods Carefully

If your backup usually takes 10 minutes but sometimes takes 30 during heavy load, set the grace period to at least 35 minutes. Too tight and you get false alarms. Too loose and you miss real failures. Start generous and tighten over time as you learn the job's typical duration.

Prevent Overlapping Runs

Use a lock file or flock to prevent two instances of the same job from running simultaneously:

# Use flock to prevent overlapping runs
* * * * * /usr/bin/flock -n /tmp/email-queue.lock /usr/local/bin/process-email-queue.sh && curl -fsS https://hb.example.com/abc123

Don't Let Monitoring Break Your Job

The curl ping should never prevent your job from completing. Use flags like -fsS (fail silently on HTTP errors, show errors on curl failures) and --retry 3 to handle transient network issues. If the monitoring service is down, your job should still run.

# Good: job runs even if ping fails
/usr/local/bin/backup.sh; curl -fsS --retry 3 --max-time 10 https://hb.example.com/abc123 || true

Use UTC for All Cron Schedules

Avoid DST-related failures entirely by running your cron daemon in UTC and scheduling all jobs in UTC. Add CRON_TZ=UTC to the top of your crontab or set the system timezone to UTC on servers that only run background jobs.

Monitor the Full Pipeline, Not Just the Trigger

For multi-step jobs, don't just ping after the first step. If your job extracts data, transforms it, and loads it into a database (ETL), place the heartbeat ping after the final load step. Better yet, validate the output before pinging:

#!/bin/bash
set -e

# Extract
python extract.py --output /tmp/data.csv

# Transform
python transform.py --input /tmp/data.csv --output /tmp/clean.csv

# Load
python load.py --input /tmp/clean.csv --db production

# Validate (check row count is reasonable)
ROW_COUNT=$(python count_rows.py --db production --table imports)
if [ "$ROW_COUNT" -gt 100 ]; then
    curl -fsS --retry 3 https://hb.example.com/abc123
else
    echo "WARNING: Only $ROW_COUNT rows loaded, expected 100+. Not signaling success."
    exit 1
fi

Frequently Asked Questions

Can I use regular uptime monitoring for cron jobs?

Not directly. Uptime monitoring checks if a URL responds, but cron jobs don't have URLs. You could create a health check endpoint that reports the last successful run time of each job, then monitor that endpoint with a tool like Notifier. But purpose-built heartbeat monitoring is simpler and more reliable for this use case. Notifier is adding heartbeat monitoring soon, which will let you track both your website uptime and cron job health from one place.

What if my cron job runs on a server without internet access?

If your server can't reach external URLs, you have a few options. You can self-host Healthchecks.io or Uptime Kuma on your internal network. You can route the heartbeat through an internal proxy. Or you can have the job write a timestamp file and monitor that file's age from a server that does have internet access.

How is this different from log monitoring?

Log monitoring catches errors that are logged. Heartbeat monitoring catches jobs that never run at all. If your cron daemon fails to start the job, there's nothing to log. If the server reboots and cron doesn't restart, there's no error message anywhere. Heartbeat monitoring covers these blind spots because the absence of a signal is the signal.

How many heartbeat monitors do I need?

One per critical cron job. Most applications have 3 to 10 cron jobs worth monitoring. Start with the ones in the "High Priority" table above and expand from there. Don't monitor every single scheduled task; focus on the ones where a failure has real business impact.

What about Kubernetes CronJobs?

Kubernetes CronJobs have the same silent failure problem. The scheduler can miss executions under cluster pressure, and pods can be evicted mid-job. Add a heartbeat ping to your container's entrypoint script, or use a sidecar pattern that pings the monitoring service after the main container exits successfully. Tools like Cronitor and Healthchecks.io both support Kubernetes-native integrations.

Should I combine cron monitoring with uptime monitoring?

Yes. They solve different problems. Uptime monitoring catches server and website failures. Heartbeat monitoring catches background job failures. Together, they give you complete coverage of your application. Notifier already offers uptime monitoring and will soon support heartbeat monitoring as well, so you can manage everything from a single dashboard.

Uptime + Heartbeat Monitoring, One Dashboard

Notifier already monitors your website. Heartbeat monitoring for cron jobs is coming soon. Sign up free and be first to know when it launches.

Start Monitoring Free
Timothy Bramlett

Written by

Timothy Bramlett

Founder, Notifier.so

Software engineer and entrepreneur building tools for website monitoring and uptime tracking.

View author profile