Your website is up. Your API responds in 200ms. Every dashboard is green. But somewhere in the background, a cron job that processes payments failed six hours ago and nobody noticed.
This is the cron job problem: they fail silently. There's no user refreshing a page and seeing an error. There's no spike in your error logs (because the job never ran at all). The first sign of trouble is usually a customer email asking why their data hasn't updated, or worse, a finance team discovering that invoices haven't been sent for a week.
Cron job monitoring solves this by flipping the model. Instead of checking if something is up, it checks if something happened. If your job doesn't report in on time, you get an alert.
The Silent Problem With Scheduled Tasks
Traditional uptime monitoring works by sending requests to your server and checking the response. That catches web server failures, but it completely misses background processes. Your server can return a perfect 200 OK while every scheduled task behind it is broken.
Consider what runs on cron in a typical application:
- Database backups that run nightly. If they fail, you won't know until you need a restore.
- Email queue processing that runs every minute. If it stops, transactional emails pile up silently.
- Data synchronization between services. A failed sync means stale data across your platform.
- Report generation for dashboards or stakeholders. Stale numbers lead to bad decisions.
- SSL certificate renewal via Certbot or similar tools. A failed renewal means your site goes down when the cert expires.
- Cleanup tasks that purge temp files or expired sessions. If they fail, disk fills up slowly until the application crashes.
Every one of these can fail without any visible symptoms for hours or days. By the time someone notices, the damage is done: lost backups, missed emails, stale data, or a full disk bringing down your entire application.
How Cron Job Monitoring Works
Cron job monitoring uses a concept called heartbeat monitoring (also known as a "dead man's switch"). The idea is simple:
- 1. You get a unique URL from your monitoring service (e.g.,
https://hb.example.com/abc123) - 2. Your cron job pings that URL after it completes successfully
- 3. The monitoring service expects that ping within a defined window (e.g., every hour)
- 4. If the ping doesn't arrive, the service sends you an alert
This is the opposite of uptime monitoring. Instead of the monitoring service reaching out to your server, your server reaches out to the monitoring service. If your job fails, crashes, or never starts, the ping never arrives, and you get notified.
A Simple Example
Say you have a database backup that runs every night at 2:00 AM. Here's how you'd add heartbeat monitoring:
Before (no monitoring):
0 2 * * * /usr/local/bin/backup-database.sh
After (with heartbeat monitoring):
0 2 * * * /usr/local/bin/backup-database.sh && curl -fsS --retry 3 https://hb.example.com/abc123
The && is critical. It means the curl only runs if the backup succeeds. If the backup script exits with a non-zero code (failure), the ping never fires, and you get alerted.
Heartbeat vs. Uptime Monitoring
| Aspect | Uptime Monitoring | Heartbeat Monitoring |
|---|---|---|
| Direction | Service checks your server | Your server checks in with service |
| Detects | Server/website is down | Background job didn't run |
| Setup | Just add a URL | Add a curl/HTTP call to your job |
| Catches silent failures | No | Yes |
| Best for | Websites, APIs, services | Cron jobs, backups, batch processing |
Most teams need both. Uptime monitoring tells you when your website or API goes down. Heartbeat monitoring tells you when your background jobs stop running. Together, they cover your entire application.
9 Reasons Cron Jobs Fail Silently
Understanding failure modes helps you decide what to monitor and how to configure alerting thresholds.
1. The Job Never Starts
The cron daemon itself can fail. After a server reboot, cron might not restart automatically. On managed platforms like Heroku or AWS ECS, the scheduler can silently drop tasks during deployments. If the job never starts, there's nothing in the logs to investigate.
2. Environment Variable Issues
Cron jobs run in a minimal shell environment. Variables from .bashrc or .profile aren't loaded. A script that works perfectly when you run it manually fails under cron because $PATH, database credentials, or API keys aren't set. This is one of the most common cron failures and one of the hardest to debug without monitoring.
3. Resource Exhaustion
The job starts but gets killed mid-execution. Common causes: out of memory (OOM killer), disk full, too many open file handles, or hitting a process limit. The job exits with a signal rather than a clean error, so error handling in your script never triggers.
4. Overlapping Runs
A job scheduled every 5 minutes takes 7 minutes to complete. Now two instances run simultaneously. They compete for the same resources, corrupt shared data, or deadlock on database locks. Each instance might "succeed" individually while the data they produce is garbage.
5. Dependency Failures
Your job depends on an external API, a database, or a third-party service. If that dependency is down when the job runs, it fails. Unlike a web request that can be retried by the user, a cron job that fails at 3:00 AM won't be retried until the next scheduled run (which might also fail if the dependency is still down).
6. Timezone and DST Confusion
A job scheduled for 2:30 AM skips execution during spring daylight saving time (2:00 AM jumps to 3:00 AM). In fall, it runs twice. Servers in UTC avoid this, but if your crontab uses local time, DST transitions can silently skip or duplicate jobs.
7. Permission Changes
A deploy changes file permissions. A security update modifies directory ownership. The cron user can no longer write to the output directory or read the config file. The job fails with a permission denied error that goes to /dev/null because the crontab redirects stderr.
8. Partial Execution
The job runs but only completes part of its work. It processes 500 of 10,000 records before timing out. It writes a backup file but the file is truncated. The exit code is 0 (success) because the script doesn't validate its own output. This is the sneakiest failure mode because everything looks fine until someone inspects the actual results.
9. Platform Scheduler Failures
On managed platforms (Heroku Scheduler, AWS EventBridge, Google Cloud Scheduler, Kubernetes CronJobs), the scheduler itself can fail. Heroku Scheduler is explicitly documented as "best effort" and can skip executions. Kubernetes CronJobs can miss their schedule if the cluster is under pressure. These failures are invisible to your application code.
Which Cron Jobs Should You Monitor?
Not every cron job needs monitoring. Focus on jobs where a silent failure has real consequences.
High Priority (Monitor These First)
| Job Type | Impact of Failure | Time to Notice Without Monitoring |
|---|---|---|
| Database backups | Data loss during incident | Days or weeks (until you need a restore) |
| Payment processing | Revenue loss, customer complaints | Hours (when customers report) |
| Email queue | Undelivered transactional emails | Hours (when users don't get confirmations) |
| SSL certificate renewal | Site goes down when cert expires | Days (when browser shows security warning) |
| Data sync between services | Stale data, inconsistent state | Hours to days (when users report wrong data) |
Medium Priority
- Report generation: Stakeholders get stale dashboards, but no immediate user impact
- Search index updates: Search results become outdated but the site still works
- Cache warming: Pages load slower until the cache rebuilds
- Analytics aggregation: Metrics lag behind, but raw data is still captured
Lower Priority
- Log rotation: Disk fills up slowly; you have time to fix it
- Temp file cleanup: Same as log rotation, a slow burn
- Non-critical notifications: Internal Slack digests, weekly summaries
How to Set Up Cron Job Monitoring
Regardless of which tool you use, the setup follows the same pattern. Here's a step by step guide.
Step 1: Create a Heartbeat Monitor
In your monitoring tool, create a new heartbeat/cron monitor. You'll need to set:
- Name: Something descriptive (e.g., "Nightly DB Backup" or "Email Queue Processor")
- Expected interval: How often the job should check in (e.g., every 24 hours for a nightly job)
- Grace period: Extra time before alerting (e.g., 10 minutes for a job that usually takes 5 minutes)
The service gives you a unique URL to ping.
Step 2: Add the Ping to Your Cron Job
Add a curl call to the end of your cron job. The exact approach depends on your setup:
Simple shell script:
#!/bin/bash
set -e
# Your actual job
/usr/local/bin/backup-database.sh
# Only runs if the above succeeded (set -e exits on error)
curl -fsS --retry 3 https://hb.example.com/abc123
Python script:
import requests
def main():
# Your actual job logic
process_email_queue()
# Signal success
requests.get("https://hb.example.com/abc123", timeout=10)
if __name__ == "__main__":
main()
Inline crontab:
# Ping only on success (&&)
0 2 * * * /usr/local/bin/backup.sh && curl -fsS --retry 3 https://hb.example.com/abc123
# Ping with status (send exit code)
0 2 * * * /usr/local/bin/backup.sh; curl -fsS "https://hb.example.com/abc123/$?"
Step 3: Configure Alerts
Set up your alert channels based on severity:
- Critical jobs (backups, payments): SMS or phone call alerts so you wake up at 3 AM
- Important jobs (data sync, email queue): Slack or email alerts during business hours
- Low-priority jobs (cleanup, reports): Email digest or Slack notification
Step 4: Verify It Works
Don't just set it and forget it. Test the monitoring by intentionally breaking the job:
- Comment out the curl/ping line and wait for the expected interval to pass
- Verify you receive the alert through your configured channels
- Restore the curl line and confirm the monitor shows as healthy again
Cron Job Monitoring Tools Compared
Several tools offer heartbeat/cron monitoring. Here's how they compare.
| Tool | Free Tier | Paid From | Focus |
|---|---|---|---|
| Healthchecks.io | 20 checks | $20/mo | Cron monitoring only. Open source, self-hostable. |
| Cronitor | 5 monitors | $12/mo | Cron + uptime + telemetry. CLI tool for crontab integration. |
| Better Stack | 10 heartbeats | $24/mo | Uptime + heartbeats + incident management. Full platform. |
| Dead Man's Snitch | 1 snitch | $5/mo | Cron monitoring only. Simple and minimal. |
| UptimeRobot | Limited | $7/mo | Primarily uptime monitoring. Heartbeats on paid plans. |
| Uptime Kuma | Free (self-hosted) | $0 | Open source. Supports push monitors (heartbeat equivalent). |
Which Tool Should You Choose?
- If you only need cron monitoring: Healthchecks.io is purpose-built for this. The free tier covers most small teams, and you can self-host the open source version if you prefer.
- If you need cron + uptime monitoring: Cronitor or Better Stack bundle both. Cronitor is more affordable; Better Stack adds incident management.
- If you want full control: Uptime Kuma's push monitors work as heartbeat checks and the whole thing is self-hosted and free.
- If you want the simplest possible setup: Dead Man's Snitch does one thing and does it well.
Pair Heartbeat Monitoring With Uptime Monitoring
Heartbeat monitoring covers your background jobs, but you still need traditional uptime monitoring for your website and API. The best setup uses both types of monitoring together. Notifier already handles uptime monitoring for your web-facing services (with a free tier of 10 monitors, status pages, and SMS/phone alerts), and heartbeat monitoring for cron jobs is coming soon. Once available, you'll be able to monitor both your website and your scheduled tasks from a single dashboard.
Cron Job Monitoring Best Practices
Always Ping After Success, Not Before
Place the heartbeat ping at the end of your job, gated behind a success check. If you ping at the start, you'll get a "healthy" signal even when the job crashes halfway through.
# Wrong: pings even if the job fails
curl https://hb.example.com/abc123 && /usr/local/bin/backup.sh
# Right: only pings on success
/usr/local/bin/backup.sh && curl -fsS --retry 3 https://hb.example.com/abc123
Set Grace Periods Carefully
If your backup usually takes 10 minutes but sometimes takes 30 during heavy load, set the grace period to at least 35 minutes. Too tight and you get false alarms. Too loose and you miss real failures. Start generous and tighten over time as you learn the job's typical duration.
Prevent Overlapping Runs
Use a lock file or flock to prevent two instances of the same job from running simultaneously:
# Use flock to prevent overlapping runs
* * * * * /usr/bin/flock -n /tmp/email-queue.lock /usr/local/bin/process-email-queue.sh && curl -fsS https://hb.example.com/abc123
Don't Let Monitoring Break Your Job
The curl ping should never prevent your job from completing. Use flags like -fsS (fail silently on HTTP errors, show errors on curl failures) and --retry 3 to handle transient network issues. If the monitoring service is down, your job should still run.
# Good: job runs even if ping fails
/usr/local/bin/backup.sh; curl -fsS --retry 3 --max-time 10 https://hb.example.com/abc123 || true
Use UTC for All Cron Schedules
Avoid DST-related failures entirely by running your cron daemon in UTC and scheduling all jobs in UTC. Add CRON_TZ=UTC to the top of your crontab or set the system timezone to UTC on servers that only run background jobs.
Monitor the Full Pipeline, Not Just the Trigger
For multi-step jobs, don't just ping after the first step. If your job extracts data, transforms it, and loads it into a database (ETL), place the heartbeat ping after the final load step. Better yet, validate the output before pinging:
#!/bin/bash
set -e
# Extract
python extract.py --output /tmp/data.csv
# Transform
python transform.py --input /tmp/data.csv --output /tmp/clean.csv
# Load
python load.py --input /tmp/clean.csv --db production
# Validate (check row count is reasonable)
ROW_COUNT=$(python count_rows.py --db production --table imports)
if [ "$ROW_COUNT" -gt 100 ]; then
curl -fsS --retry 3 https://hb.example.com/abc123
else
echo "WARNING: Only $ROW_COUNT rows loaded, expected 100+. Not signaling success."
exit 1
fi
Frequently Asked Questions
Can I use regular uptime monitoring for cron jobs?
Not directly. Uptime monitoring checks if a URL responds, but cron jobs don't have URLs. You could create a health check endpoint that reports the last successful run time of each job, then monitor that endpoint with a tool like Notifier. But purpose-built heartbeat monitoring is simpler and more reliable for this use case. Notifier is adding heartbeat monitoring soon, which will let you track both your website uptime and cron job health from one place.
What if my cron job runs on a server without internet access?
If your server can't reach external URLs, you have a few options. You can self-host Healthchecks.io or Uptime Kuma on your internal network. You can route the heartbeat through an internal proxy. Or you can have the job write a timestamp file and monitor that file's age from a server that does have internet access.
How is this different from log monitoring?
Log monitoring catches errors that are logged. Heartbeat monitoring catches jobs that never run at all. If your cron daemon fails to start the job, there's nothing to log. If the server reboots and cron doesn't restart, there's no error message anywhere. Heartbeat monitoring covers these blind spots because the absence of a signal is the signal.
How many heartbeat monitors do I need?
One per critical cron job. Most applications have 3 to 10 cron jobs worth monitoring. Start with the ones in the "High Priority" table above and expand from there. Don't monitor every single scheduled task; focus on the ones where a failure has real business impact.
What about Kubernetes CronJobs?
Kubernetes CronJobs have the same silent failure problem. The scheduler can miss executions under cluster pressure, and pods can be evicted mid-job. Add a heartbeat ping to your container's entrypoint script, or use a sidecar pattern that pings the monitoring service after the main container exits successfully. Tools like Cronitor and Healthchecks.io both support Kubernetes-native integrations.
Should I combine cron monitoring with uptime monitoring?
Yes. They solve different problems. Uptime monitoring catches server and website failures. Heartbeat monitoring catches background job failures. Together, they give you complete coverage of your application. Notifier already offers uptime monitoring and will soon support heartbeat monitoring as well, so you can manage everything from a single dashboard.