On-Call Culture and IT Burnout
We lost another senior engineer last month. Exit interview cited work-life balance concerns. When I dug deeper, the real issue was on-call rotation. This engineer had been paged sixteen times in the past month, including three incidents that required working through the night.
This is the third person we’ve lost to on-call burnout in eighteen months. The problem isn’t just turnover. It’s that we’re burning out our best people, the ones who actually know how to fix complex production issues. We had to change something.
How We Got Here
Our on-call model evolved organically over years. As we added services, we added people to rotation. As we grew more complex, incidents became more frequent. Nobody designed this system intentionally. It just accumulated into an unsustainable burden.
We had engineers on-call 24/7 on a weekly rotation. Every week, someone was primary on-call for all infrastructure and applications. A second person was backup if primary couldn’t respond. That person also carried 24/7 responsibility.
The alert frequency was crushing. Monitoring systems generated hundreds of alerts daily. Many were noise, but on-call engineers had to investigate each one to determine severity. Even if ninety percent were false alarms, responding to alerts constantly destroyed any semblance of normal life.
Incidents didn’t respect business hours. Production issues happened at 2am, on weekends, during holidays. Whoever was on-call dealt with it regardless of personal circumstances. We paid an on-call stipend, but no amount of money compensates for chronic sleep deprivation and disrupted personal life.
The Impact on People
The turnover was the visible symptom. The hidden cost was degraded performance from people who stayed.
Engineers coming off on-call weeks were exhausted. They’d been interrupted throughout the week, lost sleep handling overnight incidents, and spent mental energy in constant alert mode. They needed recovery time but had to immediately return to project work.
Quality of work suffered. Tired engineers make mistakes. Code reviews were less thorough. Design decisions were rushed. Technical debt accumulated because people were too burned out to do things properly.
Team morale deteriorated. People dreaded their on-call weeks. They’d block their calendars weeks in advance to avoid family commitments during on-call rotation. Some stopped taking real vacations because they knew they’d be paged.
We were asking people to maintain constant availability without any real boundary between work and personal life. That’s not sustainable, and we were paying the price in turnover and reduced effectiveness.
What Other Companies Do
I talked to IT leaders at a dozen other companies about their on-call practices. The approaches varied, but successful teams had some common patterns.
They limited alert volume aggressively. Only incidents requiring immediate human intervention triggered pages. Everything else went into monitoring dashboards for business-hours investigation. They’d rather miss some edge-case issues than page engineers for things that could wait.
They had follow-the-sun coverage where possible. Teams distributed across time zones meant that “on-call” happened during someone’s normal working hours. This eliminated overnight pages for most engineers.
They built better automation. Common incident types had automated remediation. Engineers still got alerted, but the system often fixed itself before anyone needed to take action. Automation reduced toil and let humans focus on genuinely complex problems.
Finally, they protected engineers’ time. Explicit policy that on-call time counted against project commitments. If you were on-call, you weren’t expected to deliver project work at the same velocity. Recovery time was built into the schedule.
Our Changes
We made several structural changes to on-call practices based on what we learned.
First, we dramatically reduced alert volume. We reviewed every alert rule and asked whether it required immediate human response. Probably half our alerts got reclassified as monitoring-only. We still track them, but they don’t page anyone.
Second, we split on-call by service domain. Instead of one person covering everything, we have specialists covering their areas. Database issues go to database engineers. Application issues go to application teams. This reduced breadth of knowledge required and ensured the right expertise handled each incident.
Third, we implemented proper escalation. Primary on-call handles the incident during working hours and minor off-hours issues. Major incidents automatically escalate to a broader team. Nobody is expected to handle a critical production outage alone at 3am.
We also established explicit boundaries. Engineers take on-call during business hours only unless they explicitly opt into overnight coverage for additional compensation. We hired a third-party service to provide basic overnight support and escalate to our team only when genuinely necessary.
Finally, we changed project planning. On-call weeks count as fifty percent project availability. If you’re on-call, you’re expected to deliver less project work. This gives people permission to focus on operational support without feeling behind on projects.
The Pushback We Got
Finance questioned the cost of third-party overnight support. I showed them the cost of replacing senior engineers who quit. The overnight support was far cheaper.
Some senior engineers initially resisted the changes. They’d survived the old system and saw it as a rite of passage. I reminded them that just because we did something historically doesn’t mean it was good or should continue.
Product teams worried that reduced on-call availability would slow incident response. We showed them that actual resolution times improved because the right specialists handled issues instead of whoever happened to be on-call.
Results Six Months Later
We haven’t lost anyone to burnout since implementing these changes six months ago. Morale has improved noticeably. People take actual vacations now without fear of being paged throughout.
Incident resolution times are actually faster despite more limited on-call coverage. Having specialists handle their domains means faster diagnosis and fix. The overnight support service handles routine issues fine and escalates appropriately when needed.
Engineer productivity has improved. People coming off on-call rotation aren’t exhausted. They’re delivering better quality work because they’re not chronically sleep-deprived.
The changes cost us money in third-party support and slightly reduced project velocity due to on-call time protection. That cost is completely justified by improved retention and effectiveness.
What I’d Tell Other IT Leaders
If you’re burning people out with on-call, fix it before they quit. The best engineers have options. They won’t tolerate unsustainable work conditions indefinitely.
Reduce alert noise aggressively. Most alerts don’t require immediate response. Page people only for genuine emergencies.
Split on-call responsibility by domain. Don’t expect one person to cover your entire technology stack.
Set boundaries around on-call expectations. Constant availability isn’t sustainable. People need personal lives.
Protect on-call time in project planning. You can’t expect full project velocity from someone also handling operational support.
The changes require investment and cultural shift. They’re worth it. Your team’s wellbeing and effectiveness depend on sustainable operational practices. That’s not optional, it’s foundational.