Running a Linux NOC: Chasing 99.9% Uptime

I have been working in a Network Operations Center for a while now, and I want to write about what that actually looks like day to day. Not the sanitized version you read in vendor whitepapers, but the real thing: the 3 AM phone calls, the monitoring dashboards, the slow creep of entropy that tries to take your systems down one misconfiguration at a time.

Our team manages a fleet of Linux servers. We are responsible for keeping them running, and our target is 99.9% uptime. That number sounds impressive until you do the math: 99.9% uptime means you are allowed 8 hours and 45 minutes of downtime per year. That is roughly 43 minutes per month. One bad deployment, one overlooked disk filling up, one network switch deciding to reboot itself, and you have burned through your error budget for the month.

The Monitoring Stack

Everything starts with monitoring. If you cannot see what is happening on your systems, you cannot keep them running. Our monitoring setup is built around Nagios, and while Nagios is not glamorous, it works.

We monitor the obvious things: CPU, memory, disk, network. But the obvious things are rarely what causes outages. The real problems are sneakier.

A log file growing because someone set the wrong log level in production
A file descriptor leak that slowly exhausts the system over days
An NTP drift that causes authentication failures because tokens expire "too early"
A DNS resolver that works fine until a specific upstream server rotates out of the pool

We have learned to monitor the things that precede failures, not just the failures themselves. Disk trending is a good example. I do not want to know when a disk is at 90%. I want to know when a disk that was at 40% last week is at 60% this week, because that trajectory means it will be at 90% in two weeks and I need to understand why.

We use Cacti for trending and capacity planning. The graphs are not pretty, but they tell you stories if you learn to read them. A gradual upward slope in memory usage across a fleet of servers usually means a leak somewhere. A sudden step change in network traffic means someone deployed something without telling ops. Again.

The Runbook

Every service we manage has a runbook. This is non-negotiable. When someone gets paged at 3 AM, they should not have to think creatively. They should open the runbook, follow the steps, and resolve the issue.

A good runbook entry has:

Symptoms: What the alert looks like, what metrics are affected
Diagnosis steps: Commands to run to confirm the issue
Resolution steps: Exact commands to fix it, with copy-paste accuracy
Escalation path: Who to call if the resolution steps do not work
Post-incident: What to check after the fix to confirm stability

We write runbooks after every incident. Not during, after. During the incident, you fix the problem. After, you document what happened, how you fixed it, and how to fix it faster next time. This documentation discipline is probably the single most important practice we follow.

The 3 AM Rotation

On-call rotation is a reality of NOC work. We rotate weekly, and when you are on call, your phone is your lifeline. The pager goes off, you respond. There is no "I will check it in the morning."

The hardest part is not the waking up. It is the context switching. You go from deep sleep to "the payment processing server is throwing disk I/O errors" in thirty seconds, and you need to make good decisions immediately. Sleep-deprived decisions at 3 AM with a production system down are not the same quality as decisions at 2 PM with a cup of coffee. This is why runbooks matter so much.

We have gotten better at reducing the noise over time. Early on, we alerted on everything. Every CPU spike, every brief network blip, every process restart. The result was alert fatigue. When everything is critical, nothing is critical. Your pager goes off so often that you start ignoring it, and then you miss the one alert that actually matters.

Now we have three tiers:

Page-worthy: The service is down or degraded for users
Warning: Something is trending in a bad direction, needs attention during business hours
Informational: Logged for analysis, no notification

Getting this classification right is an ongoing process. We review every page monthly and ask: was this page actionable? Did it require immediate human intervention? If the answer to either question is no, we reclassify it.

The Failures That Taught Me

I have seen some failures that I will never forget.

There was the time a junior admin ran a recursive permission change from the wrong directory. One command, chmod -R 755 from / instead of /opt/app. It took us six hours to rebuild the system from backups because half the system binaries had their setuid bits stripped.

There was the cascading failure when a load balancer health check started failing on one backend, which shifted all traffic to the remaining backends, which overloaded them, which caused their health checks to fail, which meant the load balancer had no healthy backends, which meant total service outage. The fix was a one-line configuration change. The cleanup took two days.

There was the silent data corruption from a failing RAID controller that passed all its self-tests but was flipping bits on writes. We did not catch it until a backup restore failed and we realized the backups had been corrupted for three weeks.

Each of these taught me something that I could not have learned from a textbook. The permission change taught me about change management and the danger of running commands as root without double-checking. The cascading failure taught me about circuit breakers and graceful degradation. The RAID controller taught me about verifying backups, not just running them.

What I Have Learned About Uptime

99.9% uptime is not a technical achievement. It is an organizational one. The technology matters, but what matters more is the discipline around it.

Automated deployments matter because manual deployments introduce human error. Configuration management matters because drift is the enemy of reliability. Monitoring matters because you cannot fix what you cannot see. Documentation matters because knowledge trapped in one person's head is a single point of failure.

The best NOC teams I have seen share a few traits. They treat every incident as a learning opportunity. They automate relentlessly. They build systems that fail gracefully instead of catastrophically. They practice their recovery procedures before they need them.

I have been reading about how companies like Google approach this problem. They have a concept they call Site Reliability Engineering, where they apply software engineering principles to operations. They set error budgets, they do capacity planning with mathematical rigor, they write postmortems for every significant incident. It is operations elevated to a discipline.

We are not at that level yet, but we are getting better. Every quarter, our uptime numbers improve. Not because we bought better hardware or fancier software, but because we got better at the fundamentals: monitoring, documentation, automation, and learning from failures.

The NOC is not glamorous work. Nobody writes blog posts about the beauty of a well-configured Nagios check. But when the systems are running, when the dashboards are green, when a month passes without a single user-impacting incident, there is a quiet satisfaction in knowing that the infrastructure is solid because you made it solid.

That is what keeps me doing this work.

Running a Linux NOC: Chasing 99.9% Uptime

The Monitoring Stack

The Runbook

The 3 AM Rotation

The Failures That Taught Me

What I Have Learned About Uptime

keep reading

Mastering Cloud Engineering: Published

Starting at a Telecom Giant: Enterprise Cloud Architect

Grad School, Research, and Cloud Computing

get this in your inbox