|8 min read

Leading Cloud Infrastructure: Lessons After Four Years

A retrospective on four years of leading cloud infrastructure teams, from technical decisions to organizational dynamics

Four years ago, I joined a major entertainment company to help build and lead its cloud infrastructure practice. I came in as a senior engineer. Within a year, I was leading a team. Within two, I was responsible for the cloud platform that supported multiple lines of business. Now, four years in, I want to capture what I have learned, because the lessons that mattered most were rarely technical.

That is the first surprise. I expected the hard problems to be technical: Kubernetes at scale, multi-region architectures, CI/CD pipelines, infrastructure as code. Those problems were real and demanding, but they were solvable with engineering discipline. The harder problems were organizational, cultural, and deeply human.

Lesson 1: Trust Is Built in Production Incidents

The fastest way to earn trust from skeptical stakeholders is to perform well during an outage. Not before. Not during planning sessions or architecture reviews. During the actual incident, when the pressure is real and the business is losing money.

I learned this early. Six months into my role, we had a major service disruption that affected customer-facing commerce systems. The war room was full of directors and VPs who barely knew my name. Over the next three hours, I led the incident response: methodical triage, clear communication, no guessing, no finger-pointing. We found the root cause (a misconfigured auto-scaling policy that drained connections faster than the database could handle), applied the fix, and restored service.

The next morning, I had credibility I could not have earned in six months of presentations. People remembered how you behave when things break, not what you say when things are fine.

After that, I made incident response a team competency, not an individual skill. We ran regular game days (controlled failure injection), maintained runbooks for known failure modes, and practiced blameless postmortems religiously. The goal was not to prevent incidents; it was to make the team excellent at handling them.

Lesson 2: Say No More Than You Say Yes

As the cloud infrastructure team, we were the enabler for every other engineering team. Everyone wanted something from us: a new Kubernetes cluster, a database migration, a CI/CD pipeline, a VPN connection, a cost optimization review. The demand was always three times our capacity.

The instinct, especially early in a leadership role, is to say yes to everything. You want to be helpful. You want to be seen as a partner. You want to demonstrate the value of your team. But saying yes to everything is a guarantee of mediocre results across the board.

I learned to evaluate requests against three criteria: business impact, technical risk, and team capacity. If a request scored high on the first two and we had capacity, yes. If it was low impact or we were already stretched, no, with a clear explanation and an alternative timeline. The discipline of saying no, clearly and respectfully, was one of the hardest things I developed as a leader.

The counterintuitive result: teams respected us more when we said no. A team that says yes to everything and delivers late is less trustworthy than a team that says "we can do this in Q3" and actually delivers in Q3.

Lesson 3: Documentation Is a Leadership Responsibility

Engineers do not naturally write documentation. Left to their own devices, even excellent engineers will build remarkable systems and leave zero record of how they work. This is not a character flaw; it is a rational response to incentives. Writing docs takes time, and the reward for writing good docs is invisible (things work) while the penalty for not writing docs is distant (the engineer who built it leaves, and knowledge walks out the door).

As a leader, I made documentation a team norm by doing three things:

First, I wrote docs myself. Leaders who delegate documentation but never write it signal that docs are beneath them. I wrote architecture decision records, runbooks, and onboarding guides. When the team saw their lead writing docs, the cultural resistance dropped.

Second, I made docs part of the definition of done. A feature or infrastructure change was not complete until the documentation was updated. Pull requests without doc updates were not approved.

Third, I invested in making docs discoverable. We consolidated everything into a single Confluence space with a clear taxonomy. The best documentation in the world is useless if nobody can find it.

Lesson 4: The Cloud Cost Conversation Never Ends

Cloud cost management is not a project with a start and end date. It is a permanent discipline. We learned this the hard way when our AWS bill grew 40% in a single quarter because three teams launched services without right-sizing their compute resources.

I established a monthly cost review cadence where each team's cloud spend was visible and discussed. Not to shame anyone, but to create awareness. Many engineers had never seen an AWS bill and had no intuition for how their architectural choices translated to dollars. A developer who launches ten m5.4xlarge instances for a staging environment is not being wasteful on purpose; they just never had a reason to think about it.

The most effective cost optimization lever was not technical. It was organizational: giving engineering teams visibility into their own costs and making cost efficiency a factor in architecture reviews. Tagging resources by team and project, generating weekly cost reports, and including estimated cost in design documents made spending a first-class consideration rather than an afterthought.

Lesson 5: Hire for Curiosity, Train for Skills

The best infrastructure engineers I hired during this period were not the ones with the most impressive resumes. They were the ones who were relentlessly curious. The engineer who had never used Kubernetes but spent her weekends building clusters on Raspberry Pis outperformed the engineer with three years of Kubernetes experience who had stopped learning.

Cloud infrastructure changes too fast for static expertise to matter. The skills that mattered three years ago (CloudFormation, ECS, Jenkins) are being superseded by new tools (Terraform, Kubernetes, GitOps). An engineer who is curious and adaptable will learn whatever comes next. An engineer who is skilled but static will become obsolete.

In interviews, I stopped asking "tell me about your experience with X" and started asking "tell me about something you learned recently that changed how you think about infrastructure." The answers were far more revealing.

Lesson 6: Your Platform Is a Product

The internal platform your infrastructure team builds is a product, and the engineering teams who use it are your customers. This reframing changed how we operated.

We started treating our platform with product discipline: gathering user feedback, prioritizing a backlog, measuring adoption, and iterating based on usage data. When we launched a new self-service tool for provisioning Kubernetes namespaces, we tracked how many teams used it, how long the process took, and where people got stuck. We iterated on the UX of our Terraform modules the same way a product team iterates on a web application.

This mindset shift had a practical consequence: we stopped building features that nobody asked for. Infrastructure teams have a tendency to over-engineer, building for theoretical future requirements rather than actual current needs. Treating the platform as a product forced us to validate demand before investing engineering time.

Lesson 7: Relationships Scale Better Than Technology

The most impactful thing I did in four years was not a technical achievement. It was building relationships with engineering leaders across the organization. Regular one-on-ones with the leads of the teams we supported, coffee chats with directors in adjacent organizations, and hallway conversations with engineers who had ideas or frustrations.

These relationships created a feedback loop that no monitoring system could replicate. I learned about problems before they became incidents, understood priorities before they became urgent requests, and built the political capital necessary to push for infrastructure investments that had long-term payoffs but short-term costs.

Technology enables scale. Relationships enable trust. And at the enterprise level, trust is the currency that determines whether your team is seen as a strategic partner or a cost center.

Looking Forward

The cloud landscape has changed dramatically since I started: Kubernetes went from experimental to standard, serverless became viable for production workloads, and multi-cloud went from a buzzword to a real strategy.

But the leadership lessons are durable. Trust is earned in crises. Saying no is a skill. Documentation is a leadership act. Costs never manage themselves. Curiosity outperforms experience. Platforms are products. Relationships matter more than technology.

The best infrastructure leaders understand that infrastructure is ultimately about people, not machines.

Share: