Grad School, Research, and Cloud Computing
Reflecting on a year of graduate research, distributed systems learnings, and the widening bridge between academia and industry
The fall semester ended last week. I have been in graduate school for almost a year now, and it feels like an appropriate moment to take stock of where I am, what I have learned, and where I am heading.
Research Progress
My thesis topic has crystallized over the past few months. I am working on resource management strategies for cloud environments, specifically focusing on workload placement and migration in multi-tenant Infrastructure-as-a-Service platforms. The question, stated simply, is: how do you decide where to run workloads in a shared cloud infrastructure to maximize efficiency while maintaining performance guarantees for each tenant?
This is not a new question. The research community has been working on it for years. But the landscape keeps shifting. When I started my literature review earlier this year, most of the work focused on virtual machine placement. Now, containers are changing the conversation. The scheduling problem is similar in structure but different in dynamics: containers are lighter, faster to start, shorter-lived, and far more numerous than VMs. Algorithms designed for VM placement need to be adapted, or in some cases completely rethought, for container workloads.
I have been developing a simulation framework that lets me evaluate different placement algorithms under various workload conditions. The framework models a cluster of physical machines, a stream of incoming workload requests, and a scheduler that assigns workloads to machines based on configurable policies. I can vary the workload characteristics (CPU-intensive, memory-intensive, mixed), the cluster size, the overcommitment ratio, and the scheduling algorithm, then measure metrics like resource utilization, SLA violations, and migration overhead.
Building this framework has been its own education. I initially wrote it in Python, which was straightforward for prototyping but too slow for large-scale simulations. I rewrote the core scheduling loop in C and kept Python for the orchestration and analysis layers. The hybrid approach works well: fast where it needs to be fast, flexible where it needs to be flexible.
Distributed Systems Learnings
The coursework this year has deepened my understanding of distributed systems in ways I did not expect.
My distributed computing class covered the fundamental impossibility results and theoretical foundations. The FLP impossibility theorem (you cannot guarantee consensus in an asynchronous distributed system if even one process can fail) is one of those results that reshapes how you think about everything. It does not mean consensus is impossible in practice; it means you cannot guarantee it in all cases. Real systems like Paxos and Raft work around this by making timing assumptions that hold in practice, even though they are not guaranteed in theory.
The CAP theorem, which states that a distributed data store can provide at most two of three guarantees (consistency, availability, partition tolerance), came alive when I was working with my WordPress POC on AWS. Amazon S3 chose availability and partition tolerance over strong consistency. DynamoDB offers tunable consistency. Aurora provides strong consistency within a region but eventual consistency across regions. These are not abstract design choices; they are engineering decisions with real consequences for application behavior.
I also developed a much deeper appreciation for the complexity of distributed consensus and coordination. Problems that are trivial on a single machine, like maintaining a counter or acquiring a lock, become deeply challenging when spread across multiple machines connected by an unreliable network. The elegance of algorithms like Paxos and the practical engineering of systems like ZooKeeper and etcd are remarkable.
The Theory-Practice Bridge
One of the recurring themes in my graduate school experience has been the tension between theoretical elegance and practical utility.
In theory, we can formulate the workload placement problem as an integer linear program and solve it optimally. In practice, the problem is NP-hard, the input data is noisy and incomplete, the environment changes faster than the optimizer can run, and the solution needs to be computed in milliseconds, not hours.
This means that the most practically useful research often involves designing heuristics and approximation algorithms that are "good enough" in reasonable time, rather than optimal in unreasonable time. The art is in understanding which simplifying assumptions are safe (they hold in practice) versus which are dangerous (they silently break under real-world conditions).
My advisor has been invaluable in developing this judgment. He has seen enough research come and go to know which problems are real and which are academic artifacts. When I proposed an elegant but impractical approach to workload migration, he asked me three questions: What is the computational overhead of your algorithm relative to the cost of a bad placement? How does it behave when the input data is wrong, which it always is? Can you implement it in a real system, or only in a simulator? The elegant approach failed all three questions.
That conversation taught me more about research methodology than any textbook.
What the Industry Looks Like from Here
Sitting in a university lab, watching the cloud computing industry evolve, is a peculiar vantage point. You are close enough to understand the technical details but far enough to see patterns that practitioners, focused on their immediate problems, might miss.
Here is what I see.
The cloud market is consolidating around a few major providers. AWS is the clear leader, but Azure is growing fast, and Google Cloud Platform is making aggressive moves (especially with Kubernetes and machine learning). Smaller providers are finding it increasingly difficult to compete on breadth of services, though they can compete on specialization, pricing, or geography.
The abstraction level keeps rising. VMs gave way to containers. Containers are giving way to serverless functions. Each level of abstraction removes operational responsibility from the customer and shifts it to the provider. The logical endpoint is a world where developers write business logic and the cloud handles everything else. We are not there yet, but the direction is unmistakable.
Infrastructure as code is becoming standard practice. Tools like Terraform, Ansible, CloudFormation, and Chef have moved from "nice to have" to "how do you not use this." The ability to define, version, and reproduce infrastructure through code is now considered a baseline competency for operations teams.
Microservices architecture is gaining momentum, enabled by containers and orchestration platforms. The idea of decomposing monolithic applications into small, independently deployable services is appealing in theory but complex in practice. The distributed systems challenges, service discovery, load balancing, circuit breaking, distributed tracing, are significant. I suspect we will see many organizations attempt microservices, struggle with the operational complexity, and settle on a pragmatic middle ground.
What I Have Become
When I arrived at grad school a year ago, I was an engineer who wanted to learn about cloud computing. Now I am starting to think like a researcher.
The difference is subtle but important. An engineer asks "how do I solve this problem?" A researcher asks "why does this problem exist, what is the space of possible solutions, how do we evaluate them rigorously, and what can we learn that generalizes beyond this specific instance?"
Research requires a tolerance for ambiguity that engineering does not. In engineering, you need a working solution by Friday. In research, you might spend months exploring an approach only to discover it does not work, and that negative result is itself valuable because it tells you something about the problem space.
I am also learning to write. Not code (though I write plenty of that), but prose. Research papers have a specific structure and style. Clearly stating the problem, precisely describing the approach, rigorously evaluating the results, honestly discussing the limitations. Good academic writing is hard. I have drafted and discarded more pages this year than I care to count.
Looking Ahead
Next semester I will begin writing my thesis in earnest. The simulation framework is largely built. The literature review is done (or as done as it can be; new papers keep appearing). I need to design experiments, run them, analyze the results, and write them up.
I also need to start thinking about what comes after graduation. The job market for people with cloud computing expertise is robust. My research gives me a depth of understanding that pure industry experience does not, and my hands-on projects (like the AWS WordPress POC) give me practical skills that pure research does not. The combination, I hope, will be valuable.
But I am not rushing toward the finish line. Graduate school has been one of the most intellectually intense periods of my life. The reading, the thinking, the discussions, the slow process of understanding something deeply, these are experiences I want to appreciate while they are happening.
It has been a good year. I am looking forward to the next one.