My First Steps in AI Research
Documenting my journey from infrastructure engineer to hands-on AI researcher and builder
Five months ago, I wrote about pivoting my career toward AI. Today, I want to document where that journey has taken me, what I have learned, and what I have built. The transition from cloud infrastructure architect to someone who works with AI systems daily has been challenging, humbling, and deeply rewarding.
From Consumer to Builder
My relationship with AI has gone through distinct phases this year. In January, I was a fascinated observer watching ChatGPT take the world by storm. By March, I was an active user experimenting with GPT-4 and evaluating how it might apply to my work. By May, I was building prototypes with LangChain and vector databases. And now, I am doing something that I would have considered unlikely a year ago: conducting my own applied AI research.
I want to be precise about what I mean by "research" here. I am not training foundation models or publishing papers in machine learning conferences. That requires resources and expertise I do not have. What I am doing is systematic experimentation with how these models can be applied to real enterprise problems, documenting results rigorously, and developing novel approaches to challenges that existing frameworks and tutorials do not adequately address.
The Research Questions
The questions driving my work are practical rather than theoretical, but they are genuinely open questions that do not have established answers:
How do you evaluate LLM outputs at scale? When your application makes thousands of model calls per day, you cannot manually review every response. I have been developing automated evaluation frameworks that use a combination of heuristic checks, reference-based comparison, and model-based evaluation (using one LLM to evaluate another's output). The challenge is calibrating these systems so they catch genuine errors without generating excessive false positives.
What retrieval strategies work best for heterogeneous enterprise data? Our internal knowledge base includes everything from architectural decision records to Terraform module documentation to incident post-mortems to Confluence pages written five years ago. These documents have different structures, different levels of currency, and different levels of reliability. I have been experimenting with retrieval strategies that account for these differences, incorporating metadata, recency, and source reliability into the ranking.
How do you build trust in AI-assisted workflows? This is partly a technical question (accuracy, reliability, explainability) and partly an organizational question (change management, training, governance). I have been studying how engineers interact with AI tools, where they trust the output and where they do not, and what factors influence that trust. The findings are informing how I design AI-assisted tools for our teams.
The Technical Deep Dive
To do this work effectively, I had to develop a much deeper understanding of how these models work under the hood. Not at the level of someone training models from scratch, but at the level needed to make informed decisions about architecture, configuration, and optimization.
I spent several weeks studying transformer architectures in detail: how attention mechanisms work, why they are effective, and what their limitations are. Understanding attention helped me reason about why models struggle with certain types of tasks (very long inputs, precise numerical computation) and excel at others (pattern matching, synthesis, translation between formats).
I studied embedding models and vector similarity in depth. The choice of embedding model, the dimensionality of the vector space, the similarity metric, and the indexing algorithm all affect retrieval quality. I ran experiments comparing different embedding models on our internal data and found that the best-performing model for general benchmarks was not the best for our specific content types.
I explored fine-tuning, both full fine-tuning and parameter-efficient methods like LoRA. I fine-tuned a Llama 2 7B model on a curated dataset of our internal documentation and operational knowledge. The resulting model was measurably better at answering infrastructure-specific questions than the base model, though it still fell short of GPT-4 on complex reasoning tasks.
What I Have Built
Several concrete projects have emerged from this research:
InfraBot: An AI-powered assistant for infrastructure engineers that can answer questions about our systems, explain recent changes, diagnose common issues, and generate infrastructure code. It uses a RAG pipeline over our documentation combined with function calling to query our monitoring and configuration management systems. Engineers on the team have started using it daily, which is the best validation I could ask for.
Eval Framework: A systematic evaluation harness for testing LLM outputs against expected results. It supports multiple evaluation strategies (exact match, semantic similarity, rubric-based scoring) and generates reports that track quality metrics over time. This has been essential for making data-driven decisions about model selection, prompt optimization, and retrieval configuration.
Knowledge Sync: A pipeline that automatically ingests and indexes internal documentation, keeping the vector store current as documents are created and updated. Getting this right required solving practical problems around incremental indexing, document deduplication, and handling format conversions from various source systems.
Lessons Learned
Several insights have emerged from this journey that I think are broadly applicable:
Infrastructure skills transfer directly. The problems of running AI systems, model serving, data pipeline management, monitoring and observability, cost optimization, are infrastructure problems. My years of building cloud platforms have been directly relevant every day.
The gap between demo and production is enormous. Building a chatbot that works in a demo takes a weekend. Building one that works reliably in production, with proper error handling, evaluation, security, and operational visibility, takes months. Most of the work is in the parts that are not visible to the end user.
Evaluation is the hardest part. It is surprisingly difficult to define and measure "good" for language model outputs. Human evaluation does not scale. Automated evaluation requires its own development and calibration. This is an unsolved problem at the industry level, and it deserves more attention than it gets.
The pace of change is real. Tools, models, and best practices evolve weekly. A technique I adopted in June was superseded by a better approach in August. Staying current requires continuous learning, and accepting that some of your work will be obsolete before it reaches production.
What Comes Next
I am continuing to deepen my AI expertise while maintaining my infrastructure foundation. The intersection of these two domains is where I see the most valuable and interesting work ahead.
Specifically, I am focusing on agent architectures. The ability for AI systems to autonomously plan and execute multi-step tasks using tools is, in my view, the most important capability on the horizon. Building reliable agent systems requires everything I have been learning: prompt engineering, tool integration, evaluation, and the infrastructure to run it all at scale.
I did not plan to become an AI researcher. But looking back at the path from my first Linux server to cloud architecture to where I am now, each step followed naturally from the last. The technology frontier keeps moving, and I keep following it.