How I Actually Learned AI + DevOps by Building a Real System (Not Just Watching Tutorials)

By Roshan Kr Singh · Founder, TheOpskart


A few months ago, I was sitting late one night with five browser tabs open. One was a YouTube tutorial on prompt engineering. Another had a LangChain quickstart. A third was a course on building AI agents. And two were Stack Overflow questions I'd opened, read, and forgotten why I opened in the first place. I closed everything. Not because I was tired. Because I realized I'd been "learning AI" for weeks and still couldn't answer a basic question — if I deployed any of this stuff to production tomorrow, would it actually work? That's when I stopped watching and started building.



The Problem I Faced When I started learning AI, I followed the usual path. Watched tutorials. Explored tools. Tried a few examples. Very quickly, I realized something. There was a lot of information, but very little clarity. Some content focused on prompt engineering. Some showed how to call an API and print a response. Some explained concepts, but never showed how things actually work in a real system. As a DevOps engineer, my question was simple: Where is the infrastructure behind all of this? Most AI content today focuses on:
  • models
  • prompts
  • frameworks
But real systems are built on:
  • secure access layers
  • cost control
  • observability
  • data pipelines
  • architecture decisions
That gap creates confusion. Especially for engineers like me who don't come from an ML background but understand systems, deployments, and production behavior.
What Actually Worked The turning point wasn't a tutorial. It was a problem. I decided to build one small thing — an LLM gateway for my own experiments. Nothing fancy. Just an API layer that:
  • routed my requests to OpenAI
  • logged every call
  • tracked tokens used per request
  • enforced a basic spending limit
  • gave me visibility when something went wrong
And immediately, the real questions showed up. Where do I store the API keys? How do I rotate them? What happens when the model is slow? How do I prevent one buggy script from burning $200 in a day? How do I know if a prompt is returning garbage in production? None of these questions came up while watching tutorials. All of them came up the moment I tried to make something real. That's when AI stopped being a "topic" and started being a system.
The Real Building Blocks Once I started building, the building blocks of AI infrastructure became obvious. They look surprisingly similar to what every DevOps engineer already knows:
Authentication and access control. Who can call your AI service? How do you stop your API key from leaking? This isn't an AI problem. It's the same problem you've solved a hundred times for normal APIs.
Logging and observability. Every AI call needs to be traceable. Input, output, latency, tokens used, cost. Without this, you're flying blind. Sound familiar? It's the same observability discipline you apply to microservices, just with different metrics.
Cost control. This is the one that catches everyone off guard. A loop that mistakenly runs 10,000 times against GPT-4 doesn't just slow down — it can cost you ₹50,000 before you notice. Token tracking is the new bill alerting.
Routing and fallback. What happens when OpenAI is down? When the model is rate-limited? When you want to switch from GPT-4 to Claude for a specific use case? You need a gateway pattern. This is just load balancing wearing a new hat.
Data pipelines. RAG (retrieval-augmented generation) is just a search-and-stuff-context pipeline. The hard part isn't the AI. It's the indexing, chunking, embedding storage, and refresh strategy. DevOps engineers understand pipelines.
Deployment patterns. Where does the agent run? In a container? As a function? Stateless or stateful? These are decisions you already know how to make for any other service. When you look at AI systems through this lens, the fear disappears. It's not magic. It's infrastructure with a new layer on top.

Lessons That Stuck After building, breaking, and rebuilding several of these systems over a few months, a few lessons became unshakeable: Building beats watching. Every hour I spent in a tutorial taught me less than every fifteen minutes I spent debugging my own broken AI pipeline. Production behavior is the teacher.

Cost will surprise you. I learned the hard way that AI APIs don't have soft limits by default. Set up token tracking and budget alerts on day one, not after the first scare.
Observability is not optional. Without logs, you can't tell if a prompt regression broke your output quality. You'll just notice users complaining a week later. That's too late.
Agents are not microservices, but they need the same discipline. AI agents have memory, tools, retries, and side effects. You can't ship them like a stateless web service and pray. They need health checks, runtime visibility, and clear failure modes — same as anything else in production. The DevOps skill set is your biggest advantage, not your gap. Every "AI engineer" job posting I see is really asking for someone who can build reliable systems. Models change every six months. Reliability principles don't.

How I Now Teach This at TheOpskart This is the gap I'm building TheOpskart to close. Not another course that shows you how to call an API. Not theoretical lectures on neural networks. Real, hands-on labs that show DevOps, Cloud, and SRE engineers how AI behaves as a production system — because that's what we'll actually be operating in our jobs. A few things you can start with for free: If you want personal guidance — career direction, architecture review, or help debugging your own AI infrastructure — I do 1-1 mentorship calls.
The Real Shift The biggest thing that changed for me wasn't a new framework or a new model. It was a realization: AI is not a separate skill from DevOps. It's a new surface area on top of it. If you can deploy services, monitor systems, control costs, and reason about failures, you already have 80% of what's needed to operate AI in production. The remaining 20% is learnable — but only by building, not by watching. So if you're a DevOps, Cloud, or SRE engineer trying to figure out where AI fits into your career, my suggestion is simple: Pick one small AI system. Build it badly. Watch it break. Fix it. That's the path. That's what worked for me, and that's what I'm building TheOpskart to help others do.
Originally an idea I first explored in a shorter post on Medium — this is the expanded version, with the lessons I've kept learning since. Got questions about your own AI infrastructure journey? Connect with me or drop a comment below.

{{DATE}}

How I Actually Learned AI + DevOps by Building a Real System (Not Just Watching Tutorials)

"AI is not a separate skill from DevOps. It's a new surface area on top of it." — Roshan Kr Singh

A few months ago, I was sitting late one night with five browser tabs open. One was a YouTube tutorial on prompt engineering. Another had a LangChain quickstart. A third was a course on building AI agents. And two were Stack Overflow questions I'd opened, read, and forgotten why I opened in the first place. I closed everything. Not because I was tired. Because I realized I'd been "learning AI" for weeks and still couldn't answer a basic question — if I deployed any of this stuff to production tomorrow, would it actually work? That's when I stopped watching and started building.

{{AUTHOR}}
DevOps Engineer and Founder of TheOpskart®. Helping engineers level up in Cloud, DevOps and AI infrastructure through hands-on labs, practical lessons, and 1-1 mentorship. 13+ years in IT.

  • LinkedIn: https://www.linkedin.com/in/roshan-singh-82985629/
  • YouTube: https://www.youtube.com/@TheOpsKart-r8q
  • Instagram: https://www.instagram.com/er.roshansingh_/