Practical AI engineering: shipping LLM features that hold up

There is a wide gap between a demo that works in front of an audience and a feature that survives real users for a month. I have shipped a few LLM features now, and almost everything I learned the hard way lives in that gap.

The demo is the easy 80 percent

Wiring up a model and getting a good answer takes an afternoon. The remaining work is everything that happens when the input is weird, the model is confidently wrong, or the user asks something you never tested. That part takes the other three weeks, and it is the part that decides whether anyone keeps using the thing.

So plan for it. Budget more time for evaluation and guardrails than for the happy path, because the happy path mostly builds itself.

RAG: retrieval is the hard part, not generation

Most useful LLM features need your data, not just the model’s training. Retrieval-augmented generation is the standard answer: find the relevant chunks, put them in the prompt, let the model answer from them. Simple to describe, fiddly to get right.

The quality of a RAG system is almost entirely the quality of its retrieval. If you fetch the wrong chunks, no amount of prompt cleverness saves you. Spend your time on chunking strategy, on whether you actually need embeddings or whether plain keyword search wins for your data, and on measuring whether the retrieved context contains the answer before you ever look at the generation step.

One concrete tip: log the retrieved chunks for every query in development. Half my RAG bugs were obvious the moment I saw what the retriever actually pulled.

You cannot improve what you do not measure

“It seems better” is not a metric. Before you tune anything, build a small evaluation set: thirty to fifty real inputs with known good outputs. Run it on every change. It feels like overkill until the day a prompt tweak that “obviously improved things” quietly broke a third of your cases.

Evals do not need to be fancy. A spreadsheet of inputs, expected behavior, and a pass or fail you check by eye beats no evals at all. Automate it later once you know what you are measuring.

Treat the model output as untrusted

This is the lesson that connects to security. Model output is just text, and if you feed it into a database query, a shell command, or another system, it can do damage the same way user input can. If an agent reads untrusted content, that content can carry instructions, which is the prompt-injection problem I cover in agentic AI in cybersecurity.

Validate structured output against a schema. Never pass raw model text into anything that executes. The same “input is hostile” mindset from my developer security checklist applies directly to what comes out of the model, not just what goes in.

Cost and latency are product decisions

The biggest model is rarely the right default. A smaller model that answers in 400 milliseconds often beats a larger one that takes four seconds, because users feel latency immediately and judge quality slowly. Cache aggressively. Route easy queries to cheap models and save the expensive one for the hard cases.

Pick your model tier on purpose. I default to the most capable model while building, then drop down once I know which calls actually need the horsepower.

Where this leaves you

Shipping AI features is mostly normal engineering with a probabilistic component bolted on. The model is the fun part and the smallest part. Retrieval, evaluation, validation, and the plumbing around it are the job. If you are building the surrounding system from scratch, the patterns in modern full-stack architecture are where the model actually has to live.