A working demo proves a model can do something interesting once. Production proves it can do that thing thousands of times a day, predictably, within a budget, while failing safely. The gap between those two is almost entirely engineering, not prompting.
Treat the model as an unreliable dependency
An LLM is a network call to a non-deterministic service with variable latency and occasional nonsense output. Once you frame it that way, the right patterns follow: timeouts, retries with backoff, fallbacks, and strict output validation at the boundary.
Never trust raw model output
Validate every structured response against a schema before it touches your database or business logic. Treat a schema violation as a failed request, not a value to coerce.
import { z } from "zod";
const Result = z.object({
category: z.enum(["bug", "feature", "question"]),
confidence: z.number().min(0).max(1),
});
export function parseModelOutput(raw: string) {
const json = JSON.parse(raw);
// Throws on any drift — caller retries or falls back.
return Result.parse(json);
}Evaluation is your test suite
You cannot ship changes to a prompt or model with confidence unless you can measure quality. Build an evaluation set of representative inputs with known-good outcomes, and run it on every change — the same discipline you'd apply to any other critical code path.
- Curate 50–200 real, representative inputs
- Define a measurable success criterion per case
- Run evals in CI on every prompt or model change
- Track quality, latency and cost as a trend over time
Observe everything
Log every request and response, token counts, latency and cost. When a user reports a bad answer, you need to reconstruct exactly what happened. Without traces, debugging an AI system is guesswork.
"If you can't see what your AI system did, you don't operate it — you hope."
— NWARRAH Engineering
Control cost before it controls you
Token costs scale with usage in ways that surprise teams. Cache aggressively, choose the smallest model that passes your evals, and set hard per-user and per-org budgets enforced in code.
None of this is glamorous. But it's the difference between an AI feature that quietly works for years and one that becomes a support burden the week after launch.



