Skip to content
All articles
AI9 min read

Building AI systems that survive production

Demos are easy. Reliable, observable, cost-controlled AI in production is the real engineering problem.

Misbah Uddin Hasanat

May 18, 2026 · Updated June 2, 2026

Building AI systems that survive production

A working demo proves a model can do something interesting once. Production proves it can do that thing thousands of times a day, predictably, within a budget, while failing safely. The gap between those two is almost entirely engineering, not prompting.

Treat the model as an unreliable dependency

An LLM is a network call to a non-deterministic service with variable latency and occasional nonsense output. Once you frame it that way, the right patterns follow: timeouts, retries with backoff, fallbacks, and strict output validation at the boundary.

Never trust raw model output

Validate every structured response against a schema before it touches your database or business logic. Treat a schema violation as a failed request, not a value to coerce.

validateResponse.ts
import { z } from "zod";

const Result = z.object({
  category: z.enum(["bug", "feature", "question"]),
  confidence: z.number().min(0).max(1),
});

export function parseModelOutput(raw: string) {
  const json = JSON.parse(raw);
  // Throws on any drift — caller retries or falls back.
  return Result.parse(json);
}

Evaluation is your test suite

You cannot ship changes to a prompt or model with confidence unless you can measure quality. Build an evaluation set of representative inputs with known-good outcomes, and run it on every change — the same discipline you'd apply to any other critical code path.

  • Curate 50–200 real, representative inputs
  • Define a measurable success criterion per case
  • Run evals in CI on every prompt or model change
  • Track quality, latency and cost as a trend over time

Observe everything

Log every request and response, token counts, latency and cost. When a user reports a bad answer, you need to reconstruct exactly what happened. Without traces, debugging an AI system is guesswork.

"If you can't see what your AI system did, you don't operate it — you hope."

NWARRAH Engineering

Control cost before it controls you

Token costs scale with usage in ways that surprise teams. Cache aggressively, choose the smallest model that passes your evals, and set hard per-user and per-org budgets enforced in code.


None of this is glamorous. But it's the difference between an AI feature that quietly works for years and one that becomes a support burden the week after launch.

Written by

Misbah Uddin Hasanat

Senior Engineer

Systems and platform engineering — distributed systems, data pipelines and the unglamorous reliability work that keeps software running.

Keep reading

Related articles

Engineering notes, in your inbox.

Occasional, high-signal writing on AI, automation and building software that lasts.