Do I need evals if I'm just using a hosted model?

Yes. Hosted models change behavior across versions. Evals are how you detect regressions before your users do.

How do I control AI feature costs?

Cache repeated requests, use the smallest model that passes evaluation, and enforce per-user and per-organization budgets directly in code.

Building AI Systems That Survive Production

A working demo proves a model can do something interesting once. Production proves it can do that thing thousands of times a day, predictably, within a budget, while failing safely. The gap between those two is almost entirely engineering, not prompting.

Treat the model as an unreliable dependency

An LLM is a network call to a non-deterministic service with variable latency and occasional nonsense output. Once you frame it that way, the right patterns follow: timeouts, retries with backoff, fallbacks, and strict output validation at the boundary.

Never trust raw model output

Validate every structured response against a schema before it touches your database or business logic. Treat a schema violation as a failed request, not a value to coerce.

validateResponse.ts

import { z } from "zod";

const Result = z.object({
  category: z.enum(["bug", "feature", "question"]),
  confidence: z.number().min(0).max(1),
});

export function parseModelOutput(raw: string) {
  const json = JSON.parse(raw);
  // Throws on any drift — caller retries or falls back.
  return Result.parse(json);
}

Evaluation is your test suite

You cannot ship changes to a prompt or model with confidence unless you can measure quality. Build an evaluation set of representative inputs with known-good outcomes, and run it on every change — the same discipline you'd apply to any other critical code path.

Curate 50–200 real, representative inputs
Define a measurable success criterion per case
Run evals in CI on every prompt or model change
Track quality, latency and cost as a trend over time

Observe everything

Log every request and response, token counts, latency and cost. When a user reports a bad answer, you need to reconstruct exactly what happened. Without traces, debugging an AI system is guesswork.

"If you can't see what your AI system did, you don't operate it — you hope."
— NWARRAH Engineering

Control cost before it controls you

Token costs scale with usage in ways that surprise teams. Cache aggressively, choose the smallest model that passes your evals, and set hard per-user and per-org budgets enforced in code.

None of this is glamorous. But it's the difference between an AI feature that quietly works for years and one that becomes a support burden the week after launch.

#ai #system-design #reliability #llm

Written by

Misbah Uddin Hasanat

Senior Engineer

Systems and platform engineering — distributed systems, data pipelines and the unglamorous reliability work that keeps software running.

LinkedIn GitHub

Keep reading