Back to Insights
Engineering by Brian Lopez

Zero Parse Failures at Scale: Structured Output with Local LLMs

On-premises AI pipelines need the same reliability guarantees as cloud APIs. We validated Ollama's grammar-constrained structured output at scale — 557 consecutive calls, zero parse failures.

A recurring requirement in client work — especially in healthcare, legal, and finance — is AI that processes sensitive data without that data ever leaving their environment. Cloud APIs are off the table. That means on-premises inference, which means the reliability and developer-experience guarantees you get with managed APIs need to be rebuilt from scratch.

This post documents one piece of that: getting structured output from a locally-hosted LLM to be as reliable as calling OpenAI’s API directly.

The Problem

We run an on-premises inference server for AI R&D — a dedicated machine we use to benchmark local model performance, validate architectures before recommending them to clients, and build pipelines that can operate entirely within a client’s firewall. One active workstream is a voice content processing pipeline: classifying podcast transcript segments, detecting speaker changes, filtering non-dialogue content like audience laughter.

The pipeline runs hundreds of LLM calls per batch. The question: would Ollama’s grammar-level schema enforcement actually guarantee zero parse failures at that scale, or would edge cases slip through under real production conditions? This matters beyond the specific pipeline — it’s a prerequisite for any on-premises AI deployment where downstream code needs to trust the LLM’s output.

What We Built

We migrated from raw requests calls to Ollama’s /api/generate endpoint to the openai Python SDK pointed at Ollama’s OpenAI-compatible /v1/ endpoint. This unlocked the structured outputs interface.

Instead of prompting for JSON and parsing the text response, we define a Pydantic model and pass it as the response schema:

from pydantic import BaseModel
from openai import OpenAI

client = OpenAI(base_url="http://localhost:11434/v1", api_key="ollama")

class TranscriptClassification(BaseModel):
    is_monologue: bool
    speaker_count: int
    confidence_score: int
    contains_audience_reaction: bool

response = client.beta.chat.completions.parse(
    model="qwen3:8b",
    messages=[{"role": "user", "content": segment}],
    response_format=TranscriptClassification,
)
result = response.choices[0].message.parsed

Ollama uses grammar-constrained generation under the hood. The model can only produce tokens that conform to your schema — not approximately, not usually, but as a hard constraint at the token level.

Alongside this, we built a 15-case eval harness to stress-test the classification logic itself: clean monologues, sitcom [laughter] tags, mid-sentence cuts, and multi-speaker exchanges. We ran two prompt variants and measured pass rate against typed assertions.

What We Found

557 consecutive production calls. Zero parse failures.

That’s the full run of our eval harness plus a production batch. Not a cherry-picked result.

The eval work surfaced something useful too: the baseline prompt was false-rejecting [laughter] annotations as multi-speaker dialogue (0/4 passing). Adding a single explicit clarification brought it to 4/4. Similarly, multi-speaker detection improved from 0/5 to 5/5 by naming bare dash dialogue markers as a rejection criterion. The schema enforcement handles structure; prompt quality still drives classification accuracy.

The score threshold needed one calibration pass — adjusted from ≤3 to ≤4 after observing consistent borderline scoring on genuinely ambiguous embedded dialogue.

Why It Matters

The Pydantic + Ollama /v1/ pattern is now our default for any on-premises inference task that produces structured data. It eliminates an entire class of runtime errors and makes pipelines deterministic in a way that prompt-based parsing can’t match.

For regulated industries: Healthcare, legal, and financial services clients increasingly need AI that operates entirely within their own infrastructure. This architecture delivers the same reliability you’d expect from a managed cloud API — zero parse failures, typed outputs, schema validation — with zero data leaving the building. That’s not a nice-to-have for a HIPAA-covered entity; it’s the only viable path.

For on-premises AI infrastructure work: This is part of a larger research program into what it takes to deploy production-grade local inference. Knowing that grammar-constrained generation is reliable at hundreds of calls is a building block — it means we can architect pipelines for clients where the LLM layer is on-prem without adding a reliability tax.

On cost economics: At scale, the break-even math between on-premises GPU inference and cloud API costs shifts decisively. Validating that local inference is reliable enough to trust with production workloads is a prerequisite to making that argument to clients with confidence.

If you’re building AI pipelines where data governance matters and you’re still parsing text responses from a local model, this is the upgrade path.

Ready to Build Something That Works?

Every engagement starts with a free 30-minute consultation. Let's talk about your project.

Start the Conversation