Preventing LLM Hallucinations in Domain-Critical Applications: A JSON-First Architecture Approach | Research

Abstract

Large language models have demonstrated remarkable fluency in generating natural language, but this fluency creates a dangerous illusion of competence in domain-critical applications. When an LLM fabricates an identifier that looks plausible but does not correspond to any real product or record, the error is nearly impossible to catch through human review alone. In domains where a single incorrect specification can trigger thousands of pounds in wasted shipping, returns, and downstream delays, the consequences of hallucination extend far beyond a bad user experience.

This paper presents a JSON-first architecture for eliminating LLM hallucinations in structured document generation, developed and validated in production. Rather than attempting to make language models more truthful through prompting strategies or fine-tuning, our approach restructures the problem itself: we constrain the model to operate exclusively on validated catalogue data, output structured JSON rather than free text, and validate every field against ground-truth sources before delivery. The system has been running in production since January 2026, processing structured procurement documents across 500+ catalogue items with zero hallucination incidents on critical fields.

Our contribution is primarily architectural. We demonstrate that for a well-defined class of document generation problems, those where outputs must reference real-world entities with verifiable identifiers, the hallucination problem can be solved through system design rather than model improvement. We describe the three-stage pipeline (Analyst, Generator, Validator), discuss key implementation decisions, and offer practical recommendations for engineers building LLM-powered systems in error-critical domains.

1. Introduction

The hallucination problem in large language models is well-documented. Models generate text that is fluent, confident, and wrong. In conversational applications, this is an inconvenience. In domain-critical applications, healthcare, legal, financial services, industrial procurement, it is a liability.

The standard response to the hallucination problem has followed two broad strategies. The first is model-level: improve the training data, apply reinforcement learning from human feedback (RLHF), or fine-tune on domain-specific corpora. The second is prompt-level: instruct the model to cite sources, use chain-of-thought reasoning, or express uncertainty. Both strategies reduce hallucination rates but do not eliminate them. A system that hallucinates 2% of the time instead of 10% is still unsuitable for applications where errors have direct financial or safety consequences.

We propose a third strategy: architectural elimination. Rather than reducing the probability that a model will hallucinate, we design systems where hallucinations are structurally impossible on critical fields. The model never generates identifiers, specifications, or reference data from its parameters. Instead, it operates on validated source data, outputs structured JSON with discrete fields, and every field is verified against ground truth before reaching the end user.

This is not a general-purpose solution. It applies specifically to document generation tasks where outputs reference real-world entities, part numbers, product codes, legal citations, medical procedure codes, financial instrument identifiers. For this class of problems, which represents a substantial share of enterprise AI use cases, the architecture we describe eliminates the category of hallucination that causes the most damage: confident fabrication of specific, verifiable facts.

We developed and validated this architecture in a production procurement system that generates structured enquiry documents. The system processes equipment requests against a database of 500+ real catalogue items across 11 equipment catalogues, producing commercially binding documents that are sent to suppliers. It has been live since January 2026 with zero hallucination incidents on critical fields, part numbers, manufacturer names, model numbers, and equipment specifications.

2. Background: Why Industrial Procurement Is Unforgiving

2.1 The Cost of Getting It Wrong

Industrial procurement is a high-volume, high-stakes activity. Operators of large equipment fleets require continuous procurement of spare parts, maintenance materials, and components throughout an operational life that often spans decades. The market for industrial spares and equipment runs into the billions annually and continues to grow.

Procurement follows a well-established workflow. A technical superintendent identifies a need. A procurement officer prepares an enquiry document listing the required items with precise part numbers, manufacturer details, specifications, and quantities. That document is distributed to multiple suppliers for competitive pricing. Quotes are compared, and an order is placed.

The bottleneck, and the risk, sits in the document preparation step. Writing an enquiry requires cross-referencing equipment catalogues, matching part numbers to the correct machinery, and structuring the output in a format suppliers can act on. A single refit can generate 50 to 200 enquiry groups, each taking 30 to 60 minutes to prepare manually.

The financial exposure from errors is severe. A wrong part number means the wrong equipment gets shipped to a site that might be thousands of miles from the nearest return facility. Idle time on critical assets can run tens of thousands of pounds per day. The cost of a single procurement error, return shipping, reordering, and downtime, can easily reach tens of thousands of pounds.

2.2 Why General-Purpose LLMs Are Dangerous Here

A procurement officer could paste an equipment list into a general-purpose LLM and request an enquiry document. The result would look professional and read convincingly. It would also be dangerous to use.

General-purpose models do not have access to real equipment catalogues. They generate part numbers that are syntactically plausible but correspond to no real product. They guess at manufacturer names based on training data that may be outdated or drawn from a different equipment class entirely. The output reads well, which is precisely the problem. The hallucinations are embedded in fluent prose, making them nearly impossible to detect without line-by-line verification against source catalogues.

This is the core challenge: in free-text generation, hallucinations hide. A fabricated part number inside a well-written paragraph looks identical to a correct one. The error becomes visible only when the wrong equipment arrives at a site thousands of miles away.

2.3 The Limitations of Prompt-Based Mitigation

Prompt engineering can reduce hallucination rates. Instructions like "only reference data provided in the context" or "if you are unsure, say so" improve reliability in general use. But they provide probabilistic improvement, not guarantees. A model instructed to only cite provided data will still occasionally synthesise plausible-looking information from its parameters, particularly when the provided context is ambiguous or incomplete.

For many applications, reducing hallucination from 10% to 2% is a meaningful improvement. For industrial procurement, 2% is still an unacceptable failure rate. If a system generates 500 enquiry items per month and 2% contain hallucinated part numbers, that is 10 potentially costly errors, roughly two per week. No procurement team would adopt a system with that error profile.

The question, then, is not how to make the model more truthful. It is how to build a system where the model's truthfulness on critical fields is irrelevant because those fields are never generated from the model's parameters in the first place.

3. Architecture: The JSON-First Approach

3.1 Design Principles

Our architecture rests on three principles:

Separation of generation and reference. The LLM generates structure, relationships, and natural language descriptions. It never generates identifiers, specifications, or reference data. Those come exclusively from validated source databases.
Structured output over free text. Every output is a JSON object with discrete, typed fields. There is no prose in which errors can hide. Every field is individually addressable and verifiable.
Independent validation. A separate pipeline stage verifies every critical field against source data. The validator does not trust the generator. It treats generator output as untrusted input and checks it independently.

These principles compose into a system where hallucinations on critical fields are not merely unlikely but structurally impossible. The model cannot hallucinate a part number because it never generates part numbers. It selects them from a validated catalogue, and the selection is verified before delivery.

3.2 The Three-Stage Pipeline

Stage 1: Analyst

The Analyst receives the user's equipment request, typically a set of keywords like "fuel injector nozzle," "cylinder head," or "bearing assembly", along with equipment type and context.

Its job is to understand the request and plan the generation strategy. It searches the equipment catalogue database using a combination of vector similarity search (pgvector), keyword matching, and deterministic rule-based lookups. For each requested item, it determines the sourcing approach:

Direct catalogue match: An exact or near-exact match exists in the catalogue with a verified part number. The Analyst retrieves the catalogue entry and passes it forward.
Specification-based sourcing: The item is a standard consumable or a component without catalogue history. The Analyst classifies it for description-based generation, with explicit flags indicating that no verified part number is available.
Equipment correlation: The Analyst cross-references the asset's equipment records (make, model, serial number) to ensure the correct manufacturer and model context is carried forward.

The Analyst's output is a structured plan, a JSON document mapping each requested item to its sourcing strategy, associated catalogue data, and relevant equipment context. No document content is generated at this stage. The Analyst's sole purpose is to assemble the verified data that the Generator will use.

Stage 2: Generator

The Generator receives the Analyst's structured plan and produces the enquiry content. Critically, it does not generate from scratch. It assembles content from the data the Analyst has already retrieved and validated.

For each item in the document, the Generator produces a JSON object with discrete fields:

{
  "item_number": 1,
  "part_number": "MAN-FI-2876",
  "manufacturer": "MAN Energy Solutions",
  "model": "48/60CR",
  "description": "Fuel injector nozzle assembly",
  "quantity": 4,
  "unit": "PCS",
  "sourcing_method": "catalogue_match",
  "catalogue_reference": "cat_engines_001"
}

The part_number, manufacturer, and model fields are not generated by the LLM. They are copied from the catalogue data the Analyst retrieved. The LLM's role is constrained to:

Selecting which catalogue entry best matches each request item (from the Analyst's shortlist)
Generating natural language descriptions for items without catalogue matches
Structuring items into enquiry rounds (typically five rounds of four to five items each)
Ensuring the overall document meets formatting and completeness requirements

This is the key architectural decision. The LLM does what it is good at, understanding context, making selection decisions, generating descriptive language, while being structurally prevented from doing what it is bad at, inventing specific identifiers and technical specifications.

Stage 3: Validator

The Validator receives the Generator's output and treats it as untrusted. It performs independent verification:

Part number verification. Every part number in the output is cross-referenced against the catalogue database. If a part number does not exist in the catalogue, the item is flagged.
Manufacturer and model verification. Manufacturer names and model numbers are checked against equipment records. Mismatches are flagged.
Structural completeness. Each enquiry round is checked for the correct number of items. Missing fields are flagged.
Description appropriateness. For items sourced by specification rather than catalogue match, the Validator checks that descriptions are consistent with the equipment category and do not contain claims that cannot be verified.
Confidence scoring. Each item receives a confidence score from 0 to 10. The overall document receives a composite score and a binary verdict: pass or fail.

Items that fail validation are either corrected automatically (if the correction is unambiguous, for example, a part number that is off by one character and has a clear match in the catalogue) or flagged for human review. In practice, the correction rate has been extremely low because the Generator is working from validated data in the first place.

Only validated output reaches the final document. The Validator acts as a structural guarantee, not a probabilistic filter.

3.3 How Structured Outputs Prevent Hallucinations

The critical insight is that hallucinations in free-text generation are dangerous because they are invisible. A fabricated part number embedded in a paragraph of otherwise accurate text is indistinguishable from a correct one without external verification.

JSON-first architecture makes every claim individually addressable. A part number is not buried in a sentence, it is a discrete field that can be programmatically checked against a database in milliseconds. The surface area for undetectable hallucination shrinks to near zero.

This does not mean the LLM cannot make errors. It can select the wrong catalogue entry, generate an inappropriate description, or structure items incorrectly. But these errors are visible, detectable, and correctable. They are not hidden in fluent prose where they pass unnoticed until they cause real-world damage.

The distinction matters: we do not claim to have built an error-free system. We claim to have built a system where errors on critical fields are structurally prevented, and errors on non-critical fields are surfaced rather than concealed.

4. Implementation: Key Technical Decisions

4.1 Why JSON Over Free Text

The decision to use JSON as the primary output format was the single most consequential architectural choice. It was also the least intuitive.

The natural approach to automating document generation with LLMs is to generate documents. Give the model a prompt, receive a formatted document. This approach leverages the model's greatest strength, fluent text generation, and produces outputs that look immediately useful.

We rejected this approach because it optimises for the wrong thing. In procurement, the goal is not a document that reads well. It is a document that is correct. A beautifully written enquiry with one wrong part number is worse than an ugly spreadsheet with all correct part numbers, because the beautiful document will be trusted and acted upon.

JSON forces every piece of information into a discrete, typed field. There is nowhere for hallucinations to hide. A part number is either in the part_number field and verifiable, or it is absent. There is no middle ground where a plausible-looking number is woven into a sentence and escapes detection.

The tradeoff is that JSON output requires a transformation step to produce human-readable documents. We generate the final document from the validated JSON as a separate, deterministic process. This adds complexity but ensures that the document generation step cannot introduce errors, it is a pure transformation of already-validated data.

4.2 Catalogue Data as Ground Truth

The system maintains a PostgreSQL database containing 500+ equipment items across 11 catalogues, plus 200+ equipment entries mapping each asset type to its specific machinery.

This database serves as the single source of truth. Every identifier in the system, part numbers, manufacturer names, model numbers, originates from this database, not from the LLM's parameters. The database is maintained by domain experts (procurement officers and technical superintendents) and is subject to its own quality assurance process independent of the AI pipeline.

We use pgvector for semantic search, enabling the Analyst to find relevant catalogue items even when the user's terminology does not exactly match the catalogue's. A search for "fuel injector" also surfaces "injection nozzle" and related components. This bridges the gap between how procurement officers describe what they need and how catalogues classify what they sell, without requiring the LLM to guess at the mapping.

4.3 The Validation Pipeline

The Validator is implemented as a separate module with its own database connections and its own logic. It does not share state with the Generator. This independence is deliberate, if the Generator has a systematic bug that produces a particular class of error, the Validator catches it because it is checking against source data independently, not against the Generator's internal representation.

The Validator operates on the following principles:

Fail closed. If a field cannot be verified, the item is flagged, not passed through. Ambiguity is treated as failure.
Corrections are auditable. When the Validator corrects an item, the original (incorrect) value and the correction are both logged. This creates a feedback loop for improving the Generator over time.
Confidence scores are conservative. The scoring algorithm penalises uncertainty more than it rewards correctness. A score of 8/10 means high confidence, not "probably fine."

5. Results

5.1 Production Performance

The system has been running in production since January 2026, processing structured procurement documents. The following results reflect production usage, not benchmarks or test scenarios.

Zero hallucination incidents on critical fields. In all documents processed since deployment, there have been zero instances of fabricated part numbers, incorrect manufacturer names, or wrong model numbers reaching the final output. This is not a statistical claim about a low rate, it is a structural outcome of the architecture. Critical fields are sourced from the catalogue database and verified by the Validator. The LLM does not generate them.

Processing time reduction. Document preparation has been reduced from 30 to 60 minutes of manual work to 60 to 90 seconds of automated processing. This represents a reduction of approximately 95 to 97 percent.

Catalogue coverage. The system currently indexes 500+ equipment items across 11 catalogues, with 200+ equipment entries. For items with catalogue matches, the accuracy of part number retrieval is 100% (verified by the Validator against source data).

Zero-manual-QA operation. The validation pipeline has eliminated the need for human review of generated documents on critical fields. Procurement officers review outputs for business context (quantities, priorities, supplier selection) but do not need to verify part numbers, manufacturer names, or equipment specifications. This represents a fundamental shift from quality-by-inspection to quality-by-design.

5.2 Validation Pipeline Metrics

The Validator catches and corrects a small number of Generator errors in each batch. These are predominantly:

Selection errors where the Generator chose a plausible but suboptimal catalogue match
Structural errors such as duplicate items across enquiry rounds
Description inconsistencies for specification-based items

These errors would have been invisible in a free-text generation system. In the JSON-first architecture, they are discrete, flagged, and corrected before delivery.

6. Discussion

6.1 Lessons Learned

The most important design decision was the earliest one. Choosing JSON over free text as the primary output format shaped every subsequent decision. It made validation possible, made errors visible, and made the system auditable.

Separation of concerns is not just good engineering, it is a safety mechanism. The three-stage pipeline is more complex than a single-prompt approach. That complexity is the point. Each stage has a narrow responsibility and can be tested, monitored, and improved independently.

Domain expertise is irreplaceable. The catalogue database is maintained by people who understand the equipment. The AI system is only as good as the domain knowledge encoded in its data and instructions. No amount of model capability compensates for wrong or missing catalogue data.

The LLM is a reasoning engine, not a knowledge base. Our architecture treats the LLM as a tool for understanding context, making selection decisions, and generating natural language, tasks where it excels. We do not treat it as a source of factual information about specific equipment, a task where it fails unpredictably.

6.2 Limitations

This approach requires structured source data. The JSON-first architecture works because the equipment has identifiers and a catalogue system. It would not work as cleanly in domains where the "correct answer" is subjective or where ground-truth data does not exist in a structured format.

The system is only as accurate as its catalogue. If the catalogue contains errors, the system will confidently produce documents with those errors. The system prevents AI-generated errors; it does not prevent human-generated data errors.

Non-critical fields remain probabilistic. Natural language descriptions for items without catalogue matches are generated by the LLM and are subject to the usual limitations of language model output.

6.3 When This Approach Works

The JSON-first architecture is well-suited to document generation tasks with the following characteristics:

Outputs reference real-world entities with verifiable identifiers. Part numbers, product codes, legal citations, medical procedure codes, financial instrument tickers.
Errors have significant consequences. Financial loss, safety risk, legal liability, regulatory non-compliance.
Ground-truth data exists in a structured format. Catalogues, databases, registries, reference tables.
The document structure is predictable. Enquiries, invoices, compliance reports, clinical documentation, regulatory filings.

7. Conclusion: Practical Recommendations

For engineers and architects building LLM-powered systems in domain-critical applications, we offer the following recommendations based on our production experience:

1. Start with the error model, not the generation model. Before choosing an LLM or designing prompts, define what errors look like in your domain and what they cost.

2. Separate what the LLM generates from what it references. Identify which fields in your output are verifiable against external sources. Route those fields through lookup and validation rather than generation.

3. Use structured output formats. JSON, not prose. Discrete fields, not paragraphs. Make every claim individually addressable and programmatically verifiable.

4. Build independent validation. Your validator should not trust your generator. It should have its own data access, its own logic, and its own failure modes.

5. Fail closed on critical fields. If a field cannot be verified, do not pass it through with a warning. Flag it, hold it, and require explicit resolution.

6. Invest in your ground-truth data. The most sophisticated AI pipeline in the world cannot compensate for a bad catalogue. Your domain data is your competitive moat and your quality floor.

7. Log everything the Validator catches. Corrections are not just error handling, they are training data.

The hallucination problem in LLMs is real, well-documented, and unlikely to be fully solved at the model level in the near term. But for a significant class of enterprise document generation tasks, it does not need to be solved at the model level. It can be solved at the architecture level, by designing systems where the model's tendency to hallucinate simply does not matter on the fields where accuracy is critical.

Our production system demonstrates that this is not theoretical. It is practical, deployable, and commercially viable today.

References

Ji, Z., et al. (2023). Survey of hallucination in natural language generation. ACM Computing Surveys, 55(12), 1-38.
Huang, L., et al. (2023). A survey on hallucination in large language models. arXiv preprint arXiv:2311.05232.
Lewis, P., et al. (2020). Retrieval-augmented generation for knowledge-intensive NLP tasks. NeurIPS 2020.
OpenAI. (2024). Structured outputs in the API. OpenAI Documentation.
Manakul, P., et al. (2023). SelfCheckGPT: Zero-resource black-box hallucination detection. EMNLP 2023.
Tonmoy, S.M., et al. (2024). A comprehensive survey of hallucination mitigation techniques in LLMs. arXiv preprint arXiv:2401.01313.
Dhuliawala, S., et al. (2023). Chain-of-verification reduces hallucination in large language models. arXiv preprint arXiv:2309.11495.

Clinton Onyekwere is the founder of Clinton AI Ltd, a UK-based AI product studio building production AI systems for enterprise applications.