Preventing LLM Hallucinations in Domain-Critical Applications: A JSON-First Architecture Approach
We present a practical architecture pattern for eliminating LLM hallucinations in procurement and financial systems. By constraining AI outputs to structured JSON validated against real catalogue data, we achieved zero-error rates on critical fields across 500+ maritime catalogue items. This paper documents the 3-stage pipeline approach (Analyst, Generator, Validator) deployed in a production maritime RFQ system serving real procurement teams.
Clinton Onyekwere
Clinton AI Ltd
Abstract
Large language models have demonstrated remarkable fluency in generating natural language, but this fluency creates a dangerous illusion of competence in domain-critical applications. When an LLM fabricates a part number that looks plausible but does not correspond to any real product, the error is nearly impossible to catch through human review alone. In industries like maritime procurement, where a single incorrect specification can trigger thousands of pounds in wasted shipping, vessel delays, and equipment returns, the consequences of hallucination extend far beyond a bad user experience.
This paper presents a JSON-first architecture for eliminating LLM hallucinations in structured document generation, developed and validated in a production maritime procurement system. Rather than attempting to make language models more truthful through prompting strategies or fine-tuning, our approach restructures the problem itself: we constrain the model to operate exclusively on validated catalogue data, output structured JSON rather than free text, and validate every field against ground-truth sources before delivery. The system has been running in production since January 2026, processing commercial Request for Quotation (RFQ) documents across 500+ catalogue items with zero hallucination incidents on critical fields.
Our contribution is primarily architectural. We demonstrate that for a well-defined class of document generation problems, those where outputs must reference real-world entities with verifiable identifiers, the hallucination problem can be solved through system design rather than model improvement. We describe the three-stage pipeline (Analyst, Generator, Validator), discuss key implementation decisions, and offer practical recommendations for engineers building LLM-powered systems in error-critical domains.
1. Introduction
The hallucination problem in large language models is well-documented. Models generate text that is fluent, confident, and wrong. In conversational applications, this is an inconvenience. In domain-critical applications, healthcare, legal, financial services, industrial procurement, it is a liability.
The standard response to the hallucination problem has followed two broad strategies. The first is model-level: improve the training data, apply reinforcement learning from human feedback (RLHF), or fine-tune on domain-specific corpora. The second is prompt-level: instruct the model to cite sources, use chain-of-thought reasoning, or express uncertainty. Both strategies reduce hallucination rates but do not eliminate them. A system that hallucinates 2% of the time instead of 10% is still unsuitable for applications where errors have direct financial or safety consequences.
We propose a third strategy: architectural elimination. Rather than reducing the probability that a model will hallucinate, we design systems where hallucinations are structurally impossible on critical fields. The model never generates identifiers, specifications, or reference data from its parameters. Instead, it operates on validated source data, outputs structured JSON with discrete fields, and every field is verified against ground truth before reaching the end user.
This is not a general-purpose solution. It applies specifically to document generation tasks where outputs reference real-world entities, part numbers, product codes, legal citations, medical procedure codes, financial instrument identifiers. For this class of problems, which represents a substantial share of enterprise AI use cases, the architecture we describe eliminates the category of hallucination that causes the most damage: confident fabrication of specific, verifiable facts.
We developed and validated this architecture in a production maritime procurement system that generates Request for Quotation documents. The system processes equipment requests against a database of 500+ real catalogue items across 11 equipment catalogues, producing commercially binding documents that are sent to suppliers. It has been live since January 2026 with zero hallucination incidents on critical fields, part numbers, manufacturer names, model numbers, and equipment specifications.
2. Background: Why Maritime Procurement Is Unforgiving
2.1 The Cost of Getting It Wrong
The global commercial shipping fleet comprises approximately 60,000 vessels, each requiring continuous procurement of spare parts, maintenance materials, and equipment throughout an operational life of 25 to 30 years. The ship spares and equipment market is valued at approximately $9 billion annually and is projected to grow to $12-14 billion by 2032.
Maritime procurement follows a well-established workflow. A vessel or technical superintendent identifies a need. A procurement officer prepares an RFQ document listing the required items with precise part numbers, manufacturer details, specifications, and quantities. That document is distributed to multiple suppliers for competitive pricing. Quotes are compared, and an order is placed.
The bottleneck, and the risk, sits in the document preparation step. Writing an RFQ requires cross-referencing equipment catalogues, matching part numbers to the correct vessel's machinery, and structuring the output in a format suppliers can act on. A single vessel refit can generate 50 to 200 RFQ groups, each taking 30 to 60 minutes to prepare manually.
The financial exposure from errors is severe. A wrong part number means the wrong equipment gets shipped to a vessel that might be 8,000 miles from the nearest return facility. Vessel idle time runs $15,000 to $30,000 per day. The cost of a single procurement error, return shipping, reordering, and vessel delay, can easily reach tens of thousands of pounds.
2.2 Why General-Purpose LLMs Are Dangerous Here
A procurement officer could paste an equipment list into a general-purpose LLM and request an RFQ. The result would look professional and read convincingly. It would also be dangerous to use.
General-purpose models do not have access to real equipment catalogues. They generate part numbers that are syntactically plausible but correspond to no real product. They guess at manufacturer names based on training data that may be outdated or drawn from a different equipment class entirely. The output reads well, which is precisely the problem. The hallucinations are embedded in fluent prose, making them nearly impossible to detect without line-by-line verification against source catalogues.
This is the core challenge: in free-text generation, hallucinations hide. A fabricated part number inside a well-written paragraph looks identical to a correct one. The error becomes visible only when the wrong equipment arrives at a port thousands of miles away.
2.3 The Limitations of Prompt-Based Mitigation
Prompt engineering can reduce hallucination rates. Instructions like "only reference data provided in the context" or "if you are unsure, say so" improve reliability in general use. But they provide probabilistic improvement, not guarantees. A model instructed to only cite provided data will still occasionally synthesise plausible-looking information from its parameters, particularly when the provided context is ambiguous or incomplete.
For many applications, reducing hallucination from 10% to 2% is a meaningful improvement. For maritime procurement, 2% is still an unacceptable failure rate. If a system generates 500 RFQ items per month and 2% contain hallucinated part numbers, that is 10 potentially costly errors, roughly two per week. No procurement team would adopt a system with that error profile.
The question, then, is not how to make the model more truthful. It is how to build a system where the model's truthfulness on critical fields is irrelevant because those fields are never generated from the model's parameters in the first place.
3. Architecture: The JSON-First Approach
3.1 Design Principles
Our architecture rests on three principles:
-
Separation of generation and reference. The LLM generates structure, relationships, and natural language descriptions. It never generates identifiers, specifications, or reference data. Those come exclusively from validated source databases.
-
Structured output over free text. Every output is a JSON object with discrete, typed fields. There is no prose in which errors can hide. Every field is individually addressable and verifiable.
-
Independent validation. A separate pipeline stage verifies every critical field against source data. The validator does not trust the generator. It treats generator output as untrusted input and checks it independently.
These principles compose into a system where hallucinations on critical fields are not merely unlikely but structurally impossible. The model cannot hallucinate a part number because it never generates part numbers. It selects them from a validated catalogue, and the selection is verified before delivery.
3.2 The Three-Stage Pipeline
Stage 1: Analyst
The Analyst receives the user's equipment request, typically a set of keywords like "fuel injector nozzle," "cylinder head," or "turbocharger bearing", along with vessel type and equipment context.
Its job is to understand the request and plan the generation strategy. It searches the equipment catalogue database using a combination of vector similarity search (pgvector), keyword matching, and deterministic rule-based lookups. For each requested item, it determines the sourcing approach:
- Direct catalogue match: An exact or near-exact match exists in the catalogue with a verified part number. The Analyst retrieves the catalogue entry and passes it forward.
- Specification-based sourcing: The item is a standard consumable or a component without catalogue history. The Analyst classifies it for description-based generation, with explicit flags indicating that no verified part number is available.
- Vessel equipment correlation: The Analyst cross-references the vessel's equipment records (engine make, model, serial number) to ensure the correct manufacturer and model context is carried forward.
The Analyst's output is a structured plan, a JSON document mapping each requested item to its sourcing strategy, associated catalogue data, and relevant vessel equipment context. No document content is generated at this stage. The Analyst's sole purpose is to assemble the verified data that the Generator will use.
Stage 2: Generator
The Generator receives the Analyst's structured plan and produces the RFQ content. Critically, it does not generate from scratch. It assembles content from the data the Analyst has already retrieved and validated.
For each item in the RFQ, the Generator produces a JSON object with discrete fields:
{
"item_number": 1,
"part_number": "MAN-FI-2876",
"manufacturer": "MAN Energy Solutions",
"model": "48/60CR",
"description": "Fuel injector nozzle assembly",
"quantity": 4,
"unit": "PCS",
"sourcing_method": "catalogue_match",
"catalogue_reference": "cat_marine_engines_001"
}
The part_number, manufacturer, and model fields are not generated by the LLM. They are copied from the catalogue data the Analyst retrieved. The LLM's role is constrained to:
- Selecting which catalogue entry best matches each request item (from the Analyst's shortlist)
- Generating natural language descriptions for items without catalogue matches
- Structuring items into enquiry rounds (typically five rounds of four to five items each)
- Ensuring the overall document meets formatting and completeness requirements
This is the key architectural decision. The LLM does what it is good at, understanding context, making selection decisions, generating descriptive language, while being structurally prevented from doing what it is bad at, inventing specific identifiers and technical specifications.
Stage 3: Validator
The Validator receives the Generator's output and treats it as untrusted. It performs independent verification:
-
Part number verification. Every part number in the output is cross-referenced against the catalogue database. If a part number does not exist in the catalogue, the item is flagged.
-
Manufacturer and model verification. Manufacturer names and model numbers are checked against vessel equipment records. Mismatches are flagged.
-
Structural completeness. Each enquiry round is checked for the correct number of items. Missing fields are flagged.
-
Description appropriateness. For items sourced by specification rather than catalogue match, the Validator checks that descriptions are consistent with the equipment category and do not contain claims that cannot be verified.
-
Confidence scoring. Each item receives a confidence score from 0 to 10. The overall RFQ receives a composite score and a binary verdict: pass or fail.
Items that fail validation are either corrected automatically (if the correction is unambiguous, for example, a part number that is off by one character and has a clear match in the catalogue) or flagged for human review. In practice, the correction rate has been extremely low because the Generator is working from validated data in the first place.
Only validated output reaches the final document. The Validator acts as a structural guarantee, not a probabilistic filter.
3.3 How Structured Outputs Prevent Hallucinations
The critical insight is that hallucinations in free-text generation are dangerous because they are invisible. A fabricated part number embedded in a paragraph of otherwise accurate text is indistinguishable from a correct one without external verification.
JSON-first architecture makes every claim individually addressable. A part number is not buried in a sentence, it is a discrete field that can be programmatically checked against a database in milliseconds. The surface area for undetectable hallucination shrinks to near zero.
This does not mean the LLM cannot make errors. It can select the wrong catalogue entry, generate an inappropriate description, or structure items incorrectly. But these errors are visible, detectable, and correctable. They are not hidden in fluent prose where they pass unnoticed until they cause real-world damage.
The distinction matters: we do not claim to have built an error-free system. We claim to have built a system where errors on critical fields are structurally prevented, and errors on non-critical fields are surfaced rather than concealed.
4. Implementation: Key Technical Decisions
4.1 Why JSON Over Free Text
The decision to use JSON as the primary output format was the single most consequential architectural choice. It was also the least intuitive.
The natural approach to automating document generation with LLMs is to generate documents. Give the model a prompt, receive a formatted document. This approach leverages the model's greatest strength, fluent text generation, and produces outputs that look immediately useful.
We rejected this approach because it optimises for the wrong thing. In procurement, the goal is not a document that reads well. It is a document that is correct. A beautifully written RFQ with one wrong part number is worse than an ugly spreadsheet with all correct part numbers, because the beautiful document will be trusted and acted upon.
JSON forces every piece of information into a discrete, typed field. There is nowhere for hallucinations to hide. A part number is either in the part_number field and verifiable, or it is absent. There is no middle ground where a plausible-looking number is woven into a sentence and escapes detection.
The tradeoff is that JSON output requires a transformation step to produce human-readable documents. We generate the final document from the validated JSON as a separate, deterministic process. This adds complexity but ensures that the document generation step cannot introduce errors, it is a pure transformation of already-validated data.
4.2 Catalogue Data as Ground Truth
The system maintains a PostgreSQL database containing 500+ equipment items across 11 catalogues, plus 200+ vessel equipment entries mapping each vessel type to its specific machinery.
This database serves as the single source of truth. Every identifier in the system, part numbers, manufacturer names, model numbers, originates from this database, not from the LLM's parameters. The database is maintained by domain experts (procurement officers and technical superintendents) and is subject to its own quality assurance process independent of the AI pipeline.
We use pgvector for semantic search, enabling the Analyst to find relevant catalogue items even when the user's terminology does not exactly match the catalogue's. A search for "fuel injector" also surfaces "injection nozzle" and related components. This bridges the gap between how procurement officers describe what they need and how catalogues classify what they sell, without requiring the LLM to guess at the mapping.
4.3 The Validation Pipeline
The Validator is implemented as a separate module with its own database connections and its own logic. It does not share state with the Generator. This independence is deliberate, if the Generator has a systematic bug that produces a particular class of error, the Validator catches it because it is checking against source data independently, not against the Generator's internal representation.
The Validator operates on the following principles:
- Fail closed. If a field cannot be verified, the item is flagged, not passed through. Ambiguity is treated as failure.
- Corrections are auditable. When the Validator corrects an item, the original (incorrect) value and the correction are both logged. This creates a feedback loop for improving the Generator over time.
- Confidence scores are conservative. The scoring algorithm penalises uncertainty more than it rewards correctness. A score of 8/10 means high confidence, not "probably fine."
5. Results
5.1 Production Performance
The system has been running in production since January 2026, processing commercial RFQ documents for a maritime procurement company. The following results reflect production usage, not benchmarks or test scenarios.
Zero hallucination incidents on critical fields. In all RFQs processed since deployment, there have been zero instances of fabricated part numbers, incorrect manufacturer names, or wrong model numbers reaching the final output. This is not a statistical claim about a low rate, it is a structural outcome of the architecture. Critical fields are sourced from the catalogue database and verified by the Validator. The LLM does not generate them.
Processing time reduction. RFQ document preparation has been reduced from 30 to 60 minutes of manual work to 60 to 90 seconds of automated processing. This represents a reduction of approximately 95 to 97 percent.
Catalogue coverage. The system currently indexes 500+ equipment items across 11 catalogues, with 200+ vessel equipment entries. For items with catalogue matches, the accuracy of part number retrieval is 100% (verified by the Validator against source data).
Zero-manual-QA operation. The validation pipeline has eliminated the need for human review of generated documents on critical fields. Procurement officers review outputs for business context (quantities, priorities, supplier selection) but do not need to verify part numbers, manufacturer names, or equipment specifications. This represents a fundamental shift from quality-by-inspection to quality-by-design.
5.2 Validation Pipeline Metrics
The Validator catches and corrects a small number of Generator errors in each batch. These are predominantly:
- Selection errors where the Generator chose a plausible but suboptimal catalogue match
- Structural errors such as duplicate items across enquiry rounds
- Description inconsistencies for specification-based items
These errors would have been invisible in a free-text generation system. In the JSON-first architecture, they are discrete, flagged, and corrected before delivery.
6. Discussion
6.1 Lessons Learned
The most important design decision was the earliest one. Choosing JSON over free text as the primary output format shaped every subsequent decision. It made validation possible, made errors visible, and made the system auditable.
Separation of concerns is not just good engineering, it is a safety mechanism. The three-stage pipeline is more complex than a single-prompt approach. That complexity is the point. Each stage has a narrow responsibility and can be tested, monitored, and improved independently.
Domain expertise is irreplaceable. The catalogue database is maintained by people who understand maritime equipment. The AI system is only as good as the domain knowledge encoded in its data and instructions. No amount of model capability compensates for wrong or missing catalogue data.
The LLM is a reasoning engine, not a knowledge base. Our architecture treats the LLM as a tool for understanding context, making selection decisions, and generating natural language, tasks where it excels. We do not treat it as a source of factual information about specific equipment, a task where it fails unpredictably.
6.2 Limitations
This approach requires structured source data. The JSON-first architecture works because maritime equipment has identifiers and a catalogue system. It would not work as cleanly in domains where the "correct answer" is subjective or where ground-truth data does not exist in a structured format.
The system is only as accurate as its catalogue. If the catalogue contains errors, the system will confidently produce documents with those errors. The system prevents AI-generated errors; it does not prevent human-generated data errors.
Non-critical fields remain probabilistic. Natural language descriptions for items without catalogue matches are generated by the LLM and are subject to the usual limitations of language model output.
6.3 When This Approach Works
The JSON-first architecture is well-suited to document generation tasks with the following characteristics:
- Outputs reference real-world entities with verifiable identifiers. Part numbers, product codes, legal citations, medical procedure codes, financial instrument tickers.
- Errors have significant consequences. Financial loss, safety risk, legal liability, regulatory non-compliance.
- Ground-truth data exists in a structured format. Catalogues, databases, registries, reference tables.
- The document structure is predictable. RFQs, invoices, compliance reports, clinical documentation, regulatory filings.
7. Conclusion: Practical Recommendations
For engineers and architects building LLM-powered systems in domain-critical applications, we offer the following recommendations based on our production experience:
1. Start with the error model, not the generation model. Before choosing an LLM or designing prompts, define what errors look like in your domain and what they cost.
2. Separate what the LLM generates from what it references. Identify which fields in your output are verifiable against external sources. Route those fields through lookup and validation rather than generation.
3. Use structured output formats. JSON, not prose. Discrete fields, not paragraphs. Make every claim individually addressable and programmatically verifiable.
4. Build independent validation. Your validator should not trust your generator. It should have its own data access, its own logic, and its own failure modes.
5. Fail closed on critical fields. If a field cannot be verified, do not pass it through with a warning. Flag it, hold it, and require explicit resolution.
6. Invest in your ground-truth data. The most sophisticated AI pipeline in the world cannot compensate for a bad catalogue. Your domain data is your competitive moat and your quality floor.
7. Log everything the Validator catches. Corrections are not just error handling, they are training data.
The hallucination problem in LLMs is real, well-documented, and unlikely to be fully solved at the model level in the near term. But for a significant class of enterprise document generation tasks, it does not need to be solved at the model level. It can be solved at the architecture level, by designing systems where the model's tendency to hallucinate simply does not matter on the fields where accuracy is critical.
Our production system demonstrates that this is not theoretical. It is practical, deployable, and commercially viable today.
References
-
Ji, Z., et al. (2023). Survey of hallucination in natural language generation. ACM Computing Surveys, 55(12), 1-38.
-
Huang, L., et al. (2023). A survey on hallucination in large language models. arXiv preprint arXiv:2311.05232.
-
Lewis, P., et al. (2020). Retrieval-augmented generation for knowledge-intensive NLP tasks. NeurIPS 2020.
-
OpenAI. (2024). Structured outputs in the API. OpenAI Documentation.
-
Manakul, P., et al. (2023). SelfCheckGPT: Zero-resource black-box hallucination detection. EMNLP 2023.
-
Tonmoy, S.M., et al. (2024). A comprehensive survey of hallucination mitigation techniques in LLMs. arXiv preprint arXiv:2401.01313.
-
Dhuliawala, S., et al. (2023). Chain-of-verification reduces hallucination in large language models. arXiv preprint arXiv:2309.11495.
Clinton Onyekwere is the founder of Clinton AI Ltd, a UK-based AI product studio building production AI systems for enterprise applications.