Ask GPT-5 or Claude Opus 4.7 a specific question about a small B2B brand, pricing, team size, or competitors, and one of three things happens: the model gets it right, the model says it does not know, or the model confidently invents an answer that sounds correct and is not. The third outcome is a hallucination, and for any company below the Wikipedia-popularity threshold it is the default behaviour, not the edge case.

Most "fix your brand hallucinations" guides treat the symptom by adding more schema, more facts pages, more knowledge-graph signals. That framing is incomplete. Silent fields stay silent no matter how much surrounding content you add, and the model keeps guessing on the slots you never filled. The fix has to address the gap-fill step itself, not the volume of content around it.

This post explains why the guess happens, what do_not_infer does to stop it, and why the directive belongs in every LLM-facing schema, including the ones that do not exist yet.

How does an LLM answer a question about your brand?

When an LLM answers a brand question it runs a four-step chain: retrieve candidate sources, parse them, match each source field to the question intent, and fill any remaining gaps from parametric memory. The hallucination happens at step four, silently. Vectara's November 2025 leaderboard found leading frontier reasoning models, including GPT-5, Claude Sonnet 4.5, Grok-4, and Gemini-3-Pro, all exceeded 10% hallucination on grounded summarization (Vectara HHEM, Nov 2025), which means the failure mode is shared across the frontier, not isolated to a single lab.

Step four is the failure mode worth studying. The ReDeEP paper traces it mechanically: "hallucinations occur when the Knowledge FFNs in LLMs overemphasize parametric knowledge in the residual stream, while Copying Heads fail to effectively retain or integrate external knowledge from retrieved content" (ReDeEP, OpenReview 2025). Translated into plain language: the model has the retrieved passage sitting in context and still defers to what its weights remember, which is not a bug in a single model but the dominant pattern across the leaderboard.

The causes split roughly evenly across three sources. Data limitations account for around 30% of hallucinations, training bias 25%, and probabilistic generation another 25% (Future AGI compilation via SQ Magazine, 2026), so no single fix addresses all three. The gap-fill step is where retrieval-time directives can move the needle, because that is the only step where the data layer can intervene before the token is generated.

The four-step inference chain Hallucinations happen at step four, silently STEP 1 Retrieve candidate sources STEP 2 Parse field by field STEP 3 Match field to intent STEP 4 Fill gap from parametric memory hallucinations happen here A silent field at step three becomes a confident guess at step four.
Conceptual model based on the ReDeEP analysis of retrieval-augmented generation (OpenReview, 2025). The gap-fill step is where directive-level controls intervene.
Chart data
The four-step LLM inference chain
StepAction
1Retrieve sources
2Parse content
3Match fields to question intent
4Fill gaps from parametric memory (hallucination step)

Why is your brand specifically high-risk?

Hallucination rates are not uniform across questions. They spike on long-tail entities, the low-frequency entities Kandpal et al. showed memorization fails on (Kandpal et al., ICML 2023), and that bucket contains nearly every B2B brand under 200 employees. Dextra Labs measured that 31.4% of real-world LLM interactions contain a hallucination, with the rate climbing to 60% in complex or domain-specific queries (Dextra Labs via SQ Magazine, 2026).

The PopQA paper at ACL 2023 measured this directly. LLMs memorize high-frequency facts and miss low-frequency ones, with accuracy dropping sharply as Wikipedia popularity declines (Mallen et al., PopQA, ACL 2023). If your brand sits in the long tail of Wikipedia popularity, which most B2B brands under 200 employees do, the model has very little to remember about you. Whatever it does not remember, it invents.

The benchmark spread tells the story. Across a 37-model evaluation, hallucination rates ranged from 15% at the low end (Grok-4) to 52% at the high end (qwen3-235b) (Dextra Labs via SQ Magazine, 2026), so the model your buyer happens to use is essentially a coin flip on whether you fall in the better or worse half. Our team has not found a credible technique that flattens this curve from the content side alone, and the practical conclusion is that the fix has to operate at retrieval time rather than at the content layer.

We walked through the broader retrieval mechanic in the GEO field guide. The short version: if a model cannot retrieve a clean fact, it will infer one, and long-tail entities give models less to retrieve so they infer more.

What does do_not_infer do?

do_not_infer: true is a field-level directive in our structured brand intelligence schema. When the model retrieves a brand profile and reaches a field marked this way, the directive converts the field from "silent," which triggers gap-fill, to "explicitly unknown," which suppresses it. Structured prompting using the same pattern reduces medical hallucinations by 33% (SQ Magazine compilation, 2026), and RAG itself cuts hallucinations by 30 to 70% across domains, which suggests the upper bound on directive-driven gains is meaningful even before LLMs natively understand the primitive.

What the Builder writes for the model to read

The shift is small in JSON terms, but the behaviour change is large. Before the directive is set, an AI crawler retrieves a silent field that looks like this:

// Without directive, the model will guess
{
  "headcount": null,
  "pricing_starting_at_usd": null
}

After you flip the toggle in Ooky's Builder, the crawler retrieves this instead:

// With directive, the model is told to refuse
{
  "headcount": {
    "value": null,
    "do_not_infer": true,
    "reason": "Not publicly disclosed"
  },
  "pricing_starting_at_usd": {
    "value": null,
    "do_not_infer": true,
    "reason": "Custom pricing, contact sales"
  }
}

The failure mode shifts from a silent absence, which the model fills, to a positive instruction, which the model can treat as a guardrail. The same trick powers the "I do not know, please do not guess" prompt patterns documented in the SQ Magazine compilation, just encoded at the data layer the model retrieves instead of the prompt layer, so it survives every retrieval.

A worked example on cloudweld.ai

Take cloudweld.ai, the parent company building Ooky. Ask any frontier model for cloudweld's exact employee count and you will get confident numbers that range across runs from "around 50" to "more than 200," none of which match anything cloudweld publishes on its site. The field is silent, so the model fills it from whatever adjacent signals it can reach (LinkedIn samples, news coverage, similar-name confusion), and the answer is wrong in a different way each time.

Now imagine cloudweld shipped a brand intelligence profile with headcount flagged do_not_infer: true, with the reason "not publicly disclosed." A retrieving model that reads the directive answers "cloudweld.ai does not publicly disclose its headcount." That answer is correct, and it is also cite-able, so the model goes from inventing a number to attributing a refusal, and the refusal is the truth.

What did schema.org miss?

Schema.org was built for search engines that index, not for LLMs that retrieve, parse, and infer. The vocabulary missed a primitive: a way to mark a field as "deliberately unknown, do not infer." That gap is the one do_not_infer fills. Search Atlas analysed schema markup adoption against citation outcomes and found no correlation between schema coverage and LLM citation frequency (Search Atlas, 2025), which is consistent with the gap-fill mechanism: more parse-able fields do not help if the silent ones still get invented.

The pattern has precedent. Gatsby's GraphQL layer ships a @dontInfer directive that tells the type system not to invent field types from sample data, which is the same idea applied to a different domain. Our team did not invent the concept, we ported it to the brand-intelligence layer and gave it a name LLMs can read.

The bigger claim is this: do_not_infer should be a category-level directive, not a vendor feature. We would rather see it land in schema.org and JSON-LD than stay proprietary to Ooky, because the honest limit today is that directives only help models that respect them, and current LLMs do not natively parse directive fields. They treat them as data, so the directive works because we surface it inside the brand intelligence payload the model retrieves and our prompt scaffolding tells the model how to read it. Wider adoption needs LLM providers to standardise directive semantics, and until then the retrieval-time surface is where the work has to live.

What should you do this week?

Inventory your silent fields, decide which ones should be filled and which should be explicitly unknown, then mark the second group. The exercise takes about 30 minutes for a well-scoped brand profile, and it directly attacks the gap-fill step where Vectara measured double-digit hallucination rates across the leading frontier reasoning models (Vectara HHEM, Nov 2025).

Step 1: inventory the silent fields

List every fact an LLM might assert about your company: pricing, headcount, founders, founding year, integrations, customers, locations, certifications, security postures, and revenue. Any field a buyer might ask about is a candidate, and most teams find 20 to 40 fields in the first pass before the list starts to feel exhaustive.

Step 2: decide value or refusal

For each field, pick one of two outcomes: a confident value with a source, or an explicit refusal. Refusals are valid and sometimes the only honest answer, because pre-announcement pricing, sensitive headcount, undisclosed customer lists, and pending certifications all deserve a refusal rather than a guess. Trying to fill every field is the mistake, because some silences are correct.

Step 3: mark the refusal list in the Builder

In Ooky, marking a field is a per-field switch in the Builder. Flip the toggle, hit publish, and the marked profile gets served to AI crawlers at the edge with the directive attached. The reason this lives in Ooky's intelligence layer rather than as a static JSON file: the directive has to be served at retrieval time, on bot-detected traffic, with edge cache invalidation when a field changes. The gap-fill step only has something to react to if the directive travels with the data the model retrieves.

For the broader stack, bot interception, edge-served intelligence, and perception tracking, see the GEO field guide. Once the fields are marked, the five-question protocol in the wince test is how we verify the LLMs are respecting the directive.

What changes once you mark the refusal list?

Hallucinations aren't random. They happen at a specific step in a four-step inference chain, on a specific kind of entity, in a measurable band that hasn't closed since Vectara started tracking it. More content alone doesn't fix that. A directive that turns silent fields into explicit refusals, encoded at the data layer the model retrieves, does.

Open your brand profile this week and mark the fields you would rather the model refuse than guess on. The exercise takes 30 minutes manually, and the next time a buyer asks ChatGPT about your pricing the answer is your refusal rather than someone else's invention.

Want the directive without writing the JSON?

Ooky's free tier surfaces do_not_infer as a per-field toggle in the Builder, serves the marked profile to AI crawlers automatically, and re-publishes the moment you flip a switch. About 30 minutes manually, zero ongoing.

Sources

  1. Vectara. "Introducing the next generation of Vectara's Hallucination Leaderboard." November 19, 2025. Retrieved 2026-05-14. https://www.vectara.com/blog/introducing-the-next-generation-of-vectaras-hallucination-leaderboard
  2. SQ Magazine, citing Dextra Labs. "LLM hallucination statistics." April 27, 2026. Retrieved 2026-05-14. https://sqmagazine.co.uk/llm-hallucination-statistics/
  3. SQ Magazine, citing Future AGI. "LLM hallucination statistics (cause split)." April 27, 2026. Retrieved 2026-05-14. https://sqmagazine.co.uk/llm-hallucination-statistics/
  4. SQ Magazine compilation. "Structured prompting reduces medical hallucinations by 33%; RAG cuts hallucinations 30 to 70%." 2026. Retrieved 2026-05-14. https://sqmagazine.co.uk/llm-hallucination-statistics/
  5. Mallen, A. et al. "When Not to Trust Language Models: Investigating Effectiveness of Parametric and Non-Parametric Memories (PopQA)." ACL 2023. Retrieved 2026-05-14. https://aclanthology.org/2023.acl-long.546.pdf
  6. Search Atlas. "The limits of schema markup for AI search." 2025. Retrieved 2026-05-14. https://searchatlas.com/blog/limits-of-schema-markup-for-ai-search/
  7. ReDeEP. "Detecting hallucinations in retrieval-augmented generation via mechanistic interpretability." OpenReview, 2025. Retrieved 2026-05-14. https://openreview.net/forum?id=ztzZDzgfrh