5 prompts to see what AI really says about your company

Q: Why do I get different answers when I run the same prompt twice?

LLMs are non-deterministic. A 2,961-run SparkToro study found a less than 1% chance of identical recommendation lists across repeat queries and less than 0.1% chance of identical order (SparkToro via Passionfruit, Feb 2026). Run each prompt three times and score the pattern, not a single answer.

When a buyer asks an AI assistant "who should I use for X?" before they ever type into Google, your brand is being described in a way you have not seen and cannot directly control. 94% of B2B buyers now use LLMs during their purchase journey (6sense, Nov 2025), which means whatever ChatGPT, Claude, or Perplexity says about your company on a given afternoon is the homepage your next buyer reads first.

Most founders have never looked. The ones who do tend to look once, see a weird answer, close the tab, and walk away with a hunch instead of a backlog.

Here's a protocol that turns the hunch into a backlog: five prompts, three models, one scoring rubric. About 20 minutes for a fast pass, two to four hours for a thorough one. I'll walk you through what it looked like when we ran it on cloudweld.ai, the parent company behind Ooky, so the findings come from a real brand and not a hypothetical. If you'd rather skip the manual run, Ooky's free tier puts this on a schedule for you (see what the free tier covers).

Why does it matter what AI says about your brand?

This is not curiosity traffic, it is the new shortlist surface. 6sense found that buyers now contact vendors 3.5 weeks earlier than they did before AI tools were widespread, at 26.4 weeks into a project instead of 30 (6sense, Nov 2025). By the time a buyer fills out your contact form, the LLM has already framed the choice and decided which competitors get mentioned alongside you.

What should you set up before you start?

About ten minutes of prep. You need a fresh chat session per model so no prior context leaks in, a plan to repeat each prompt three times because LLMs are non-deterministic, and a simple scoring sheet. A 2,961-run SparkToro study (covered by Passionfruit) found less than 1% chance of identical recommendation lists across repeat queries (SparkToro via Passionfruit, Feb 2026).

The setup checklist

Pick three models. ChatGPT (GPT-4.1 or newer), Claude (Sonnet 4 or newer), Perplexity (Sonar Pro). Add Gemini 2.5 if you have time.
New chat per prompt. No memory, no prior context. Run one pass with browse mode on, one with browse mode off so you can see the difference between retrieval and pure training data.
Run each prompt three times. SE Ranking found only 9.2% URL consistency across three identical queries on the same day (SE Ranking via Passionfruit, Feb 2026). One run is an anecdote; three runs are a pattern.

Our team got this wrong the first time around. We ran each prompt once, wrote down a verdict, and called it a day, which gave us five data points and a strong opinion about each one but no actual pattern. The pattern only emerges across 15 runs (5 prompts times 3 models), so a single answer is an anecdote and the shape across the runs is the actual audit.

What are the five prompts?

Each prompt targets a different layer of buyer behavior: discovery, identity, comparison, recommendation, reputation. Together they reconstruct what an LLM "thinks" about your brand. Frontier models hallucinate at a cluster of roughly 22%, with a range of 15 to 52% across a 37-model benchmark (SQ Magazine, citing Dextra Labs, 2026), so expect drift on at least one dimension.

1. Discovery prompt

What are the best tools for [your category] in 2026?

Tests whether you appear at all when buyers do not know your name yet. Watch for: top-3 ranking, missing entirely, listed below competitors that no longer exist, or grouped into the wrong category. This is the cold-start question.

2. Identity prompt

What does [your company] do?

Tests factual recall when a buyer already has your name. Watch for: wrong category, missing flagship feature, wrong founding year, invented product lines, or a description that matches your 2023 positioning instead of your 2026 one.

3. Comparison prompt

[Your company] vs [your closest competitor]: which is better for [your ICP]?

Tests how the model frames you against a named alternative. Watch for: pricing claims (often wrong), feature attribution errors, biased framing toward the competitor, or comparison tables that mix and match data points from different years.

4. Recommendation prompt

I'm a [your ICP role] at a [company size] B2B brand. Should I use [your company]?

Tests purchase-intent surfacing. Watch for: hedges, "I do not have enough information," recommended alternatives instead of you, or qualifiers that contradict your positioning. This is the prompt closest to what a real buyer types.

5. Reputation prompt

Are there any concerns or complaints about [your company]?

Tests the due-diligence surface and hallucination risk. Watch for: invented complaints, real-but-stale issues from a 2022 Reddit thread, false competitor attribution, or outage incidents that belong to someone else.

How do you score each response? The 4-dimension rubric

Score each response on four dimensions, zero to three each, for a total per prompt out of 12. Across 15 runs (5 prompts x 3 models), the maximum possible is 180. Hit 150 or above and you are strong. Land between 90 and 149 and you have a backlog. Below 90 is urgent. This rubric is first-party Ooky work.

4-dimension AI brand audit scoring rubric
Dimension	0	1	2	3
Mention	Not mentioned	Mentioned, low position	Mentioned, mid-list	Top-3
Accuracy	Factually wrong	Partly wrong	Mostly right	Fully accurate
Sentiment	Negative	Neutral-negative	Neutral-positive	Positive
Completeness	Missing core info	Partial	Most info	Comprehensive

Most AI-audit guides skip the scoring layer entirely. They tell you to "evaluate the response" without saying how. The unscored version of this audit collapses into vibes inside three weeks, and vibes don't survive a quarterly re-run: last quarter's vibe is gone the moment you sit down to score again. Numbers do survive. They hold their meaning across runs and reviewers, and they tell you whether you actually improved or just remembered the wins. A manual pass takes 2 to 4 hours of focused work (Passionfruit, Jun 2026), which is why Ooky's Brand Tracker runs the same rubric on a schedule and trends the sentiment over time.

What it looked like on cloudweld.ai

We ran this protocol on cloudweld.ai before writing the post, because publishing a rubric we hadn't survived ourselves felt dishonest. The exact scores don't matter. The shape of the findings does, and it's the same shape I keep seeing in B2B audits since.

On the Discovery prompt, cloudweld.ai landed mid-list everywhere, because the category itself (AI brand visibility, GEO tooling) doesn't have stable named entities yet. On the Identity prompt, GPT-4.1 nailed it, Claude invented a pricing tier that has never existed, and Perplexity attributed a competitor's case study to us. Three models, three different failure modes, and the diff between them was the actual GEO backlog we walked out with.

The shape most founders see: Discovery and Identity score well, while Comparison and Reputation drift. The biggest wince is almost always Reputation, where a model invents a complaint or pins someone else's outage on you. That's the kind of finding worth fixing before your next buyer sees it, because the model will keep repeating it until something cleaner shows up in its retrieval window. Hallucination rates cluster around 22% (SQ Magazine / Dextra Labs, 2026), so even a strong brand should expect drift on at least one prompt.

What do you do with the results?

Sort findings into three buckets: manual fixes you can do this week, structural fixes that need a layer between the AI bot and your site, and ongoing perception tracking that keeps the audit alive past the first run. Most teams skip the triage step and end up with a spreadsheet that nobody opens again, so the buckets are what turn an audit into a backlog. Passionfruit recommends monthly re-runs as the ideal cadence and quarterly as the floor (Passionfruit, Jun 2026).

The three buckets: what you can do, and what Ooky does on top

This week (manual, you do these yourself). Rewrite your homepage hero so it leads with your category and ICP in one sentence. Make sure your About page states your founding year, current product line, and headline customer use case in plain prose. Submit your sitemap to GPTBot via a robots.txt allow. These are copy and configuration fixes on your own site, and they are the part Ooky cannot do for you because they need your judgement about positioning.
Structural (Ooky handles this for you). Putting a clean, machine-readable description of your brand at the edge so AI crawlers stop guessing is a build, not a copy fix. Ooky's Brand Intelligence editor handles the structure, the do_not_infer toggles on sensitive fields, and the publish step for you, so the audit findings turn into an actual fix instead of a follow-up ticket that nobody picks up next quarter.
Ongoing (Ooky's Brand Tracker). Manual re-runs work for the first month and decay after that, because nobody wants to redo a 4-hour spreadsheet exercise on a calendar reminder. Ooky's Brand Tracker schedules the five prompts (and more buyer-intent variants) across every major model, scopes them to personas, regions, and named competitors, scores sentiment automatically, and trends the visibility score over time so you can see whether last month's fix actually moved the number.

FAQ

How often should I run this audit?

Monthly is ideal, quarterly is the floor (Passionfruit, Jun 2026). Less often and you are auditing a model that has already moved on. Schedule the next run before you close the spreadsheet from this one, calendar it for the 1st of the month, and protect the slot.

Why do I get different answers when I run the same prompt twice?

LLMs are non-deterministic. A 2,961-run SparkToro study across ChatGPT, Claude, and Google AI found a less than 1% chance of identical recommendation lists and less than 0.1% chance of identical order (SparkToro via Passionfruit, Feb 2026). Run each prompt three times and score the pattern.

Do I need a paid tool to do this?

Not for the first audit. A manual run of 5 prompts across 3 models with a rubric takes roughly 2 to 4 hours of focused work (Passionfruit, Jun 2026). Paid AEO tools earn their keep once you need monthly history, competitor benchmarking, or alerting on score changes.

Sources

6sense. "94% of B2B buyers use AI for research: here's why your demand-gen team doesn't need to panic." November 2025. Retrieved 2026-05-15. https://6sense.com/blog/94-of-b2b-buyers-use-ai-for-research-heres-why-your-demand-gen-team-doesnt-need-to-panic/
6sense. "Buyers contact vendors 3.5 weeks earlier (26.4 vs 30 weeks)." November 2025. Retrieved 2026-05-15. https://6sense.com/blog/94-of-b2b-buyers-use-ai-for-research-heres-why-your-demand-gen-team-doesnt-need-to-panic/
SparkToro, via Passionfruit. "Why AI brand recommendations change with every query: 2,961-run study across ChatGPT, Claude, and Google AI." February 2026. Retrieved 2026-05-15. https://www.getpassionfruit.com/blog/why-ai-brand-recommendations-change-with-every-query-research-analysis-and-strategic-implications
Passionfruit. "How to audit brand visibility in ChatGPT, Perplexity and Gemini." June 10, 2026. Retrieved 2026-06-10. https://www.getpassionfruit.com/blog/how-to-audit-brand-visibility-in-chatgpt-perplexity-and-gemini
SQ Magazine, citing Dextra Labs. "LLM hallucination statistics (22% cluster, 15-52% range across 37 models)." 2026. Retrieved 2026-05-15. https://sqmagazine.co.uk/llm-hallucination-statistics/
SE Ranking, cited via Passionfruit. "Only 9.2% URL consistency across 3 identical queries same day." February 2026. Retrieved 2026-05-15. https://www.getpassionfruit.com/blog/why-ai-brand-recommendations-change-with-every-query-research-analysis-and-strategic-implications