10,800 Queries across 5 AI models
23,068 Unique businesses named
5.6% Average cross-model overlap
0.0% ChatGPT reproducibility

Abstract

We queried five major AI models (ChatGPT, Claude, Gemini, Perplexity, Grok) 10,800 times across 18 U.S. cities, 5 home service categories, 8 question phrasings, and 3 repeated waves to determine whether AI-powered search produces consistent, reliable business recommendations. It does not. The models recommend almost entirely different businesses from each other (Fleiss' kappa = −0.172; average pairwise Jaccard overlap = 5.6%). ChatGPT never returned the same answer twice across waves. Even slight changes in question phrasing significantly altered which businesses were named (Friedman p < 0.001). These findings demonstrate that AI-generated recommendations are neither stable nor convergent, with direct implications for businesses seeking visibility in AI-powered search and for consumers relying on these tools for local decisions.

Executive Summary: 10 Numbers That Tell the Story

  1. 23,068 unique businesses were named across 10,800 AI queries — and the five models can't agree on which ones matter.
  2. 5.6% overlap — the average proportion of businesses any two AI models have in common. They're not reading from the same playbook.
  3. Fleiss' kappa = −0.172 — the models don't just disagree, they actively anti-agree. Worse than random chance.
  4. 0.0% reproducibility for ChatGPT — of 720 unique queries asked 3 times, ChatGPT never returned the same answer twice. Not once.
  5. Only 4.2% of all queries produced identical results across three waves, across all models.
  6. Claude is the most consistent at 10.1% exact match and 56.7% majority match — and even that means 43% of the time it changes its mind.
  7. Question phrasing changes everything — asking “who's the best plumber?” vs. “I need a plumber for an old house” produces only 17.5% business overlap (Friedman p < 0.001).
  8. Larger cities get 10% more recommendations per response (6.8 vs. 6.2 businesses), but city size doesn't fix the disagreement problem.
  9. ChatGPT names the most businesses (10,266 unique) while Grok names the fewest (2,088) — a 5x difference in recommendation breadth.
  10. 16 of 20 businesses in a typical city+category appear in only one model's top 5. The AI models live in parallel universes.

Methodology

This study was pre-registered on OSF before data collection. The analysis plan, including all statistical tests and alpha thresholds, was filed in advance. No post-hoc modifications were made to the primary analyses.

Models (5) ChatGPT (GPT-5.2), Claude (Sonnet 4.5), Gemini (2.5 Pro), Perplexity (Sonar Pro), Grok (4)
Cities (18) Stratified by population tier (large/mid/small), U.S. region, and income level
Categories (5) Plumbing, HVAC, Electrical, Roofing, Landscaping
Query Variants (8) Best-of, older homes, emergency, affordable, trustworthy, top 3, near downtown, conversational
Waves (3) Same queries repeated 3 times for reproducibility testing
Collection Window 72 hours (Feb 9–11, 2026). 100% completion. Zero failures.

Total: 5 models × 18 cities × 5 categories × 8 variants × 3 waves = 10,800 queries

Cities by Tier

TierCities
Large metrosDenver, Memphis, Boston, Birmingham, Minneapolis, Columbus
Mid metrosPortland ME, Scranton, Boise, Shreveport, Madison, Dayton
Small metrosBurlington, Elmira, Hilton Head, Hattiesburg, Waukesha, Terre Haute

Statistical framework: Alpha = 0.05 with Benjamini-Hochberg FDR correction for all multiple comparisons. Effect sizes reported alongside significance tests. Both parametric and non-parametric alternatives computed. All code version-controlled; deterministic pipeline with fixed random seed (42).

Finding 1: The Models Actively Disagree

Research Question: Do AI models agree on which businesses to recommend?

No. They actively anti-agree. Fleiss' kappa across all five models was −0.172, indicating worse-than-chance agreement. A negative kappa means the models don't just fail to converge — they systematically recommend different businesses from each other.

The average pairwise Jaccard similarity was 0.056 — meaning any two models share only about 5.6% of the businesses they name. Every pairwise Cohen's kappa was negative. The worst disagreement: ChatGPT vs. Gemini (k = −0.327). The highest overlap: ChatGPT and Gemini at 6.6%. The lowest: Claude and Perplexity at 3.5%.

Heatmap showing Jaccard similarity between all 5 AI models, with values ranging from 3.5% to 6.6% overlap
Figure 1: Pairwise Jaccard similarity between models. Maximum overlap between any two models is 6.6%.

Businesses Named by Model

ModelUnique BusinessesShare of Total
ChatGPT10,26644.5%
Gemini6,33527.5%
Claude5,11122.2%
Perplexity3,60615.6%
Grok2,0889.1%

What this means: A business optimized for visibility in ChatGPT has essentially no guarantee of appearing in Claude, Gemini, or any other model. There is no single “AI optimization” strategy — there are five different realities.

Finding 2: ChatGPT's Reproducibility Is 0%

Research Question: Are AI recommendations reproducible?

Barely. ChatGPT never returned the same answer twice. When we asked the exact same question three times across three waves, only 4.2% of queries returned identical business lists all three times.

ModelExact Match (3/3 waves)Majority Match (≥2/3 waves)
Claude10.1%56.7%
Gemini5.0%51.9%
Grok5.3%22.4%
Perplexity0.4%13.1%
ChatGPT0.0%0.0%
Bar chart showing reproducibility rates by model. Claude leads at 56.7% majority match. ChatGPT at 0%.
Figure 2: Cross-wave reproducibility by model. ChatGPT produced completely different recommendations every time.

ChatGPT produced a completely different set of recommended businesses every single time — across all 720 unique queries, not one matched across all three waves, and not one even achieved 2-of-3 agreement. Claude was the most stable, but even Claude changes its mind 43% of the time.

What this means: AI recommendations are more like a slot machine than a phone book. The results change with every pull of the lever.

Finding 3: Question Phrasing Changes Everything

Research Question: Does question phrasing affect who gets recommended?

Dramatically. A Friedman test confirmed that query phrasing significantly changes which businesses are named (chi-square = 125.2, p < 0.001). Average overlap between any two phrasings: only 19.2%.

The more specific or situational the question, the more the recommendations diverge. Emergency queries (“24/7 plumber who can come tonight”) and niche queries (“best plumber for older homes”) produce the most unique recommendation sets.

Chart showing Jaccard similarity across 8 query phrasings, ranging from 15.5% to 21.9% overlap
Figure 3: How question phrasing affects business overlap. Different questions produce less than 20% overlap in recommendations.

What this means: A business that appears when a consumer asks “who is the best plumber” may disappear entirely when the same consumer asks “I need a plumber for my old house.” The question itself is a variable as important as the answer.

Finding 4: Market Size Matters, Income Doesn't

Research Question: Do market characteristics affect AI recommendations?

City size matters. Income level does not. Large cities get ~10% more businesses named per response (F = 32.8, p < 0.001). But income tier has no significant effect on cross-model agreement (F = 1.9, p = 0.17).

Heatmap of businesses named by city and model, showing variation across market sizes
Figure 4: Businesses named per response by city and model. Larger markets get slightly more recommendations, but the disagreement problem is universal.
City TierAvg Businesses per ResponseAvg Models Agreeing
Large6.81.21
Mid6.451.18
Small6.191.17

What this means: Businesses in smaller markets face a thinner recommendation pool, but the fundamental problem — models disagreeing with each other — exists everywhere. Market size doesn't fix the divergence.

What This Means for Local Businesses

If you run a local service business and you're thinking about “getting recommended by AI,” here's what the data actually says:

There is no single AI optimization strategy.

The five models operate in parallel universes. What works for ChatGPT may be invisible to Claude. Any consultant selling you a one-size-fits-all “AEO package” either doesn't have data or isn't sharing it. The real work requires testing across models, tracking what actually moves, and adjusting.

Yesterday's recommendation is already gone.

With 0% reproducibility from the most popular AI model and under 5% across the board, a recommendation today means very little about tomorrow. The businesses winning this game will be the ones who monitor it continuously — not the ones who optimize once and walk away.

How your customer asks matters more than how you answer.

The same business appears or disappears depending on whether the customer says “best plumber” vs. “plumber for an old house.” You can't control how people ask. But you can ensure your business entity is rich enough to surface across different framings.

This study doesn't exist to sell you something. It exists because the data should be public. But if you're a local business owner looking at these numbers and thinking “I need someone who actually understands this” — that's what we do.

Limitations & Future Work

Temporal scope: All data was collected in a 3-day window (February 9–11, 2026). Longer intervals would likely show even greater instability. Longitudinal tracking is planned.

Category scope: Five home service trades only. Generalizability to other verticals — legal, medical, restaurants, retail — is untested but planned.

Phase 2 (pending): Tests 3–7 from the pre-registered analysis plan require business-level enrichment data (Google Places ratings, schema markup, domain age, review counts). These tests answer the core question: what predicts whether a business gets recommended? Phase 2 is designed and estimated at ~$1,400 in API costs for 23,068 businesses.

Parser accuracy: ChatGPT's unique markdown formatting required post-hoc correction, reducing total mentions from 81,363 to 71,835 (−11.7%). The correction was documented in the integrity log before statistical analysis.

Temperature settings: All models used temperature=0 for reproducibility (except GPT-5.2, which requires temp=1 per API limitation). Perplexity's search-augmented nature means results still vary.

Full Study Design

Pre-registration: OSF (osf.io/sr3fy), filed before data collection.

Design: 5 models × 18 cities × 5 categories × 8 query variants × 3 waves = 10,800 queries.

Models: ChatGPT (GPT-5.2-chat-latest), Claude (Sonnet 4.5, claude-sonnet-4-5-20250929), Gemini (2.5 Pro), Perplexity (Sonar Pro), Grok (4).

City stratification: 18 cities selected by population tier (large/mid/small), U.S. Census region, and income level (higher/lower). Each tier contains 6 cities spanning 4 regions.

Integrity: SHA-256 hash verification on all 10,800 raw JSON files. Deterministic analysis pipeline with fixed random seed (42). All code version-controlled in git.

Analysis pipeline: 10,800 raw JSON files → regex-based business name extraction (71,835 mentions) → canonical name matching (rapidfuzz, threshold 85) → statistical tests per pre-registered plan → automated executive summary generation.

Total cost: ~$105 in API fees. 100% completion rate. Zero failures across all three waves.

How to Cite This Research

APA (7th Edition) Carr, T. R. (2026). What makes AI name you? Determinants of local business recommendation across five generative AI models. The Midnight Garden. https://themidnightgarden.club/research/what-makes-ai-name-you/
MLA (9th Edition) Carr, Troy Richard. “What Makes AI Name You? Determinants of Local Business Recommendation Across Five Generative AI Models.” The Midnight Garden, Feb. 2026, themidnightgarden.club/research/what-makes-ai-name-you/.
Chicago (17th Edition) Carr, Troy Richard. “What Makes AI Name You? Determinants of Local Business Recommendation Across Five Generative AI Models.” The Midnight Garden. February 2026. https://themidnightgarden.club/research/what-makes-ai-name-you/.

Press & Citation Kit

Writing about AI recommendations, AEO, or AI search reliability? Use these resources. Attribution to The Midnight Garden and a link back to this page is appreciated.

Research Charts (High-Res)

All 5 figures from this study, ready for embedding in articles and presentations.

Pre-Registration

Full study design and analysis plan filed on OSF before data collection.

View on OSF →

Press & Research Inquiries

For interviews, data access requests, or collaboration:

troyrichardcarr@gmail.com
262.391.8137

As Seen On / Discussed On

This research has been shared and discussed across multiple platforms. Follow the conversation:

Discussion threads will be linked here as they're published on Reddit, X, and LinkedIn. If you've written about or referenced this study, let us know and we'll add your link.

Want to Know If AI Recommends Your Business?

The data shows AI recommendations are unpredictable, inconsistent, and different across every model. We help local businesses navigate that reality — with data, not guesswork.

Contact The Midnight Garden