AI Visibility API Data Models Explained: What Prompt, Citation, and Crawler Fields Actually Mean for Your Content Strategy in 2026

Key takeaways

Prompt fields (volume, difficulty, fan-outs) tell you which questions to target and how hard it will be to win them — not just what people are asking.
Citation objects reveal the exact pages, domains, and content formats AI models trust enough to quote, which is more actionable than any backlink metric.
Crawler logs show whether AI bots can even find and read your content — a problem most teams discover far too late.
The three data types work together: crawler logs tell you if you're reachable, citation data tells you if you're trustworthy, and prompt data tells you if you're relevant.
Tools like Promptwatch surface all three in one place and connect them to actual content gaps and traffic attribution.

If you've started poking around an AI visibility platform's API or data export, you've probably hit a wall of fields that look deceptively simple. prompt_text. citation_url. crawler_agent. What do these actually mean? And more importantly, what are you supposed to do with them?

This guide breaks down the three core data models you'll encounter in AI visibility APIs in 2026 — prompt data, citation data, and crawler data — and explains what each field is actually measuring, why it matters, and how it should change the way you create and structure content.

The three data models, briefly

Before going field by field, it helps to understand what each model represents at a high level.

Prompt data describes the questions users are actually asking AI models. Citation data describes the sources AI models choose to quote when answering those questions. Crawler data describes how AI bots are discovering and reading your website.

These three models form a chain. If AI crawlers can't read your pages, you won't appear in citation data. If you don't appear in citation data, you won't show up when prompts are answered. Most content teams only think about the last step (visibility in answers) without realizing the problem starts much earlier.

Prompt data fields

`prompt_text`

This is the literal question or query being tracked. It sounds obvious, but there's nuance here. Prompt text in AI visibility platforms isn't the same as a keyword. It's a full natural-language question — often conversational, sometimes multi-sentence — that mirrors how real users interact with ChatGPT, Perplexity, or Claude.

"Best CRM for small business" is a keyword. "What's the best CRM for a 10-person sales team that doesn't want to deal with Salesforce?" is a prompt. The difference matters because AI models answer the latter with a very different set of sources than they'd surface for the former.

When you're building a prompt set to track, you want prompt_text values that match how your actual customers phrase questions — not how your SEO team phrases them.

`prompt_volume`

This field estimates how often a given prompt (or semantically similar variants) is being asked across AI platforms. Volume estimates in AI visibility tools are inherently less precise than traditional search volume data — AI platforms don't publish query logs the way Google does — but they're useful for prioritization.

A high-volume prompt that you're not appearing in is a gap worth fixing urgently. A low-volume prompt where you're already cited is probably fine to leave alone. The ratio between the two is where your content roadmap should come from.

`prompt_difficulty`

Difficulty scores estimate how competitive a prompt is — specifically, how many authoritative sources are already being cited for it and how consistently. A prompt with a difficulty score of 90 means well-established domains are dominating that answer and displacing you will take significant content investment. A score of 30 means the AI models are pulling from a fragmented set of sources, and a single well-structured page could break in.

This is one of the most underused fields in AI visibility data. Most teams focus on whether they appear, not on how hard it would be to appear. Difficulty-adjusted prioritization is what separates a realistic content roadmap from wishful thinking.

`fan_out_queries`

This is where prompt data gets genuinely interesting. A fan-out is the set of sub-queries an AI model generates internally when processing a prompt. When a user asks "how do I improve my AI search visibility?", the model doesn't just answer that question directly — it expands it into sub-questions: "what is AI search visibility?", "which AI models cite web sources?", "what content formats get cited by Perplexity?", and so on.

Fan-out data shows you the full cluster of questions your content needs to answer to be considered a comprehensive source. A page that only answers the surface-level prompt will lose to a page that addresses the entire fan-out cluster. This is why thin content keeps failing even when it's technically "optimized" for the main keyword.

`model_coverage`

This field tells you which AI models (ChatGPT, Perplexity, Claude, Gemini, etc.) are answering a given prompt and which ones are citing sources at all. Not every model cites external sources for every prompt type. Perplexity almost always cites sources. ChatGPT's behavior depends on whether web browsing is active. Google AI Overviews have their own citation logic.

Knowing which models are answering your target prompts — and which ones are actually pulling external citations — helps you prioritize where to focus your optimization efforts.

Citation data fields

`citation_url`

The most fundamental field in citation data: the exact URL being cited in an AI response. This is not the domain — it's the specific page. That distinction matters enormously.

A lot of teams look at domain-level citation data and conclude "we're getting cited." But when you drill into the URL level, you often find that a single page (usually a listicle, a comparison, or a detailed how-to) is doing all the work, while the rest of the site is invisible. That's a content architecture problem, and you can't see it without URL-level data.

`citation_position`

Where in the response the citation appears. Citation position data typically distinguishes between citations that appear inline (the AI is directly quoting or paraphrasing the source), citations that appear in a reference list at the end, and citations that appear in a "sources" panel (common in Perplexity and Google AI Overviews).

Inline citations carry more weight than reference-list citations. If your URL is consistently appearing in position 4 or 5 of a reference list but never inline, that's a signal that AI models are aware of your content but don't trust it enough to quote directly. The fix is usually structural — answer-first formatting, clearer claims, more specific data.

`citation_frequency`

How often a given URL is cited across all responses to a given prompt (or across all prompts in your tracked set). A page with high citation frequency is doing something right — it's consistently being selected as a trustworthy source. A page with low frequency despite being indexed is a candidate for content improvement.

Frequency data is also useful for competitive analysis. If a competitor's page is being cited 80% of the time for a prompt you care about, you need to understand what that page does differently before you can displace it.

`citation_sentiment`

Some platforms include sentiment analysis on how the AI is using a citation — whether it's being cited as a positive example, a cautionary example, or a neutral reference. This is a newer field and not universally available, but it's worth paying attention to. Being cited as "an example of what not to do" is technically a citation, but it's not the kind of visibility you want.

`source_type`

This field classifies what kind of source is being cited: a brand website, a Reddit thread, a YouTube video, a news article, a Wikipedia page, an industry report, etc. Source type data reveals something important: AI models don't just cite brand websites. They cite wherever the best answer lives.

If Reddit threads are consistently being cited for prompts in your industry, that's a channel you need to be active in. If YouTube videos are being cited, that's a content format gap. Source type analysis often reveals that the competition for AI citations is much broader than just your direct competitors' websites.

The AI search process — from query understanding to synthesized answer — is what citation data is actually measuring at each step.

`llm_model`

Which AI model produced the response containing this citation. Citation patterns differ significantly across models. A page that gets cited by Perplexity 60% of the time might barely register on Claude. Understanding per-model citation behavior helps you tailor content for the specific models your audience uses most.

Crawler data fields

`crawler_agent`

The user agent string identifying which AI crawler visited your site. Common ones in 2026 include GPTBot (OpenAI), ClaudeBot (Anthropic), PerplexityBot, Google-Extended (for Google AI products), and meta-externalagent (Meta AI). Each crawler has different crawl behaviors, different politeness settings, and different content preferences.

Knowing which crawlers are hitting your site — and which aren't — is the starting point for diagnosing AI visibility problems. If GPTBot hasn't visited your site in 30 days, that's a problem worth investigating before you spend time optimizing content.

`crawl_timestamp`

When the crawler last visited a specific page. Recency matters. AI models that use real-time web retrieval (like Perplexity) need to have crawled your page recently to include it in responses. A page that was last crawled six months ago is effectively stale for retrieval-augmented generation.

Crawl frequency is influenced by your site's crawl budget, page speed, internal linking, and how often you update content. Pages that are updated regularly tend to get crawled more frequently.

`http_status_code`

The response code the crawler received when it visited the page. A 200 means the page loaded successfully. A 404 means the page doesn't exist. A 403 means the crawler was blocked. A 503 means the server was temporarily unavailable.

This field catches a surprisingly common problem: pages that are blocked from AI crawlers via robots.txt or server-side rules, often accidentally. If you've added Disallow: / for GPTBot in your robots.txt (a common recommendation during the early AI panic of 2023-2024), you may have inadvertently made your entire site invisible to ChatGPT's retrieval system.

`content_extracted`

Whether the crawler was able to extract meaningful content from the page. This field distinguishes between a crawler that visited a page and a crawler that actually read it. JavaScript-heavy pages, pages behind login walls, and pages with content loaded via client-side rendering often get visited but not extracted.

If your content_extracted field is consistently false for certain pages, those pages are functionally invisible to AI models regardless of how well-optimized the content is. This is a technical problem, not a content problem — and it requires a technical fix (server-side rendering, pre-rendering, or dynamic rendering).

`render_type`

Related to content extraction: whether the crawler rendered the page as HTML only or executed JavaScript. Most AI crawlers don't execute JavaScript by default, which means any content that only appears after JavaScript runs — dynamic product descriptions, client-rendered blog posts, lazy-loaded FAQs — is invisible to them.

This is one of the most underappreciated technical issues in AI visibility. A site that looks perfectly functional to a human visitor can be nearly empty from an AI crawler's perspective.

`crawl_errors`

Specific errors encountered during the crawl: timeout errors, redirect loops, SSL certificate issues, blocked resources. Each error type has a different fix, but all of them reduce the probability that your content gets indexed and cited.

How the three data models connect

The real value of these data models isn't in any single field — it's in the relationships between them.

Signal	What it tells you	What to do
High prompt volume + low citation frequency	You're missing a high-value opportunity	Create or improve content targeting that prompt cluster
High citation frequency + low crawler recency	Your content is trusted but going stale	Update the page and improve crawl frequency
Crawler visits but no content extracted	Technical rendering problem	Fix JavaScript rendering or add server-side rendering
Competitor cited inline, you cited in reference list	Your content is less authoritative	Restructure with answer-first formatting, add specific data
Reddit/YouTube cited more than brand sites	Off-site content gap	Publish on those platforms, not just your own site
High fan-out count + single page targeting prompt	Content architecture gap	Build supporting pages for each sub-query
Low prompt difficulty + zero citations	Quick win available	Publish a targeted page immediately

This table is essentially a content strategy framework built from API fields. Each row is a diagnostic pattern with a clear action.

What this means for content creation

Understanding these data models changes how you approach content in a few concrete ways.

First, you stop writing for keywords and start writing for prompt clusters. A single piece of content needs to address the main prompt, its fan-out sub-queries, and the specific claims AI models are looking for when they decide whether to cite a source inline.

Second, you start thinking about extractability as a first-class concern. Content that's buried in JavaScript, hidden behind tabs, or structured in ways that make it hard to parse programmatically is content that AI crawlers can't use. Answer-first structure — where the direct answer appears in the first paragraph, followed by supporting detail — consistently outperforms content that buries the answer.

Third, you treat citation data as a feedback loop, not a vanity metric. The question isn't "are we being cited?" — it's "which pages are being cited, for which prompts, by which models, and at what position?" That level of specificity is what turns citation data into an actionable editorial calendar.

Promptwatch is built around exactly this loop — it surfaces answer gaps, generates content targeting those gaps, and then tracks whether that content starts getting cited. The data models described in this guide are what power that workflow.

Promptwatch

Track and optimize your brand visibility in AI search engines

Tools that expose these data models

A few platforms worth knowing about if you're working with AI visibility data at any level of depth:

Profound covers enterprise-level prompt and citation tracking across multiple AI models.

Profound

Enterprise AI visibility platform tracking brand mentions across ChatGPT, Perplexity, and 9+ AI search engines

Peec AI offers prompt and citation monitoring with a relatively clean interface for teams getting started.

Peec AI

AI search visibility tracking for marketing teams

AthenaHQ focuses on brand tracking across AI search engines, though it's primarily a monitoring tool without content generation.

AthenaHQ

Track and optimize your brand's visibility across AI search

For teams that need crawler log analysis specifically, xSeek tracks GPTBot activity and pairs it with rank tracking.

xSeek

ChatGPT rank tracking with GPTBot crawler monitoring

And if you need to fix the technical rendering issues that crawler data often surfaces, Prerender.io handles JavaScript pre-rendering so AI crawlers can actually read your content.

Prerender.io

Technical GEO tool for JavaScript rendering and crawling

A note on data freshness and model behavior

One thing worth keeping in mind: AI models don't all behave the same way, and their citation behavior changes over time. A model update can shift which sources get cited for a given prompt without any change on your end. This is why tracking citation frequency over time — not just point-in-time snapshots — matters.

The crawl_timestamp and citation_frequency fields together give you a time-series view of your AI visibility. If your citation frequency drops after a model update, that's a signal to investigate what changed in the model's preferences, not necessarily a signal that your content got worse.

The platforms that surface this kind of longitudinal data are more useful than those that only show you a current-state snapshot. Point-in-time data tells you where you are. Time-series data tells you whether you're moving in the right direction.

Putting it together

The API fields described in this guide aren't abstract technical concepts — they're the measurements that tell you whether your content strategy is working in an AI-first search environment.

Prompt volume and difficulty tell you where to focus. Fan-out data tells you how comprehensive your content needs to be. Citation URL and position data tell you which pages are working and which aren't. Crawler fields tell you whether AI models can even reach your content in the first place.

Most teams are making content decisions based on traditional SEO metrics that don't capture any of this. That's the gap. The teams winning AI visibility in 2026 are the ones treating these data models as primary inputs to their editorial process — not afterthoughts.

Start with crawler logs. Fix what's broken. Then look at citation data to understand what's working. Then use prompt data to find the gaps. That sequence is the right order of operations, and the data models described here are the tools that make it possible.