Methodology & sources

How the audit works

Every audit ends with this section so you can see what we checked, where each recommendation comes from, and how confident we are in it. Share this page when you want to send the framework — not a specific run — to a client or developer.

What the audit checks and why

Can AI find your site?

We check robots.txt, sitemap discovery, and whether your CDN/WAF is letting documented AI crawlers reach your pages.

Can AI understand your content?

We check headings and page structure, structured data, and whether the main content is reachable without running JavaScript.

Will AI cite you as a source?

We check trust signals (HTTPS, author markup, outbound citations, freshness) and content shape that nudge an AI assistant toward picking you when it answers.

Where these recommendations come from

We don't make up rules. Every check is grounded in one of four source types — and findings link back to the exact doc they came from.

Official documentation from AI companies

Crawler names, user agents, and opt-out mechanisms come straight from each provider's own documentation.

Established web standards

Checks like robots.txt parsing, structured data, and sitemap discovery follow the documented protocol — not heuristics.

Emerging standards

Emerging

Worth knowing about, but not yet officially adopted. We surface these as low-effort possible-future wins, not as critical fixes.

llms.txt (proposed 2024, Jeremy Howard) llmstxt.org ↗
Not yet officially adopted by major AI companies. The audit reports this as a low-effort, possible-future-benefit recommendation, not as critical.

Research and industry observation

Content-structure recommendations and trust-signal heuristics lean on published research and reporting from infrastructure providers, not provider docs.

"GEO: Generative Engine Optimization" (Aggarwal et al., 2023) arxiv.org ↗
Basis for several content-structure recommendations.
Cloudflare — published data on AI crawler traffic blog.cloudflare.com ↗
Ongoing industry coverage — Search Engine Land, Search Engine Journal searchengineland.com ↗

How confident we are in each finding

Every finding carries one of three confidence levels so you can tell at a glance which recommendations are based on documented standards and which are based on observed behaviour or emerging practice.

Definitive

Based on standards or official documentation.

Examples: robots.txt blocks GPTBot; no JSON-LD Article schema present.

Suggestive

Based on observed behaviour, not conclusive.

Example: CDN appears to block GPTBot — our audit IP received 403, but real GPTBot uses published IP ranges that may be treated differently.

Emerging best practice

Based on early signals or recent research, not established standards.

Example: llms.txt is recommended but not yet required by major AI providers.

About this audit run

Reproducible

We show the exact provenance of every report so anyone receiving it can re-run it, verify it, or just trust where the numbers came from.

Audit timestamp

Recorded per audit

Tool version

Recorded per audit

Audit region

Recorded per audit

User agent

Recorded per audit

Pages sampled

Recorded per audit

Duration

Recorded per audit

Run an audit