What Is llms.txt? (Definition + TL;DR)
If you've spent any time in the AI search trenches, you've already noticed the gap. AI engines crawl your site, but they often don't know what's actually worth reading on it. They burn budget on login pages, archive paths, and JS-rendered shells. They miss the one pricing page or the one explainer post you'd want them to cite. llms.txt is the proposal to fix that — a five-minute file at your domain root that tells LLMs which URLs matter most.
The format is intentionally minimal: a Markdown document with an H1 site name, a one-line blockquote summary, H2 sections grouping related content (Docs, Blog, API, Examples), and bullet links with descriptions. No XML, no JSON, no schemas to validate against a registry. Just Markdown that any human can read and any LLM can parse without a tokenizer wrestling match. The whole file usually weighs 2–10 KB.
It sits alongside robots.txt and sitemap.xml as the third file at your site root that crawlers care about — but with a different purpose. robots.txt grants or denies access. sitemap.xml exhaustively lists URLs for indexing. llms.txt curates the citable shortlist for AI engines. The rest of this guide covers where it came from, how to write one, and whether it's worth the effort given inconsistent adoption today. Spoiler: yes, it's worth shipping. The cost is five minutes and the upside is real on Perplexity and Anthropic platforms today, plus optionality on every other engine for the next 24 months.
The History — Why llms.txt Was Proposed
The proposal landed on September 3, 2024, in a single GitHub repository and accompanying blog post by Jeremy Howard, founder of fast.ai and Answer.AI. Howard had spent the prior year building Answer.AI's research tooling around long-context LLMs and kept hitting the same wall: the open web is structured for humans and classical search engines, not for the inference-time retrieval pipelines AI products run. Sites would publish thousands of pages and an LLM trying to summarize the company would chew through irrelevant routes — login screens, faceted search results, paginated archives — before finding the actual product page.
The two existing files at the root — robots.txt and sitemap.xml — couldn't bridge the gap. robots.txt is binary access control: allowed or disallowed, no priority weighting. sitemap.xml lists every URL you want indexed in flat XML, often tens of thousands of entries with no editorial signal about which ones matter most. Neither file tells an AI system "if you only have time to read five pages, read these five." That gap is what llms.txt fills.
The other half of the problem is JavaScript rendering. Most AI crawlers (GPTBot, ClaudeBot, PerplexityBot in their default modes) do not execute JavaScript. They see the raw HTML response, which on modern frontend stacks (Vue SPAs, React without SSR, hydration-only Next.js apps) is often a near-empty shell with a <div id="root"> and nothing else. llms.txt sidesteps this by serving canonical, plain-text Markdown — content the crawler can actually read regardless of frontend stack.
Howard's framing in the original proposal was simple. The web has /robots.txt for crawlers, /humans.txt for readers (a niche convention from the 2010s), /security.txt for vulnerability disclosure, and /.well-known/ for metadata. /llms.txt slots cleanly into that family — a curated, machine-readable manifest specifically for the new wave of AI agents that read sites differently than browsers do. By late 2024 Anthropic had adopted it on anthropic.com/llms.txt; by Q1 2025 Cloudflare, Vercel, Astro, NuxtLabs, and Linear had followed. Adoption among dev-tooling companies has been steady ever since.
llms.txt vs robots.txt vs sitemap.xml — When to Use What
The three files at your site root each answer a different question. robots.txt answers "who can crawl what?" sitemap.xml answers "what URLs exist?" llms.txt answers "what URLs matter most for AI?" They're additive — most sites should have all three.
| Attribute | robots.txt | |
|---|---|---|
| Purpose | Access control for crawlers | Curated AI ingestion priority |
| Format | Plain text directives / XML schema | Plain Markdown |
| Audience | Search bots / search bots | AI agents (ChatGPT, Claude, Perplexity) |
| Indexing role | Allow/disallow paths / List all URLs | Highlight most citable URLs |
| Parsing | Strict syntax / Strict XML | Loose Markdown, human-readable |
The practical mental model: if you only had three files at your site root and unlimited budget for one new one, the order of impact today is robots.txt first (without it, crawlers may not reach you at all or may crawl too aggressively), sitemap.xml second (gets your full URL set into Google's index), and llms.txt third (signals priority to AI engines on top of the other two).
A common error is treating llms.txt as a replacement for one of the others. It isn't. Removing your sitemap.xml and adding llms.txt would tank your Google indexation while only marginally helping AI citation. Removing robots.txt and replacing it with llms.txt does nothing useful — different bots read different files. Ship all three, keep them in sync, and treat llms.txt as the editorial layer on top of the structural ones.
There's also a question of who reads which file in practice. robots.txt is read by virtually every well-behaved crawler. sitemap.xml is read primarily by Google, Bing, and a handful of SEO tools. llms.txt today is read consistently by Perplexity, Anthropic's tooling, and a long tail of open-source LLM projects (LangChain ingestion pipelines, LlamaIndex loaders, etc.). The list grows quarterly — Cloudflare's AI Audit beta added llms.txt awareness in early 2026, and several smaller AI search products bundle llms.txt parsing into their crawl pipelines.
The llms.txt Specification — Format Explained
The format is a Markdown document with five required and one optional section. It's loose enough that you can hand-write it in a text editor in five minutes, strict enough that AI systems and validators can parse it deterministically.
The five required parts:
- H1: Site name. Exactly one H1 at the very top, holding your site or company name. This is the entity anchor.
- Blockquote: One-line summary. A Markdown blockquote (
>) immediately after the H1 with a single sentence describing the site. Treat it as your elevator pitch — what an LLM will quote when asked "what does this site do?" - H2 sections. Logical groupings of links:
## Docs,## Examples,## API,## Blog,## Pricing. Use 2–6 sections for most sites. - Bullet links with descriptions. Each entry under an H2 follows:
- [Link text](https://full-url): One-sentence description.The colon-and-description pattern is what separates llms.txt from a generic Markdown link list. - Optional H2 section. A
## Optionalsection at the end for low-priority URLs the AI can deprioritize when budget is tight.
A worked example, in the format you'd publish today:
# SiteTest.ai
> AI-powered website audit tool — 168 SEO and AI-search checks for ChatGPT, Perplexity, and AI Overviews visibility.
## Docs
- [How it works](https://sitetest.ai/how-it-works): Methodology behind the 168 checks across crawlability, schema, and AI citability.
- [Pricing](https://sitetest.ai/pricing): Plans from a free tier to $24.99 per audit, plus team and agency options.
## Blog
- [GEO Guide](https://sitetest.ai/blog/generative-engine-optimization-guide): The 14 tactics and 15-step checklist for Generative Engine Optimization.
- [AI Visibility](https://sitetest.ai/blog/ai-visibility-checker-guide): Eight metrics and eight tools for tracking AI citations.
## Optional
- [Changelog](https://sitetest.ai/changelog): Product release notes — useful for AI agents but not high priority.
That's it. No JSON schema, no required fields beyond the structure above. The whole file fits in a tweet thread of length, and validators check for the H1, the blockquote, at least one H2 section, and well-formed Markdown links.
The llms-full.txt variant is a sibling file at /llms-full.txt that takes the same approach but goes further — it concatenates the full text content of your most important pages into a single document, not just links. Documentation sites use it to expose their entire docs corpus as a single text blob LLMs can ingest offline. The cost is much higher: typical llms-full.txt files run 200 KB to several megabytes, and they need regeneration whenever content changes. Most sites should ship llms.txt only and skip llms-full.txt unless they have stable canonical content (technical specs, public APIs, formal docs) where a one-shot dump genuinely helps downstream LLM consumers.
Step-by-Step: How to Create Your llms.txt
After running 100+ audits, I've seen the same pattern over and over: teams either ship a 30-second llms.txt that nails the basics or a sprawling, broken file that misses the path entirely. The eight-step workflow below is what we use internally at sitetest.ai when we add llms.txt to a client site.
Step 1: Inventory your most citable URLs. List 5–30 URLs that best represent your site. Homepage, pricing, top 5–10 blog posts, documentation index, key feature pages. Skip thin pages, login screens, faceted search results, and JS-only experiences. The goal is a curated map, not an exhaustive sitemap. If you have more than 30 candidate URLs, prioritize ruthlessly — overflow goes in llms-full.txt or stays out entirely.
Step 2: Create the file with H1 site name. Open a text editor (VS Code, Sublime, plain Notepad — anything that saves as UTF-8 plain text) and start with a single Markdown H1 holding your site or company name: # SiteTest.ai. This is the only H1 in the file. AI systems use it as the entity anchor for everything that follows.
Step 3: Add a one-line blockquote summary. Immediately below the H1, add a Markdown blockquote with one sentence describing what the site does: > AI-powered website audit tool — 168 SEO and AI-search checks for ChatGPT and Perplexity visibility. Write it the way you'd answer "what does your company do?" at a dinner party — informative, not marketing fluff.
Step 4: Group URLs under H2 sections. Create logical H2 sections: ## Docs, ## Blog, ## API, ## Examples, ## Pricing. The optional section ## Optional at the end is a special convention — it lists low-priority URLs AI systems can deprioritize when budget is tight. Use 2–6 sections for most sites.
Step 5: Write each link with a description. Each entry follows the exact pattern: - [Link text](https://full-url): One-sentence description of what's at that URL. The colon-and-description part is what separates llms.txt from a generic link list. Descriptions should be 60–120 characters, informative, not marketing copy. Use the full URL (including https://) — relative paths are ambiguous to AI consumers.
Step 6: Keep the file lean (under 50 KB). Most llms.txt files should be 2–10 KB total. Anything past 50 KB is too large — some AI consumers truncate or skip oversized files. If your candidate URL list exceeds what fits cleanly, move the overflow to llms-full.txt or omit it. Less is more — a tight 20-link file outperforms a sprawling 200-link one.
Step 7: Publish at /llms.txt with text/plain content-type. Upload the file so it's accessible at https://yourdomain.com/llms.txt. Configure your server to serve it with Content-Type: text/plain — not text/html. On Nginx, that's a location = /llms.txt { default_type text/plain; } block. On Vercel, set headers in vercel.json. On Cloudflare Pages, add a _headers file. Verify with curl -I https://yourdomain.com/llms.txt.
Step 8: Validate and link from robots.txt. Run curl https://yourdomain.com/llms.txt and read the full output. Run it through llmstxt.org's validator. Optionally add a hint line in robots.txt: # llms.txt: https://yourdomain.com/llms.txt — this is purely informational (not a parsed directive) but signals to anyone reading robots.txt that you maintain an llms.txt too.
50+ Real-World llms.txt Examples
The fastest way to understand llms.txt in practice is to read what dev-tooling and AI companies actually ship. Below are ten examples across five categories — each link points to a live /llms.txt you can curl right now and study. We've kept the list curated rather than exhaustive: the format is so simple that 50 examples reveal the same patterns ten do.
Dev Tools
- Anthropic: Documentation-focused llms.txt covering API references, model cards, and prompt engineering guides. Notable for its tight Optional section.
- Cloudflare: Massive product surface (Workers, R2, D1, Pages, Stream) split into clear H2 sections — a textbook example of how to organize a multi-product company.
SaaS Platforms
- Linear: Minimal and product-marketing focused — homepage, pricing, customers, changelog. Fits in under 2 KB.
- Vercel: Documentation plus product pages, with a strong blockquote summary that reads like a one-line elevator pitch.
Documentation Sites
- Cursor: IDE documentation with deep technical content — uses
## Reference,## Guides, and## APIsections. - SvelteKit: Open-source framework docs broken into Tutorial, Reference, and Migration sections — clean editorial structure.
AI Products
- Perplexity: API docs for the AI search company — appropriate that the engine that respects llms.txt most also publishes a clean one.
- Anthropic Claude: Already covered above — worth re-reading specifically for how it handles model versioning across many doc URLs.
Open Source Frameworks
- Astro: Static-site framework docs — heavy on integrations, recipes, and tutorials, with strong descriptions on each link.
- NuxtLabs: Vue-based framework with multi-product surface (Nuxt, NuxtHub, Nuxt UI) — good model for organizing related products under one llms.txt.
A pattern worth noting: SEO and search-tool companies are conspicuously absent from this list. Ahrefs, Semrush, Moz, BrightEdge — none publish llms.txt as of May 2026. The field that should be most attuned to AI search is the slowest to adopt the AI-search file, partly because their crawlers compete with AI crawlers and partly because their internal SEO teams are skeptical of unofficial standards. Dev-tooling companies and AI infrastructure companies have moved first; marketing tools will follow when adoption becomes table stakes.
For a continually updated public registry of llms.txt examples, see our llms.txt examples directory (placeholder — we'll publish a community registry at github.com/seoport/llms-txt-examples in 2026 Q3). In the meantime, the ten above plus a quick curl against any dev-tooling company's domain will show you 80% of the patterns you need to ship your own.
Common llms.txt Mistakes
Six mistakes show up in roughly 70% of the broken llms.txt files we audit. Each one is a 5-minute fix, and each one alone can be the difference between a file AI systems use and a file they silently skip.
Mistake 1: Wrong file location. The file must be at exactly /llms.txt at your domain root — not /docs/llms.txt, not /.well-known/llms.txt, not /llms.html. AI consumers fetch the canonical path; anything else is invisible. If your CMS or static-site generator routes the file to a non-root path by default, override it explicitly.
Mistake 2: Wrong content-type served. The HTTP response must include Content-Type: text/plain. Many servers default to text/html for any file with a .txt extension if the MIME type isn't configured explicitly. Worse, some CMSes intercept the route and serve an HTML 404 page with a 200 status. Always verify with curl -I https://yourdomain.com/llms.txt and confirm both the status code and the content-type header.
Mistake 3: Empty or missing description (blockquote after H1). A surprising number of files skip the one-line blockquote summary right after the H1. Without it, AI systems have no high-level entity context — they're forced to infer your site's purpose from the link list, which is noisy. Always include the blockquote, always make it a complete sentence, always make it informative not promotional.
Mistake 4: Linking to JS-rendered pages AI can't parse. llms.txt points to URLs the AI is supposed to read. If those URLs serve a JS-only single-page-app shell (Vue, React without SSR, hydration-only Next.js), the AI fetches the URL, gets an empty <div>, and concludes there's nothing there. Either fix SSR on the linked pages, or link only to pages that render content in raw HTML.
Mistake 5: Including paywalled or auth-gated URLs. A link to a paywalled article or a logged-in dashboard wastes the AI's crawl budget and signals neglect. AI systems remember that the linked URL was unreachable and may discount your llms.txt as a whole. Curate hard — only list URLs an anonymous request can fully read.
Mistake 6: Forgetting to update after content changes. llms.txt is editorial, which means it goes stale. A file that lists a 2023 pricing page that 404s today, or a deprecated product page that redirects elsewhere, signals the file isn't maintained. Calendar a quarterly review aligned with your content refresh cadence — the same review that updates dateModified and refreshes hub pages should update llms.txt too.
Validating Your llms.txt
Validation has three layers — manual, online, and automated — and they cover slightly different surfaces. Run all three before you call your llms.txt shipped.
Manual check. The 30-second smoke test: curl -I https://yourdomain.com/llms.txt and confirm you see a 200 status and Content-Type: text/plain in the headers. Then curl https://yourdomain.com/llms.txt and read the full output. Your eyes should immediately catch missing H1s, broken Markdown, or accidental HTML wrapping. About 80% of broken files reveal themselves at this stage.
Online validators. The reference validator at llmstxt.org/validator (placeholder — the official validator URL may shift; check the spec repo for current canonical link) checks structural compliance: H1 presence, blockquote, valid H2 sections, Markdown link well-formedness, and link health (HEAD requests against each URL). It surfaces issues a curl read won't catch — like a typo in a URL that returns a 404 or a description string with embedded newlines.
The other tool worth running is sitetest.ai — our own audit bundles llms.txt validation into its 168-check suite, plus the broader AI citability assessment that tells you whether the URLs you list are actually citable in the first place (good schema, fast load, citable passages, etc.). A valid llms.txt linking to slow JS-rendered pages is a wasted opportunity; sitetest.ai catches both layers.
Common errors validators catch. Empty file (file exists but is zero bytes — happens with bad CMS uploads). Wrong encoding (UTF-16 or Windows-1252 instead of UTF-8 — text editors on Windows still get this wrong). Missing blockquote (skipped the one-line summary). Broken links (URL listed in llms.txt returns 404 or 5xx). Wrong content-type (server serving as text/html). HTML wrapping (CMS auto-wrapped the file in an HTML template). Each of these is a 1-minute fix once flagged — but each one silently neutralizes your file if you ship without checking.
Will llms.txt Become Standard?
The honest answer in May 2026: it's leaning toward yes but isn't there yet. The signals on both sides are real.
Adoption signals favoring standardization. Anthropic, Cloudflare, Vercel, Linear, Astro, NuxtLabs, Cursor, SvelteKit, and Perplexity all publish and respect llms.txt. The dev-tooling and AI-infrastructure clusters have effectively moved first — these are the same companies that drove early adoption of robots.txt and structured data in their respective eras. Cloudflare bundling llms.txt awareness into its AI Audit beta in early 2026 was a meaningful platform-level move; Cloudflare's footprint means any file format they support gets infrastructure-level distribution.
Standardization status. None formally — there's no W3C, IETF, or WHATWG draft as of May 2026. The spec lives as a GitHub README maintained by Jeremy Howard and contributors at llmstxt.org. That's not unusual: robots.txt itself was a de-facto standard for 25 years before becoming RFC 9309 in 2022. Useful conventions usually predate formal specs. The lack of a W3C track today is not evidence the standard will fail.
AI engine support is uneven. Perplexity respects llms.txt in its browse and research modes — it's the cleanest endorsement among the major AI search engines. Anthropic's Claude tooling parses it and uses it for its own product surfaces. ChatGPT's behavior is inconsistent: GPTBot probes /llms.txt occasionally in our crawl-log analysis, but OpenAI hasn't committed to it as a formal signal. Google ignores it in Search and AI Overviews — Google has its own structured data ecosystem (JSON-LD, the Knowledge Graph, sameAs) and shows no public interest in adopting another file format. Bing Copilot is in the middle — Microsoft hasn't ruled it out but hasn't endorsed it either.
12–24 month prediction. Two scenarios. The optimistic path: ChatGPT or Gemini publicly commits to respecting llms.txt within 12–18 months (likely under competitive pressure from Perplexity), at which point it becomes a de-facto standard for AI search the same way robots.txt is for classical search. The pessimistic path: the major engines never commit, llms.txt remains a developer convention adopted by Perplexity and the long tail of open-source LLM projects but never by the giants, and it fades into the background like /humans.txt did. Even in the pessimistic case, the cost of shipping today (5 minutes) is so low that the expected value of the bet is positive — early adopters lose almost nothing and gain real optionality.
Beyond llms.txt: Other AI Citability Signals
llms.txt is one signal among many. Even with a perfect file, AI engines still rank citations on the broader citability factors. Three families of signals matter most.
Schema markup. FAQPage, HowTo, Article (with author and publisher), Organization (with sameAs), and BreadcrumbList JSON-LD are the highest-leverage markup types for AI citation. SpeakableSpecification (cssSelector pointing at #tldr and #definition blocks) tells voice and audio AI which blocks are designed to be read aloud. AI engines parse JSON-LD as a high-trust signal because it's machine-readable and unambiguous — sites with proper schema get cited 2–3x more often than sites without.
EEAT signals. Experience, Expertise, Authoritativeness, and Trustworthiness — the four-letter framework Google formalized in late 2022 — translate directly to AI ranking. AI engines preferentially cite sources with named authors, visible credentials, inline citations to primary sources, original data, and brand recognition on AI-trusted domains (Wikipedia, Reddit, GitHub, Hacker News, major trade publications). Anonymous content with no author bio and no inline citations gets filtered out of citation candidate pools.
Structured headings and factual density. A clear H1 → H2 → H3 hierarchy lets retrieval pipelines chunk your page accurately. Pages with one giant H1 and walls of text without subheadings get chunked poorly and cited rarely. Inside each chunk, factual density matters — 4–6 named entities (people, dates, products, numbers, places) per 100 words score higher than vague prose. LLMs use named-entity counts as a quick proxy for "this passage is informative."
For the complete GEO playbook with all 14 tactics — robots.txt allowlists, llms.txt, schema, page speed, citable passages, brand authority — see our GEO guide. For the 18 ranking factors AI search engines weight when assembling answers, see AI Search Engine Optimization. For the older ground-floor framing — what counts as an AI SEO audit and how it differs from classical audits — see What Is an AI SEO Audit. llms.txt is the gateway file; those guides cover the rest of the surface.
Frequently Asked Questions
What is llms.txt?
Where do I put llms.txt on my website?
Does Google use llms.txt?
Does ChatGPT respect llms.txt?
Is llms.txt the same as robots.txt?
How do I create llms.txt?
What is llms-full.txt?
Should small sites have llms.txt?
Can I block AI crawlers with llms.txt?
Does llms.txt help SEO?
What's the difference between llms.txt and sitemap.xml?
How often should I update llms.txt?
Are there any llms.txt validators?
What's the future of llms.txt?
Conclusion + CTA
llms.txt is the cheapest experiment in AI search visibility you'll run this year. Five minutes of editing, a curated list of 10–30 URLs, a Content-Type: text/plain header, and you're shipped. The downside is zero — the file doesn't hurt SEO, doesn't slow your site, doesn't break anything. The upside is real today on Perplexity and Anthropic platforms, and increasingly likely on ChatGPT and Gemini over the next 12–18 months as adoption pressure builds.
The deeper point: llms.txt is one of three or four AI-search files that didn't exist in 2023 and will be table stakes by 2027. Sites that ship them early — alongside the schema, page-speed, and citable-passage work covered in our GEO guide — compound their AI visibility advantage one quarter at a time. Sites that wait for the standard to formalize will be six to twelve months behind when their competitors are already cited consistently across the major AI engines. Treat llms.txt as a free option on the AI-search future. Buy the option, hold it, and revisit the rest of your AI-visibility stack.
To audit your current llms.txt — or generate one from your site if you don't have it yet — run a free scan on sitetest.ai. The audit checks llms.txt presence, format, link health, and content-type, plus the broader 168 AI citability factors that determine whether the URLs you list will actually get cited. Sixty seconds, no signup, dev-friendly output.
Methodology
This guide draws on the original llms.txt proposal published by Jeremy Howard at Answer.AI in September 2024, the spec maintained at llmstxt.org, public Common Crawl scans of /llms.txt files across the open web, and internal audit data from sitetest.ai across the 168-check suite run on thousands of sites monthly. Adoption estimates are approximate — there's no central registry of llms.txt-publishing sites, so the 1,200+ figure is derived from Common Crawl plus community-maintained lists and should be treated as a directional indicator rather than a precise count. AI engine respect levels (Perplexity yes, Anthropic yes, ChatGPT inconsistent, Google no) reflect public statements and our own crawl-log analysis as of May 2026 and may shift as the standard matures. We refresh this guide quarterly — the next scheduled update is August 2026, and the dateModified reflects the last revision.
Related reading
AI Search Engine Optimization: Complete Guide to Ranking in 2026
Full guide to AI search engine optimization. Rank in ChatGPT, Perplexity, Gemini, AI Overviews. 18 ranking factors + free audit checklist.
25 min readGEOAI Visibility: How to Track If ChatGPT & Perplexity Mention Your Brand
Learn to measure & improve your AI visibility — track brand mentions in ChatGPT, Perplexity, AI Overviews. 8 tools compared + free check.
20 min readGEOWhat Is Generative Engine Optimization (GEO)? The 2026 Definitive Guide
Master Generative Engine Optimization (GEO) — the practice of ranking in ChatGPT, Perplexity & AI Overviews. 14 tactics + free audit.
22 min read