What Is llms.txt and Why Should You Care?

llms.txt is a new robots.txt-style file designed specifically to tell AI crawlers—like those from OpenAI, Claude, and Perplexity—how you want your content used. Unlike robots.txt, which controls search engine indexing for Google and Bing, llms.txt is meant to govern how large language models access and train on your data.

The file is placed in your root directory (yoursite.com/llms.txt) and contains directives telling AI systems whether they can scrape your content, whether they can use it for training, and whether they must attribute your work when they reference it. Think of it as a permission slip for the AI era.

But here’s what most marketers get wrong: llms.txt is not legally binding, not universally respected, and not a replacement for robots.txt. As of early 2025, most AI companies are ignoring it entirely. OpenAI doesn’t respect it. Neither does Google’s Gemini crawler. Perplexity respects some llms.txt directives, but compliance isn’t guaranteed.

Key Takeaway: llms.txt exists, it’s gaining visibility, but it’s not the game-changer some vendors are marketing. It’s an optional, unenforceable file that ethical AI companies might respect—nothing more.

How Does robots.txt Actually Control AI Crawlers?

robots.txt has been the standard for crawler control since 1994, and it remains the primary tool for blocking AI access—even though it wasn’t designed for that purpose. Most AI crawlers still respect robots.txt directives, making it your only real leverage today.

Here’s what you can do with robots.txt:

  • Block specific user agents (e.g., User-agent: GPTBot or User-agent: anthropic-ai)
  • Disallow entire directories or specific file types
  • Set crawl-delay rates to reduce server load
  • Use Disallow: / to block all crawlers

Real AI Crawler User Agents You Can Block

The major players all have distinct user-agent strings. If you want to block OpenAI’s GPT crawler, add this to your robots.txt:

User-agent: GPTBot
Disallow: /

For Anthropic’s Claude:

User-agent: claudebot
Disallow: /

For Google’s Gemini crawler:

User-agent: Google-Extended
Disallow: /

The problem? These companies have minimal financial incentive to respect robots.txt when it comes to training data. OpenAI has already stated it doesn’t feel bound by robots.txt for data collected before their updated policy in 2024. Perplexity was caught openly ignoring robots.txt directives in 2024.

Key Takeaway: robots.txt works, but only if the AI company chooses to respect it. For the major players, enforcement is spotty at best.

llms.txt vs robots.txt: The Direct Comparison

Aspectrobots.txtllms.txt
Age30+ years, industry standard2024-2025, emerging
Legal bindingNo, but widely respected by search enginesNo, unenforceable by law
AI company complianceInconsistent; GPT ignores for training dataMinimal; mostly ignored
What it controlsCrawler access and indexingTraining data usage and attribution
Enforcement mechanismSocial pressure, brand reputation, policy updatesNone; relies on company ethics
File location/robots.txt/llms.txt
Current adoption99% of sites; search engines expect it<5% adoption; AI companies don’t enforce
Search engine useGoogle, Bing, and others read it dailyIgnored by Google Search, Bing, and most major AI firms

Key Takeaway: robots.txt is the established standard with real-world impact on search visibility. llms.txt is a well-intentioned but toothless experiment that almost nobody enforces.

Should You Even Create an llms.txt File?

No, not yet—unless you’re specifically concerned about a handful of smaller AI startups.

Here’s why:

  1. Zero leverage: If OpenAI, Google, and Anthropic aren’t respecting llms.txt, blocking them there doesn’t prevent them from using your content. It’s security theater.

  2. Search visibility risk: Putting a strict llms.txt file in place doesn’t affect Google Search rankings directly, but creating uncertainty about your content permissions could confuse honest crawlers in the future.

  3. Time investment: You have 1,000 better growth levers than writing a file that 99.9% of AI companies ignore.

When You Should Create an llms.txt File

  • You’re blocking Perplexity specifically. Perplexity is one of the few AI companies actively respecting llms.txt. If you don’t want your content in Perplexity’s search results and training set, an llms.txt file with User-agent: perplexitybot directives can work.

  • You’re in a competitive niche. If you sell B2B software, AI training data, or premium research, blocking Perplexity prevents them from indexing your proprietary content as public knowledge.

  • You’re building trust with your audience. Some brands use llms.txt as a signal that they’re thoughtful about AI scraping—even if the file itself doesn’t do much yet.

  • You expect regulation. If EU AI Act enforcement expands, having a clear llms.txt file in place now positions you better legally than scrambling to add one later.

Key Takeaway: Focus on robots.txt if you want to actually block crawlers today. Create llms.txt only if you’re explicitly blocking Perplexity or preparing for future regulatory compliance.

The Real Impact on Your Traffic and SEO

Adding llms.txt doesn’t hurt your search rankings, but it also doesn’t help them. Google doesn’t read llms.txt. Neither does Bing. Your SEO is entirely unaffected.

What does affect your traffic:

The Perplexity Effect

Perplexity’s AI-powered search engine grew from 0 to 500 million monthly searches in 2024—that’s a real number that matters. When Perplexity indexes your content, it can drive traffic (through citations) or steal it (through direct answers that don’t link back).

Research from Semrush found that 15-25% of traffic loss marketers attributed to AI abstractions came from Perplexity, not ChatGPT. Perplexity cites sources, but users often don’t click through.

If Perplexity is a meaningful traffic drain for your business, blocking them via llms.txt makes sense:

User-agent: perplexitybot
Disallow: /

Or use robots.txt instead (more effective):

User-agent: perplexitybot
Disallow: /

Key Takeaway: Block Perplexity if it’s eating your traffic. Ignore llms.txt for GPT, Claude, and Gemini—the file won’t protect you anyway.

How to Actually Block AI Crawlers From Your Site

If you want real control, use robots.txt, test it with Google Search Console and Bing Webmaster Tools, and monitor your crawler activity logs.

Step 1: Identify Your Threat

Check your server logs (CloudFlare, AWS, Google Analytics 4) to see which AI crawlers are actually hitting your site. You might find:

  • GPTBot (OpenAI)
  • anthropic-ai (Anthropic)
  • perplexitybot (Perplexity)
  • Google-Extended (Google’s Gemini)
  • Scrapy, curl, or custom bots (malicious or unauthorized scrapers)

Step 2: Block in robots.txt

Add directives for specific crawlers:

User-agent: GPTBot
Disallow: /

User-agent: perplexitybot
Disallow: /

User-agent: anthropic-ai
Disallow: /

Or block all bots except Google and Bing:

User-agent: *
Disallow: /

User-agent: Googlebot
Allow: /

User-agent: Bingbot
Allow: /

Step 3: Validate in Search Console

Upload your updated robots.txt and test specific URLs in Google Search Console. Use the URL Inspection tool to confirm Googlebot can still crawl your site.

Step 4: Monitor Crawler Activity

Set up alerts in your server logs or use Cloudflare Analytics to track:

  • How many GPTBot requests you get weekly
  • Whether blocked crawlers respect the directive (most do, eventually)
  • Any spikes in unauthorized scrapers

Key Takeaway: robots.txt + monitoring is your actual control mechanism. llms.txt is optional window dressing.

FAQ: What Marketers Really Want to Know

1. Will blocking AI crawlers hurt my Google rankings?

No. Google uses Googlebot (which respects your robots.txt Allow directives) and doesn’t care if you block GPTBot or perplexitybot. You can block every AI company simultaneously and your organic search traffic stays exactly the same. Test it in Search Console first, but you’ll be fine.

2. Can I legally force OpenAI to stop training on my content?

Not with llms.txt, and robots.txt isn’t legally binding either. OpenAI has already trained on most of the internet and claims fair use. Your only legal leverage is the CFAA (Computer Fraud and Abuse Act), which prohibits unauthorized access, and the DMCA (Digital Millennium Copyright Act), which can be used against scrapers that bypass authentication.

Send OpenAI a cease-and-desist letter if you believe your copyrighted content was used without permission. They’ll likely ignore it, but it creates a paper trail for litigation.

3. Should I block Perplexity?

Only if traffic loss is measurable. If Perplexity is a <2% traffic source and cites you properly, leaving them unblocked gains you more SEO authority links than you lose to their abstracts. If they’re 15%+ of your traffic loss, block them.

4. What’s the difference between blocking a crawler and delisting from an index?

Blocking (via robots.txt) prevents future crawling. Delisting removes you from a search index immediately. You can delist from Perplexity’s index directly via their removal form. For OpenAI, you’d need to file a DMCA takedown to remove your content from their training data.

The Bottom Line

llms.txt is not the tool that controls AI crawler access. It’s a well-intentioned, mostly-ignored experiment that works only if the AI company voluntarily respects it—and almost none of them do yet.

robots.txt is your actual control mechanism. It’s been working for 30 years, it’s respected by Google and Bing, and it works on Perplexity. Use it to block crawlers that matter to your business.

Here’s what to do today:

  1. Audit your server logs to see which AI crawlers are actually hitting your site
  2. Measure the traffic impact of each crawler (especially Perplexity)
  3. Block only the ones that matter in robots.txt (not llms.txt)
  4. Test in Search Console to confirm Google can still crawl
  5. Monitor weekly to ensure blocks are working

You don’t need llms.txt unless you’re explicitly preparing for EU AI Act compliance or want to block Perplexity’s crawler specifically. Save your engineering time for growth levers that actually move the needle—and use the proven tools that actually work.

The AI scraping problem is real. The solution is robots.txt, not new files that nobody enforces yet.