Robots.txt in 2026: a new paradigm

The robots.txt file was invented in 1994 to tell web crawlers which parts of your site not to visit. For 30 years, it primarily served to manage Googlebot. In 2026, it has become a strategic AI visibility tool.
According to an analysis by Originality.AI (2025), 35% of the top 1,000 websites worldwide block GPTBot. In Europe, the figure rises to 42% — largely due to GDPR concerns and worries about scraping. But here is the problem: these sites are voluntarily depriving themselves of citations in ChatGPT, Perplexity and other AI engines.
Martin Splitt, Developer Advocate at Google (Zurich): "The robots.txt is a gentleman's agreement, not a security wall. Well-behaved bots respect it. Bad ones ignore it. Your strategy should reward bots that play by the rules."
The key AI bots to know in 2026
| Bot | Operator | User-Agent | Respects robots.txt | Benefit if allowed |
|---|---|---|---|---|
| GPTBot | OpenAI | GPTBot | Yes | ChatGPT citations |
| ChatGPT-User | OpenAI | ChatGPT-User | Yes | Real-time browsing |
| ClaudeBot | Anthropic | ClaudeBot | Yes | Claude citations |
| PerplexityBot | Perplexity | PerplexityBot | Yes | Perplexity citations |
| Google-Extended | Google-Extended | Yes | Gemini training | |
| Bytespider | ByteDance | Bytespider | Partially | Low in Europe |
The AISOS configuration strategy
Our approach is simple: allow AI bots that cite their sources, block those that do not. Here is the standard configuration we recommend:
# Googlebot - always allowed
User-agent: Googlebot
Allow: /
# AI bots that cite sources - allowed
User-agent: GPTBot
Allow: /blog/
Allow: /glossaire/
Disallow: /admin/
Disallow: /api/
User-agent: ChatGPT-User
Allow: /
User-agent: ClaudeBot
Allow: /blog/
Allow: /glossaire/
Disallow: /admin/
Disallow: /api/
User-agent: PerplexityBot
Allow: /
# Training bots without citation - blocked
User-agent: Google-Extended
Disallow: /
User-agent: Bytespider
Disallow: /
User-agent: CCBot
Disallow: /
# Sitemap
Sitemap: https://your-site.com/sitemap.xml
The logic: GPTBot and ClaudeBot are allowed on your blog and glossary (the pages you want cited), but blocked on private sections. Google-Extended (Gemini training) and Bytespider (TikTok/ByteDance) are blocked because they scrape for training without citing.
The risks of blocking AI bots
Before blocking all AI bots "as a precaution", measure what you are losing:
- Lost citations — if GPTBot cannot read your site, ChatGPT will not cite you. Full stop.
- Perplexity visibility — Perplexity crawls in real time. A block means instant invisibility
- Snowball effect — fewer AI citations = fewer brand mentions = less authority = weaker classic SEO
For the complement with the llms.txt file (which guides AI bots to your key content), see our llms.txt guide. For broader technical context, see our technical SEO guide 2026. And to optimise what AI bots crawl, read our article on crawl budget.
GDPR and robots.txt: the false debate
Many European companies block AI bots "because of GDPR". This is a dangerous shortcut. The GDPR protects personal data, not the public content of your website. If your blog is public, it is accessible to everyone anyway — including AI bots.
The real question is not legal but strategic: do you want to be cited by AI, or not? If yes, allow. If not, block. But do not confuse data protection with fear of change.
FAQ — Robots.txt and AI
Does robots.txt actually prevent scraping?
No. Robots.txt is a voluntary protocol — well-behaved bots respect it, malicious scrapers ignore it. To actually prevent scraping, you need technical measures (rate limiting, Cloudflare, etc.).
Does blocking GPTBot stop ChatGPT from mentioning my company?
Not completely. ChatGPT has already absorbed content in its training data. But blocking GPTBot prevents updates and real-time browsing, which progressively reduces your citations in new conversations.
Can you allow GPTBot on specific pages only?
Yes, that is exactly the recommended approach. Use specific Allow/Disallow directives for each AI bot, only allowing the sections you want to be cited.
Do you need a different robots.txt for each AI bot?
Yes. Each bot has its own User-Agent and deserves specific directives. GPTBot and ClaudeBot can share the same rules, but Google-Extended and Bytespider deserve different treatment.
How often should you update your robots.txt?
With every change in AI strategy or site structure. In practice, a quarterly review is sufficient, unless new AI bots appear (which happens regularly in 2026).
Does robots.txt affect Google ranking?
Not directly. Robots.txt affects crawling, not ranking. However, if you accidentally block important pages, they will not be indexed — and therefore not ranked.
What is the difference between Disallow and noindex?
Disallow (robots.txt) prevents crawling. Noindex (meta tag) allows crawling but prevents indexing. For AI bots, Disallow is more effective as it completely prevents access to the content.
Is your robots.txt configured for the AI era?
We audit your robots.txt and define the optimal strategy to maximise your AI citations while protecting your sensitive content.
Configure my robots.txt

