Robots.txt and AI: optimal configuration in 2026

TL;DR — In 2026, your robots.txt no longer manages only Googlebot. It must also configure access for GPTBot (OpenAI), ClaudeBot (Anthropic), PerplexityBot, and other AI crawlers. The temptation to block everything is strong — scraping, stolen content, etc. But blocking AI bots means giving up citations in AI responses. This guide gives you the balanced configuration: allow what makes you visible, block what does not serve you.

[Image: robots.txt decision diagram for AI bots]

Decision tree: what to block, what to let through?

Robots.txt in 2026: a new paradigm

Isometric illustration of robots.txt configuration for AI bots — Robots.txt et IA : faut-il bloquer les bots IA ?

The robots.txt file was invented in 1994 to tell web crawlers which parts of your site not to visit. For 30 years, it primarily served to manage Googlebot. In 2026, it has become a strategic AI visibility tool.

According to an analysis by Originality.AI (2025), 35% of the top 1,000 websites worldwide block GPTBot. In Europe, the figure rises to 42% — largely due to GDPR concerns and worries about scraping. But here is the problem: these sites are voluntarily depriving themselves of citations in ChatGPT, Perplexity and other AI engines.

Martin Splitt, Developer Advocate at Google (Zurich): "The robots.txt is a gentleman's agreement, not a security wall. Well-behaved bots respect it. Bad ones ignore it. Your strategy should reward bots that play by the rules."

The key AI bots to know in 2026

Bot	Operator	User-Agent	Respects robots.txt	Benefit if allowed
GPTBot	OpenAI	GPTBot	Yes	ChatGPT citations
ChatGPT-User	OpenAI	ChatGPT-User	Yes	Real-time browsing
ClaudeBot	Anthropic	ClaudeBot	Yes	Claude citations
PerplexityBot	Perplexity	PerplexityBot	Yes	Perplexity citations
Google-Extended	Google	Google-Extended	Yes	Gemini training
Bytespider	ByteDance	Bytespider	Partially	Low in Europe

The AISOS configuration strategy

Our approach is simple: allow AI bots that cite their sources, block those that do not. Here is the standard configuration we recommend:

# Googlebot - always allowed
User-agent: Googlebot
Allow: /

# AI bots that cite sources - allowed
User-agent: GPTBot
Allow: /blog/
Allow: /glossaire/
Disallow: /admin/
Disallow: /api/

User-agent: ChatGPT-User
Allow: /

User-agent: ClaudeBot
Allow: /blog/
Allow: /glossaire/
Disallow: /admin/
Disallow: /api/

User-agent: PerplexityBot
Allow: /

# Training bots without citation - blocked
User-agent: Google-Extended
Disallow: /

User-agent: Bytespider
Disallow: /

User-agent: CCBot
Disallow: /

# Sitemap
Sitemap: https://your-site.com/sitemap.xml

The logic: GPTBot and ClaudeBot are allowed on your blog and glossary (the pages you want cited), but blocked on private sections. Google-Extended (Gemini training) and Bytespider (TikTok/ByteDance) are blocked because they scrape for training without citing.

The risks of blocking AI bots

Before blocking all AI bots "as a precaution", measure what you are losing:

Lost citations — if GPTBot cannot read your site, ChatGPT will not cite you. Full stop.
Perplexity visibility — Perplexity crawls in real time. A block means instant invisibility
Snowball effect — fewer AI citations = fewer brand mentions = less authority = weaker classic SEO

For the complement with the llms.txt file (which guides AI bots to your key content), see our llms.txt guide. For broader technical context, see our technical SEO guide 2026. And to optimise what AI bots crawl, read our article on crawl budget.

[Image: graph showing the drop in AI citations after blocking GPTBot]

Impact of blocking GPTBot on AI citations (case study)

GDPR and robots.txt: the false debate

Many European companies block AI bots "because of GDPR". This is a dangerous shortcut. The GDPR protects personal data, not the public content of your website. If your blog is public, it is accessible to everyone anyway — including AI bots.

The real question is not legal but strategic: do you want to be cited by AI, or not? If yes, allow. If not, block. But do not confuse data protection with fear of change.

FAQ — Robots.txt and AI

Does robots.txt actually prevent scraping?

No. Robots.txt is a voluntary protocol — well-behaved bots respect it, malicious scrapers ignore it. To actually prevent scraping, you need technical measures (rate limiting, Cloudflare, etc.).

Does blocking GPTBot stop ChatGPT from mentioning my company?

Not completely. ChatGPT has already absorbed content in its training data. But blocking GPTBot prevents updates and real-time browsing, which progressively reduces your citations in new conversations.

Can you allow GPTBot on specific pages only?

Yes, that is exactly the recommended approach. Use specific Allow/Disallow directives for each AI bot, only allowing the sections you want to be cited.

Do you need a different robots.txt for each AI bot?

Yes. Each bot has its own User-Agent and deserves specific directives. GPTBot and ClaudeBot can share the same rules, but Google-Extended and Bytespider deserve different treatment.

How often should you update your robots.txt?

With every change in AI strategy or site structure. In practice, a quarterly review is sufficient, unless new AI bots appear (which happens regularly in 2026).

Does robots.txt affect Google ranking?

Not directly. Robots.txt affects crawling, not ranking. However, if you accidentally block important pages, they will not be indexed — and therefore not ranked.

What is the difference between Disallow and noindex?

Disallow (robots.txt) prevents crawling. Noindex (meta tag) allows crawling but prevents indexing. For AI bots, Disallow is more effective as it completely prevents access to the content.

Is your robots.txt configured for the AI era?

We audit your robots.txt and define the optimal strategy to maximise your AI citations while protecting your sensitive content.

Configure my robots.txt

Robots.txt and AI: Should You Block AI Bots?