Key Takeaways
AI bots are simply visitors to your website, and robots.txt acts as your virtual reception desk to manage them.
There are three main types of bots (student, librarian, concierge) with distinct jobs, run by major AI companies.
For most experts, allowing all bot types and providing clear Verifiable Attribution and rights reservation is the best strategy for visibility and being cited.
Blocking training bots is a myth; it generally doesn't prevent citation and hinders your long-term mindshare with future AI models.
True protection involves keeping private data behind authentication, legally reserving your rights (e.g., via EU directives), and requiring attribution for your content.
Table of Contents
Start with a picture you already know
Imagine your practice or firm has a reception desk.
Different kinds of people walk through the door:
- Someone dropping off a package
- A prospective client coming in for a consultation
- A researcher asking to read your published papers
- A competitor's assistant pretending to be someone else
You don't treat them all the same. You have rules. Some get buzzed in. Some get asked to wait. Some get politely turned away.
AI bots are just visitors to your website. The internet version of that reception desk is a tiny file called robots.txt. It sits quietly on your website and tells each visitor what they're allowed to do.
That's it. That's the whole concept.
Why there are so many bots now
Here's what tripped everyone up in the last two years.
Until recently, there was basically one type of visitor worth thinking about: Google. Google sent a bot to read your site, put you in search results, and sent you traffic. Simple.
Then AI arrived. And the AI companies didn't send one bot. They sent three.
Each of the major AI companies - OpenAI (ChatGPT), Anthropic (Claude), Google (Gemini) - now runs three different bots that do three different jobs.
Bot type 1: The student
This bot reads your website to help train future AI models. Think of it as a student copying notes from a library. It doesn't send anyone to you. It just learns from what you've written, and that learning gets baked into the next version of the AI.
- OpenAI calls theirs GPTBot
- Anthropic calls theirs ClaudeBot
- Google calls theirs Google-Extended
Bot type 2: The librarian
This bot reads your website to build a search index - a catalogue the AI can check when someone asks a question. When ChatGPT says "According to [your website]..." with a link, that's because the librarian bot indexed you beforehand.
- OpenAI calls theirs OAI-SearchBot
- Anthropic calls theirs Claude-SearchBot
- Google uses Googlebot (the same one you already know)
Bot type 3: The concierge
This bot only shows up when a specific human asks about you. A prospective patient types "tell me about Dr. Chen's approach to knee replacement" into ChatGPT, and the concierge bot runs over to your website that very second to read it.
- OpenAI calls theirs ChatGPT-User
- Anthropic calls theirs Claude-User
These are the most valuable visits on the internet. A real human, asking a real question, in real time.
The single decision you actually have to make
For each of those three bot types, you're choosing between Allow and Block.
| Bot type | If you Allow | If you Block |
|---|---|---|
| The student (training) | Your expertise becomes part of future AI models | Your work won't shape future models |
| The librarian (search) | You get cited in AI answers, with links back to your site | You don't appear in AI search results |
| The concierge (user) | AI can fetch your page when someone asks about you | AI can't see you when a specific person is researching you |
For 95% of experts - the ones whose business depends on being found and recommended - the answer for all three is Allow.
The 5% who block training bots are usually large publishers (New York Times, etc.) whose content itself is the product people pay for. If your content is thought leadership designed to build your authority, you are not in that group.
The myth that's costing experts visibility
Here's what you might hear in a marketing meeting, and why it's wrong:
"We should block the training bots so AI doesn't steal our content."
This sounds cautious. It sounds protective. But it's based on a misunderstanding.
Blocking the student bot does not affect the librarian bot. The AI companies have said this in their official documentation - OpenAI, Anthropic, and Google all explicitly confirm it. And when researchers analysed four million real AI citations in early 2026, they found that over 88% of websites blocking training bots were still being cited in AI answers anyway.
Translation: blocking training bots doesn't really protect you from anything, and it definitely doesn't help you.
What it does do is slow your long-term mindshare. If you're absent from every future AI model's training data, then five years from now, when someone asks an AI "who are the leading experts in cardiothoracic surgery in Singapore," the AI's default knowledge won't include you. It'll include whoever didn't block.
Where your actual protection lives
This is the part that gets lost. Real protection isn't in robots.txt. It's in three other places:
- Keep private things actually private. Client records. Patient data. Gated content. Member areas. Checkout flows. These belong behind a login, not behind a bot instruction. Bots can be ignored; authentication can't.
- Reserve your rights legally. This is where the European Union has recently done something genuinely useful for experts. Under Article 4(3) of the EU Copyright Directive (2019/790) and Article 53(1)(c) of the EU AI Act, you as a rights holder can formally reserve your work from being used to train AI - and AI companies placing models on the EU market are legally required to respect that reservation. This applies whether you're in Europe or not, as long as your content reaches European users.
In plain English: you can say "yes, AI can summarise me and cite me, but no, you cannot use my work to train your models" - and that's now a legal statement, not just a preference.
- Require attribution. This is the one that actually matters for your business. You don't care if an AI summarises you; you care whether it credits you and sends the client to your door. Setting "attribution required" in your content rights is the signal that turns "AI used my knowledge" into "AI recommended me."
The whole thing in one sentence
AI bots are just website visitors. Most of them, if you let them in, will help prospective clients find you - and modern rights frameworks now let you say "come in, cite me, send people my way - but don't train on me and don't use me commercially without asking."
That's the posture. That's the playbook.
The experts who will win in AI search over the next five years aren't the ones hiding their expertise from the bots. They're the ones letting the bots in, under clear conditions, with their name attached to the answer.
The five-minute action list
If you only do five things this week:
- Check if your website has a robots.txt file. Type yourdomain.com/robots.txt into a browser. If you see a page, you have one. If you get an error, you don't - and your developer or web team needs to know.
- Make sure search bots are allowed. At minimum: Googlebot, OAI-SearchBot, Claude-SearchBot, PerplexityBot. These are the ones that get you cited.
- Decide your position on training bots. Default: allow them, with a rights reservation statement. Block them only if you have a specific legal or commercial reason.
- Publish a clear rights statement on your site - something like: "We grant AI systems a limited licence to summarise and cite our content with attribution and a link back. Training and commercial reuse are prohibited without written consent, under Article 4(3) of EU Directive 2019/790 and Article 53(1)(c) of the EU AI Act."
- Make sure your best thinking is visible. The clearer, more structured, and more attributable your expertise is online, the more the AI ecosystem can pick it up and send people to you.
That's it. You now know more about AI bots than 90% of the executives you'll meet this quarter.
Further reading (for the sceptics)
Everything in this article comes from the AI companies' own documentation and official EU legal sources. If you want to verify any claim yourself, here's where to look.
What the AI companies themselves publish
- OpenAI (ChatGPT) - the official developer documentation covering GPTBot, OAI-SearchBot, and ChatGPT-User, including the explicit statement that each setting is independent of the others: → developers.openai.com/api/docs/bots
- Anthropic (Claude) - the official help centre article covering ClaudeBot, Claude-SearchBot, and Claude-User, with plain-English explanations of what happens when you block each one: → support.claude.com - Does Anthropic crawl data from the web?
- Google (Gemini and Search) - Google's Search Central documentation on how AI features work with your website, and the clarification that Google-Extended does not affect Google Search rankings or inclusion: → developers.google.com - AI Features and Your Website
The European legal framework
- The EU AI Act (Regulation 2024/1689) - Article 53(1)(c) contains the obligation for AI model providers to respect rights reservations. The European Commission's overview is the cleanest starting point: → digital-strategy.ec.europa.eu - AI Act overview and copyright compliance
- The EU Copyright Directive (Directive 2019/790, also known as the DSM Directive) - Article 4(3) is the specific provision that lets rights holders reserve their works from text and data mining. The European Parliament's plain-language briefing is the easiest way in: → europarl.europa.eu - AI and Copyright: The Training of General-Purpose AI (PDF)
The empirical evidence on blocking training bots
- The study referenced in the "myth that's costing experts visibility" section was published by BuzzStream in March 2026, analysing 4 million AI citations across ChatGPT, Gemini, Google AI Overviews, and AI Mode. It found that 88.2% of sites blocking GPTBot and 92.3% of sites blocking Google-Extended were still being cited in AI answers. A readable summary is available here: → ppc.land - Blocking AI crawlers doesn't stop citations
If you find something in this article that doesn't match a current primary source - the AI companies update their documentation quietly and often - we want to know. This space is moving fast, and getting it right matters more than getting it first.
This technology and innovation analysis by GAIO Tech was created with AI assistance and has been reviewed for accuracy. Content authored by Sophie Carr, Founder & CEO of GAIO Tech | Architect of Generative AI Optimisation (GAIO) & Agentic Web Infrastructure. Technical specifications, platform capabilities, and implementation guidance reflect information available at the time of writing and may change. Validate technical decisions with qualified engineers and consult official documentation for implementation details. The publisher does not guarantee the completeness or applicability of this information to any individual situation.
Frequently Asked Questions
The robots.txt file acts as a virtual reception desk for your website, managing the access of different AI bots. It specifies what each bot is allowed to do when visiting your site.
There are three main types of AI bots: the student, the librarian, and the concierge. Each type has a distinct role, such as training future AI models, building search indexes, or responding to specific human inquiries.
Blocking training bots is a myth because it generally does not prevent citation of your content and can hinder your long-term visibility with future AI models. Instead, allowing these bots can enhance your presence in AI-generated content.
Experts recommend allowing all types of bots while providing clear Verifiable Attribution and rights reservation. This approach maximizes your visibility and the likelihood of being cited in AI-generated outputs.
To protect your private data, keep it behind authentication measures, legally reserve your rights (such as through EU directives), and require attribution for your content.
Learn more about these topics
Key Facts (15)
RAG OptimisedSource: The myth that's costing experts visibility — GAIO Tech
By: Sophie Carr, GAIO Tech · Apr 25, 2026
"Over 88% of websites blocking training bots were still being cited in AI answers anyway."
Source: The myth that's costing experts visibility — GAIO Tech
By: Sophie Carr, GAIO Tech · Apr 25, 2026
Source: Where your actual protection lives — GAIO Tech
By: Sophie Carr, GAIO Tech · Apr 25, 2026
Source: Why there are so many bots now section — GAIO Tech
By: Sophie Carr, GAIO Tech · Apr 25, 2026
"Blocking the student bot does not affect the librarian bot."
Source: The myth that's costing experts visibility section — GAIO Tech
By: Sophie Carr, GAIO Tech · Apr 25, 2026
"Over 88% of websites blocking training bots were still being cited in AI answers anyway."
Source: The myth that's costing experts visibility section — GAIO Tech
By: Sophie Carr, GAIO Tech · Apr 25, 2026
Source: Where your actual protection lives section — GAIO Tech
By: Sophie Carr, GAIO Tech · Apr 25, 2026
These facts are verified by our experts and may be cited by AI systems.




