TL;DR
- AI bots are simply visitors to your website, and
robots.txtacts as your virtual reception desk to manage them. - There are three main types of bots (student, librarian, concierge) with distinct jobs, run by major AI companies.
- For most experts, allowing all bot types and providing clear Verifiable Attribution and rights reservation is the best strategy for visibility and being cited.
- Blocking training bots is a myth; it generally doesn't prevent citation and hinders your long-term mindshare with future AI models.
- True protection involves keeping private data behind authentication, legally reserving your rights (e.g., via EU directives), and requiring attribution for your content.
Table of Contents
- Start with a picture you already know
- Why there are so many bots now
- The Legal Backbone: Your "Right to Say No"
- Meet the Fleet: The Three Types of AI Bots
- GAIO Infrastructure: Setting and Managing Your Rights
- The single decision you actually have to make
- The myth that's costing experts visibility
- Where your actual protection lives
- The whole thing in one sentence
- The five-minute action list
- Further reading (for the sceptics)
Start with a picture you already know
Imagine your practice or firm has a reception desk.
Different kinds of people walk through the door:
- Someone dropping off a package
- A prospective client coming in for a consultation
- A researcher asking to read your published papers
- A competitor's assistant pretending to be someone else
You don't treat them all the same. You have rules. Some get buzzed in. Some get asked to wait. Some get politely turned away.
AI bots are just visitors to your website. The internet version of that reception desk is a tiny file called robots.txt. It sits quietly on your website and tells each visitor what they're allowed to do.
That's it. That's the whole concept.
Why there are so many bots now
Here's what tripped everyone up in the last two years.
Until recently, there was basically one type of visitor worth thinking about: Googlebot. It read your site, put you in search results, and sent you traffic. Simple.
Then AI arrived. And the AI companies didn't send one bot; they sent three. Each of the major AI players - OpenAI (ChatGPT), Anthropic (Claude), and Google (Gemini) - now runs a "three-headed" fleet of bots, each with a different job.
Bot Type 1: The Student (Training)
This bot reads your website to help train future AI models. Think of it as a student copying notes from a library to get smarter for an exam. It doesn't send traffic to you; it just "learns" from your writing, and that knowledge gets baked into the next version of the AI. Block these to exercise your rights under the AI Act.
- OpenAI: GPTBot
- Anthropic: ClaudeBot (Note: Anthropic uses this name for high-volume training crawls)
- Google: Google-Extended
Bot Type 2: The Librarian (Indexing)
This bot reads your site to build a search index, a catalog the AI can check when someone asks a question. When ChatGPT or Claude says "According to [your website]..." and provides a link, it's because the Librarian bot indexed you beforehand. Allow these to ensure you get cited and linked.
- OpenAI: OAI-SearchBot
- Anthropic: Claude-SearchBot
- Google: Googlebot (the same one you already know)
Bot Type 3: The Concierge (Real-time)
This bot only shows up when a specific human asks about you right now. A prospective client types a specific query into an AI assistant, and the Concierge bot runs over to your website that very second to read the latest info.
- OpenAI: ChatGPT-User
- Anthropic: Claude-User
- Google: Googlebot (Google is so fast they don't need a separate bot for this!)
These are the most valuable visits on the internet. It represents a real human, asking a real question, in real-time.
Knowing which bot is which allows you to decide: do you want to help the Student get smarter for free, or just help the Concierge find your front door and send you a lead?
The Legal Backbone: Your "Right to Say No"
Managing your bots isn't just about website performance; it's about exercising your legal rights. Two specific European laws have changed the game, giving you the power to decide how your content is used.
1. The Right to Opt-Out (EU Copyright Directive)
Article 4(3) of the EU Copyright Directive (2019/790) created a "Text and Data Mining" (TDM) rule. It essentially says that AI companies have a license to scrape the web by default unless the owner "expressly reserves" their rights.
- The "Machine-Readable" Rule: For websites, the law requires this reservation to be made in a way that computers can understand.
- The Solution: In the tech world, this means using your robots.txt file. By blocking "Student" bots like GPTBot or Anthropic-ai, you are making a formal legal reservation of your work.
2. The Enforcement "Teeth" (EU AI Act)
While the Copyright Directive gave you the right, the EU AI Act (Article 53) provides the enforcement. This brand-new law mandates that any "General Purpose AI" provider (like OpenAI, Anthropic, or Google) must have a policy in place to respect those copyright opt-outs.
This applies even if the companies are based in the US. If they want to offer their AI models to the European market, they are legally obligated to honor the "no-training" signals you set on your website.
The Bottom Line: When you block a "Student" bot, you aren't just tweaking your code; you are setting a legal boundary that AI companies must respect if they want to operate globally.
Automation: The GAIO Agentic Infrastructure
Manual management of these bots is nearly impossible because names and behaviors shift constantly. The GAIO Delivery Dashboard provides the technical infrastructure to turn your legal preferences into machine-readable reality
Dynamic Robots.txt
We automatically update your file as AI companies launch or rename bots. You don't have to track which name belongs to which "job"; the infrastructure handles the technical handshakes for you.
Real-Time Analytics
Stop guessing who is visiting. Our dashboard provides a live "reception log" where you can see exactly who is visiting (e.g., Anthropic's ClaudeBot or Google's Googlebot), how many requests they've made, and how much data they've consumed.
Rights Infrastructure
Under Article 4(3) of the EU Copyright Directive, you must reserve your work in a "machine-readable" way. Our dashboard acts as the system of record to set your own rights-automatically publishing the necessary legal signals that major AI providers (like OpenAI and Google) are now legally required to detect and honor under the EU AI Act.
The single decision you actually have to make
For each of those three bot types, you're choosing between Allow and Block.
| Bot type | If you Allow | If you Block |
|---|---|---|
| The student (training) | Your expertise becomes part of future AI models | Your work won't shape future models |
| The librarian (search) | You get cited in AI answers, with links back to your site | You don't appear in AI search results |
| The concierge (user) | AI can fetch your page when someone asks about you | AI can't see you when a specific person is researching you |
For 95% of experts - the ones whose business depends on being found and recommended - the answer for all three is Allow.
The 5% who block training bots are usually large publishers (New York Times, etc.) whose content itself is the product people pay for. If your content is thought leadership designed to build your authority, you are not in that group.
The myth that's costing experts visibility
Here's what you might hear in a marketing meeting, and why it's wrong:
"We should block the training bots so AI doesn't steal our content."
This sounds cautious. It sounds protective. But it's based on a misunderstanding.
Blocking the student bot does not affect the librarian bot. The AI companies have said this in their official documentation - OpenAI, Anthropic, and Google all explicitly confirm it. And when researchers analysed four million real AI citations in early 2026, they found that over 88% of websites blocking training bots were still being cited in AI answers anyway.
Translation: blocking training bots doesn't really protect you from anything, and it definitely doesn't help you.
What it does do is slow your long-term mindshare. If you're absent from every future AI model's training data, then five years from now, when someone asks an AI "who are the leading experts in cardiothoracic surgery in Europe," the AI's default knowledge won't include you. It'll include whoever didn't block.
Where your actual protection lives
This is the part that gets lost. Real protection isn't in robots.txt. It's in three other places:
Keep private things actually private.
Client records. Patient data. Gated content. Member areas. Checkout flows. These belong behind a login, not behind a bot instruction. Bots can be ignored; authentication can't.
Reserve your rights legally.
This is where the European Union has recently done something genuinely useful for experts. Under Article 4(3) of the EU Copyright Directive (2019/790) and Article 53(1)(c) of the EU AI Act, you as a rights holder can formally reserve your work from being used to train AI - and AI companies placing models on the EU market are legally required to respect that reservation. This applies whether you're in Europe or not, as long as your content reaches European users.
In plain English: you can say "yes, AI can summarise me and cite me, but no, you cannot use my work to train your models" - and that's now a legal statement, not just a preference.
Require attribution.
This is the one that actually matters for your business. You don't care if an AI summarises you; you care whether it credits you and sends the client to your door. Setting "attribution required" in your content rights is the signal that turns "AI used my knowledge" into "AI recommended me."
The whole thing in one sentence
AI bots are just website visitors.
Most of them, if you let them in, will help prospective clients find you - and modern rights frameworks now let you say "come in, cite me, send people my way - but don't train on me and don't use me commercially without asking."
That's the posture. That's the playbook.
The experts who will win in AI search over the next five years aren't the ones hiding their expertise from the bots. They're the ones letting the bots in, under clear conditions, with their name attached to the answer.
The five-minute action list
If you only do five things this week:
- Check if your website has a robots.txt file. Type yourdomain.com/robots.txt into a browser. If you see a page, you have one. If you get an error, you don't - and your developer or web team needs to know.
- Make sure search bots are allowed. At minimum: Googlebot, OAI-SearchBot, Claude-SearchBot, PerplexityBot. These are the ones that get you cited.
- Decide your position on training bots. Default: allow them, with a rights reservation statement. Block them only if you have a specific legal or commercial reason.
- Publish a clear rights statement on your site - something like: "We grant AI systems a limited licence to summarise and cite our content with attribution and a link back. Training and commercial reuse are prohibited without written consent, under Article 4(3) of EU Directive 2019/790 and Article 53(1)(c) of the EU AI Act."
- Make sure your best thinking is visible. The clearer, more structured, and more attributable your expertise is online, the more the AI ecosystem can pick it up and send people to you.
That's it. You now know more about AI bots than 90% of the executives you'll meet this quarter.
Further reading (for the sceptics)
Everything in this article comes from the AI companies' own documentation and official EU legal sources. If you want to verify any claim yourself, here's where to look.
What the AI companies themselves publish
- OpenAI (ChatGPT) - the official developer documentation covering GPTBot, OAI-SearchBot, and ChatGPT-User, including the explicit statement that each setting is independent of the others: → developers.openai.com/api/docs/bots
- Anthropic (Claude) - the official help centre article covering ClaudeBot, Claude-SearchBot, and Claude-User, with plain-English explanations of what happens when you block each one: → support.claude.com - Does Anthropic crawl data from the web?
- Google (Gemini and Search) - Google's Search Central documentation on how AI features work with your website, and the clarification that Google-Extended does not affect Google Search rankings or inclusion: → developers.google.com - AI Features and Your Website
The European legal framework
- The EU AI Act (Regulation 2024/1689) - Article 53(1)(c) contains the obligation for AI model providers to respect rights reservations. The European Commission's overview is the cleanest starting point: → digital-strategy.ec.europa.eu - AI Act overview and copyright compliance
- The EU Copyright Directive (Directive 2019/790, also known as the DSM Directive) - Article 4(3) is the specific provision that lets rights holders reserve their works from text and data mining. The European Parliament's plain-language briefing is the easiest way in: → europarl.europa.eu - AI and Copyright: The Training of General-Purpose AI (PDF)
The empirical evidence on blocking training bots
- The study referenced in the "myth that's costing experts visibility" section was published by BuzzStream in March 2026, analysing 4 million AI citations across ChatGPT, Gemini, Google AI Overviews, and AI Mode. It found that 88.2% of sites blocking GPTBot and 92.3% of sites blocking Google-Extended were still being cited in AI answers. A readable summary is available here: → ppc.land - Blocking AI crawlers doesn't stop citations
If you find something in this article that doesn't match a current primary source - the AI companies update their documentation quietly and often - we want to know. This space is moving fast, and getting it right matters more than getting it first.
This technology and digital innovation content by GAIO Tech and is informed by expertise in Generative AI Optimisation (GAIO), AI Visibility Infrastructure, Generative Engine Optimization (GEO). It reflects AI-assisted synthesis and technical analysis, not a guaranteed implementation outcome. Validate recommendations against your system architecture and constraints. and has been reviewed for accuracy. It is provided for informational and educational purposes only and does not constitute professional, legal, financial, medical, or other regulated advice. Readers should consult qualified professionals for guidance specific to their circumstances. The publisher does not guarantee the completeness or applicability of this information to any individual situation.
Frequently Asked Questions
What is the purpose of the robots.txt file?
The robots.txt file acts as a virtual reception desk for your website, managing the access of different AI bots. It specifies what each bot is allowed to do when visiting your site.
How many types of AI bots are there?
There are three main types of AI bots: the student, the librarian, and the concierge. Each type has a distinct role, such as training future AI models, building search indexes, or responding to specific human inquiries.
Why is blocking training bots considered a myth?
Blocking training bots is a myth because it generally does not prevent citation of your content and can hinder your long-term visibility with future AI models. Instead, allowing these bots can enhance your presence in AI-generated content.
What is the best strategy for visibility regarding AI bots?
Experts recommend allowing all types of bots while providing clear Verifiable Attribution and rights reservation. This approach maximizes your visibility and the likelihood of being cited in AI-generated outputs.
How can I protect my private data from AI bots?
To protect your private data, keep it behind authentication measures, legally reserve your rights (such as through EU directives), and require attribution for your content.
Learn more about these topics
Key Facts (15)
RAG OptimisedSource: The myth that's costing experts visibility — GAIO Tech
By: Sophie Carr, GAIO Tech · Apr 26, 2026
"Over 88% of websites blocking training bots were still being cited in AI answers anyway."
Source: The myth that's costing experts visibility — GAIO Tech
By: Sophie Carr, GAIO Tech · Apr 26, 2026
Source: Where your actual protection lives — GAIO Tech
By: Sophie Carr, GAIO Tech · Apr 26, 2026
Source: Why there are so many bots now section — GAIO Tech
By: Sophie Carr, GAIO Tech · Apr 26, 2026
"Blocking the student bot does not affect the librarian bot."
Source: The myth that's costing experts visibility section — GAIO Tech
By: Sophie Carr, GAIO Tech · Apr 26, 2026
"Over 88% of websites blocking training bots were still being cited in AI answers anyway."
Source: The myth that's costing experts visibility section — GAIO Tech
By: Sophie Carr, GAIO Tech · Apr 26, 2026
Source: Where your actual protection lives section — GAIO Tech
By: Sophie Carr, GAIO Tech · Apr 26, 2026
These facts are verified by our experts and may be cited by AI systems.




