Your website and AI bots: what they crawl, what they ignore, and what you should be controlling right now

Two years ago, the only bot we were seriously worried about was Googlebot. Today there are dozens: OpenAI's GPTBot, Anthropic's ClaudeBot, PerplexityBot, Google-Extended, Applebot-Extended, Bytespider, CCBot, Amazonbot, FacebookBot, cohere-ai, Meta-ExternalAgent — the list keeps growing every month. Each one visits your site, decides what to read, what to keep and what to ignore, and uses what it keeps to build the answers it gives your potential customers when they ask about what you sell. And more importantly: you can control a lot more than you think about what they do with your content — but only if you know where to look. This article is a direct guide on how to manage AI bots on your website: what they're crawling, what you should block, what you should allow, and how to know when something changes.

Infographic · AI bots on your website
The 3 categories of bots visiting your site in 2026

Training

They take your content for datasets that feed future models.

GPTBot
ClaudeBot
Google-Extended
Applebot-Extended
CCBot
Bytespider

Real-time response

They read your site the moment someone asks the chatbot something.

PerplexityBot
ChatGPT-User
Claude-User
OAI-SearchBot

Generative index

They build their own indexes for conversational search engines.

PerplexityBot
Meta-ExternalAgent
Amazonbot

Who the AI bots visiting your site right now actually are

Let's start with the basics: not all bots are the same, and they don't all have the same purpose. There are three categories worth distinguishing clearly, because the decision to block or allow them is different in each case.

Training bots

These are the ones that collect your content to include it in the dataset AI models are trained on. What they crawl today may show up in a model's answers 6–12 months from now, when that model is released or updated. The main ones are GPTBot (OpenAI), ClaudeBot (Anthropic, though technically it's called anthropic-ai in some variants), Google-Extended (a separate training bot from Googlebot for Gemini models), Applebot-Extended, cohere-ai, CCBot (Common Crawl, which feeds many models), Bytespider (ByteDance/TikTok) and Amazonbot.

Real-time response bots

These are the ones that, when a user asks the chatbot a question, go to your site right then to read your content and respond. They don't use it for training; they use it to answer right now. This group includes PerplexityBot, ChatGPT-User (when ChatGPT browses the web for you), Claude-User, OAI-SearchBot and similar. Blocking them means those engines can't respond with your information when someone asks about you or your sector.

Generative index bots

These are the ones building their own indexes for generative search engines. They resemble Googlebot, but they feed conversational experiences rather than classic SERPs. PerplexityBot also works this way, and Meta-ExternalAgent is the most recent addition to this space.

What each bot is doing on your site right now (and how to find out)

Before making decisions, the first thing is to know what's happening. Most websites don't know which bots are visiting, how many pages they're reading or what kind of content they're taking away. You can find out, and it's easier than it looks.

How to spot bots in your logs

If you have access to your server logs (access.log on Apache/Nginx, or your hosting panel), you can filter by user-agent. Bots identify themselves with clear names. A typical GPTBot entry, for example, looks like this:

66.249.x.x - - [10/Apr/2026:12:34:56 +0200] "GET /servicios/auditoria-seo HTTP/1.1" 200 34521 "-" "Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; GPTBot/1.2; +https://openai.com/gptbot)"

With a basic analysis of 30 days of logs, you can get a good picture of which bots are visiting, how often and which pages. If you don't have access, there are plugins for WordPress (Wordfence, WP Activity Log) and solutions for other CMSs that give you the same information. In projects where we've measured it, a medium-sized website gets anywhere between 500 and 5,000 AI bot visits per month — something unthinkable until 2023.

Which pages they're reading

A pattern we see: AI bots particularly often read service pages, product landings, blog posts with clear questions, and the "about us" page. They tend to skip legal pages, checkout, private dashboards and static resources. That makes sense: they're looking for content that answers real user questions.

How much content they take

One interesting data point you can see in the logs is the response size. When a bot reads a complete 50KB page, it takes all that information with it. If your main content sits behind JavaScript or requires interaction, many bots end up with a degraded version. This matters when deciding how to serve your content (more on this in a section below).

robots.txt: your first line of control

robots.txt is still the standard way to tell a bot what it can and can't visit. Major AI bots respect it (OpenAI, Anthropic, Google, Meta, Apple), while malicious or ethically questionable bots ignore it (more on that later). But for managing the bulk of legitimate automated traffic, it's your main tool.

Basic syntax for blocking AI bots

A robots.txt example that blocks the main training bots but allows real-time response bots would look like this:

User-agent: GPTBot
Disallow: /

User-agent: ClaudeBot
Disallow: /

User-agent: Google-Extended
Disallow: /

User-agent: Applebot-Extended
Disallow: /

User-agent: CCBot
Disallow: /

User-agent: Bytespider
Disallow: /

User-agent: cohere-ai
Disallow: /

User-agent: FacebookBot
Disallow: /

User-agent: PerplexityBot
Allow: /

User-agent: ChatGPT-User
Allow: /

This example illustrates a concrete decision: "I don't want my content used for training, but I do want to be cited when someone asks right now." It's a very common stance for businesses that create their own content (media outlets, consultancies, agencies) who don't want their know-how feeding models for free but do want referral traffic from those models when someone consults them.

Typical mistake: blocking everything without thinking

One of the most damaging things we've seen with clients is people who, spooked by the "AI is stealing my content" narrative, indiscriminately block every AI bot with a blanket User-agent: * Disallow: /. That means zero citations in AI Overviews, zero mentions in ChatGPT, zero traffic from Perplexity. For a business looking to capture B2B leads or high-consideration customers, shutting every door is self-destructive. The right call is almost never blocking everything: it's choosing which ones you block and which ones you don't, based on your business model.

Typical mistake: copying someone else's robots.txt

Every business has its own logic. The robots.txt of a paywalled publisher doesn't work for an ecommerce site, which doesn't work for a service agency, which doesn't work for a specialised database. Copying someone else's without understanding the reasoning behind it is a guaranteed way to end up blocking or allowing things you didn't mean to.

Four AI bot strategies based on your type of business

There's no single answer to what you should do. There are different strategies depending on what you sell and how you monetise. These are the four most common.

Strategy 1: "Content as intellectual property" (blocking training)

If your business model depends on producing original content that would be costly for others to replicate (consultancy, research, original reporting, databases), it makes sense to block training bots so they don't put your content into datasets. You do, however, allow real-time response bots, because every citation is potentially a customer. This is the strategy most leading publishers and research-driven consultancies follow.

Strategy 2: "Maximum visibility" (allow everything)

If your business depends on discovery and every mention is gold, opening the doors to every bot is the sensible call. Local services, high-volume ecommerce, SMEs that need to capture leads. Here the robots.txt is permissive, and you also want to make sure your content is easy to read (plain HTML, not JavaScript-only). For these businesses, visibility in AI responses more than makes up for the "loss" of someone training on your content.

Strategy 3: "Hybrid by section"

A smart and underused strategy: allow AI bots on your blog and informational pages (because you want to be cited) and block them on your product or detailed pricing pages (because you don't want third parties using them to compete or compare). robots.txt lets you discriminate by path. This takes a bit of thought, but the freedom it gives you is huge.

Strategy 4: "Wait and see with monitoring"

If you're not sure what to do, the reasonable default is to not block anything but do monitor what the bots are doing. In 6 months, with real data in hand, you can make an informed decision. This is our recommendation for most SMEs we work with: understand first, decide later. Blocking out of generic fear almost always costs more than it saves.

Beyond robots.txt: llms.txt and positive signals

Something new from the past few months: llms.txt. It's a kind of "robots.txt for language models", designed to give AI models a curated guide to your site: which sections are important, which pages are representative, what structured information you offer. Unlike robots.txt (which is a negative filter: "not this"), llms.txt is a positive filter ("this yes, and here's the summary").

As of today (April 2026), llms.txt isn't an official standard accepted by every major model, but some are starting to read it and others are considering implementing it. The effort to generate one is low if your site is well structured, and it can give you a competitive edge over sites that only have robots.txt. Our recommendation: if your competition doesn't have one yet and your site produces content relevant to your sector, implement it now. The barrier to entry is minimal and the potential return is high.

How to make your content readable to AI bots

Even if you allow bots in, if your content sits behind heavy JavaScript, behind a login or is loaded dynamically through API calls without server-side rendering, many bots won't be able to read it fully. This is a technical issue that can be fixed, but almost no website reviews it. These are the basic checks.

Server-side rendering or pre-rendering. If your site is a React/Vue/Angular SPA, make sure your main content is served in the initial HTML, not only after JS execution. Frameworks like Next.js, Nuxt, Astro or SvelteKit do this by default in their standard modes.
Content that doesn't depend on interactions. Accordions, tabs, collapsed "read more" sections — anything that only loads on click won't be read by many bots. If you have relevant information hidden behind a button, you're losing citations.
Structured metadata. Schema.org is how you tell the bot "this is a product, this is a price, this is an author, this is a FAQ". AI bots are reading schema more and more, and pages with proper schema are more likely to be cited with accurate information.
Clear titles and structure. A single H1, coherent H2s and H3s, readable paragraphs, lists where appropriate. Models extract information better from well-structured text. The good news is that this is exactly what classic SEO also asks for: the two are aligned.
Content in your target user's language. If your business is local, your main content should be in the market's language. Bots don't automatically translate to cite you in another language; they simply don't cite you if you're not in that language.

Malicious bots and aggressive scrapers: a different story

So far we've talked about the legitimate bots that respect robots.txt. But there's another layer: aggressive scrapers, unauthorised competitive analysis bots, crawlers that ignore all the rules and can take gigabytes of your content per day without permission. You don't stop these with robots.txt — you stop them with active measures.

Server-side rate limiting. Configure your server (or Cloudflare, Sucuri or similar) to limit requests per IP over a short period. A human user doesn't make 200 requests to your site in 30 seconds.
Cloudflare Bot Management. If you use Cloudflare, it has a specific bot module that identifies and blocks malicious patterns with ML. A basic plan covers a lot, and you can tune it.
Detecting fake user-agents. Many malicious bots lie about who they are. A simple rule is to verify that requests claiming to be "Googlebot" actually come from Google's IP ranges. The ones that don't get blocked.
Honeypots. A hidden link in your HTML that a human would never click but a link-following bot would. Whichever bot touches it, you block. It's a classic defensive technique and it still works well.

The practical difference matters: robots.txt is for telling "not this" to polite bots; rate limiting and active blocking is for protecting yourself from the ones that aren't. The two layers aren't mutually exclusive — they complement each other.

How we audit this in an SEO audit

When a new client comes on board, one of the 60 reviews we run in our full SEO audit is specifically about bots and crawling. We look at 4 concrete things:

The current robots.txt content and whether it matches the client's business strategy (often it doesn't).
Which AI bots are showing up in the last 30 days of logs, how often and to which URLs.
Whether the main content is readable without JavaScript or requires advanced rendering to extract information.
Whether an llms.txt exists and, if not, whether implementing one makes sense for that specific client.

The output of this section is usually a short but very profitable deliverable: in many cases we find the bots are reading pages the client doesn't care about (and should block) or that pages that would actually matter are invisible because they sit behind heavy JavaScript. These are cheap fixes with a direct impact on visibility in generated responses.

How to take control of AI bots on your website this week

Properly managing AI bots on your website isn't a months-long project and doesn't require restructuring anything. It's a sequence of 4 actions that fit into a working week: download 30 days of server logs and filter by user-agent to see which bots are visiting, pick one of the 4 strategies from the article (selective blocking, full opening, hybrid by section, or wait-and-see with monitoring) based on your business model, update the robots.txt to reflect that decision, and check that important content is served as plain HTML without depending on heavy JavaScript.

If you publish original content and want to stand out, add an llms.txt before your competition does: the standard isn't official yet but some engines are already reading it, and it takes half an hour to implement properly. And protect whatever needs protecting with rate limiting or Cloudflare for the scrapers that don't respect robots.txt — they're a different category. Once you've done all this, you've got a foundation that'll carry you for the next 12 months and that takes 30 minutes to update whenever a new bot appears. It's probably the best cost/impact ratio you can get right now for AI visibility.