Key facts at a glance
AI crawler access in 2026
Last updated
- Two layers, check both
- A polite layer,
robots.txtand noindex, that compliant bots obey, and a hard layer, an edge or server block that returns a 403 the bot cannot bypass. You can be open on one and closed on the other. - The block people miss most
- Since July 1, 2025, Cloudflare blocks AI crawlers by default at the network edge, which overrides robots.txt. A site can welcome GPTBot in robots.txt and still block it at the door. Check Cloudflare AI Crawl Control.
- The WordPress switch that blocks all
- Discourage search engines from indexing this site, under Settings then Reading, tells every bot, search and AI alike, to stay off the whole site. Frequently left on by accident after launch.
- Training bots vs answer bots
- Training crawlers (GPTBot, ClaudeBot, Google-Extended) absorb your content with no traffic back. Answer bots (OAI-SearchBot, Claude-SearchBot, PerplexityBot) fetch live and cite and link to you. The answer bots are the ones that send traffic.
- Blocking training does not hurt Google
- GPTBot and Google-Extended are separate from Googlebot, which powers search. You can block AI training crawlers and keep your Google rankings exactly as they are.
- robots.txt only asks
- It is a request compliant bots honor, not a wall. To genuinely stop a scraper that ignores it you need an edge or firewall block. And some assistant browsers use a plain Chrome user agent with no token, so robots.txt cannot target them at all. The only way to know your real status is to test a live request from an AI user agent and read the status code, not to read robots.txt.
Source: the Cloudflare announcement and documentation on default AI crawler blocking and managed robots.txt, the OpenAI bots documentation, the WordPress reading settings documentation, and reporting on managed WordPress AI blocking. Get a quote in 60 seconds →
Why this matters now
A growing share of the people who would once have found you through a Google search now ask an AI assistant instead, and the assistant answers by reading live web pages and citing them. If your page is the one ChatGPT, Claude, or Perplexity reads and links when someone asks a question in your field, you get the visit and the credibility, the same role a top Google ranking used to play. That only happens if the AI answer bots can actually reach your content. If they are blocked, you are invisible in the channel that is quietly replacing part of search.
The catch is that blocking AI crawlers has become easy to do by accident. Through 2025 the tools to control AI access multiplied, and several of them block by default or were toggled on during setup. The WordPress discourage-indexing checkbox, meant for sites under construction, blocks everything and is constantly left on after launch. Cloudflare flipped to blocking AI crawlers by default. Security plugins and managed hosts added AI-bot controls. The result is a lot of live, public WordPress sites that look open but are closed to the AI bots, often in a layer the owner never thinks to check.
This is not an argument that you must allow every AI bot. Plenty of publishers reasonably block the training crawlers that take content to train models without sending anything back. The point is that it should be a decision, not an accident. You want to know exactly which bots can reach your site, set that on purpose, and make sure the answer bots that send you traffic are not being blocked by a switch you forgot was on.
The two layers of blocking
To audit AI access you have to understand that there are two completely different layers, and a site can be open on one while closed on the other. Confusing them is why so many owners get the wrong answer.
The polite layer: robots.txt and noindex
A request, not a wall. Your robots.txt and any noindex signals tell well-behaved crawlers what you want crawled. The major AI companies bots honor it. But it physically blocks nothing, so a crawler that ignores the rules can still read your pages. This is the layer the WordPress discourage setting and your robots.txt live in.
The hard layer: edge and server blocks
An enforced block. A CDN like Cloudflare, a firewall, a security plugin, or your host can return a 403 to an AI user agent before the request reaches WordPress. This actually stops the bot, and it overrides robots.txt completely. A welcoming robots.txt means nothing if the edge is returning 403.
The practical upshot: to allow a bot you must be open on both layers, and to truly block one you usually need the hard layer. Reading robots.txt alone tells you only half the story, which is why the audit below starts with a live request test.
Training bots vs answer bots
Not all AI crawlers do the same thing, and the difference decides what you should allow. There are three groups.
Training crawlers
They collect your content to train AI models, generally with no attribution and no traffic back. The common ones are GPTBot, ClaudeBot, Google-Extended, CCBot, and Meta-ExternalAgent. Blocking these is a content-rights choice and does not affect search or AI answers.
Answer and search bots
They fetch your page live to answer a user question now, and they cite and link back, which can send you referral traffic. Examples are OAI-SearchBot, Claude-SearchBot, and PerplexityBot. These are the ones most sites want to allow, because being cited is the new visibility.
Assistant agents
They browse on a specific user request, like ChatGPT-User. Useful to allow, though some newer assistant browsers use a plain Chrome user agent with no token, so robots.txt cannot single them out.
The widely adopted posture is block training, allow search: disallow the training crawlers if you wish, while keeping the answer bots open so AI assistants can still find and cite you. Some sites that want maximum reach simply allow them all. Either is a valid choice. Blocking everything by accident is not.
Which symptom matches yours
Find the row that matches your situation. Each points at a layer to check.
| Symptom | Most likely cause |
|---|---|
| Your content never appears or gets cited in ChatGPT, Claude, or Perplexity answers | The answer bots may be blocked at some layer. Test a live request from an AI user agent, then check robots.txt, the discourage-indexing setting, and Cloudflare. |
| Your site is behind Cloudflare and was set up in mid-2025 or later | Cloudflare blocks AI crawlers by default since July 1, 2025, at the network edge, which overrides robots.txt. Check Cloudflare AI Crawl Control. |
| Discourage search engines is checked under Settings then Reading | That one box adds a noindex signal and a robots.txt rule telling all bots, search and AI alike, to stay off the entire site. Uncheck it on a live site. |
| Your robots.txt has Disallow rules naming GPTBot, ClaudeBot, and others, or Disallow: / | A robots.txt level block, intentional or inherited. Compliant AI bots will honor it. Edit it to the posture you actually want. |
| A live request from an AI user agent returns 403 even though robots.txt looks open | A hard block at the edge or server: Cloudflare, a security plugin, a firewall rule, or your managed host blocking AI user agents. |
| You want to stop AI scraping but training bots keep reading your content | robots.txt is only a request, so non-compliant scrapers ignore it. You need an enforcing block at the edge or server to actually stop them. |
| You allowed the bots in robots.txt but still are not cited anywhere | Access is necessary but not sufficient. The bots also need clean, structured content and ideally an llms.txt to understand and cite you. |
The five ways WordPress blocks AI crawlers
A WordPress site can block AI bots in five places. An audit checks each, because a site is only as open as its most restrictive layer.
1. The discourage-indexing setting
The Discourage search engines from indexing this site checkbox under Settings then Reading is the bluntest block in WordPress. When on, it adds a noindex signal and serves a robots.txt that asks every bot to avoid the whole site, with no distinction between Google, training crawlers, or answer bots. It is meant for development and is routinely left checked after a site goes live, quietly telling everything to stay away. Always check this first.
2. Cloudflare blocking at the edge by default
This is the one people miss. Since July 1, 2025, Cloudflare blocks AI crawlers by default, and every domain added since is set to block all known AI bots unless told otherwise. The block happens at the network edge, before the request reaches WordPress, so it overrides your robots.txt entirely. You can have a robots.txt that explicitly welcomes GPTBot and still block it completely. If you are behind Cloudflare, the answer lives in Cloudflare AI Crawl Control, not in your WordPress files.
3. robots.txt Disallow rules
Your robots.txt may carry Disallow rules naming AI bots, either added on purpose, inherited from a theme or plugin, or copied from a template. Compliant AI crawlers honor these, so they are the right place to express a block-training-allow-search posture. The thing to watch is an overly broad Disallow: / or a block on an answer bot you actually wanted. Note that WordPress generates a virtual robots.txt unless a real file or an SEO plugin overrides it, so confirm which one is actually being served.
4. A security plugin, firewall, or header rule
Security plugins and server firewalls can block AI user agents with a hard 403, independent of robots.txt. Some ship bot-blocking lists that include AI crawlers, and a header rule or a X-Robots-Tag can also carry a noindex. Because this is an enforced block, it is what you reach for if you genuinely want to stop a scraper, and also what you must remove if it is blocking a bot you wanted to allow.
5. A managed host blocking at the platform level
Some managed WordPress hosts block AI crawlers in their own infrastructure, which you cannot see from the WordPress dashboard or your robots.txt at all. The symptom is a live AI user-agent request returning 403 while every layer you can see looks open. When robots.txt and Cloudflare are both clear but the bot is still blocked, the host is the remaining suspect, and you confirm it by checking the host control panel or asking their support.
How to audit your site
Run these in order. The first step is the one that tells the truth, because it tests the real, enforced status rather than what robots.txt claims.
From a terminal, request your homepage with an AI user agent and read the status code. A 200 means the bot is getting through. A 403 or a challenge means a hard block at the edge or server.
# Pretend to be an AI crawler and check the response
curl -A "GPTBot" -I https://yoursite.com
curl -A "OAI-SearchBot" -I https://yoursite.com
# 200 = allowed through, 403 = blocked at the edge or serverIn wp-admin, open Settings then Reading and confirm Discourage search engines from indexing this site is unchecked on a live site. If it is checked, that alone is closing the whole site to crawlers.
Visit yoursite.com/robots.txt. Look for Disallow: / or Disallow rules naming AI bots, and confirm whether the answer bots you want are blocked. Note whether the file is the WordPress virtual one or an SEO-plugin override.
In the Cloudflare dashboard, open AI Crawl Control and the bot settings. If the default AI block or managed robots.txt is on, that is overriding your site robots.txt at the edge. Decide which bots to allow here.
If the live request was blocked but robots.txt and Cloudflare look clear, check your security plugin bot rules, any firewall or X-Robots-Tag header, and your managed host AI-bot settings. One of these is returning the 403.
How to set the access you want
Decide your posture first, then set it on every layer. The most common goal is to be visible to the AI answer engines that send traffic, so that is the example below.
Step 1: turn off the accidental blocks
Uncheck Discourage search engines under Settings then Reading if your site is live. Remove any security-plugin or firewall rule that is blocking a bot you want, and clear an overly broad Disallow: / from robots.txt if it is not intentional.
Step 2: set robots.txt to the posture you want
For block training, allow search, which keeps you citable while opting out of model training, a robots.txt like this expresses it. To allow everything for maximum reach, set them all to Allow instead.
# Allow the AI answer engines that cite and link back
User-agent: OAI-SearchBot
Allow: /
User-agent: ChatGPT-User
Allow: /
User-agent: PerplexityBot
Allow: /
User-agent: Claude-SearchBot
Allow: /
# Opt out of AI model training (optional, your choice)
User-agent: GPTBot
Disallow: /
User-agent: ClaudeBot
Disallow: /
User-agent: Google-Extended
Disallow: /
User-agent: CCBot
Disallow: /If a real robots.txt file or an SEO plugin is serving your robots rules, edit it there. Confirm the change at yoursite.com/robots.txt afterward.
Step 3: fix the Cloudflare edge, the layer that overrides robots.txt
If you are behind Cloudflare, open AI Crawl Control and explicitly allow the answer bots you want, rather than leaving the default block to decide. Because the edge overrides robots.txt, this step is what actually lets the bots through. If managed robots.txt is on and conflicts with your intent, reconcile the two so they agree.
Step 4: give the bots something to cite
Access is necessary but not sufficient. Once the answer bots can reach you, they still have to understand and trust your content to cite it. Publish clean, well-structured pages, valid structured data, and an llms.txt file at your site root that summarizes what your site is and points to your best pages. This is the difference between being crawlable and being citable.
If your goal is to block AI instead
To genuinely keep AI out, remember that robots.txt only asks. Disallow the bots in robots.txt for the compliant ones, then add an enforcing block at the edge or firewall, such as Cloudflare AI Crawl Control set to block, so that crawlers ignoring robots.txt are stopped with a 403. The polite layer plus the hard layer together is what actually prevents scraping.
Not sure which layer is blocking you, or want it configured right the first time?
Let us audit and configure it →A real AI-visibility audit
A composite from the AI-crawler audits we run, with identifying details removed.
A consultancy ranked well in Google but noticed it was never the source ChatGPT or Perplexity cited for questions in its niche, while competitors were. The marketing lead had added the company robots.txt rules to welcome AI bots and could not understand why it was not working. The site had moved onto Cloudflare a few months earlier.
A live request with an OAI-SearchBot user agent returned a 403, even though robots.txt clearly allowed it. That pointed past WordPress to the edge. In Cloudflare, the default AI crawler block from the mid-2025 setup was still on, stopping every AI bot before it reached the site and overriding the welcoming robots.txt. On top of that, the discourage-indexing box had been left checked since the original build, adding a sitewide noindex.
We unchecked the discourage-indexing setting, then configured Cloudflare AI Crawl Control to allow the answer bots, OAI-SearchBot, Claude-SearchBot, PerplexityBot, and ChatGPT-User, while leaving the training crawlers blocked per the client preference. We re-ran the user-agent test and now got 200s. Finally we published an llms.txt summarizing the firm and linking its cornerstone pages. Over the following weeks the answer engines began fetching and citing the site.
DIY vs hand it off
Running the user-agent test and unchecking a setting is well within reach. Reconciling robots.txt, the Cloudflare edge, and a host block into one coherent posture is where it gets fiddly. If the left column matches you can likely do this. If the right column matches, get help.
Realistic on your own
- ✓You can run the curl user-agent test and read the status code
- ✓The block was just the discourage-indexing checkbox
- ✓You can edit robots.txt or your SEO plugin robots settings
- ✓You have Cloudflare access to open AI Crawl Control
- ✓You know which posture you want and just need to set it
- ✓You are comfortable publishing an llms.txt file
Hand it off, save the time
- ✗The user-agent test is blocked and you cannot find which layer
- ✗robots.txt, Cloudflare, and a host setting disagree with each other
- ✗You are not sure which bots to allow or block for your goals
- ✗A managed host is blocking at the platform level
- ✗You want the full setup: access, structured data, and llms.txt
- ✗You want it done right once rather than guessing across layers
AI crawler FAQ
How do I know if my WordPress site is blocking AI crawlers?
Reading your robots.txt is not enough, because a block can sit at a layer robots.txt cannot show you. The reliable test is to make a request pretending to be an AI crawler and see what comes back. From a terminal, run a request with an AI user agent against your homepage, for example using GPTBot or OAI-SearchBot as the agent string, and look at the status. A 200 means the crawler is being allowed through to your content. A 403 or a challenge page means something is blocking it before it ever reaches WordPress, almost always a CDN like Cloudflare or a security plugin. Then separately check your robots.txt for Disallow rules naming AI bots, and check Settings then Reading for the discourage-indexing box. Together those three checks, the live request, the robots.txt, and the WordPress setting, tell you the whole picture.
What is the difference between AI training crawlers and AI search bots?
It is the single most important distinction for deciding what to allow. Training crawlers collect your content to train AI models, generally without sending you any traffic or attribution, and the well-known ones are GPTBot, ClaudeBot, and Google-Extended. Search or answer bots fetch your pages live to answer a user question right now, and they typically cite your page and link back, which can send you referral traffic, examples being OAI-SearchBot, Claude-SearchBot, and PerplexityBot. A third group are assistant agents that browse on a specific user request. For most marketing and business sites the valuable visitors are the answer bots, because being the source a chatbot cites is the new version of ranking. So the common posture is to allow the answer bots even if you choose to block the training crawlers.
Will blocking AI training crawlers hurt my Google rankings?
No. This is a common worry and the answer is clear: blocking AI training crawlers like GPTBot and Google-Extended does not affect your Google search rankings, because Google Search is powered by a separate crawler called Googlebot. Google-Extended controls whether your content is used to train Google AI models, and it is entirely separate from the Googlebot that crawls for search results. You can disallow GPTBot, ClaudeBot, and Google-Extended and keep ranking in Google exactly as before. The only thing you change by blocking training crawlers is whether your content feeds those companies AI training datasets. Your search visibility, and your visibility to the AI answer bots if you allow them, are unaffected.
I am behind Cloudflare. Could it be blocking AI bots without my knowledge?
Very possibly, and this is the cause people miss most often. On July 1, 2025, Cloudflare began blocking AI crawlers by default, and every new domain added to Cloudflare since then is asked about AI crawlers and blocks all known ones unless told otherwise. Critically, a Cloudflare block happens at the network edge, before the crawler ever reaches your site, which means it overrides your robots.txt entirely. So you can have a robots.txt that warmly welcomes GPTBot and still be blocking it completely, because Cloudflare stops the request at the edge first. If your site sits behind Cloudflare and was set up in mid-2025 or later, assume the default AI block may be on and check Cloudflare AI Crawl Control. This is the classic case of a site that looks open in its robots.txt but is closed at the door.
Does the WordPress discourage search engines setting block AI bots too?
Yes, it blocks everything. The Discourage search engines from indexing this site checkbox under Settings then Reading is meant for sites still in development, and when it is on WordPress adds a noindex signal and serves a robots.txt that tells all bots to stay away from the entire site. It does not distinguish between Google, AI training crawlers, or AI answer bots, it discourages all of them. The trap is that this box is very often left checked by accident after a site launches, having been set during the build. So a live site can be quietly asking every crawler, search and AI alike, not to index it. If you are auditing AI visibility, this is one of the first things to check, because it is a single switch that closes the whole site to crawlers.
Does robots.txt actually stop a determined scraper?
No, and it is important to understand the limit. A robots.txt file is a request, not a wall. Well-behaved crawlers, including the major AI companies bots, read it and respect it, so it is the right tool for telling compliant bots what you do and do not want crawled. But it does not physically block anything, so a crawler that chooses to ignore it can still read your pages. If your goal is to genuinely prevent AI scraping, including by bots that do not honor robots.txt, you need an enforcing block at the edge or server, such as Cloudflare AI Crawl Control or a firewall rule that returns a 403 to those user agents. There is also a category that robots.txt cannot touch at all: some AI assistant browsers use a standard Chrome user agent with no identifying token, so there is nothing for robots.txt to match. Knowing the difference between a polite request and an enforced block is what lets you set the posture you actually want.
What is the recommended setup for a site that wants AI visibility?
If your goal is to be found and cited by AI answer engines, which is where AI referral traffic comes from, make sure nothing is blocking the answer bots at any layer. Uncheck the discourage-indexing setting if it is on. In robots.txt, allow the answer bots, OAI-SearchBot, Claude-SearchBot, PerplexityBot, and the assistant agents like ChatGPT-User, and decide separately whether to allow or block the training crawlers GPTBot, ClaudeBot, and Google-Extended based on your view of AI training. If you are behind Cloudflare, configure AI Crawl Control to allow the answer bots rather than relying on the default block. Then give those bots something good to read: clean, well-structured content, valid structured data, and an llms.txt file that summarizes your site. Allowing the bots is necessary but not sufficient, they still have to be able to understand and cite you once they are in.
My managed WordPress host might be blocking AI bots. How would I tell?
Some managed WordPress hosts block AI crawlers at the platform level, and because that block lives in the host infrastructure rather than your robots.txt, you cannot see it in your WordPress dashboard at all. The symptom is the same as a Cloudflare edge block: a live request from an AI user agent comes back as a 403 or is challenged, even though your robots.txt looks permissive. To confirm, run the AI user-agent request test against your site, and if it is blocked but your robots.txt and Cloudflare both look clear, the host is the remaining suspect. Check your host control panel for any AI bot or bot protection setting, and if you cannot find one, ask their support directly whether AI crawlers are blocked at the platform level and how to allow the ones you want.
Sources and further reading
Every claim on this page traces back to Cloudflare announcements and documentation, the OpenAI bots documentation, the WordPress reading settings documentation, or industry reporting.
- Cloudflare: blocking AI crawlers by default (press release, July 1, 2025)
- Cloudflare: control content use for AI training and managed robots.txt
- OpenAI: bots documentation (GPTBot, OAI-SearchBot, ChatGPT-User)
- WordPress: Settings then Reading screen (discourage search engines)
- Search Engine Land: managed WordPress might be blocking AI bots