Addresse

Boulevard la girande
Casablanca, MAROC

Numéro de téléphone

+212 681 53 04 05

Adresse email

info@skyweb3agency.com

Today’s question looks beyond the typical traffic-driving goals of AI visibility to the value those large language models provide a website owner, and asks:

“AI crawlers are visiting my website increasingly often, but I can’t tell whether they provide any value. Should I allow them, block them, or treat different AI crawlers differently? How can I measure whether their activity leads to citations, referral traffic, or conversions before making that decision?”

Many SEOs don’t realize the cost of having bots visit their site. Recently, with the proliferation of AI bots, the costs of allowing anyone and everyone to access your content are becoming an expensive business.

Types Of AI Crawlers

First, let’s look at the different types of bots that visit a website.

Common bots that will be visiting a website regularly include those we want to have access to our site, for example, search engine bots. These aren’t the only bots, but they are often some of the most prolific consumers of bandwidth. Alongside search bots, there will be tools. These can include bots from uptime monitors, search and analytics tools, and security and vulnerability scanners.

Overall, website owners have to decide whether the bots visiting their site should be allowed to continue or if they pose more harm than good. Examples of bots that site managers often block are those that are trying to scrape product information to feed another website’s database, or malicious bots looking for login vulnerabilities. Whether or not to block these bots is a fairly easy decision – they pose a risk to the intellectual property of the brand or the safety of the website.

AI bots might actually fall somewhere in between these “good” and “bad” bots.

AI Training Bots

These bots, for example, OpenAI’s GPTBot, are scouring the web for information to feed the AI training models. They are helping to create the knowledge base that the LLMs are learning from, including entities and how they relate to each other.

For many website owners, these are the most controversial AI crawlers. Their primary purpose is not to send traffic back to your site, but to “read” and collect information that may be used to train and improve models. In some cases, that content may later be used to answer user questions without generating a visit to the original source. This makes it harder to draw a direct line between the crawler’s activity and business value.

Search Indexing Bots

These bots, OpenAI’s OAI-SearchBot, for example, are reviewing pages and collecting information to surface and link websites in LLM “search results,” not to train foundation models.

These are often easier to justify allowing because their purpose is closer to that of a traditional search engine. If they are indexing your content so that it can be cited in AI-generated answers, they have a more obvious route to creating visibility, referral traffic, and brand awareness.

User-Triggered Fetches

These bots, including OpenAI’s ChatGPT-User, retrieve pages on demand when users ask about specific websites or documents, rather than relying solely on a pre-built index or knowledge base.

These fetches represent genuine user interest in your site. They are specifically looking for additional information or context on your content, business, or products. This is a valuable indicator of their place within the purchase funnel. They have already discovered your brand and are now diving deeper into your content.

How To Block AI Bots

OpenAI updated its documentation so that ChatGPT-User, the user-triggered fetcher, no longer commits to honoring a website’s robots.txt. Perplexity behaves in a similar manner, with Perplexity-User. So the robots.txt, which SEOs have been reliably using for years to control major bots, now only blocks the compliant training and search crawlers. For user-triggered and non-compliant bots, you need server or WAF-level blocking. 

WAF-Level Blocking

A WAF (web application firewall) sits in front of a website’s server and acts as an inspection checkpoint. A WAF can be configured to only allow certain bots, or to allow all but excluded bots. This is a very robust way of preventing unwanted bots from visiting a website.

Although this typically sits outside the purview of an SEO, you may be familiar with some of the brands that offer WAF-level blocking, like Cloudflare and AWS. If you know which tech stack your website runs on, you may be able to research WAF blocking before presenting the idea to your infrastructure team. However, most large companies will already have a variety of bots they are blocking, so enterprise teams will likely have a process in place for adding or removing bots from WAF lists.

Server Rules

Rules can be added directly to your server that examine the traffic that is hitting it, and determine if it comes from an unsafe bot. The server will check items like whether the request comes from a source using automation or lacks the proper headers. If it deems the user-agent as unsafe based on the rules, it will not let the bot hit the site.

The Risk Of Blocking All AI Bots

This is where the dilemma lies. Some of the AI bots are scraping your website’s intellectual property. However, if you block them, that means they may not surface your brand or products in their answers, putting you at a competitive disadvantage.

The primary risk with blocking AI bots is that you may find your site no longer cited in LLM answers. Given the low volume of referral traffic LLMs are passing, that may seem like a risk you are willing to take.

However, what we do know is that, although LLMs aren’t passing the same volume of traffic as traditional search engines, they are helpful in raising brand awareness. If your brand isn’t the one being cited, that means a competitor’s is.

With everything AI-related, we have to remember that the field is evolving quickly. LLMs may not be passing much traffic right now, but that doesn’t mean that will always be the case.

Preventing AI bots from crawling a site now might make the site functionally invisible in the future if LLMs become the primary discovery method.

In addition, blocking all AI bots removes your ability to test and learn. If you stop every AI crawler from accessing your site, you lose the opportunity to understand which platforms generate visibility, which cite your content accurately, and which have the potential to become meaningful traffic sources in the future.

The Risk Of Allowing All AI Bots

There is, of course, a very real threat that sites are facing from AI crawlers today, however. The two greatest risks come from the ferocity at which the bots are crawling and consuming content.

Training On Intellectual Property

Many website owners are uncomfortable with the idea that proprietary content or assets could be used to improve an AI model without any direct compensation or attribution. This is one of the loudest complaints that we hear from SEOs – you are visiting my site, taking my content, but I am not getting traffic in return.

The concern is particularly high for publishers and businesses whose competitive advantage comes from unique information or assets. If that content becomes part of a model’s training data, there is less need for users to visit the original website.

There is also the risk that bots may be scraping data or content that actually forms part of a product or service. For an LLM to repackage that information and serve it as an answer or generation can be devastating to businesses. For example, artists are seeing photos of their work being ingested by LLMs and used to generate images “in the style of” their own creations. This use of IP could be directly impacting a business’s profits.

Crawl Costs

AI crawlers can consume significant server resources. Large sites frequently report AI bots requesting pages at a much higher frequency than traditional search engine crawlers.

This cost is not always obvious because it is often absorbed into general hosting fees. However, at scale, excessive crawling can increase bandwidth consumption and impact the experience of real users if resources become constrained.

For some organizations, the direct financial cost of serving AI crawlers is the primary factor behind decisions to restrict or block them.

How To Identify Which Bots Are Visiting Your Site

The biggest blocker to understanding the risk and reward to your brand from AI bots is knowing which bots are even crawling your site.

This data isn’t always easy to come by. Let’s go through a couple of ways we can identify if a bot has or is crawling your site.

Log Files

Log files will be the most complete source of information on which bots are visiting your website. Downloading a sample of logs from the past 30 days could give you a good idea of what percentage of your bots are linked to AI.

The log files will likely have all manner of bots in them, and it might take a bit of research to identify which of them are AI crawlers. Once you have translated the user-agent information into something more human-readable, it will be a simple case of adding up the hits of each bot and working out what percentage of the whole is from AI crawlers.

There are a lot of tools available that will automate this, however. There are a couple of types that might help with this exercise – traditional log file analyzers and AI visibility tracking tools.

The log file analyzers will provide a breakdown of which bots are from traditional search engines, and which are from AI. The AI optimization tools, which are primarily for tracking and analyzing your site’s visibility in LLMs, often also have an AI agent tracking feature based on your log files.

You should also try to understand whether specific bots are concentrating on particular sections of the site. A crawler repeatedly accessing product pages may indicate that those assets are particularly valuable to the platform. This can help inform whether you allow access to the whole site or create more specific restrictions.

See also: The Modern Guide To Robots.txt: How To Use It Avoiding The Pitfalls

Referral Traffic

If you don’t have access to your log files, you can still get an idea of which bots have visited your site from the referral traffic they send.

Looking in your analytics software at referral sources, you may recognize a portion as LLMs, like ChatGPT or Perplexity. Google Analytics has recently deployed a new channel classification called “AI Assistant.” This new channel makes it easier to see what visitors have found your site via an LLM, but it only recognizes ChatGPT, Gemini, and Claude via referrer header and does not capture Perplexity. It is safe to assume that if an LLM has cited your website and provided a link for visitors to follow, its bot may have visited your site at some point.

This isn’t a foolproof method of seeing all the AI bots that have visited your site, because it will only reveal platforms that have sent referral traffic within the timeframe you are viewing. Any LLM bot that has crawled your site but not sent referral traffic will remain unknown to you. It is also possible that the citation that sent traffic to your site came from training data or a cached version of your page. However, if you are truly unable to access log file data, this can give you a fair approximation of the bots that have visited your website.

What Additional Data You Need

Beyond simply knowing if a bot has visited your site, it is necessary to know the impact of their visit. This means you need to find out from the log files, or landing pages of their referred traffic, which pages the AI bots have crawled.

This information will give you a better idea of where the bots are scraping data from, and whether they are pages you do or do not want them visiting.

Potentially the most important point of data for this analysis is the cost of the AI bots hitting your site. This is likely information you will need to get from whoever manages your website server. They should be able to tell you which bots are crawling the site so much they are already at the point where they are considering blocking them. This person should also be able to calculate how much money it is costing your company to allow bots to crawl the site. This is very helpful information when it comes to the next bit of the analysis – determining the value of AI bots.

How To Measure Value

This next step is critical in the decision-making process. The question of whether to allow, block, or restrict an AI bot from your site hinges on the value those bots provide.

Most website owners are aware that LLMs do not send as much traffic to websites as traditional search engines do. However, Cloudflare data from June 2025 suggests that for every one visit to a website, Anthropic’s Claude will have made 70,900 page requests, whereas for Google, that ratio is 9.4:1. This “crawl-to-refer” ratio is shockingly high for some LLMs.

What Value Is The Traffic The LLMs Send?

The first step is understanding whether visitors arriving from LLMs are actually valuable. Looking purely at session numbers can be misleading. AI platforms currently send significantly less traffic than traditional search engines, but the visitors they do send may be highly qualified.

Essentially, the key measures to consider here are engagement metrics. Are users from LLMs engaging positively with your site in a way that indicates they may become converting users? Even if they don’t purchase something on their first visit, they may return via another channel at a later date. Using your knowledge of user journeys on the site, compare the behavior of LLM-referred visitors with converting visitors from other channels.

Ultimately, the most persuasive argument for allowing an AI crawler is revenue generation that outweighs the cost of them crawling the site. If visitors arriving from a specific LLM go on to purchase products or complete lead forms, they show they have positive business impact.

Citations And Mentions

Traffic is only one form of value. A platform that consistently cites your content may be increasing awareness of your brand even when users do not click through. As SEOs, we know that traffic isn’t the be-all and end-all of marketing. Just because a visitor has not clicked to visit your website, it does not mean they will not jump in their car to visit your brick-and-mortar store they just discovered through a Google Business Profile.

Consider LLMs in a similar way.

Track how often your site appears in AI-generated answers for topics relevant to your business. The more frequently your content is surfaced, the greater the likelihood that your brand is becoming associated with those topics in users’ minds.

Sentiment

Being mentioned is not enough; understanding how your brand is being represented is equally important.

Review AI-generated answers to determine whether your company is being described accurately and positively. If a platform frequently references your content but misrepresents your products or expertise, that should form part of the decision-making process. An LLM that continually gets it wrong is not just costing your business in server fees; it could be costing your brand’s goodwill.

Query/Topic Coverage

Assess which topics, products, or services your brand appears for within AI platforms.

If competitors dominate important commercial topics while your brand rarely appears, allowing relevant crawlers may become strategically important. Conversely, if you already have strong visibility for key subjects, you may be more comfortable restricting certain types of crawlers.

Consider Future Value

One of the hardest aspects of this analysis is that today’s value may not reflect tomorrow’s value.

A crawler that generates little traffic today may belong to a platform that becomes a major discovery channel in the future. Equally, a crawler that appears expensive today may eventually justify its cost through improved visibility and referral traffic.

For this reason, avoid evaluating AI crawlers solely on short-term performance. Consider their potential strategic value over the next several years.

Build A Decision Matrix

The final part of the analysis is a decision matrix. It’s a simple way of organizing the AI crawlers into bots to “keep,” “restrict,” or “block.”

Using the information you have already gathered, ask the following series of questions of each bot:

Does This Bot Provide My Site With Converting Revenue Or Useful Visibility?

Does this crawler contribute to traffic, leads, revenue, or brand awareness? If it does, that is a strong reason to keep it. If it doesn’t seem to provide any traffic or visibility within the LLMs, then this is likely a “no” or “maybe.”

Is It Accessing Sensitive Information, Or Information We Want To Keep Proprietary?

This is where you analyze if it is safe to let the bot roam freely, or if you have caught it scraping content that is part of your company’s IP. If that is the case, you will likely want to block it or restrict it.

How Trustworthy Is This Bot?

Is this a bot from a well-known AI company? Is there publicly available documentation on how its crawlers work, what commands they respect, and their data retention policies? If there is, this is a stronger sign that this is a bot that can be allowed to crawl your site. If there isn’t, then it is likely one to block.

Is This Bot Costing Us Significant Money Or Impacting User Access To Our Site?

This is a question about the cost of letting the bot crawl your site freely. If it is hitting the site at a high frequency, it may well be costing you a lot in server fees. It could also be pushing the server past its capacity, which may prevent other helpful bots, or your actual site users, from being able to access the site.

Can We Afford The Competitive Disadvantage From Not Allowing This Bot To Access Our Site?

This centers on the risk of your site not being accessible to the bots.

If blocking a crawler would likely remove your brand from a major AI platform’s answers, then the strategic cost may outweigh the infrastructure savings. If there is little evidence that the platform references your content or competitors, then the downside may be limited.

The Final Decision

Once you have gathered all of your data and weighed up the pros and cons of each bot, you are ready to make a decision. The key to this decision-making is remembering that this may change over time. You may not need to block a bot today, but you may want to restrict it for now, knowing you can block it entirely at a later date.

Keep – Doesn’t Cost Much/Brings In More Value Than It Costs

These are bots that provide measurable value. This may be through traffic, citations, brand visibility, or future strategic importance, but importantly, this value outweighs the operational burden.

Monitor Or Restrict – Doesn’t Have Much Value But Doesn’t Cost Much

These are bots where the business case remains unclear. You may choose to limit crawl rates, restrict access to specific areas of the site, or continue gathering data before making a final decision.

Block – Low Value/High Risk

These are bots that create significant costs, access sensitive content, or provide little evidence of current or future value.

See also: WordPress Robots.txt: What Should You Include?

Going Forward

A key point to remember is that this is not a case of “set it and forget it.” New AI bots will be created. Bots that you have blocked may increase in potential value over the next few months and years.

As part of your assessment you need to build in regular reviews. These might be triggered by the person who is responsible for server costs asking you if you really need ChatGPT to be accessing the site. Ideally, though, it will be something that you are proactively considering and that you can present to your stakeholders as both a brand protection and future-proofing plan.

Consider reviewing your block list once a quarter. This is a cadence that doesn’t put too much pressure on the person pulling the log files, and also gives you time to make strategic changes if needed.

The key takeaway is that there is rarely a good reason to either allow every AI crawler or block them all. Instead, treat each bot as an individual business case. Measure its cost, assess the visibility it provides, understand the risk it creates, and then make a deliberate decision. That approach is far more likely to protect both your current resources and your future discoverability.

More Resources:


Featured Image: Paulo Bobita/Search Engine Journal

Source link

Leave a Reply

Votre adresse e-mail ne sera pas publiée. Les champs obligatoires sont indiqués avec *