Back to blog

Robots.txt: role and key directives for SEO

SEO

Discover Incremys

The 360° Next Gen SEO Platform

Request a demo

08/02/2026

Chapter 01

The robots.txt file is one of the simpler SEO levers, yet it can have a high impact because it directly influences how search engines allocate their resources across your site. Used well, it helps you prioritise high-value pages (offers, customer case studies, pillar content) and reduce noise (parameters, faceted navigation, sorting pages, duplicates). Used poorly, it can cut off organic visibility at the source.

In this guide, we provide a complete overview: definition, directives, best practices, testing and specific cases (WordPress, multisite, multilingual). We also cover current challenges related to AI bots and the difference from llms.txt, so that your crawl governance remains aligned with a modern, data-driven GEO/SEO strategy.

Robots.txt: The Complete Guide to Optimising Crawling and Improving Your SE

1. Understanding the Fundamentals

Definition: What This File Does (and What It Doesn't Do)

The robots.txt file is a public text file, placed at the root of a website, that tells crawling bots (crawlers) which URLs they are allowed to access. Google highlights an essential point: its main purpose is to manage crawl traffic and avoid overloading your server, not to prevent a page from appearing in search results.

What it does:

control bot access to different sections of your site;
optimise crawl budget allocation;
declare the location of your sitemaps.

What it doesn't do:

prevent a page from being indexed (a blocked URL can still appear in results if it is referenced elsewhere);
protect sensitive data (the file is public and readable by anyone);
replace a complete indexing strategy (meta robots, canonicals, architecture).

In B2B, this file becomes a governance tool: it helps channel crawling towards what generates business (solution pages, expert content, useful downloadable resources) and reduces crawling of low SEO value areas (internal search, parameters, technical environments).

Where to Find It: Root Location, Subdomains and Access

The file must be accessible at a fixed URL, for example: https://www.yourdomain.co.uk/robots.txt. It must:

be at the root (not in a subfolder);
be named in lowercase (robots.txt);
remain reasonable in size (some engines limit how much they read; common practice is to stay well below 512 KB).

Multisite and subdomains: each host must have its own file at its root (e.g. https://blog.domain.co.uk/robots.txt and https://www.domain.co.uk/robots.txt). A common mistake is to assume a single file covers all subdomains.

A good operational habit: version this file (Git) and document the purpose of rules (comments), to ensure transparency and continuity between SEO, content and the technical team.

How Bots Read Rules: Matching Logic and Priority

The exclusion protocol works on a simple principle: a bot arrives on a domain, first checks the file at the standard location, then applies the rules that match its user-agent. Each engine has its own bots (Googlebot, Bingbot, etc.). You can target a specific bot or apply general rules with User-agent: *.

A key point to remember: this mechanism is advisory. Reputable bots generally respect it, but nothing stops a malicious bot from ignoring it.

For priority, keep this logic in mind:

the bot chooses the most specific User-agent group that matches it;
within the same group, when there is a conflict, the rule with the most specific (longest) path generally wins;
Allow often acts as an exception to a broader Disallow.

Pragmatic advice: the simpler your rules, the more you reduce the risk of surprises during a migration, a CMS change, or adding new parameters.

Visual: Diagram of the "Discovery → Crawl → Indexing" Flow

Discovery ├─ internal links (menu, internal linking, pagination) ├─ external links (backlinks)

└─ sitemap.xml ↓Crawling (crawl) ├─ access allowed? (rules in the crawl control file) ├─ resources accessible? (CSS/JS/images needed for rendering)

└─ server responds correctly? (200, 3xx, 4xx, 5xx) ↓Indexing ├─ allowed? (meta robots / X-Robots-Tag / canonicals) ├─ content understood? (rendering, duplication, quality)

└─ sufficient signals? (internal linking, authority, freshness)

Useful takeaway: this file mainly influences crawling. Indexing depends on other complementary signals.

2. Essential Directives

User-agent: Targeting One Bot or All Crawlers

The User-agent directive specifies the bot concerned. Then you define what it can or cannot crawl using Allow and Disallow. Basic example:

User-agent: *Disallow:

Here, an empty Disallow: means "everything is allowed". It is the equivalent of having no file (for crawling). To target a specific bot:

User-agent: GooglebotDisallow: /internal-search/

A good habit: only segment by bot if you have a real need (AI, aggressive crawling, specific sections). Otherwise, a global rule makes maintenance easier.

User-agent: * followed by an empty Disallow: is the most permissive configuration: it allows all bots to crawl all sections of the site. Conversely, to block certain bots whilst allowing others, you can add specific user-agent groups without affecting the indexing bots you need.

Disallow and Allow: Blocking, Allowing and Managing Exceptions

Disallow prevents crawling of a path; Allow creates an exception within a blocked area. Commented example:

User-agent: *Disallow: /private/Allow: /private/allowed-page.html

Watch out for common confusion: the Noindex directive is not a universal robots.txt standard and Google recommends not relying on it. If your goal is to prevent indexing, use instead:

the <meta name='robots' content='noindex'> tag (page level);
or the HTTP header X-Robots-Tag: noindex (useful for PDFs and files).

To prevent indexing of a page whilst also asking bots not to follow its links, use a meta tag (or header), not the crawl file. Example:

<meta name='robots' content='noindex,nofollow'>

A common audit trap: blocking a URL in the crawl file and wanting to set it to noindex. If the bot cannot crawl the page, it may not see the noindex. Result: the URL may still appear in the SERPs if other pages reference it, but without a meaningful snippet.

Sitemap: Declaring One or More Sitemaps with Absolute URLs

Declaring a sitemap helps engines discover important URLs faster. Recommended example:

Sitemap: https://www.yourdomain.co.uk/sitemap.xml

In practice, use an absolute URL rather than a relative URL. Some setups tolerate relative paths, but a full URL reduces ambiguity, especially in multisite or multi-environment contexts.

Tip: if you have multiple sitemaps (by language, by content type), declare a sitemap index, then monitor the impact on coverage and crawled pages.

You can declare multiple sitemaps, for example:

Sitemap: https://www.yourdomain.co.uk/sitemap_index.xml

Sitemap: https://www.yourdomain.co.uk/sitemap-blog.xml

Commented Example: A Simple, Safe File for a Brochure Site

Goal: do not block business pages, only limit technical areas, and declare the sitemap.

User-agent: *Disallow: /cgi-bin/Disallow: /tmp/Sitemap: https://www.yourdomain.co.uk/sitemap.xml

This "minimal" model suits many brochure sites: it remains readable, reduces the risk of accidental blocking, and clarifies discovery via the sitemap.

3. Optimising Crawling and Crawl Budget

Why This File Influences SEO Performance (Crawling, Rendering and Signals)

In SEO, you seek a balance: have strategic pages crawled and understood quickly, whilst avoiding bots spending resources on low-value URLs. This file acts like a traffic plan for crawlers. For a business, the outcome is tangible: the more your key commercial and editorial pages are crawled and refreshed, the more you protect visibility and lead generation.

Beyond crawl volume, rendering matters too: if required resources (CSS/JS) are not accessible, an engine may interpret a page differently (structure, visible content, interactive elements), which weakens relevance signals.

Prioritising Strategic Sections: Categories, Content and Business Pages

To prioritise, think "business value + SEO value". Sections that deserve smooth crawling (and therefore, generally, should not be blocked):

solution / product pages;
category pages (if they target an intent);
pillar content, topic hubs, studies and customer case studies;
useful conversion pages (book a meeting, request a demo), if they have an SEO role (or at least support the journey and internal linking).

To secure indexing, think in threes: crawling (access), indexing (noindex, canonicals), and discovery (internal linking + sitemap). If a page is strategic, check:

it is not blocked;
it does not have unintended noindex (plugin, template);
it receives relevant internal links;
it is included in a coherent sitemap.

Managing Low-Value Pages: Internal Search, Sorting, Pagination and Duplicates

Crawl budget is the number of URLs an engine is willing to crawl over a given period. On large sites (catalogues, resources, help centres, multilingual content), it becomes critical. High-value actions include:

blocking infinite spaces (internal search, unlimited faceted combinations);
limiting tracking URLs (UTM, marketing parameters);
highlighting priority pages (offers, categories, hubs) via sitemap;
reducing duplicates (canonicals + internal architecture).

With pagination, be careful: blocking all paginated pages can hinder discovery of deeper products or articles. Often it is better to:

allow pagination to be crawled when it supports discovery;
handle duplication via canonicals, parameters and architecture (e.g. clean category pages, genuinely useful facets).

URL Parameters and E-commerce Facets: Methodology and Practical Examples

URL parameters (query parameters) often generate duplicates: sorting, filters, pagination, tracking. You can limit crawling with targeted rules. Example:

User-agent: *Disallow: /*?sort=Disallow: /*?filter=Disallow: /*?utm_

For e-commerce facets, the goal is not to "block everything", but to distinguish:

facets that match a search intent (e.g. "safety shoes size 8");
infinite or low-value combinations (e.g. sort + multiple filters + tracking) that create noise.

Recommended methodology:

map parameters (complete list + volumes in logs);
identify those that create duplication and URL explosion (sorting, tracking, combinatorial filters);
define an indexing strategy for truly strategic facets (dedicated pages, templates, content, canonicals);
limit crawling of non-strategic parameters via pattern-type rules.

Common examples (adapt to your URL structure):

User-agent: *Disallow: /*?utm_Disallow: /*?gclid=Disallow: /*?fbclid=Disallow: /*?sort=Disallow: /*?order=

Important: before deploying, test real URLs (categories, filters, product pages) and ensure you are not blocking a facet you actually want to rank.

Do Not Block Useful Resources (CSS/JS): Impact on Rendering and Rankings

With resources, be careful: Google states that blocking CSS/JS can prevent proper rendering understanding. Only block scripts and stylesheets if you are certain it will not affect rendering or analysis. For media, blocking can prevent crawling (images, video, audio), but it does not stop a third party linking directly to those files.

In practice, if you block a directory such as /wp-includes/ or /assets/ without analysis, you risk degrading interpretation of the DOM, dynamically loaded content, and therefore relevance signals.

The directive Disallow: /*.js$ blocks all JavaScript files. Before applying such a rule, make sure those scripts are not required to render important pages. Otherwise, search engines may misinterpret your content.

Visual: "Crawl / Limit / Block" Table by Page Type

Page / URL Type	Recommendation	Why
Solution pages, products, strategic categories	Allow crawling	Business value + SEO intent, needs freshness and coverage
Pillar articles, hubs, guides	Allow crawling	Organic acquisition and topical authority
Internal search	Block	Infinite space, low value, frequent duplication
Tracking (utm_, gclid, fbclid)	Limit / block	Duplication with no SEO value, wasted crawl
Sorting (sort, order) and filter combinations	Limit	URL explosion, relevance varies depending on facets
Pagination	Allow crawling (often)	Supports discovery; better handled via architecture/canonicals
CSS/JS required for rendering	Allow crawling	Proper page understanding (rendering, visible content, UX)
Basket, account, transactional steps	Limit / block	Low SEO value, personalised content, duplication risks

4. Advanced Use Cases and Compatibility

Wildcards (*, $): When to Use Them and Compatibility Limits

Google supports common wildcards (*) and end markers ($) in many cases. Always test, because not all engines interpret syntax in the same way.

Typical examples:

User-agent: *Disallow: /*?utm_Disallow: /*.pdf$

Important limitation: an overly broad rule can block useful URLs (e.g. indexable PDFs that generate leads). Before applying a pattern, list example URLs to "keep" and "exclude", then test.

Crawl-delay: Use Cases, Limits and Recommended Alternatives

Crawl-delay can, depending on the engine, space out requests. Google does not take it into account in the same way as some other engines; it relies on its own adaptive mechanisms. Recommendations:

improve server performance;
reduce low-value crawl spaces;
monitor crawling via logs.

If a non-critical bot overloads the site, a combination of "rules + server blocking" (reverse proxy, WAF) is more robust.

Multilingual and Multisite: Rules, Subdirectories and Subdomains

Multilingual: avoid blocking language directories by default. Prefer a clean strategy: hreflang, sitemaps per language, and duplication control (parameters, partial translations).

Multisite and subdomains: each host must have its own file at its root. If you have a main site and a help centre, for example, check the file at https://www.domain.co.uk/robots.txt and at https://help.domain.co.uk/robots.txt.

Files (PDFs, Images) and Directories: Effective Blocking Scenarios

You can block crawling by path (e.g. Disallow: /docs/ or Disallow: /white-paper.pdf). To prevent indexing, use X-Robots-Tag: noindex on the file, or access control.

Examples:

User-agent: *Disallow: /docs/Disallow: /private-images/Disallow: /*.pdf$

Common scenario "useful PDF but should not be indexed": allow crawling, then apply X-Robots-Tag: noindex server-side on the file. This way, the engine can read the instruction.

5. Common Errors and Best Practices

Checklist: A Reliable, Maintainable File Structure

An effective file is readable, stable and goal-driven. Best practices:

group rules by category (general, specific bots, sitemaps);
add dated comments to explain intent;
avoid over-complexity (too many rules = more risk);
test every change before releasing to production;
keep a change history (versioning) to roll back quickly.

Accidental Blocks: Symptoms, Causes and Fixes

The most costly mistakes:

global block in production (e.g. Disallow: /) after a staging phase;
blocking directories containing business pages (e.g. /solutions/, /resources/);
forgetting to create a dedicated file for a strategic subdomain (blog, help centre, app).

A simple check to add to your release checklists: verify the public URL of the file on every critical host and validate a few sensitive URLs.

Typical symptoms:

a sudden drop in crawled and/or indexed pages in Search Console;
a large number of "blocked" URLs appearing in coverage reports;
delayed indexing of new content.

Fixes: correct the rule, republish the file, then request re-crawling of key pages (and monitor logs + crawl reports).

Security and Confidentiality: Why This Is Not Data Protection

This file must not be used to "hide" sensitive information: it is public and can draw attention to the areas you want to conceal. For protection, prioritise:

authentication and access control (SSO, password);
server restrictions (403/401, IP allowlist);
removal of content exposed by mistake.

Because the file is public, it can "reveal" your sensitive areas: admin directories, endpoints, backup folders, etc. If a section must remain private, protect it properly (authentication, server rules) and do not rely on a crawl ban alone.

Ready-to-Use Examples: Blog, E-commerce, WordPress and WooCommerce

Example "B2B site with a blog + marketing parameters":

User-agent: *Disallow: /search/Disallow: /*?utm_Disallow: /*?gclid=Sitemap: https://www.yourdomain.co.uk/sitemap_index.xml

Example "allow everything, except a technical folder":

User-agent: *Disallow: /tmp/Disallow: /cgi-bin/

Example "block only AI training bots (without affecting SEO bots)":

User-agent: GPTBotDisallow: /User-agent: ChatGPT-UserDisallow: /User-agent: Google-ExtendedDisallow: /

This last case must be handled rigorously: target only the training user-agents, and monitor the impact. Confusing AI bots with indexing bots can be very costly in terms of visibility.

For WordPress and WooCommerce, the most common approach is to limit internal search and certain dynamic URLs, without blocking resources needed for rendering. On WooCommerce, be cautious with SEO-relevant pages (categories, products) and target parameters, baskets, accounts, etc., depending on your strategy.

6. Testing, Validation and Monitoring

Testing the File: Control Methods and What to Check

Before and after every change, test. Google provides detailed documentation and reports to verify whether its systems can process your file. In practice, recommendations:

accessibility check (200 OK, expected content);
tests on representative URLs (strategic pages, pages to block, parameters);
checking indexing signals (coverage, excluded pages, anomalies).

A quick, simple check: open /robots.txt in your browser, then test a few URLs in an inspection tool (Search Console) to confirm accessibility and interpretation.

Official reference documentation (Google): https://developers.google.com/search/docs/crawling-indexing/robots/intro?hl=en.

Analysing Server Logs: Spotting Bots and Over-Crawled Areas

Server logs remain the most reliable source to understand who crawls what, how often, and with which HTTP status codes. Complement this with a crawler (such as Screaming Frog) and SEO tools that detect inconsistencies.

Recommended data-driven approach: identify the most crawled directories, then compare them to business value (traffic, conversions, depth, SEO potential). That is often where the quickest wins are found.

Updating Safely: Process During a Redesign or Release

This file is not a one-off deliverable. It must evolve with:

content growth (new sections, new languages);
CMS changes (WordPress migration, redesign);
new crawl sources (AI, scrapers, partners).

Recommended review cadence: quarterly, and systematically with every major release. It is also a collaborative topic: SEO, content, dev and security must share the same view.

Recommended redesign process:

prepare separate "staging" and "production" versions;
add a go-live checklist control (check the URL and ensure there is no Disallow: /);
monitor Search Console and logs in the days that follow.

Critical Case: What to Do If the File Is Missing, Empty, 404 or 5xx

Missing or 404: bots generally crawl as if it didn't exist. Empty (or empty Disallow:): everything is allowed. In the event of a server error (5xx), some bots may reduce or temporarily stop crawling as a precaution.

Priority actions:

404: put the file back online if you had necessary rules (sitemaps, parameter limitations);
empty: check whether it is intentional (simple brochure site) or a regression (rules lost);
5xx: treat as a technical incident (server stability), as the impact on crawling can be immediate.

Visual: Example Crawl Report Snapshot and Interpretation

Crawl report (example)- Crawled: 12,430 URLs- Blocked by a rule: 3,210 URLs- Server errors (5xx): 58 URLs- Client errors (404): 112 URLs

Quick interpretation- "Blocked": confirm these are truly low-value areas- "5xx": technical priority (risk of reduced crawling)- "404": check redirects and internal links

Goal: connect crawl anomalies to concrete actions (rules, server performance, internal linking).

7. Managing AI Bots, LLMs and New Crawlers

Which AI Bots Crawl the Web and Why It Changes the Stakes

AI bots differ from "classic" crawlers: they may collect data to train models or feed products. They identify themselves via specific user-agents (e.g. GPTBot, Google-Extended, CCBot). Because the protocol is advisory, some bots may ignore rules or spoof a user-agent.

In B2B, the issue goes beyond SEO: it affects data governance, intellectual property, and transparency around published content.

Management Best Practices: What Is Possible and Its Limits

Operational recommendations:

never block the indexing bots you need for visibility (Googlebot, Bingbot);
target only identified AI user-agents and monitor logs;
strengthen protection at server level (WAF, reverse proxy) for abusive behaviour;
document your access policy, in the spirit of transparency.

Limits to understand: it is declarative, therefore bypassable. For a robust policy, combine rules, server filtering, and access control where necessary.

When to Complement with Other Signals: Access, Authentication and Headers

If content truly must be restricted, the right approach is to apply enforceable controls:

authentication (SSO, password, tokens);
appropriate server responses (401/403);
WAF rules / rate limiting to prevent large-scale extraction;
indexing control headers for files (e.g. X-Robots-Tag on PDFs).

This combination reduces risk without compromising organic performance on public pages.

8. Comparison with Other Control Mechanisms

Meta Robots and X-Robots-Tag: Differences, Use Cases and Classic Scenarios

Robots.txt mainly controls crawling. Meta robots controls indexing at HTML level. X-Robots-Tag controls indexing at HTTP level, including for non-HTML files (PDFs, images).

A useful classic scenario: "allow crawling but prevent indexing". Allow crawling (do not block it), then add noindex via meta robots or X-Robots-Tag. If you block crawling, the bot may never see the noindex instruction.

Crawl Control File vs llms.txt: Goals, Complementarity and Scenarios

Robots.txt historically targets search engine crawlers: it governs access to paths and helps manage load and crawling. llms.txt is more about communicating usage preferences or guidance to systems connected to large language models (LLMs): recommended sources, scope, conditions, entry points.

In short: one manages crawling behaviour (with limits), the other aims to guide how AI systems consume content. They are complementary, but neither is a security control on its own.

When should you prioritise one or the other?

You want to optimise SEO crawling, reduce duplication, guide bots: prioritise robots.txt.
You want to provide clear entry points for AI (canonical pages, documentation, conditions): add llms.txt as a complement.
You want to protect data: implement server security (authentication, ACLs, WAF), then only use declarative files.

Example of combined use:

robots.txt: limit crawling of parameters, declare sitemaps, block targeted AI bots.
llms.txt: list your "source of truth" pages (documentation, glossary, pillar pages) to reduce ambiguous interpretations.

In a GEO/SEO strategy, this combination improves access to information quality whilst preserving performance and governance.

Visual: Comparison Table of Solutions (Crawling, Indexing and Control)

Mechanism	Affects Crawling	Affects Indexing	Enforceable	Main Use Case
Crawl control file (robots.txt)	Yes	No (indirectly)	No	Manage crawl budget, avoid noise, declare sitemaps
Meta robots	No	Yes	Yes (for engines that respect it)	Control indexing of HTML pages
X-Robots-Tag	No	Yes	Yes	Control indexing of PDFs, images, files
Access control (401/403, authentication)	Yes (blocks access)	Yes (prevents collection)	Yes (technical)	Actually protect sensitive content
llms.txt	No (depending on systems)	No	No	Guide how AI systems use your content

FAQ: Quick Answers to Common Questions

Does This File Block Indexing or Only Crawling?

It blocks crawling, not indexing. Google states that a blocked URL can still appear in results if it is referenced elsewhere, often without a description.

How Do You Prevent a URL from Being Indexed Whilst Still Allowing Crawling?

Allow crawling (do not block it), then add noindex via meta robots or X-Robots-Tag. If you block crawling, the bot may never see the noindex instruction.

How Do You Block a PDF, an Image or a Specific Directory?

You can block crawling by path (e.g. Disallow: /docs/ or Disallow: /white-paper.pdf). To prevent indexing, use X-Robots-Tag: noindex on the file, or access control.

Where Should You Place the File and How Do You Manage Multiple Subdomains?

At the root of each host. Each subdomain must have its own file available at https://subdomain.example.co.uk/robots.txt.

How Do You Manage a Staging Environment Without Impacting Production?

On staging, block crawling (e.g. Disallow: /) and, above all, protect access (password, IP). In production, make sure no staging rule has been deployed by mistake (a common cause of traffic loss).

Can You Use Wildcards (*, $) and What Are the Limits?

Yes, Google supports common wildcards (*) and end markers ($) in many cases. Always test, because not all engines interpret syntax in the same way.

How Do You Declare Multiple Sitemaps in the Configuration File?

Add multiple Sitemap: lines, or better, declare a sitemap index that references your sitemaps (languages, content types).

How Do You Avoid Accidentally Blocking Useful Resources (CSS/JS)?

Do not block resource directories by default. If you must, check rendering via an inspection tool (and watch the impact on analysis and indexing). Google says blocking resources can prevent it from properly understanding a page.

How Can You Check That Bots Respect the Rules (Googlebot, Bingbot, etc.)?

Analyse server logs (user-agent, paths, frequency, HTTP status codes). Cross-check with crawl reports and your crawl audits to detect discrepancies.

What Are the Impacts on Crawl Budget and How Do You Prioritise Effectively?

The fewer useless URLs you expose, the more bots focus on your important pages. Prioritise by: blocking low-value areas, keeping a clean sitemap, using business-driven internal linking, and removing duplicates.

How Do You Manage AI Bots and What Limits Will You Face?

You can add targeted user-agent rules (e.g. GPTBot) and block access. Limits: it is advisory and bypassable. For stronger protection, combine with server filtering (reverse proxy, WAF) that can return 403 responses.

Is This File Mandatory for All Websites?

No, but it becomes highly recommended as soon as your site grows, you have URL parameters, multiple environments, or a need for crawl governance.

SEO Impacts of Poor Configuration: What Should You Monitor First?

Typical impacts include: an indexing drop, strategic pages disappearing, crawling wasted on useless URLs, delayed indexing of new content, and direct losses in organic traffic.

Optimise Your Crawling Strategy with Incremys

Configuration Audit and Crawl Budget: Actionable Recommendations and GEO/SEO Monitoring in the Platform

Incremys helps marketing and SEO teams turn crawl governance into measurable results: identifying over-crawled areas, prioritising high-value pages, delivering actionable recommendations on rules, sitemaps and architecture, and then tracking impact (coverage, rankings, ROI). To go further, you can explore the platform on incremys.com and build a data-driven GEO/SEO content strategy aligned with your growth objectives.

Concrete example