Tech for Retail 2025 Workshop: From SEO to GEO – Gaining Visibility in the Era of Generative Engines

Back to blog

SEO Crawl Budget: Stakes and Method

SEO

Discover Incremys

The 360° Next Gen SEO Platform

Request a demo
Last updated on

19/2/2026

Chapter 01

Example H2
Example H3
Example H4
Example H5
Example H6

Building on our technical SEO audit, this article focuses on a topic that is often underestimated at scale: managing your SEO crawl budget. On large catalogues, the challenge is not simply to be crawled, but to be crawled usefully — getting bots onto the right URLs, at the right pace, without trapping them in low-value areas.

 

SEO Crawl Budget: Optimising Crawling for Large E-Commerce Websites

 

 

Why This Article Complements a Technical SEO Audit Without Repeating It

 

The main article sets the foundations: mapping the site like a bot, qualifying HTTP status codes, understanding depth, spotting duplication, redirects, dead ends and blocked resources. Here, we go further on one very specific point — how those symptoms translate into crawl prioritisation decisions at Google, and how to turn that understanding into crawl budget optimisation choices that make sense for e-commerce and high URL volumes.

One key nuance: Google stresses that crawling and indexing are two separate steps; not every crawled page ends up indexed (Google's official documentation on crawl budget). This changes how you interpret reports and helps you avoid false emergencies.

 

Which Sites Are Most Exposed: Large E-Commerce Sites, Marketplaces, Publishers and Directories

 

Google positions crawl budget optimisation as an advanced topic that is mainly relevant for:

  • very large sites (order of magnitude: over one million unique pages) whose content changes moderately (around once a week);
  • sites from roughly 10,000 unique pages whose content changes very quickly (around daily);
  • sites with a large number of URLs marked as "discovered/crawled — currently not indexed" in Search Console (Google documentation on crawl budget).

In practice, this typically includes e-commerce sites (facets, sorting, pagination, variants), marketplaces (highly dynamic inventory), media publishers (freshness) and directories (volume plus duplication). On these sites, structural drift can create tens of thousands of parasitic URLs that absorb crawl activity at the expense of high-stakes categories and products.

 

What Crawl Budget Means: What Google Can Crawl and What It Chooses to Process

 

 

Crawling, Rendering and Indexing: Clarifying What Matters

 

According to Google, crawl budget is the set of URLs Google can and wants to crawl. After crawling, Google evaluates and consolidates signals before potential indexing; crawling does not guarantee eligibility to appear in results (Google documentation). Another structural point: crawl budget is defined per hostname; https://www.example.com/ and https://code.example.com/ do not share the same budget (Google documentation). For multi-subdomain architectures, this becomes a genuine SEO design consideration.

At this stage, it helps to distinguish:

  • Discovery: Google finds the URL (links, sitemap, historical knowledge).
  • Crawling: Googlebot fetches the resource (HTML and sometimes required assets).
  • Rendering: processing JavaScript and constructing the rendered page, where necessary.
  • Indexing: analysis, consolidation and potential inclusion in the index.

On JavaScript-heavy sites, the rendering cost can multiply the crawl cost: even if a URL is reached, it may be expensive to process, which mechanically reduces the coverage achievable within a given crawl budget.

 

Crawl Capacity and Crawl Demand: the Two Drivers Behind Your Crawl Budget

 

Google explains that crawl budget results from two components: crawl rate limit (capacity) and crawl demand (Google documentation). If demand is low, Google crawls less even if the server could handle more.

  • Capacity: depends in part on the number of parallel connections Google can open and the delay between fetches, in order to avoid overloading your servers (Google documentation). This is where the concept of crawl delay belongs semantically.
  • Demand: reflects Google's interest in revisiting your URLs (site size, freshness, perceived quality, relevance, popularity). Google also notes that different crawlers and products have different demands (Google documentation), which can explain differences across page types such as products and images.

 

Signals That Increase or Reduce How Often Your URLs Are Visited

 

Google highlights three main drivers of demand: perceived inventory, popularity and staleness (Google documentation). On large sites, perceived inventory is often the most actionable lever: too many duplicated, removed or undesirable URLs waste time and can lead Google's systems to decide it is not worth crawling the rest of the site.

Conversely, certain site-wide events — such as a migration — can temporarily increase crawl demand so Google can reprocess the site (Google documentation). That does not replace good hygiene: if the migration creates redirect chains and duplicates, you raise processing costs at the worst possible moment.

 

Identifying a Crawl Issue: Reliable Signals and False Positives

 

 

In Google Search Console: Crawl Stats, Errors, Redirects and Resources

 

The most useful reading comes from Google Search Console: indexing reports (valid, excluded, errors), the "discovered — currently not indexed" and "crawled — currently not indexed" signals, plus indicators related to server errors. A high volume of URLs in these states often signals an unfavourable trade-off around perceived quality, duplication or architecture, rather than a simple lack of crawling.

Common false positives include confusing "not indexed" with a purely technical problem, or assuming every discovered URL is worth saving. On a large site, the right question is: should this URL exist within the indexation strategy?

 

Discovered but Not Crawled: Depth, Internal Linking and Prioritisation

 

When Google discovers but does not crawl a page (or takes a long time to do so), two causes dominate on large catalogues: (1) the page is too deep or poorly linked; (2) it belongs to a set of URLs with low perceived value, such as facet combinations. Excessive depth acts as a priority penalty: the further a URL sits from the homepage and hubs, the more expensive it becomes to reach, and the weaker its internal signals tend to be.

A pragmatic approach is to ensure key commercial pages are reachable in roughly three clicks via a rational architecture and internal linking structure. This improves discovery and crawl prioritisation at scale.

 

Frequently Updated Content That Is Not Revisited: Spotting a Freshness Gap

 

If pages covering prices, stock or content change often but do not appear to be revisited at an appropriate cadence, investigate crawl demand. Google recrawls to capture changes (staleness), but only if it perceives the visit is worth the cost. The signals that lower that perception are typically structural: massive duplication, slow templates, server instability or noise from infinite URL spaces.

Where visibility is concentrated in a small number of positions, the business impact is immediate. For context, the number one organic desktop position captures 34% of clicks (source: SEO.com, 2026, via our SEO statistics). Freshness that is poorly reflected in search results can harm CTR, conversion and trust even before rankings visibly decline.

 

Measuring and Segmenting Crawling on a Large Site: a Decision-Led Method

 

 

Segment by Page Type: Categories, Products, Facets, Editorial and Technical Pages

 

With thousands or millions of URLs, you do not manage crawl budget at the individual page level, but at the template and URL-family level. As a minimum, segment by:

  • categories and subcategories (strategic listing pages);
  • product detail pages (long tail plus seasonality);
  • facets, filters, sorting and parameters (often the main source of drift);
  • editorial content (guides, advice articles, hubs);
  • technical pages (account, basket, internal search, and so on).

Then, for each segment, compare: (1) URL volume generated, (2) expected indexability, (3) crawl signals, (4) performance (TTFB, stability) and (5) business value.

 

Prioritise Pages With Business Value: Traffic, Conversion, Margin and Seasonality

 

In e-commerce, "important for SEO" must remain aligned with "important for the business". Prioritisation is more robust when it combines organic performance (impressions and clicks), conversion data from GA4, margin, availability and seasonality. This prevents overinvestment in low-contribution URLs.

It also helps to keep the wider visibility mix in mind: SEO accounts for 54% of web traffic versus 28% for paid search (source: Odiens, 2025, via our SEA statistics). On a large site, improving crawl efficiency often means protecting a major share of structural organic traffic.

 

Isolate URLs That Consume Crawl Budget Without Measurable SEO Benefit

 

The most actionable diagnosis is to identify URL families that:

  • are crawled frequently but remain not indexed or are repeatedly excluded;
  • generate no organic traffic and match no meaningful search intent;
  • introduce duplicate content through near-identical variants, parameters or mismanaged pagination;
  • reduce crawl capacity through slowness, errors or redirect chains.

The goal is to reduce undesirable perceived inventory — the factor Google identifies as the most positively influenceable lever (Google documentation).

 

Where Crawl Budget Gets Wasted: the Biggest Sources of Leakage

 

 

URL Parameters, Facets and Sorting Pages: Controlling Combinatorial Explosion

 

Faceted navigation can generate an effectively infinite number of URLs through filter combinations, sorting options and pagination. This is a classic crawler trap: Google discovers ever more URLs, but much of the content is redundant. The result is wasted crawl budget, repeated comparison of similar pages, and an unfavourable trade-off for truly strategic pages.

The right approach is not to block everything, but to decide which combinations deserve indexation — those with strong search demand, structuring product ranges or clear seasonality — and neutralise the rest with coherent rules covering directives, canonicals, URL patterns and internal linking.

 

Low-Value Pages: Internal Search, Empty Pages, Unnecessary Variants and Technical URLs

 

Google recommends blocking certain areas via robots.txt when they are not important for the search engine, even if they serve a user purpose — for example, sorting variants of a page, or infinite scrolling that duplicates information already accessible via links (Google documentation). On e-commerce sites, internal search results, basket pages, login pages or empty pages can become crawl sinks if they are accessible to bots.

A key method note: avoid using noindex as a crawl-saving fix for areas you do not want crawled. Google points out that if it has to crawl a URL to see the noindex directive, crawl resources are still consumed; robots.txt blocking is specifically designed to prevent fetching (Google documentation).

 

Duplicate Pages, Canonicals and Contradictory Signals: Why Google Crawls More Than It Indexes

 

Google recommends consolidating or eliminating duplication so that crawling focuses on unique content rather than merely unique URLs (Google documentation). On large sites, duplicates commonly arise from URL variations (http/https, www/non-www, trailing slash), parameters, facets, near-identical categories, or templates that produce overly similar text.

Also watch out for overzealous canonicalisation: if internal links primarily point to URLs that canonicalise elsewhere, Google can repeatedly crawl and reprocess effectively lost pages, diluting internal signals and increasing consolidation cost. The target is full alignment — coherent canonical versions, internal linking towards final destination URLs, and a clean sitemap.

 

Redirect Chains and Loops: Hidden Cost and Effects on Discovery

 

Google recommends avoiding long redirect chains because they negatively affect crawling (Google documentation). On large catalogues, they commonly appear after migrations, URL rule changes, discontinued products being redirected in cascades, or late standardisation of slash conventions, parameters and tracking strings.

Each hop consumes a request and processing time; multiplied across thousands of URLs, this reduces the time available to cover genuinely useful pages returning a 200 status.

 

4xx/5xx Errors, Server Instability and Latency: When Google Slows Its Crawl

 

Crawl capacity varies with crawl health: if a site responds quickly over a sustained period, Google increases the limit; if the site slows down or returns server errors, Google reduces crawl intensity (Google documentation). This is the direct link between availability, performance and crawl cadence.

On speed, an operational illustration helps: if TTFB is 100ms, Googlebot can theoretically fetch 10 pages per second; at 500ms, closer to 2 (example cited in a performance analysis focused on crawling). Without claiming real-world parity, this illustrates the mechanism clearly: the slower the server and rendering pipeline, the smaller the useful window for covering a large inventory within a given crawl budget.

Finally, for permanently removed pages, Google recommends returning 404 or 410 rather than blocking them: a 404 is a strong signal not to recrawl, whereas a blocked URL may remain in the queue longer (Google documentation). Soft 404s should also be eliminated, as they continue consuming crawl resources (Google documentation).

 

Crawl Delay in SEO: What It Really Means and When It Matters

 

 

The Crawl-Delay Directive in robots.txt: Compatibility, Limitations and Impacts

 

Crawl delay historically refers to a Crawl-delay directive interpreted by some bots via robots.txt. In practice, treat it as a non-universal mechanism: not all crawlers respect it, and Google primarily relies on its own regulation — observed server capacity, errors and latency — to adjust crawl rate.

For Google, the useful concept behind crawl delay is the delay between fetches built into the crawl rate limit: Googlebot adjusts this delay to avoid overloading the server (Google documentation).

 

Distinguishing Deliberate Limitation From Server-Capacity Throttling

 

Two scenarios can look similar — less crawling overall — but require very different fixes:

  • Deliberate limitation: you have blocked areas via robots.txt or reduced the accessibility of certain URL families, which can be entirely desirable when those URLs serve no SEO purpose.
  • Throttling: Google slows down because it detects slowness or server errors (5xx, instability), thereby reducing crawl capacity (Google documentation).

One indicator mentioned by Google: if the URL Inspection tool returns a message such as Hostload exceeded, this may indicate an infrastructure-side limit; Google suggests adding server resources as a lever to increase crawl capacity (Google documentation).

 

Safer Alternatives to Protect Infrastructure Without Sacrificing SEO

 

Rather than applying blanket throttling, robust alternatives include:

  • reducing parasitic URLs (fewer pointless requests);
  • stabilising responses (fewer errors, better latency);
  • improving rendering efficiency (lower JavaScript overhead);
  • keeping sitemaps up to date (better guidance, particularly using <lastmod> as recommended by Google) (Google documentation).

In short: protect infrastructure by making crawling more efficient, not by reducing overall discovery capacity.

 

Optimising Your SEO Crawl Budget: a Technical Action Plan for Large E-Commerce Sites

 

 

Focus Crawling on Strategic Pages

 

 

Strengthen Internal Linking to High-Impact Categories and Products

 

At scale, internal linking acts as a prioritisation system. High-impact pages — core categories, high-margin products, bestsellers and pillar content — should receive more links through navigation hubs, contextual links, top-categories blocks and links from already-strong pages. Conversely, limiting internal links to non-indexable URLs helps avoid crawl dead ends.

This extends the approach described in SEO crawling: observe the paths bots can actually follow, then adjust link distribution to guide exploration towards what matters.

 

Reduce Click Depth for Key Pages

 

Reducing depth does not mean placing everything in the header. Effective patterns at scale include hub pages by product universe, lateral links across categories, accessible pagination and crawlable HTML link blocks. A business page that sits too deep becomes a natural candidate for crawl delays, especially when the site simultaneously generates large volumes of faceted URLs.

 

Reduce Pointless Crawl Surface

 

 

Manage Facets: Indexation Rules, Canonicals and URL Patterns

 

Make an explicit decision about which facets are SEO-first (indexable, included in the sitemap, reinforced via internal linking) and which are UX-only. For UX-only areas:

  • avoid injecting them heavily into crawlable internal linking;
  • consider robots.txt blocking where crawling brings no value, in line with Google's recommendations on blocking unimportant URLs (Google documentation);
  • standardise parameters (order, casing, separators) to limit duplicate URL patterns.

A key nuance from Google: blocking URLs can reduce their processing by other systems, and any freed crawl budget is not automatically reallocated unless you were already at the serving limit (Google documentation). That is why you should address the root cause — URL explosion — rather than only the symptom.

 

Clean Up Duplication: Near-Identical Pages, Variants and Copied Content

 

Duplication is not only an indexation issue: it increases comparison and consolidation cost. Google's recommendation is clear — consolidate or eliminate duplicated content so that crawling focuses on unique content (Google documentation). For e-commerce, this typically means:

  • reducing near-identical categories through real differentiation or consolidation;
  • limiting variant pages with no search demand;
  • rewriting overly similar templated content on strategic segments.

 

Clean Up Redirects

 

 

Remove Redirect Chains and Standardise URL Versions (http/https, www, Trailing Slash)

 

The goal is a single redirect at most between an old URL and its final destination — ideally zero for internal links. Standardising URL versions (http/https, www, trailing slash) also reduces duplicates and avoids crawling pointless technical variants. Google indicates that long chains negatively affect crawling (Google documentation), making this a high-yield housekeeping task for large inventories.

 

Avoid Long-Lived Temporary Redirects on SEO Landing Pages

 

On SEO landing pages such as categories, hubs and hero products, temporary redirects that persist (302/307) maintain uncertainty, encourage unnecessary recrawls and can create unstable URL paths over time. The rule is straightforward: if the destination is permanent, switch to a permanent redirect and update internal links to point directly to the final URL.

 

Stabilise Server Responses

 

 

Reduce Errors and Latency to Increase Crawl Capacity

 

Crawl capacity increases when the site responds quickly and reliably; it decreases when the site slows down or returns server errors (Google documentation). This is also an organisational argument: a performance backlog covering TTFB, caching and stability is equally a crawl budget backlog.

Google also notes that if pages load and render faster, Google may be able to read more content (Google documentation). On JavaScript architectures, reducing rendering costs — unused bundles, excessive hydration — prevents rendering from becoming the bottleneck that caps your effective crawl budget.

 

Optimise Delivery of Resources Needed for Rendering

 

If your pages depend on rendering, avoid blocking critical CSS, JavaScript or images in robots.txt: doing so can degrade rendering quality and comprehension, even if the URL itself is crawled. The principle remains: block areas you do not want crawled, not the resources required to analyse important pages correctly.

 

Ongoing Management: Tracking Impact and Preventing Regressions

 

 

A Minimal Dashboard: Crawl, Indexation and Performance Metrics to Correlate

 

To steer effectively, a minimal dashboard should correlate:

  • crawl signals (trends, server errors, redirects) via Search Console;
  • indexation states (valid, excluded, "discovered/crawled — not indexed");
  • technical performance (TTFB, stability, slow templates);
  • business value (impressions, clicks, conversions) to measure impact on business-critical areas.

To contextualise the visibility stakes, our GEO statistics underline that evolving SERP features such as generative overviews can reduce organic traffic (sources cited in the study). In that context, crawl efficiency and the prioritisation of genuinely useful pages within your crawl budget become even more strategically important.

 

Post-Release Checks: Redirects, Duplication and Newly Generated Pages

 

Regressions rarely stem from bad intent; they tend to come from a release that:

  • generates new parameterised URLs;
  • multiplies near-identical pages through template duplication;
  • introduces redirect chains;
  • changes page depth through navigation or pagination changes;
  • degrades latency or stability through 5xx spikes.

After each major release, prioritise these checks on the most crawled and most business-critical templates. A single mis-scoped facet rule can recreate a large perceived-inventory problem within just a few days.

 

When to Rerun the Analysis: Redesigns, Seasonality and Major Catalogue Additions

 

Rerun a deep analysis when you face:

  • a redesign affecting URLs, templates, JavaScript or performance;
  • a migration or rule change covering redirects, canonicals or the sitemap;
  • a seasonal peak involving catalogue or event pages;
  • a major addition of products or categories.

Google notes that a migration can increase site-wide crawl demand (Google documentation). It is worth capitalising on this phase with a clean architecture, an up-to-date sitemap and well-controlled redirects.

 

Operationalising Crawl Budget Analysis With Incremys (Alongside Google Search Console and Google Analytics)

 

 

Centralise Signals via API to Prioritise Fixes and Measure ROI

 

On large sites, the hard part is not finding anomalies — it is deciding what to fix first and how to measure the effect. Incremys helps centralise and cross-reference SEO signals by integrating via API with Google Search Console and Google Analytics 4, enabling you to prioritise initiatives by impact, effort and risk, and to track performance and ROI across URL segments covering categories, products and content.

 

FAQ on Crawl Budget and Crawl Delay in SEO

 

 

What Is Crawl Budget in SEO?

 

It is the set of URLs that a search engine like Google can and wants to crawl on a site. Google clarifies that crawling does not guarantee indexing: after crawling, pages are assessed and consolidated before potential inclusion in the index (Google documentation).

 

What Is Crawl Budget for, and When Does It Become a Constraint?

 

It becomes a constraint when high-value URLs — categories, products and pillar content — are not crawled quickly enough or frequently enough because the bot spends time on parasitic URLs such as facets, parameters, duplicates, redirects and errors, or because capacity is reduced by slowness or 5xx errors. Google provides orders of magnitude where the topic becomes especially relevant: around one million or more unique pages updated weekly, or 10,000 or more pages updated daily (Google documentation).

 

How Can You Tell if Crawl Budget Is a Problem on Your Site?

 

Common signals include:

  • high volumes of "discovered/crawled — currently not indexed" URLs in Search Console;
  • delays in new strategic pages appearing in search results;
  • price, stock or content updates not being reflected quickly;
  • crawling concentrated on parameters, sorting options and technical pages;
  • rising 5xx errors, instability and latency indicating falling capacity.

 

How Do You Optimise Crawl Budget Without Losing Important Pages?

 

Work through segmentation and rules: first define which URL families should be indexable, then align internal linking, sitemaps and directives accordingly. Avoid broad, irreversible actions such as wide-scale blocking without a precise inventory. For permanently removed pages, Google recommends 404 or 410 responses rather than robots.txt blocking (Google documentation).

 

What Is Crawl Delay in SEO?

 

It refers to slowing the frequency at which a bot requests pages. With Google, crawl rate mainly depends on crawl capacity, determined in part by parallel connections and the delay between fetches, adjusted to avoid overloading the server (Google documentation).

 

Does the Crawl-Delay Directive in robots.txt Actually Improve SEO?

 

Not as a primary lever. The Crawl-delay directive is not universal, and Google mostly adjusts crawl rate based on server health — latency, errors and stability. The most reliable lever for improving your SEO crawl budget remains reducing pointless URLs and improving performance, thereby increasing both capacity and crawl efficiency.

 

Can Duplicate Pages Reduce How Often Google Crawls High-Value Pages?

 

Yes. Google explains that if many known URLs are duplicated or undesirable, crawl time is wasted through inflated perceived inventory, and recommends consolidating or eliminating duplicated content so that crawling focuses on unique content (Google documentation).

 

Why Do Redirect Chains Consume So Much Crawl Budget?

 

Because a chain requires multiple requests and processing steps to reach a single destination. Google recommends avoiding long chains because they negatively affect crawling (Google documentation). On a large catalogue, the cumulative impact can become significant very quickly.

 

My Site Has Fewer Than 10,000 URLs: Should I Worry?

 

Often less so, but not never. If your site is slow, unstable or generates many pointless URLs through parameters, you can still observe inefficient crawling. The biggest gains from optimising crawl budget generally appear on very large or highly dynamic sites, in line with the orders of magnitude Google provides in its advanced guide.

 

Should You Block Filters and Facets, or Leave Them Crawlable on a Large E-Commerce Site?

 

Neither block everything nor open everything. The right approach is to identify which combinations have genuine search value and clear intent — those worth indexing — then prevent combinatorial explosion for the remainder. Google recommends robots.txt blocking for certain unimportant URLs such as sorting variants, and notes that noindex does not prevent crawling (Google documentation).

 

How Do You Prioritise Fixes on a Catalogue With Hundreds of Thousands of Pages?

 

Combine segmentation by template, business value (traffic, conversion, margin and seasonality) and crawl and indexation signals. Start with the red zones: frequently crawled but slow templates, URL families generating significant duplication, and recurring redirects or errors. To keep exploring SEO, GEO and digital marketing topics in depth, visit the Incremys blog.

Concrete example

Discover other items

See all

Next-gen GEO/SEO starts here

Complete the form so we can contact you.

The new generation of SEO
is on!

Thank you for your request, we will get back to you as soon as possible.

Oops! Something went wrong while submitting the form.