When you scrape too fast, you lose access. When you scrape too slow, you lose value.

By:Victoire Habamungu

Nobody tells you about this tension when you start building a scraping system. You think about concurrency, about storage, about parsing. You think about making it fast. What you don't think about until it costs you is that speed and access are in direct conflict, and both sides of that conflict have a price tag.

Losing access is not a technical incident. It's a business loss.

When a website blocks your scraper, the conversation in most engineering teams sounds like this: "We got blocked, we'll rotate IPs and retry." That framing misses what actually happened.

The data from that window is gone. You cannot go back and scrape what you missed while you were blocked. If your client's pipeline depends on that source for pricing intelligence, competitive monitoring, or market data, you just created a permanent gap. The business made decisions during that window without the data they were paying for. That's not a retry problem. It's a revenue problem.

At 1,000+ concurrent jobs across hundreds of targets, getting blocked on even a fraction of your sources compounds fast. Each block is a gap. Each gap is a cost.

Losing value is quieter but just as expensive.

The other side of the tension is less visible but equally real. A scraper that moves too conservatively collects data outside the window where it's actionable.

In time-sensitive pipelines such as pricing, inventory, or any process where the market moves faster than your data, slow data is not necessarily late data. It's wrong data. Your client is looking at a snapshot of reality that no longer exists. The decisions they make are based on a false picture.

The business may not always be aware of this happening. The pipeline looks healthy. Data is arriving. Everything appears to be working. The cost is invisible until someone asks why the decisions made last Tuesday didn't match what the market was doing.

Why most scrapers get this wrong

Static rate limiting. You pick a number of requests per second, a delay between calls, and apply it uniformly across every target. It's simple to implement, and it feels safe.

The problem is that different websites respond completely differently. A target that can handle aggressive scraping without triggering blocks is actually being throttled unnecessarily and costing you throughput and value. A target that needs careful handling is being hit at the same rate as everything else, costing you access.

One setting applied uniformly to unpredictable targets is not a rate limiter. It's a guess held constant.

The decision that resolved the tension

The scraper needed to watch how each target responded and adjust in real time. Not a static setting but a dynamic speed per target, calibrated continuously based on live feedback from that specific source.

When a target responds cleanly, the scraper moves faster. When a target shows signs of resistance, such as slower responses, unusual patterns, or early warning signals before a block, the scraper pulls back automatically. Each source gets the fastest safe speed it can handle, determined by the source itself, not by a number someone configured weeks ago.

The system stops guessing. It listens and adjusts.

What that produced

100% success rate across all targets. 300% throughput increase over the previous architecture. No permanently blocked sources. No lost data windows.

The business got faster data, more reliably, from every source without sacrificing access to a single one.

That's what resolves the tension. Not picking a side. Building a system that finds the right speed for each target dynamically, so you never have to choose between access and value.

Victoire Habamungu

Software engineer specialising in data systems, distributed architecture and platform engineering.

Keep reading

When you scrape too fast, you lose access. When you scrape too slow, you lose value.

Why most scrapers get this wrong

The decision that resolved the tension

What that produced

Victoire Habamungu

Share post:

More on web development

Most data tools infer how to read your data. Here's why that's the root of every pipeline corruption I've seen.

Every Django team solves the same DRF problems from scratch. They shouldn't have to.

Article details

In this article