A big spider update that takes the crawling framework to the next level 🕷️
🚀 New Stuff and quality of life changes
-
Added a LinkExtractor primitive in scrapling.spiders.LinkExtractor to pull URLs out of a Response. There are a lot of controls (Check the docs)
from scrapling.spiders import LinkExtractor
extractor = LinkExtractor(allow=r"/posts/", deny_domains=["ads.example.com"])
-
Added CrawlSpider and CrawlRule generic spider templates so you no longer have to hand-write the same "follow links matching this pattern" boilerplate. Override rules() to return a list of CrawlRule objects, each pairing a LinkExtractor. (Check the docs)
from scrapling.spiders import CrawlSpider, CrawlRule, LinkExtractor
class QuotesSpider(CrawlSpider):
name = "blog"
start_urls = ["https://quotes.toscrape.com/"]
def rules(self):
return [
CrawlRule(LinkExtractor(allow=r"/author/"), callback=self.parse_author),
CrawlRule(LinkExtractor(allow=r"/page/\d+/")), # pagination, no callback
]
async def parse_author(self, response):
yield {
"name": response.css(".author-title::text").get(),
"birthday": response.css(".author-born-date::text").get(),
"url": response.url,
}
-
Added a SitemapSpider template that seeds a crawl directly from a sitemap, or robots.txt URLs. Handles gzip-compressed sitemaps, and a lot of controls and options. URLs are dispatched via the crawl rules as shown above for CrawlSpider. (Check the docs)
from scrapling.spiders import SitemapSpider, CrawlRule, LinkExtractor
class NewsSitemap(SitemapSpider):
name = "news"
sitemap_urls = ["https://example.com/robots.txt"]
def rules(self):
return [
CrawlRule(LinkExtractor(allow=r"/articles/"), callback=self.parse_article),
]
async def parse_article(self, response):
yield {"url": response.url, "title": response.css("h1::text").get()}
-
Adaptive relocation now defaults to a 40% similarity threshold instead of 0 across all methods. This will make the adaptive feature work better. When nothing crosses the threshold, a warning now tells you the top score it did see, so you can lower percentage deliberately if needed.
-
Updated all browsers and fingerprints. Run a new scrapling install --force after updating to refresh the browsers and fingerprints.
🐛 Bug Fixes
- Fixed
Fetcher.configure(...) not applying to per-request calls. Same fix applied to AsyncFetcher.
- Fixed incorrect request fingerprinting that caused duplicate requests in spiders by @yetval in #255.
- Fixed the Adaptive scraping engine staying silent on weak matches. Combined with the threshold change above, you now get a warning instead of a misleading "best guess" element when relocation fails.
Docs
- Refreshed older code examples across the documentation to match the current version.
- Improved the code copy-paste experience on the docs site and trimmed the agent skill so it uses fewer tokens per invocation.
🙏 Special thanks to the community for all the continuous testing and feedback
Big shoutout to our Platinum Sponsors
详细ChangeLog
A focused update on browser stealth, privacy, and developer experience 🔒
🚀 New Stuff and quality of life changes
- Added built-in ad blocking for browser fetchers. Pass
block_ads=True to block requests to ~3,500 known ad and tracker domains at the route interception level -- no DNS, no TCP, instant abort. Can be combined with blocked_domains for custom lists. The MCP server and CLI --ai-targeted mode enable this automatically to save tokens and speed up page loads.
page = StealthyFetcher.fetch('https://example.com', block_ads=True)
- Added DNS-over-HTTPS support to prevent DNS leaks when using proxies. Pass
dns_over_https=True to route DNS queries through Cloudflare's DoH, so your real location isn't exposed through DNS resolution even when your HTTP traffic goes through a proxy.
page = StealthyFetcher.fetch('https://example.com', proxy='http://proxy:8080', dns_over_https=True)
- Added
page_setup callback for browser fetchers. A function that runs before page.goto(), letting you register event listeners, routes, or scripts that must be set up before the page navigates. Pairs with page_action (which runs after navigation). (Solves #237)
def capture_websockets(page):
page.on("websocket", lambda ws: print(f"WS: {ws.url}"))
page = DynamicFetcher.fetch('https://example.com', page_setup=capture_websockets)
- Added
--block-ads and --dns-over-https CLI options to both fetch and stealthy-fetch commands.
🐛 Bug Fixes
- Fixed
Seconds type alias rejecting float values. Passing wait=1.5 or timeout=500.0 to browser fetchers would fail with a type error because the type alias incorrectly treated float as metadata instead of a type. by @kuishou68 in #240
- Fixed duplicate ID segments in full-path selector generation. Elements with
id attributes had their selector appended twice when generating full CSS/XPath paths, producing selectors like body > #main > #main > #target > #target. Also fixed full-path XPath emitting bare [@id='x'] predicates (invalid XPath) instead of *[@id='x']. by @sjhddh in #241
- Fixed missing shell signature parameters. The interactive shell was missing
blocked_domains, block_ads, retries, retry_delay, capture_xhr, executable_path, and dns_over_https from its function signatures.
🙏 Special thanks to the community for all the continuous testing and feedback
Big shoutout to our Platinum Sponsors
详细ChangeLog
A focused update with one big quality-of-life feature for spider developers and a couple of important fixes 🎉
🚀 New Stuff and quality of life changes
-
Spider Development Mode: Iterating on a spider's parse() logic used to mean re-hitting the target servers on every run, which is slow, noisy, and a great way to get rate-limited while you're still figuring out your selectors. The new development mode caches every response to disk on the first run and replays them from disk on every subsequent run, so you can tweak your callbacks and re-run as many times as you want without making a single network request. Enable it with one class attribute:
class MySpider(Spider):
name = "my_spider"
start_urls = ["https://example.com"]
development_mode = True
async def parse(self, response):
yield {"title": response.css("title::text").get("")}
The cache lives in .scrapling_cache/{spider.name}/ by default and can be redirected anywhere with development_cache_dir. Two new stat counters, cache_hits and cache_misses, let you see how the cache performed. Cache replay bypasses download_delay, rate limiting, and the blocked-request retry path so iteration is as fast as the disk allows. Don't ship a spider with development_mode = True -- it's a development tool, not a production cache. See the docs for the full story.
-
Safer redirects by default: follow_redirects now defaults to "safe" across all HTTP fetchers, the MCP server, and the shell. Redirects are still followed, but ones targeting internal/private IPs (loopback, private networks, link-local) are rejected. This protects you from SSRF when scraping user-supplied URLs. Pass follow_redirects="all" to get the old behavior, or False to disable redirects entirely.
🐛 Bug Fixes
- Force-stop no longer loses your checkpoint: Pressing Ctrl+C twice (force-stop) on a spider with
crawldir enabled used to race against the checkpoint write -- the cancel scope would tear down the task before the pickle finished, leaving paused=False and triggering the cleanup path that deletes the previous checkpoint. The result was that force-stopping a long crawl could lose all the progress you were trying to save. The engine now writes the checkpoint before calling cancel_scope.cancel(), so a force-stop always preserves the latest pending state. By @voidborne-d in #230.
🙏 Special thanks to the community for all the continuous testing and feedback
详细ChangeLog
A new maintenance update with important changes
Bug fixes
- The function
get_all_text() now captures tail text nodes. This will make the MCP server and commands see text that was missed before (#168). Thanks @mhillebrand
- Referer now returns a bare Google url instead of a Google search URL. The previous logic was incorrect and may have produced a fingerprinting signal (#179). Thanks @Bortlesboat
- Fixed an issue with extra flags concatenation in all browsers. Thanks @rostchri
- Fixed a type hints issue with Python versions below 3.12 that caused it to crash. (Solves #163)
Other
- Added an Agent Skill for Claude Code / OpenClaw and other AI agentic tools.
- Added the Agent Skill to Clawhub.
- Updates all browsers and Playwright versions to the latest.
- Added a French translation to the main README file.
🙏 Special thanks to the community for all the continuous testing and feedback
Big shoutout to our Platinum Sponsors




详细ChangeLog
A new update with many important changes
🚀 New Stuff and quality of life changes
- Improved regex precision for Cloudflare challenge detection (Thanks to @RinZ27 #133)
- Improved the speed and efficiency of the Cloudflare solver. Now it is nearly twice as fast.
- Improved the Cloudflare solver to handle the case where websites sometimes show the Cloudflare page twice before redirecting to the main website.
- Improved the stealthy browser's stealth mode and speed by removing the injected JS files.
- Improved the MCP schema to be acceptable by OpenCode (Thanks to @robin-ede #137)
- Made the MCP schema even more MCP-friendly to be accepted by VS Code Copilot and other strict tools. (Solves #150 )
- Improved the MCP server tokens consumption by a large margin through stripping useless HTML tags while the
main_content_only option is activated.
- Fixed the PyPI page and added the files to register the MCP server to the MCP servers registry.
- Added a new code snippet to show how to install the browsers deps through code instead of using the commandline to allow easier automation.
- Improved all workflows by using the latest actions versions (Thanks to @salmanmkc #143/#144)
🙏 Special thanks to the community for all the continuous testing and feedback
详细ChangeLog
The biggest release of Scrapling yet — introducing the Spider framework, proxy rotation, and major parser improvements
This release brings a fully async spider/crawling framework, intelligent proxy management, and significant API changes that make Scrapling more powerful and consistent. Please review the breaking changes section carefully before upgrading.
🕷️ Spider Framework
A new async crawling framework built on top of anyio for structured, large-scale scraping:
from scrapling.spiders import Spider, Response
class MySpider(Spider):
name = "demo"
start_urls = ["https://example.com/"]
async def parse(self, response: Response):
for item in response.css('.product'):
yield {"title": item.css('h2::text').get()}
MySpider().start()
- Scrapy-like Spider API: Define spiders with
start_urls, async parse callbacks, Request/Response objects, and priority queue.
- Concurrent Crawling: Configurable concurrency limits, per-domain throttling, and download delays.
- Multi-Session Support: Unified interface for HTTP requests, and stealthy headless browsers in a single spider - route requests to different sessions by ID. Supports lazy session initialization.
- Pause & Resume: Checkpoint-based crawl persistence. Press Ctrl+C to gracefully shut down; then restart to resume from where you left off.
- Streaming Mode: Stream scraped items as they arrive via
async for item in spider.stream() with real-time stats - ideal for UI, pipelines, and long-running crawls.
- Blocked Request Detection: Automatic detection and retry of blocked requests with customizable logic.
- Built-in Export: Export results through hooks and your own pipeline or the built-in JSON/JSONL with
result.items.to_json() / result.items.to_jsonl() respectively.
- Lifecycle hooks:
on_start(), on_close(), on_error(), on_scraped_item(), and more hooks for full control over the crawl lifecycle.
- Detailed crawl stats: track requests, responses, bytes, status codes, proxies, per-domain/session breakdowns, log level counts, and more.
- uvloop support: Pass
use_uvloop=True to spider.start() for faster async execution when available.
A new section has been added to the website with the Full details. Click here
🔄 Proxy Rotation
- New
ProxyRotator class with thread-safe rotation. Works with all fetchers and sessions:
from scrapling import ProxyRotator
rotator = ProxyRotator(["http://proxy1:8080", "http://proxy2:8080"])
Fetcher.get(url, proxy_rotator=rotator)
- Custom rotation strategies: Make your own proxy rotation logic
- Per-request proxy override: Pass
proxy= to any individual get()/post()/fetch() call to override the session proxy for that request.
🌐 Browser Fetcher Improvements
- Domain blocking: New
blocked_domains parameter on DynamicFetcher/StealthyFetcher to block requests to specific domains (subdomains matched automatically).
- Automatic retries: Browser fetchers now retry on failure with
retries (default: 3) and retry_delay (default: 1s) parameters. Includes proxy-aware error detection.
- Response metadata:
Response.meta dict automatically stores the proxy used, and merges request metadata.
- Response.follow(): Create follow-up
Request objects with automatic referer flow, designed for the spider system.
- No autoplay: Browser sessions are now blocking autoplay content, which caused issues before.
- Speed: Improved stealth and speed by adjusting browser flags.
🔧 Bug Fixes & Improvements
- Parser optimization: Optimized the parser for repeated operations, improving performance.
- Errored pages: Fixed a bug that caused the browser to not close when pages gave errors.
- Empty body: Handle responses with empty body.
- Playwright loop: Solving an issue with leaving the Playwright loop open when CDP connection fails
- Type safety: Fixed all mypy errors and added type hints across untyped function bodies. Added mypy and pyright to the CI workflow.
⚠️ Breaking Changes
css_first/xpath_first removed: Use css('.selector').first, css('.selector')[0], or css('.selector').get() instead.
- All selection now returns
Selectors: css('::text'), xpath('//text()'), css('::attr(href)'), and xpath('//@href') now return Selectors (wrapping text nodes in Selector objects with tag="#text") instead of TextHandlers. This makes the API consistent across all selection methods and the type hints.
Response.body is always bytes: Previously could be str or bytes, now always returns bytes.
get()/getall() behavior: On Selector: get() returns TextHandler (serialized HTML or text value), getall() returns TextHandlers. Aliases: extract_first = get, extract = getall. Old get_all() on Selectors is removed.
Selectors.first/.last: Safe accessors that return Selector | None instead of raising IndexError.
- Internal constants renamed:
DEFAULT_FLAGS → DEFAULT_ARGS, DEFAULT_STEALTH_FLAGS → STEALTH_ARGS, HARMFUL_DEFAULT_ARGS → HARMFUL_ARGS, DEFAULT_DISABLED_RESOURCES → EXTRA_RESOURCES.
🔨 Other Changes
- Dependency changes: Replaced
tldextract with tld, removed internal _html_utils.py in favor of w3lib.html.replace_entities, added typing_extensions as a hard requirement.
- Docs overhaul: Full switch from MkDocs to Zensical, new spider documentation section, updated all existing pages, and added new API references.
🙏 Special thanks to our Discord community for all the continuous testing and feedback
Big shoutout to our biggest Sponsors







详细ChangeLog
A minor maintenance update to fix issues that happened with some devices in v0.3.13
- Disabled the incognito mode in
StealthyFetcher and its session classes since it made cookies not persistent across pages on Windows devices. It didn't happen on MacOS and Linux (Fixes #123, thanks to @frugality4121 for bringing it up and to @gembleman for pointing out the solution).
- Pinned down the last version of browserforge to solve the issue with old header models for users with an already old browserforge version.
🙏 Special thanks to our Discord community for all the continuous testing and feedback
Big shoutout to our biggest Sponsors






详细ChangeLog