contexts_by_config accumulated browser contexts unboundedly in long-running crawlers (Docker API). Two root causes fixed: 1. _make_config_signature() hashed ~60 CrawlerRunConfig fields but only 7 affect the browser context (proxy_config, locale, timezone_id, geolocation, override_navigator, simulate_user, magic). Switched from blacklist to whitelist — non-context fields like word_count_threshold, css_selector, screenshot, verbose no longer cause unnecessary context creation. 2. No eviction mechanism existed between close() calls. Added refcount tracking (_context_refcounts, incremented under _contexts_lock in get_page, decremented in release_page_with_context) and LRU eviction (_evict_lru_context_locked) that caps contexts at _max_contexts=20, evicting only idle contexts (refcount==0) oldest-first. Also fixed: storage_state path leaked a temporary context every request (now explicitly closed after clone_runtime_state). Closes #943. Credit to @Martichou for the investigation in #1640.
5.0 KiB
5.0 KiB
Contributors to Crawl4AI
We would like to thank the following people for their contributions to Crawl4AI:
Core Team
- Unclecode - Project Creator and Main Developer
- Nasrin - Project Manager and Developer
- Aravind Karnam - Head of Community and Product
Community Contributors
- aadityakanjolia4 - Fix for
CustomHTML2Textis not defined. - FractalMind - Created the first official Docker Hub image and fixed Dockerfile errors
- ketonkss4 - Identified Selenium's new capabilities, helping reduce dependencies
- jonymusky - Javascript execution documentation, and wait_for
- datehoer - Add browser prxy support
Pull Requests
- dvschuyl - AsyncPlaywrightCrawlerStrategy page-evaluate context destroyed by navigation #304
- nelzomal - Enhance development installation instructions #286
- HamzaFarhan - Handled the cases where markdown_with_citations, references_markdown, and filtered_html might not be defined #293
- NanmiCoder - fix: crawler strategy exception handling and fixes #271
- paulokuong - fix: RAWL4_AI_BASE_DIRECTORY should be Path object instead of string #298
- TheRedRad - feat: add force viewport screenshot option #1694
- ChiragBellara - fix: avoid Common Crawl calls for sitemap-only URL seeding #1746
- YuriNachos - fix: replace tf-playwright-stealth with playwright-stealth #1714, fix: respect
<base>tag for relative link resolution #1721, fix: include GoogleSearchCrawler script.js in package #1719, fix: allow local embeddings by removing OpenAI fallback #1717 - christian-oudard - fix: deep-crawl CLI outputting only the first page #1667
- vladmandic - fix: VersionManager ignoring CRAWL4_AI_BASE_DIRECTORY env var #1296
- nnxiong - fix: script tag removal losing adjacent text in cleaned_html #1364
- RoyLeviLangware - fix: bs4 deprecation warning (text -> string) #1077
- garyluky - fix: proxy auth ERR_INVALID_AUTH_CREDENTIALS #1281
- Martichou - investigation: browser context memory leak under continuous load #1640, #943
Feb-Alpha-1
- sufianuddin - fix: Documentation for JsonCssExtractionStrategy
- tautikAg - fix: Markdown output has incorect spacing
- cardit1 - fix: 'AsyncPlaywrightCrawlerStrategy' object has no attribute 'downloads_path'
- dmurat - fix: Incorrect rendering of inline code inside of links
- Sparshsing - fix: Relative Urls in the webpage not extracted properly
Other Contributors
Typo fixes
Acknowledgements
We also want to thank all the users who have reported bugs, suggested features, or helped in any other way to make Crawl4AI better.
If you've contributed to Crawl4AI and your name isn't on this list, please open a pull request with your name, link, and contribution, and we'll review it promptly.
Thank you all for your contributions!