crawl4ai

Author	SHA1	Message	Date
unclecode	0bfcf080dd	Add contributors from PRs #1133 , #729 Credit chrizzly2309 and complete-dope for identifying bugs that were resolved on develop.	2026-02-02 07:56:37 +00:00
unclecode	b962699c0d	Add contributors from PRs #973 , #1073 , #931 Credit danyQe, saipavanmeruga7797, and stevenaldinger for identifying bugs that were resolved on develop.	2026-02-02 07:14:12 +00:00
unclecode	c790231aba	Fix browser context memory leak — signature shrink + LRU eviction (#943 ) contexts_by_config accumulated browser contexts unboundedly in long-running crawlers (Docker API). Two root causes fixed: 1. _make_config_signature() hashed ~60 CrawlerRunConfig fields but only 7 affect the browser context (proxy_config, locale, timezone_id, geolocation, override_navigator, simulate_user, magic). Switched from blacklist to whitelist — non-context fields like word_count_threshold, css_selector, screenshot, verbose no longer cause unnecessary context creation. 2. No eviction mechanism existed between close() calls. Added refcount tracking (_context_refcounts, incremented under _contexts_lock in get_page, decremented in release_page_with_context) and LRU eviction (_evict_lru_context_locked) that caps contexts at _max_contexts=20, evicting only idle contexts (refcount==0) oldest-first. Also fixed: storage_state path leaked a temporary context every request (now explicitly closed after clone_runtime_state). Closes #943. Credit to @Martichou for the investigation in #1640.	2026-02-01 14:23:04 +00:00
unclecode	bb523b6c6c	Merge PRs #1077 , #1281 — bs4 deprecation and proxy auth fix - PR #1077: Fix bs4 deprecation warning (text -> string) - PR #1281: Fix proxy auth ERR_INVALID_AUTH_CREDENTIALS - Comment on PR #1081 guiding author on needed DFS/BFF fixes - Update CONTRIBUTORS.md and PR-TODOLIST.md	2026-02-01 07:06:39 +00:00
unclecode	a56dd07559	Merge PRs #1667 , #1296 , #1364 — CLI deep-crawl, env var, script tags - PR #1667: Fix deep-crawl CLI outputting only the first page - PR #1296: Fix VersionManager ignoring CRAWL4_AI_BASE_DIRECTORY - PR #1364: Fix script tag removal losing adjacent text - Fix: restore .crawl4ai subfolder in VersionManager path - Close #1150 (already fixed on develop) - Update CONTRIBUTORS.md and PR-TODOLIST.md	2026-02-01 06:53:53 +00:00
unclecode	dc4ae73221	Merge PRs #1714 , #1721 , #1719 , #1717 and fix base tag pipeline - PR #1714: Replace tf-playwright-stealth with playwright-stealth - PR #1721: Respect <base> tag in html2text for relative links - PR #1719: Include GoogleSearchCrawler script.js in package data - PR #1717: Allow local embeddings by removing OpenAI fallback - Fix: Extract <base href> from raw HTML before head gets stripped - Close duplicates: #1703, #1698, #1697, #1710, #1720 - Update CONTRIBUTORS.md and PR-TODOLIST.md	2026-02-01 05:41:33 +00:00
unclecode	ee717dc019	Add contributor for PR #1746 and fix test pytest marker - Add ChiragBellara to CONTRIBUTORS.md for sitemap seeding fix - Add missing @pytest.mark.asyncio decorator to seeder test	2026-02-01 03:10:32 +00:00
unclecode	5be0d2d75e	Add contributor and docs for force_viewport_screenshot feature - Add TheRedRad to CONTRIBUTORS.md for PR #1694 - Document force_viewport_screenshot in API parameters reference - Add viewport screenshot note in browser-crawler-config guide - Add viewport-only screenshot example in screenshot docs	2026-02-01 01:10:20 +00:00
Aravind	a9e24307cc	Release prep (#749 ) * fix: Update export of URLPatternFilter * chore: Add dependancy for cchardet in requirements * docs: Update example for deep crawl in release note for v0.5 * Docs: update the example for memory dispatcher * docs: updated example for crawl strategies * Refactor: Removed wrapping in if __name__==main block since this is a markdown file. * chore: removed cchardet from dependancy list, since unclecode is planning to remove it * docs: updated the example for proxy rotation to a working example * feat: Introduced ProxyConfig param * Add tutorial for deep crawl & update contributor list for bug fixes in feb alpha-1 * chore: update and test new dependancies * feat:Make PyPDF2 a conditional dependancy * updated tutorial and release note for v0.5 * docs: update docs for deep crawl, and fix a typo in docker-deployment markdown filename * refactor: 1. Deprecate markdown_v2 2. Make markdown backward compatible to behave as a string when needed. 3. Fix LlmConfig usage in cli 4. Deprecate markdown_v2 in cli 5. Update AsyncWebCrawler for changes in CrawlResult * fix: Bug in serialisation of markdown in acache_url * Refactor: Added deprecation errors for fit_html and fit_markdown directly on markdown. Now access them via markdown * fix: remove deprecated markdown_v2 from docker * Refactor: remove deprecated fit_markdown and fit_html from result * refactor: fix cache retrieval for markdown as a string * chore: update all docs, examples and tests with deprecation announcements for markdown_v2, fit_html, fit_markdown	2025-02-28 19:53:35 +08:00
UncleCode	357414c345	docs(readme): update version references and fix links Update version numbers to v0.4.3bx throughout README.md Fix contributing guidelines link to point to CONTRIBUTORS.md Update Aravind's role in CONTRIBUTORS.md to Head of Community and Product Add pre-release installation instructions Fix minor formatting in personal story section No breaking changes	2025-01-22 20:46:39 +08:00
UncleCode	8c76a8c7dc	docs: add contributor entry for dvschuyl regarding AsyncPlaywrightCrawlerStrategy issue	2024-11-29 21:14:49 +08:00
UncleCode	1d83c493af	Enhance setup process and update contributors list - Acknowledge contributor paulokuong for fixing RAWL4_AI_BASE_DIRECTORY issue - Refine base directory handling in `setup.py` - Clarify Playwright installation instructions and improve error handling	2024-11-28 19:58:40 +08:00
UncleCode	24723b2f10	Enhance features and documentation - Updated version to 0.3.743 - Improved ManagedBrowser configuration with dynamic host/port - Implemented fast HTML formatting in web crawler - Enhanced markdown generation with a new generator class - Improved sanitization and utility functions - Added contributor details and pull request acknowledgments - Updated documentation for clearer usage scenarios - Adjusted tests to reflect class name changes	2024-11-28 12:45:05 +08:00
unclecode	4d48bd31ca	Push async version last changes for merge to main branch	2024-09-24 20:52:08 +08:00
unclecode	659c8cd953	refactor: Update image description minimum word threshold in get_content_of_website_optimized	2024-08-02 15:55:32 +08:00

15 Commits