crawl4ai

Author	SHA1	Message	Date
UncleCode	2d6b19e1a2	refactor(browser): improve browser path management Implement more robust browser executable path handling using playwright's built-in browser management. This change: - Adds async browser path resolution - Implements path caching in the home folder - Removes hardcoded browser paths - Adds httpx dependency - Removes obsolete test result files This change makes the browser path resolution more reliable across different platforms and environments.	2025-01-17 22:14:37 +08:00
UncleCode	ece9202b61	fix(dispatcher): adjust memory threshold and fix dispatcher initialization - Increase memory threshold from 70% to 90% for better resource utilization - Remove incorrect self parameter from MemoryAdaptiveDispatcher initialization These changes improve the crawler's performance by allowing more memory usage before throttling and fix a bug in dispatcher initialization.	2025-01-16 21:58:52 +08:00
UncleCode	9d694da939	fix(models): make model fields optional with default values Make fields in MediaItem and Link models optional with default values to prevent validation errors when data is incomplete. Also expose BaseDispatcher in __init__ and fix markdown field handling in database manager. BREAKING CHANGE: MediaItem and Link model fields are now optional with default values which may affect existing code expecting required fields.	2025-01-15 22:58:14 +08:00
UncleCode	20c027b79c	chore(cleanup): remove unused files and improve type hints - Remove .pre-commit-config.yaml and duplicate mkdocs configuration files - Add Optional type hint for proxy parameter in BrowserConfig - Fix type annotation for results list in AsyncWebCrawler - Move calculate_batch_size function import to model_loader - Update prompt imports in extraction_strategy.py No breaking changes.	2025-01-14 13:07:18 +08:00
UncleCode	8ec12d7d68	Apply Ruff Corrections	2025-01-13 19:19:58 +08:00
UncleCode	c3370ec5da	refactor(scraping): replace ScrapingMode enum with strategy pattern Replace the ScrapingMode enum with a proper strategy pattern implementation for content scraping. This change introduces: - New ContentScrapingStrategy abstract base class - Concrete WebScrapingStrategy and LXMLWebScrapingStrategy implementations - New Pydantic models for structured scraping results - Updated documentation reflecting the new strategy-based approach BREAKING CHANGE: ScrapingMode enum has been removed. Users should now use ContentScrapingStrategy implementations instead.	2025-01-13 17:53:12 +08:00
UncleCode	f3ae5a657c	feat(scraping): add LXML-based scraping mode for improved performance Adds a new ScrapingMode enum to allow switching between BeautifulSoup and LXML parsing. LXML mode offers 10-20x better performance for large HTML documents. Key changes: - Added ScrapingMode enum with BEAUTIFULSOUP and LXML options - Implemented LXMLWebScrapingStrategy class - Added LXML-based metadata extraction - Updated documentation with scraping mode usage and performance considerations - Added cssselect dependency BREAKING CHANGE: None	2025-01-12 20:46:23 +08:00
UncleCode	825c78a048	refactor(dispatcher): migrate to modular dispatcher system with enhanced monitoring Reorganize dispatcher functionality into separate components: - Create dedicated dispatcher classes (MemoryAdaptive, Semaphore) - Add RateLimiter for smart request throttling - Implement CrawlerMonitor for real-time progress tracking - Move dispatcher config from CrawlerRunConfig to separate classes BREAKING CHANGE: Dispatcher configuration moved from CrawlerRunConfig to dedicated dispatcher classes. Users need to update their configuration approach for multi-URL crawling.	2025-01-11 21:10:27 +08:00
UncleCode	3865342c93	Merge branch 'next' into next-cdp	2025-01-10 16:01:49 +08:00
UncleCode	ac5f461d40	feat(crawler): add memory-adaptive dispatcher with rate limiting Implements a new MemoryAdaptiveDispatcher class to manage concurrent crawling operations with memory monitoring and rate limiting capabilities. Changes include: - Added RateLimitConfig dataclass for configuring rate limiting behavior - Extended CrawlerRunConfig with dispatcher-related settings - Refactored arun_many to use the new dispatcher system - Added memory threshold and session permit controls - Integrated optional progress monitoring display BREAKING CHANGE: The arun_many method now uses MemoryAdaptiveDispatcher by default, which may affect concurrent crawling behavior	2025-01-10 16:01:18 +08:00
UncleCode	e8b4ac6046	docs(urls): update documentation URLs to new domain Update all documentation URLs from crawl4ai.com/mkdocs to docs.crawl4ai.com Improve badges styling and layout in documentation Increase code font size in documentation CSS BREAKING CHANGE: Documentation URLs have changed from crawl4ai.com/mkdocs to docs.crawl4ai.com	2025-01-09 16:22:41 +08:00
UncleCode	051a6cf974	docs(readme): update personal story and project vision Revise the README's personal story section to better reflect the project's origins, motivation, and vision for open-source data accessibility. Add more detail about the creator's background and the project's mission to democratize AI through open data access. Also includes a minor TODO comment addition in async crawler strategy.	2025-01-08 21:13:31 +08:00
UncleCode	1c9464b988	Update all documents	2025-01-08 19:31:31 +08:00
UncleCode	6838901788	Update All docs 2025 8th Jan	2025-01-08 19:31:17 +08:00
UncleCode	c110d459fb	Update .gitattributes	2025-01-07 21:20:17 +08:00
UncleCode	4d1975e0a7	Update .gitattributes	2025-01-07 21:18:45 +08:00
UncleCode	82734a750c	Update .gitattributes	2025-01-07 21:11:45 +08:00
UncleCode	56fa4e1e42	refactor(doc) Update README	2025-01-07 20:53:10 +08:00
UncleCode	ca3e33122e	refactor(docs): reorganize documentation structure and update styles Reorganize documentation into core/advanced/extraction sections for better navigation. Update terminal theme styles and add rich library for better CLI output. Remove redundant tutorial files and consolidate content into core sections. Add personal story to index page for project context. BREAKING CHANGE: Documentation structure has been significantly reorganized	2025-01-07 20:49:50 +08:00
UncleCode	ae376f15fb	docs(extraction): add clarifying comments for CSS selector behavior Add explanatory comments to JsonCssExtractionStrategy._get_elements() method to clarify that it returns all matching elements using select() instead of select_one(). This helps developers understand the method's behavior and its difference from single element selection. Removed trailing whitespace at end of file.	2025-01-05 19:39:15 +08:00
UncleCode	72fbdac467	fix(extraction): JsonCss selector and crawler improvements - Fix JsonCssExtractionStrategy._get_elements to return all matching elements instead of just one - Add robust error handling to page_need_scroll with default fallback - Improve JSON extraction strategies documentation - Refactor content scraping strategy - Update version to 0.4.247	2025-01-05 19:26:46 +08:00
UncleCode	0857c7b448	Merge branch 'main' of https://github.com/unclecode/crawl4ai into next	2025-01-05 17:05:59 +08:00
Guilume	07b4c1c0ed	fix: not working long page screenshot (#403 )	2025-01-05 17:04:34 +08:00
UncleCode	196dc79ec7	fix: prevent memory leaks by ensuring proper closure of Playwright pages - Fixes critical memory leak issue where browser pages remained open - Ensures proper cleanup of Playwright resources after page operations - Improves resource management in browser farm implementation This is an urgent fix to address resource leakage that could impact system stability.	2025-01-03 21:17:23 +08:00
UncleCode	24b3da717a	refactor(): - Update hello world example	2025-01-02 17:53:30 +08:00
UncleCode	98acc4254d	refactor: - Update hello_world.py example	2025-01-01 19:47:22 +08:00
UncleCode	eac78c7993	Merge branch 'vr0.4.246'	2025-01-01 19:43:01 +08:00
UncleCode	da1bc0f7bf	Update version file	2025-01-01 19:42:35 +08:00
UncleCode	aa4f92f458	refactor(crawler): - Update hello_world example with proper content filtering	2025-01-01 19:39:42 +08:00
UncleCode	a96e05d4ae	refactor(crawler): optimize response handling and default settings - Set wait_for_images default to false for better performance - Simplify response attribute copying in AsyncWebCrawler - Update hello_world example with proper content filtering	2025-01-01 19:39:02 +08:00
UncleCode	5c95fd92b4	fix(browser): resolve merge conflicts in browser channel configuration	2025-01-01 19:05:47 +08:00
UncleCode	4cb2a62551	Update README	2025-01-01 18:59:55 +08:00
UncleCode	5b4fad9e25	- Bump version to 0.4.244	2025-01-01 18:58:43 +08:00
UncleCode	ea0ac25f38	refactor(browser): Update browser channel default to 'chromium' in BrowserConfig.from_args method	2025-01-01 18:58:15 +08:00
UncleCode	7688aca7d6	Update Version	2025-01-01 18:44:27 +08:00
UncleCode	a7215ad972	fix(browser): update default browser channel to chromium and simplify channel selection logic	2025-01-01 18:38:33 +08:00
Arno.Edwards	8e2403a7da	fix(browser)!: default to Chromium channel for new headless mode (#387 ) BREAKING CHANGE: Updated `chrome_channel` to "chromium" to fix compatibility with the new Chromium headless implementation. This resolves the error `playwright._impl._errors.Error: BrowserType.launch: Chromium distribution 'chrome' is not found`, caused by the removal of the old headless mode in Chromium. With this change, channels like "chrome" and "msedge" now default to the new headless mode, aligning with upstream updates in Playwright v1.49. The new headless mode uses the real Chrome browser, offering more authenticity, reliability, and feature parity with the full browser. Additionally, simplified fallback logic by directly assigning `chrome_channel` based on `browser_type` or defaulting to "chromium". Refer to: - https://playwright.dev/python/docs/browsers#chromium - https://github.com/microsoft/playwright/issues/33566	2025-01-01 18:37:50 +08:00
UncleCode	318554e6bf	Merge branch 'v0.4.243' v0.4.243	2025-01-01 18:11:15 +08:00
UncleCode	c64979b8dd	docs: update README	2025-01-01 18:10:38 +08:00
UncleCode	bfe21b29d4	build: streamline package discovery and bump to v0.4.243 - Replace explicit package listing with setuptools.find - Include all crawl4ai.* packages automatically - Use `packages = {find = {where = ["."], include = ["crawl4ai*"]}}` syntax - Bump version to 0.4.243 This change simplifies package maintenance by automatically discovering all subpackages under crawl4ai namespace instead of listing them manually.	2025-01-01 17:55:59 +08:00
UncleCode	e9d9a6ffe8	fix: ensure js_snippet files are included in package - Add js_snippet to packages list in pyproject.toml - Verified JS files are properly included in installed package - Bump version to 0.4.242	2025-01-01 17:38:59 +08:00
UncleCode	5313c71a0d	docs: update REAME browser installation command - Remove Chrome from manual installation command - Keep Chromium as the only default browser in docs	2025-01-01 17:24:44 +08:00
UncleCode	d36ef3d424	refactor(install): use chromium as default browser - Remove Chrome installation to reduce setup time - Keep Chromium as default browser for better cross-platform compatibility	2025-01-01 17:19:54 +08:00
UncleCode	4a4f613238	docs: simplify installation instructions - Add crawl4ai-doctor command to verify installation - Update browser installation instructions in README and docs - Move optional features to documentation - Add manual browser installation steps as fallback - Update getting-started guide with verification step	2025-01-01 16:54:03 +08:00
UncleCode	dc6a24618e	feat(install): add doctor command and force browser install - Add --force flag to Playwright browser installation - Add doctor command to test crawling functionality - Install Chrome and Chromium browsers explicitly - Add crawl4ai-doctor entry point in pyproject.toml - Implement simple health check focused on crawling test	2025-01-01 16:33:43 +08:00
UncleCode	74a7c6dbb6	feat(install): specify chrome and chromium for playwright - Install Chrome and Chromium browsers explicitly - Split browser installation into separate commands	2025-01-01 16:10:08 +08:00
UncleCode	67f65f958b	refactor(build): simplify setup.py configuration - Remove dependency management from setup.py - Remove entry points configuration (moved to pyproject.toml) - Keep minimal setup.py for backwards compatibility - Clean up package metadata structure	2025-01-01 15:52:01 +08:00
UncleCode	78b6ba5cef	build: modernize package configuration with pyproject.toml - Add pyproject.toml for PEP 517 build system support - Configure dependencies, scripts, and metadata in pyproject.toml - Set Python requirement to >=3.9 and add support up to 3.13 - Keep setup.py for backwards compatibility - Move package dependencies and entry points to pyproject.toml	2025-01-01 15:45:27 +08:00
UncleCode	3f019d34cc	docs: update project description emojis - Change project description emojis from 🔥🕷️ to 🚀🤖 - Update emojis consistently in both setup.py and pyproject.toml	2025-01-01 15:39:33 +08:00
UncleCode	304260e484	refactor(install): simplify Playwright installation error handling - Remove setup_docs() call from post_install() - Simplify error messages for Playwright installation failures - Use sys.executable for more accurate Python path in error messages - Add --with-deps flag to Playwright install command	2025-01-01 15:33:36 +08:00

1 2 3 4 5 ...

551 Commits