crawl4ai

Author	SHA1	Message	Date
Aravind Karnam	9109ecd8fc	chore: Raise an exception with clear messaging when body tag is missing in the fetched html. The message should warn users to add appropriate wait_for condition to wait until body tag is loaded into DOM. fixes: https://github.com/unclecode/crawl4ai/issues/804	2025-03-18 15:26:44 +05:30
Aravind Karnam	84883be513	Merge branch 'next' into 2025-MAR-ALPHA-1	2025-03-18 15:12:21 +05:30
UncleCode	a24799918c	feat(llm): add additional LLM configuration parameters Extend LLMConfig class to support more fine-grained control over LLM behavior by adding: - temperature control - max tokens limit - top_p sampling - frequency and presence penalties - stop sequences - number of completions These parameters allow for better customization of LLM responses.	2025-03-14 21:36:23 +08:00
UncleCode	a31d7b86be	feat(changelog): update CHANGELOG for version 0.5.0.post5 with new features, changes, fixes, and breaking changes	2025-03-14 15:26:37 +08:00
UncleCode	7884a98be7	feat(crawler): add experimental parameters support and optimize browser handling Add experimental parameters dictionary to CrawlerRunConfig to support beta features Make CSP nonce headers optional via experimental config Remove default cookie injection Clean up browser context creation code Improve code formatting in API handler BREAKING CHANGE: Default cookie injection has been removed from page initialization	2025-03-14 14:39:24 +08:00
Aravind Karnam	c190ba816d	refactor: Instead of custom validation of question, rely on the built in FastAPI validator, so generated API docs also reflects this expectation correctly	2025-03-14 09:40:50 +05:30
Aravind Karnam	a3954dd4c6	refactor: Move the checking of protocol and prepending protocol inside api handlers	2025-03-14 09:39:10 +05:30
UncleCode	6e3c048328	feat(api): refactor crawl request handling to streamline single and multiple URL processing	2025-03-13 22:30:38 +08:00
UncleCode	b750542e6d	feat(crawler): optimize single URL handling and add performance comparison Add special handling for single URL requests in Docker API to use arun() instead of arun_many() Add new example script demonstrating performance differences between sequential and parallel crawling Update cache mode from aggressive to bypass in examples and tests Remove unused dependencies (zstandard, msgpack) BREAKING CHANGE: Changed default cache_mode from aggressive to bypass in examples	2025-03-13 22:15:15 +08:00
Aravind Karnam	cbb8755972	Merge branch 'next' into 2025-MAR-ALPHA-1	2025-03-13 10:42:22 +05:30
UncleCode	dc36997a08	feat(schema): improve HTML preprocessing for schema generation Add new preprocess_html_for_schema utility function to better handle HTML cleaning for schema generation. This replaces the previous optimize_html function in the GoogleSearchCrawler and includes smarter attribute handling and pattern detection. Other changes: - Update default provider to gpt-4o - Add DEFAULT_PROVIDER_API_KEY constant - Make LLMConfig creation more flexible with create_llm_config helper - Add new dependencies: zstandard and msgpack This change improves schema generation reliability while reducing noise in the processed HTML.	2025-03-12 22:40:46 +08:00
UncleCode	1630fbdafe	feat(monitor): add real-time crawler monitoring system with memory management Implements a comprehensive monitoring and visualization system for tracking web crawler operations in real-time. The system includes: - Terminal-based dashboard with rich UI for displaying task statuses - Memory pressure monitoring and adaptive dispatch control - Queue statistics and performance metrics tracking - Detailed task progress visualization - Stress testing framework for memory management This addition helps operators track crawler performance and manage memory usage more effectively.	2025-03-12 19:05:24 +08:00
UncleCode	9547bada3a	feat(content): add target_elements parameter for selective content extraction Adds new target_elements parameter to CrawlerRunConfig that allows more flexible content selection than css_selector. This enables focusing markdown generation and data extraction on specific elements while still processing the entire page for links and media. Key changes: - Added target_elements list parameter to CrawlerRunConfig - Modified WebScrapingStrategy and LXMLWebScrapingStrategy to handle target_elements - Updated documentation with examples and comparison between css_selector and target_elements - Fixed table extraction in content_scraping_strategy.py BREAKING CHANGE: Table extraction logic has been modified to better handle thead/tbody structures	2025-03-10 18:54:51 +08:00
UncleCode	9d69fce834	feat(scraping): add smart table extraction and analysis capabilities Add comprehensive table detection and extraction functionality to the web scraping system: - Implement intelligent table detection algorithm with scoring system - Add table extraction with support for headers, rows, captions - Update models to include tables in Media class - Add table_score_threshold configuration option - Add documentation and examples for table extraction - Include crypto analysis example demonstrating table usage This change enables users to extract structured data from HTML tables while intelligently filtering out layout tables.	2025-03-09 21:31:33 +08:00
UncleCode	c6a605ccce	feat(filters): add reverse option to URLPatternFilter Adds a new 'reverse' parameter to URLPatternFilter that allows inverting the filter's logic. When reverse=True, URLs that would normally match are rejected and vice versa. Also removes unused 'scraped_html' from WebScrapingStrategy output to reduce memory usage. BREAKING CHANGE: WebScrapingStrategy no longer returns 'scraped_html' in its output dictionary	2025-03-08 18:54:41 +08:00
UncleCode	4aeb7ef9ad	refactor(proxy): consolidate proxy configuration handling Moves ProxyConfig from configs/ directory into proxy_strategy.py to improve code organization and reduce fragmentation. Updates all imports and type hints to reflect the new location. Key changes: - Moved ProxyConfig class from configs/proxy_config.py to proxy_strategy.py - Updated type hints in async_configs.py to support ProxyConfig - Fixed proxy configuration handling in browser_manager.py - Updated documentation and examples to use new import path BREAKING CHANGE: ProxyConfig import path has changed from crawl4ai.configs to crawl4ai.proxy_strategy	2025-03-07 23:14:11 +08:00
UncleCode	a68cbb232b	feat(browser): add standalone CDP browser launch and lxml extraction strategy Add new features to enhance browser automation and HTML extraction: - Add CDP browser launch capability with customizable ports and profiles - Implement JsonLxmlExtractionStrategy for faster HTML parsing - Add CLI command 'crwl cdp' for launching standalone CDP browsers - Support connecting to external CDP browsers via URL - Optimize selector caching and context-sensitive queries BREAKING CHANGE: LLMConfig import path changed from crawl4ai.types to crawl4ai	2025-03-07 20:55:56 +08:00
UncleCode	f78c46446b	feat(deep-crawling): improve URL normalization and domain filtering Enhance URL handling in deep crawling with: - New URL normalization functions for consistent URL formats - Improved domain filtering with subdomain support - Added URLPatternFilter to public API - Better URL deduplication in BFS strategy These changes improve crawling accuracy and reduce duplicate visits.	2025-03-06 22:45:57 +08:00
UncleCode	1b72880007	chore(version): bump version to 0.5.0.post3	2025-03-06 20:32:32 +08:00
UncleCode	29f7915b79	fix(models): support float timestamps in CrawlStats Modify CrawlStats class to handle both datetime and float timestamp formats for start_time and end_time fields. This change improves compatibility with different time formats while maintaining existing functionality. Other minor changes: - Add datetime import in async_dispatcher - Update JsonElementExtractionStrategy kwargs handling No breaking changes.	2025-03-06 20:30:57 +08:00
UncleCode	2327db6fdc	refactor(crawler): introduce CrawlResultContainer and simplify interfaces Introduces a new generic CrawlResultContainer class to standardize return types and improve type safety. Removes legacy parameter handling and simplifies method signatures. This change makes the API more consistent and easier to maintain. BREAKING CHANGE: Synchronous crawler methods now always return CrawlResultContainer instead of raw CrawlResult or List[CrawlResult]. Legacy parameters have been removed from method signatures.	2025-03-05 22:23:08 +08:00
UncleCode	3a234ec950	fix(auth): make JWT authentication optional with fallback Modify authentication system to gracefully handle cases where JWT is not enabled or token is missing. This includes: - Making HTTPBearer auto_error=False to prevent automatic 403 errors - Updating token dependency to return None when JWT is disabled - Fixing model deserialization in CrawlResult - Updating documentation links - Cleaning up imports BREAKING CHANGE: Authentication behavior changed to be more permissive when JWT is disabled	2025-03-05 17:14:42 +08:00
UncleCode	9e89d27fcd	chore(version): bump version to 0.5.0.post2	2025-03-05 14:18:29 +08:00
UncleCode	b3ec7ce960	Merge branch 'vr0.5.0.post1' into next	2025-03-05 14:17:19 +08:00
UncleCode	baee4949d3	refactor(llm): rename LlmConfig to LLMConfig for consistency Rename LlmConfig to LLMConfig across the codebase to follow consistent naming conventions. Update all imports and usages to use the new name. Update documentation and examples to reflect the change. BREAKING CHANGE: LlmConfig has been renamed to LLMConfig. Users need to update their imports and usage.	2025-03-05 14:17:04 +08:00
UncleCode	9c58e4ce2e	fix(docs): correct section numbering in deepcrawl_example.py tutorial v0.5.0.post1	2025-03-04 20:57:33 +08:00
UncleCode	df6a6d5f4f	refactor(docs): reorganize tutorial sections and update wrap-up example	2025-03-04 20:55:09 +08:00
UncleCode	e896c08f9c	chore(version): bump version to 0.5.0.post1	2025-03-04 20:29:27 +08:00
UncleCode	56bc3c6e45	refactor(cli): improve CLI default command handling Make 'crawl' the default command when no command is specified. This improves user experience by allowing direct URL input without explicitly specifying the 'crawl' command. Also removes unnecessary blank lines in example code for better readability.	2025-03-04 20:28:16 +08:00
UncleCode	cbef406f9b	docs: update README for version 0.5.0 release with new features and CLI commands	2025-03-04 19:24:46 +08:00
UncleCode	8a76563018	chore(docs): update site version to v0.5.x in mkdocs configuration	2025-03-04 18:30:03 +08:00
UncleCode	415c1c5bee	refactor(core): replace float('inf') with math.inf Replace float('inf') and float('-inf') with math.inf and -math.inf from the math module for better readability and performance. Also clean up imports and remove unused speed comparison code. No breaking changes.	2025-03-04 18:23:55 +08:00
UncleCode	f334daa979	feat(deep-crawling): add max_pages and score_threshold parameters for improved crawling control	2025-03-03 21:54:58 +08:00
Aravind Karnam	504207faa6	docs: update text in llm-strategies.md to reflect new changes in LlmConfig	2025-03-03 19:24:44 +05:30
UncleCode	d024749633	refactor(deep-crawl): add max_pages limit and improve crawl control Add max_pages parameter to all deep crawling strategies to limit total pages crawled. Add score_threshold parameter to BFS/DFS strategies for quality control. Remove legacy parameter handling in AsyncWebCrawler. Improve error handling and logging in crawl strategies. BREAKING CHANGE: Removed support for legacy parameters in AsyncWebCrawler.run_many()	2025-03-03 21:51:11 +08:00
Aravind	f14e4a4b67	Merge pull request #776 from jawshoeadan/patch-1 Fix LiteLLM branding and link	2025-03-03 19:01:30 +05:30
Aravind Karnam	1e819cdb26	fixes: https://github.com/unclecode/crawl4ai/issues/774	2025-03-03 11:53:15 +05:30
jawshoeadan	5edfea279d	Fix LiteLLM branding and link	2025-03-02 16:58:00 +01:00
UncleCode	c612f9a852	feat(profiles): add CLI command for crawling with browser profiles Adds new functionality to crawl websites using saved browser profiles directly from the CLI. This includes: - New CLI option to use profiles for crawling - Helper functions for profile-based crawling - Fixed type hints for config parameters - Updated example to show browser window by default This makes it easier for users to leverage saved browser profiles for crawling without writing code.	2025-03-02 21:33:33 +08:00
UncleCode	95175cb394	feat(cli): add browser profile management functionality Adds new interactive browser profile management system that allows users to: - Create and manage browser profiles for authenticated crawling - List existing profiles with detailed information - Delete unused profiles - Use profiles during crawling with the new -p/--profile flag Also restructures CLI to use Click groups and adds humanize dependency for better size formatting.	2025-03-02 20:54:45 +08:00
UncleCode	cba4a466e5	feat(browser): add BrowserProfiler class for identity-based browsing Adds a new BrowserProfiler class that provides comprehensive management of browser profiles for identity-based crawling. Features include: - Interactive profile creation and management - Profile listing, retrieval, and deletion - Guided console interface - Migration of profile management from ManagedBrowser - New example script for identity-based browsing ALSO: - Updates logging format in AsyncWebCrawler - Removes content filter from hello_world example - Relaxes httpx version constraint BREAKING CHANGE: Profile management methods from ManagedBrowser are now deprecated and delegate to BrowserProfiler	2025-03-02 20:32:29 +08:00
Aravind Karnam	7c1705712d	fix: https://github.com/unclecode/crawl4ai/issues/756	2025-03-01 18:17:11 +05:30
Aravind	a9e24307cc	Release prep (#749 ) * fix: Update export of URLPatternFilter * chore: Add dependancy for cchardet in requirements * docs: Update example for deep crawl in release note for v0.5 * Docs: update the example for memory dispatcher * docs: updated example for crawl strategies * Refactor: Removed wrapping in if __name__==main block since this is a markdown file. * chore: removed cchardet from dependancy list, since unclecode is planning to remove it * docs: updated the example for proxy rotation to a working example * feat: Introduced ProxyConfig param * Add tutorial for deep crawl & update contributor list for bug fixes in feb alpha-1 * chore: update and test new dependancies * feat:Make PyPDF2 a conditional dependancy * updated tutorial and release note for v0.5 * docs: update docs for deep crawl, and fix a typo in docker-deployment markdown filename * refactor: 1. Deprecate markdown_v2 2. Make markdown backward compatible to behave as a string when needed. 3. Fix LlmConfig usage in cli 4. Deprecate markdown_v2 in cli 5. Update AsyncWebCrawler for changes in CrawlResult * fix: Bug in serialisation of markdown in acache_url * Refactor: Added deprecation errors for fit_html and fit_markdown directly on markdown. Now access them via markdown * fix: remove deprecated markdown_v2 from docker * Refactor: remove deprecated fit_markdown and fit_html from result * refactor: fix cache retrieval for markdown as a string * chore: update all docs, examples and tests with deprecation announcements for markdown_v2, fit_html, fit_markdown	2025-02-28 19:53:35 +08:00
UncleCode	3a87b4e43b	fix(dependencies): update cchardet to faust-cchardet for compatibility	2025-02-26 18:25:58 +08:00
UncleCode	4bcd4cbda1	refactor(pdf): improve PDF processor dependency handling Make PyPDF2 an optional dependency and improve import handling in PDF processor. Move imports inside methods to allow for lazy loading and better error handling. Add new 'pdf' optional dependency group in pyproject.toml. Clean up unused imports and remove deprecated files. BREAKING CHANGE: PyPDF2 is now an optional dependency. Users need to install with 'pip install crawl4ai[pdf]' to use PDF processing features.	2025-02-25 22:27:55 +08:00
UncleCode	71ce01c9e1	feat(browser): add cdp_url parameter to BrowserManager initialization	2025-02-24 14:48:02 +08:00
UncleCode	c6d48080a4	feat(logger): add abstract logger base class and file logger implementation Add AsyncLoggerBase abstract class to standardize logger interface and introduce AsyncFileLogger for file-only logging. Remove deprecated always_bypass_cache parameter and clean up AsyncWebCrawler initialization. BREAKING CHANGE: Removed deprecated 'always_by_pass_cache' parameter. Use BrowserConfig cache settings instead.	2025-02-23 21:23:41 +08:00
UncleCode	46d2f12851	chore: remove old Dockerfile and server script	2025-02-22 13:45:04 +08:00
UncleCode	367cd71db9	feat(core): release version 0.5.0 with deep crawling and CLI This major release adds deep crawling capabilities, memory-adaptive dispatcher, multiple crawling strategies, Docker deployment, and a new CLI. It also includes significant improvements to proxy handling, PDF processing, and LLM integration. BREAKING CHANGES: - Add memory-adaptive dispatcher as default for arun_many() - Move max_depth to CrawlerRunConfig - Replace ScrapingMode enum with strategy pattern - Update BrowserContext API - Make model fields optional with defaults - Remove content_filter parameter from CrawlerRunConfig - Remove synchronous WebCrawler and old CLI - Update Docker deployment configuration - Replace FastFilterChain with FilterChain - Change license to Apache 2.0 with attribution clause	2025-02-21 19:55:02 +08:00
Aravind	2af958e12c	Feat/llm config (#724 ) * feature: Add LlmConfig to easily configure and pass LLM configs to different strategies * pulled in next branch and resolved conflicts * feat: Add gemini and deepseek providers. Make ignore_cache in llm content filter to true by default to avoid confusions * Refactor: Update LlmConfig in LLMExtractionStrategy class and deprecate old params * updated tests, docs and readme	2025-02-21 15:41:37 +08:00

1 2 3 4 5 ...

675 Commits