crawl4ai

Author	SHA1	Message	Date
Aravind Karnam	0ff95c83bc	feat: change input params to scraper, Add asynchronous context manager to AsyncWebScraper, Optimise filter application	2025-01-27 18:13:33 +05:30
UncleCode	e6ef8d91ba	refactor(scraper): optimize URL validation and filter performance - Replace validators library with built-in urlparse for URL validation - Optimize filter statistics update logic for better performance - Add performance benchmarking suite for filters - Add execution time tracking to scraper examples - Update gitignore with windsurfrules BREAKING CHANGE: Removed dependency on validators library for URL validation	2025-01-22 19:45:56 +08:00
Aravind Karnam	6e78c56dda	Refactor: Removed all scheduling logic from scraper. From now scraper expects arun_many to handle all scheduling. Scraper will only do traversal, validations, compliance checks, URL filtering and scoring etc. Reformatted some of the scraper files with Black code formatter	2025-01-21 18:44:43 +05:30
Aravind Karnam	67fa06c09b	Refactor: Removed all scheduling logic from scraper. From now scraper expects arun_many to handle all scheduling. Scraper will only do traversal, validations, compliance checks, URL filtering and scoring etc. Reformatted some of the scraper files with Black code formatter	2025-01-21 17:49:51 +05:30
Aravind Karnam	26d78d8512	Merge branch 'next' into feature/scraper	2025-01-21 12:35:45 +05:30
Aravind	a677c2b61d	Merge pull request #496 from aravindkarnam/scraper-uc Trying to merge scraper on-going development with new developments in parallel processing	2025-01-20 16:55:41 +05:30
UncleCode	2cec527a22	feat(extraction): add LLM-powered schema generation utility Adds new static method generate_schema() to JsonElementExtractionStrategy classes that can automatically generate extraction schemas using LLM (OpenAI or Ollama). This provides a convenient way to bootstrap extraction schemas while maintaining the performance benefits of selector-based extraction. Key changes: - Added generate_schema() static method to base extraction strategy - Added support for both CSS and XPath schema generation - Updated documentation with examples and best practices - Added new prompt templates for schema generation	2025-01-20 17:28:00 +08:00
UncleCode	4b1309cbf2	feat(crawler): add URL redirection tracking Add capability to track and return final URLs after redirects in crawler responses. This enhancement helps users understand the actual destination of crawled URLs after any redirections. Changes include: - Added final_url tracking in AsyncPlaywrightCrawlerStrategy - Added redirected_url field to CrawlResult model - Updated AsyncWebCrawler to properly handle and store redirect URLs - Fixed typo in documentation signature	2025-01-19 19:53:38 +08:00
UncleCode	8b6fe6a98f	docs(api): add streaming mode documentation and examples Add comprehensive documentation for the new streaming mode feature in arun_many(): - Update arun_many() API docs to reflect streaming return type - Add streaming examples in quickstart and multi-url guides - Document stream parameter in configuration classes - Add clone() helper method documentation for configs This change improves documentation for processing large numbers of URLs efficiently.	2025-01-19 18:21:34 +08:00
UncleCode	1221be30a3	feat(browser): improve browser context management and add shared data support Add shared_data parameter to CrawlerRunConfig to allow data sharing between hooks. Implement browser context reuse based on config signatures to improve memory usage. Fix Firefox/Webkit channel settings. Add config parameter to hook callbacks for better context access. Remove debug print statements. BREAKING CHANGE: Hook callback signatures now include config parameter	2025-01-19 17:12:03 +08:00
UncleCode	3d09b6a221	feat(content-filter): add LLMContentFilter for intelligent markdown generation Add new LLMContentFilter class that uses LLMs to generate high-quality markdown content: - Implement intelligent content filtering with customizable instructions - Add chunk processing for handling large documents - Support parallel processing of content chunks - Include caching mechanism for filtered results - Add usage tracking and statistics - Update documentation with examples and use cases Also includes minor changes: - Disable Pydantic warnings in __init__.py - Add new prompt template for content filtering	2025-01-18 19:31:07 +08:00
UncleCode	8ec12d7d68	Apply Ruff Corrections	2025-01-13 19:19:58 +08:00
UncleCode	c3370ec5da	refactor(scraping): replace ScrapingMode enum with strategy pattern Replace the ScrapingMode enum with a proper strategy pattern implementation for content scraping. This change introduces: - New ContentScrapingStrategy abstract base class - Concrete WebScrapingStrategy and LXMLWebScrapingStrategy implementations - New Pydantic models for structured scraping results - Updated documentation reflecting the new strategy-based approach BREAKING CHANGE: ScrapingMode enum has been removed. Users should now use ContentScrapingStrategy implementations instead.	2025-01-13 17:53:12 +08:00
UncleCode	f3ae5a657c	feat(scraping): add LXML-based scraping mode for improved performance Adds a new ScrapingMode enum to allow switching between BeautifulSoup and LXML parsing. LXML mode offers 10-20x better performance for large HTML documents. Key changes: - Added ScrapingMode enum with BEAUTIFULSOUP and LXML options - Implemented LXMLWebScrapingStrategy class - Added LXML-based metadata extraction - Updated documentation with scraping mode usage and performance considerations - Added cssselect dependency BREAKING CHANGE: None	2025-01-12 20:46:23 +08:00
UncleCode	825c78a048	refactor(dispatcher): migrate to modular dispatcher system with enhanced monitoring Reorganize dispatcher functionality into separate components: - Create dedicated dispatcher classes (MemoryAdaptive, Semaphore) - Add RateLimiter for smart request throttling - Implement CrawlerMonitor for real-time progress tracking - Move dispatcher config from CrawlerRunConfig to separate classes BREAKING CHANGE: Dispatcher configuration moved from CrawlerRunConfig to dedicated dispatcher classes. Users need to update their configuration approach for multi-URL crawling.	2025-01-11 21:10:27 +08:00
UncleCode	3865342c93	Merge branch 'next' into next-cdp	2025-01-10 16:01:49 +08:00
UncleCode	ac5f461d40	feat(crawler): add memory-adaptive dispatcher with rate limiting Implements a new MemoryAdaptiveDispatcher class to manage concurrent crawling operations with memory monitoring and rate limiting capabilities. Changes include: - Added RateLimitConfig dataclass for configuring rate limiting behavior - Extended CrawlerRunConfig with dispatcher-related settings - Refactored arun_many to use the new dispatcher system - Added memory threshold and session permit controls - Integrated optional progress monitoring display BREAKING CHANGE: The arun_many method now uses MemoryAdaptiveDispatcher by default, which may affect concurrent crawling behavior	2025-01-10 16:01:18 +08:00
UncleCode	f9c601eb7e	docs(urls): update documentation URLs to new domain Update all documentation URLs from crawl4ai.com/mkdocs to docs.crawl4ai.com across README, examples, and documentation files. This change reflects the new documentation hosting domain. Also add todo/ directory to .gitignore.	2025-01-09 16:24:41 +08:00
UncleCode	e8b4ac6046	docs(urls): update documentation URLs to new domain Update all documentation URLs from crawl4ai.com/mkdocs to docs.crawl4ai.com Improve badges styling and layout in documentation Increase code font size in documentation CSS BREAKING CHANGE: Documentation URLs have changed from crawl4ai.com/mkdocs to docs.crawl4ai.com	2025-01-09 16:22:41 +08:00
UncleCode	1c9464b988	Update all documents	2025-01-08 19:31:31 +08:00
UncleCode	ca3e33122e	refactor(docs): reorganize documentation structure and update styles Reorganize documentation into core/advanced/extraction sections for better navigation. Update terminal theme styles and add rich library for better CLI output. Remove redundant tutorial files and consolidate content into core sections. Add personal story to index page for project context. BREAKING CHANGE: Documentation structure has been significantly reorganized	2025-01-07 20:49:50 +08:00
UncleCode	72fbdac467	fix(extraction): JsonCss selector and crawler improvements - Fix JsonCssExtractionStrategy._get_elements to return all matching elements instead of just one - Add robust error handling to page_need_scroll with default fallback - Improve JSON extraction strategies documentation - Refactor content scraping strategy - Update version to 0.4.247	2025-01-05 19:26:46 +08:00
UncleCode	24b3da717a	refactor(): - Update hello world example	2025-01-02 17:53:30 +08:00
UncleCode	98acc4254d	refactor: - Update hello_world.py example	2025-01-01 19:47:22 +08:00
UncleCode	aa4f92f458	refactor(crawler): - Update hello_world example with proper content filtering	2025-01-01 19:39:42 +08:00
UncleCode	4a4f613238	docs: simplify installation instructions - Add crawl4ai-doctor command to verify installation - Update browser installation instructions in README and docs - Move optional features to documentation - Add manual browser installation steps as fallback - Update getting-started guide with verification step	2025-01-01 16:54:03 +08:00
UncleCode	67f65f958b	refactor(build): simplify setup.py configuration - Remove dependency management from setup.py - Remove entry points configuration (moved to pyproject.toml) - Keep minimal setup.py for backwards compatibility - Clean up package metadata structure	2025-01-01 15:52:01 +08:00
UncleCode	bd66befcf0	Fix issue in 0.4.24 walkthrough	2024-12-31 21:07:58 +08:00
UncleCode	19b0a5ae82	Update 0.4.24 walkthrough	2024-12-31 21:01:46 +08:00
UncleCode	bd71f7f4ea	Add 0.4.24 walkthrough	2024-12-31 20:22:33 +08:00
UncleCode	5c3c05bf93	docs: update README badges and Docker section, reorganize documentation structure	2024-12-31 19:45:02 +08:00
UncleCode	67d0999bc3	chore: resolve merge conflicts for v0.4.24	2024-12-31 19:24:03 +08:00
UncleCode	0ec593fa90	Update the Tutorial section for new document version	2024-12-31 17:27:31 +08:00
UncleCode	fb33a24891	Commit Message: - Added examples for Amazon product data extraction methods - Updated configuration options and enhance documentation - Minor refactoring for improved performance and readability - Cleaned up version control settings.	2024-12-29 20:05:18 +08:00
Robin Singh	78768fd714	Update simple-crawling.md (#379 ) In the comprehensive example, AttributeError: type object 'CacheMode' has no attribute 'ENABLE'. Did you mean: 'ENABLED'?	2024-12-27 17:42:59 +08:00
UncleCode	f2d9912697	Renames browser_config param to config in AsyncWebCrawler Standardizes parameter naming convention across the codebase by renaming browser_config to the more concise config in AsyncWebCrawler constructor. Updates all documentation examples and internal usages to reflect the new parameter name for consistency. Also improves hook execution by adding url/response parameters to goto hooks and fixes parameter ordering in before_return_html hook.	2024-12-26 16:34:36 +08:00
UncleCode	9a4ed6bbd7	Commit Message: Enhance crawler capabilities and documentation - Added SSL certificate extraction in AsyncWebCrawler. - Introduced new content filters and chunking strategies for more robust data extraction. - Updated documentation management to streamline user experience.	2024-12-26 15:17:07 +08:00
UncleCode	d5ed451299	Enhance crawler capabilities and documentation - Add llm.txt generator - Added SSL certificate extraction in AsyncWebCrawler. - Introduced new content filters and chunking strategies for more robust data extraction. - Updated documentation.	2024-12-25 21:34:31 +08:00
Haopeng138	bacbeb3ed4	Fix #340 example llm_extraction (#358 ) @Haopeng138 Thank you so much. They are still part of the library. I forgot to update them since I moved the asynchronous versions years ago. I really appreciate it. I have to say that I feel weak in the documentation. That's why I spent a lot of time on it last week. Now, when you mention some of the things in the example folder, I realize I forgot about the example folder. I'll try to update it more. If you find anything else, please help and support. Thank you. I will add your name to contributor name as well.	2024-12-24 19:56:07 +08:00
UncleCode	84b311760f	Commit Message: Enhance Crawl4AI with CLI and documentation updates - Implemented Command-Line Interface (CLI) in `crawl4ai/cli.py` - Added chunking strategies and their documentation in `llm.txt`	2024-12-21 14:26:56 +08:00
UncleCode	849765712f	Enhance Crawl4AI with new features and documentation - Fix crawler text mode for improved performance; cover missing `srcset` and `data_srcset` attributes in image tags. - Introduced Managed Browsers for enhanced crawling experience. - Updated documentation for clearer navigation on configuration. - Changed 'text_only' to 'text_mode' in configuration and methods. - Improved performance and relevance in content filtering strategies.	2024-12-19 21:02:29 +08:00
Aravind Karnam	7a5f83b76f	fix: Added browser config and crawler run config from 0.4.22	2024-12-18 10:33:09 +05:30
aravind	7c0fa269a6	Merge pull request #9 from aravindkarnam/main Pulling version 0.4.22 from main into scraper	2024-12-17 18:43:36 +05:30
UncleCode	a11d9646e3	Enhance crawler features and improve documentation - Added detailed CrawlerRunConfig parameters documentation. - Introduced plans for real-time event-driven crawling. - Updated async logger default level to DEBUG for better insights. - Improved structure and readability in configuration file. - Enhanced documentation on future capabilities in new blog entries.	2024-12-16 18:52:51 +08:00
UncleCode	e9e5b5642d	Fix js_snipprt issue 0.4.21 bump to 0.4.22	2024-12-15 19:49:30 +08:00
UncleCode	7524aa7b5e	Feature: Add Markdown generation to CrawlerRunConfig - Added markdown generator parameter to CrawlerRunConfig in `async_configs.py`. - Implemented logic for Markdown generation in content scraping in `async_webcrawler.py`. - Updated version number to 0.4.21 in `__version__.py`.	2024-12-13 21:51:38 +08:00
UncleCode	4a72c5ea6e	Add release notes and documentation for version 0.4.2: Configurable Crawlers, Session Management, and Enhanced Screenshot/PDF features	2024-12-12 20:15:50 +08:00
UncleCode	0982c639ae	Enhance AsyncWebCrawler and related configurations - Introduced new configuration classes: BrowserConfig and CrawlerRunConfig. - Refactored AsyncWebCrawler to leverage the new configuration system for cleaner parameter management. - Updated AsyncPlaywrightCrawlerStrategy for better flexibility and reduced legacy parameters. - Improved error handling with detailed context extraction during exceptions. - Enhanced overall maintainability and usability of the web crawler.	2024-12-12 19:35:09 +08:00
UncleCode	5188b7a6a0	Add full-page screenshot and PDF export features - Introduced a new approach for capturing full-page screenshots by exporting them as PDFs first, enhancing reliability and performance. - Added documentation for the feature in `docs/examples/full_page_screenshot_and_pdf_export.md`. - Refactored `perform_completion_with_backoff` in `crawl4ai/utils.py` to include necessary extra parameters. - Updated `quickstart_async.py` to utilize LLM extraction with refined arguments.	2024-12-10 20:59:31 +08:00
UncleCode	5431fa2d0c	Add PDF & screenshot functionality, new tutorial - Added support for exporting pages as PDFs - Enhanced screenshot functionality for long pages - Created a tutorial on dynamic content loading with 'Load More' buttons. - Updated web crawler to handle PDF data in responses.	2024-12-10 20:10:39 +08:00

1 2 3 4

165 Commits