Commit Graph

642 Commits

Author SHA1 Message Date
Aravind Karnam
60ce8bbf55 Merge: with v-0.4.3b 2025-01-28 12:59:53 +05:30
Aravind Karnam
85847ff13f feat:
1. Make active_crawls into a dict instead of set and remove jobs array. Effective lookup and storage of active crawls and crawl control.
2. Put a lock on active_crawls, so similtanious push and pop by coroutines doesn't cause a race condition
3. Move the depth check logic outside the child link for loop, as source_url doesn't change in the loop.
2025-01-28 12:39:45 +05:30
Aravind Karnam
f34b4878cf fix: code formatting 2025-01-28 10:00:01 +05:30
Aravind Karnam
d9324e3454 fix: Move the creation of crawler outside the main loop 2025-01-27 18:31:13 +05:30
Aravind Karnam
0ff95c83bc feat: change input params to scraper, Add asynchronous context manager to AsyncWebScraper, Optimise filter application 2025-01-27 18:13:33 +05:30
Aravind Karnam
bb6450f458 Remove robots.txt compliance from scraper 2025-01-27 11:58:54 +05:30
Aravind Karnam
513d008de5 feat: Merge reviews from unclecode for scorers and filters & Remove the robots.txt compliance from scraper since that will be now handled by crawler 2025-01-27 11:54:10 +05:30
UncleCode
dde14eba7d Update README.md (#562) 2025-01-26 11:00:28 +08:00
UncleCode
d0586f09a9 Merge branch 'vr0.4.3b3' 2025-01-25 21:57:29 +08:00
UncleCode
09ac7ed008 feat(demo): uncomment feature demos and add fake-useragent dependency
Uncomments demonstration code for memory dispatcher, streaming support,
content scraping, JSON schema generation, LLM markdown, and robots compliance
in the v0.4.3b2 features demo file. Also adds fake-useragent package as a
project dependency.

This change makes all feature demonstrations active by default and ensures
proper user agent handling capabilities.
2025-01-25 21:56:08 +08:00
UncleCode
97796f39d2 docs(examples): update proxy rotation demo and disable other demos
Modify proxy rotation example to include empty user agent setting and comment out other demo functions for focused testing. This change simplifies the demo file to focus specifically on proxy rotation functionality.

No breaking changes.
2025-01-25 21:52:35 +08:00
UncleCode
4d7f91b378 refactor(user-agent): improve user agent generation system
Redesign user agent generation to be more modular and reliable:
- Add abstract base class UAGen for user agent generation
- Implement ValidUAGenerator using fake-useragent library
- Add OnlineUAGenerator for fetching real-world user agents
- Update browser configurations to use new UA generation system
- Improve client hints generation

This change makes the user agent system more maintainable and provides better real-world user agent coverage.
2025-01-25 21:16:39 +08:00
UncleCode
69a77222ef feat(browser): add CDP URL configuration support
Add support for direct CDP URL configuration in BrowserConfig and ManagedBrowser classes. This allows connecting to remote browser instances using custom CDP endpoints instead of always launching a local browser.

- Added cdp_url parameter to BrowserConfig
- Added cdp_url support in ManagedBrowser.start() method
- Updated documentation for new parameters
2025-01-24 15:53:47 +08:00
UncleCode
0afc3e9e5e refactor(examples): update API usage in features demo
Update the demo script to use the new crawler.arun_many() API instead of dispatcher.run_urls()
and fix result access patterns. Also improve code formatting and remove
extra whitespace.

- Replace dispatcher.run_urls with crawler.arun_many
- Update streaming demo to use new API and correct result access
- Clean up whitespace and formatting
- Simplify result property access patterns
2025-01-23 22:37:29 +08:00
UncleCode
65d33bcc0f style(docs): improve code formatting in features demo
Clean up whitespace and improve readability in v0_4_3b2_features_demo.py:
- Remove excessive blank lines between functions
- Improve config formatting for better readability
- Uncomment memory dispatcher demo in main function

No breaking changes.
2025-01-23 22:36:58 +08:00
UncleCode
6a01008a2b docs(multi-url): improve documentation clarity and update examples
- Restructure multi-URL crawling documentation with better formatting and examples
- Update code examples to use new API syntax (arun_many)
- Add detailed parameter explanations for RateLimiter and Dispatchers
- Enhance CSS styling for better documentation readability
- Fix outdated method calls in feature demo script

BREAKING CHANGE: Updated dispatcher.run_urls() to crawler.arun_many() in examples
2025-01-23 22:33:36 +08:00
UncleCode
cf3e1e748d feat(scraper): add optimized URL scoring system
Implements a new high-performance URL scoring system with multiple scoring strategies:
- FastKeywordRelevanceScorer for keyword matching
- FastPathDepthScorer for URL depth analysis
- FastContentTypeScorer for file type scoring
- FastFreshnessScorer for date-based scoring
- FastDomainAuthorityScorer for domain reputation
- FastCompositeScorer for combining multiple scorers

Key improvements:
- Memory optimization using __slots__
- LRU caching for expensive operations
- Optimized string operations
- Pre-computed scoring tables
- Fast path optimizations for common cases
- Reduced object allocation

Includes comprehensive benchmarking and testing utilities.
2025-01-23 20:46:33 +08:00
UncleCode
6dc01eae3a refactor(core): improve type hints and remove unused file
- Add RelevantContentFilter to __init__.py exports
- Update version to 0.4.3b3
- Enhance type hints in async_configs.py
- Remove empty utils.scraping.py file
- Update mkdocs configuration with version info and GitHub integration

BREAKING CHANGE: None
2025-01-23 18:53:22 +08:00
UncleCode
7b7fe84e0d docs(readme): resolve merge conflict and update version info
Resolves merge conflict in README.md by removing outdated version 0.4.24x information and keeping current version 0.4.3bx details. Updates release notes description to reflect current features including Memory Dispatcher System, Streaming Support, and other improvements.

No breaking changes.
2025-01-22 20:52:42 +08:00
UncleCode
5c36f4308f Merge branch 'main' of https://github.com/unclecode/crawl4ai 2025-01-22 20:51:52 +08:00
UncleCode
45809d1c91 Merge branch 'vr0.4.3b2' 2025-01-22 20:51:46 +08:00
UncleCode
357414c345 docs(readme): update version references and fix links
Update version numbers to v0.4.3bx throughout README.md
Fix contributing guidelines link to point to CONTRIBUTORS.md
Update Aravind's role in CONTRIBUTORS.md to Head of Community and Product
Add pre-release installation instructions
Fix minor formatting in personal story section

No breaking changes
2025-01-22 20:46:39 +08:00
UncleCode
260b9120c3 docs(examples): update v0.4.3 features demo to v0.4.3b2
Rename and replace the features demo file to reflect the beta 2 version number.
The old v0.4.3 demo file is removed and replaced with a new beta 2 version.

Renames:
- docs/examples/v0_4_3_features_demo.py -> docs/examples/v0_4_3b2_features_demo.py
2025-01-22 20:41:43 +08:00
UncleCode
976ea52167 docs(examples): update demo scripts and fix output formats
Update example scripts to reflect latest API changes and improve demonstrations:
- Increase test URLs in dispatcher example from 20 to 40 pages
- Comment out unused dispatcher strategies for cleaner output
- Fix scraping strategies performance script to use correct object notation
- Update v0_4_3_features_demo with additional feature mentions and uncomment demo sections

These changes make the examples more current and better aligned with the actual API.
2025-01-22 20:40:03 +08:00
UncleCode
e6ef8d91ba refactor(scraper): optimize URL validation and filter performance
- Replace validators library with built-in urlparse for URL validation
- Optimize filter statistics update logic for better performance
- Add performance benchmarking suite for filters
- Add execution time tracking to scraper examples
- Update gitignore with windsurfrules

BREAKING CHANGE: Removed dependency on validators library for URL validation
2025-01-22 19:45:56 +08:00
UncleCode
2d69bf2366 refactor(models): rename final_url to redirected_url for consistency
Renames the final_url field to redirected_url across all components to maintain
consistent terminology throughout the codebase. This change affects:
- AsyncCrawlResponse model
- AsyncPlaywrightCrawlerStrategy
- Documentation and examples

No functional changes, purely naming consistency improvement.
2025-01-22 17:14:24 +08:00
UncleCode
dee5fe9851 feat(proxy): add proxy rotation support and documentation
Implements dynamic proxy rotation functionality with authentication support and IP verification. Updates include:
- Added proxy rotation demo in features example
- Updated proxy configuration handling in BrowserManager
- Added proxy rotation documentation
- Updated README with new proxy rotation feature
- Bumped version to 0.4.3b2

This change enables users to dynamically switch between proxies and verify IP addresses for each request.
2025-01-22 16:11:01 +08:00
UncleCode
88697c4630 docs(readme): update version and feature announcements for v0.4.3b1
Update README.md to announce version 0.4.3b1 release with new features including:
- Memory Dispatcher System
- Streaming Support
- LLM-Powered Markdown Generation
- Schema Generation
- Robots.txt Compliance

Add detailed version numbering explanation section to help users understand pre-release versions.
2025-01-21 21:20:04 +08:00
Aravind Karnam
6e78c56dda Refactor: Removed all scheduling logic from scraper. From now scraper expects arun_many to handle all scheduling. Scraper will only do traversal, validations, compliance checks, URL filtering and scoring etc. Reformatted some of the scraper files with Black code formatter 2025-01-21 18:44:43 +05:30
UncleCode
16b8d4945b feat(release): prepare v0.4.3 beta release
Prepare the v0.4.3 beta release with major feature additions and improvements:
- Add JsonXPathExtractionStrategy and LLMContentFilter to exports
- Update version to 0.4.3b1
- Improve documentation for dispatchers and markdown generation
- Update development status to Beta
- Reorganize changelog format

BREAKING CHANGE: Memory threshold in MemoryAdaptiveDispatcher increased to 90% and SemaphoreDispatcher parameter renamed to max_session_permit
2025-01-21 21:03:11 +08:00
Aravind Karnam
67fa06c09b Refactor: Removed all scheduling logic from scraper. From now scraper expects arun_many to handle all scheduling. Scraper will only do traversal, validations, compliance checks, URL filtering and scoring etc. Reformatted some of the scraper files with Black code formatter 2025-01-21 17:49:51 +05:30
UncleCode
d09c611d15 feat(robots): add robots.txt compliance support
Add support for checking and respecting robots.txt rules before crawling websites:
- Implement RobotsParser class with SQLite caching
- Add check_robots_txt parameter to CrawlerRunConfig
- Integrate robots.txt checking in AsyncWebCrawler
- Update documentation with robots.txt compliance examples
- Add tests for robot parser functionality

The cache uses WAL mode for better concurrency and has a default TTL of 7 days.
2025-01-21 17:54:13 +08:00
Aravind Karnam
26d78d8512 Merge branch 'next' into feature/scraper 2025-01-21 12:35:45 +05:30
Aravind Karnam
1079965453 refactor: Remove the URL processing logic out of scraper 2025-01-21 12:16:59 +05:30
UncleCode
9247877037 feat(proxy): add proxy configuration support to CrawlerRunConfig
Add proxy_config parameter to CrawlerRunConfig to support dynamic proxy configuration per crawl request. This enables users to specify different proxy settings for each crawl operation without modifying the browser config.

- Added proxy_config parameter to CrawlerRunConfig
- Updated BrowserManager to apply proxy settings from CrawlerRunConfig
- Updated proxy-security documentation with new usage examples
2025-01-20 22:14:05 +08:00
Aravind
a677c2b61d Merge pull request #496 from aravindkarnam/scraper-uc
Trying to merge scraper on-going development with new developments in parallel processing
2025-01-20 16:55:41 +05:30
UncleCode
2cec527a22 feat(extraction): add LLM-powered schema generation utility
Adds new static method generate_schema() to JsonElementExtractionStrategy classes
that can automatically generate extraction schemas using LLM (OpenAI or Ollama).
This provides a convenient way to bootstrap extraction schemas while maintaining
the performance benefits of selector-based extraction.

Key changes:
- Added generate_schema() static method to base extraction strategy
- Added support for both CSS and XPath schema generation
- Updated documentation with examples and best practices
- Added new prompt templates for schema generation
2025-01-20 17:28:00 +08:00
UncleCode
4b1309cbf2 feat(crawler): add URL redirection tracking
Add capability to track and return final URLs after redirects in crawler responses. This enhancement helps users understand the actual destination of crawled URLs after any redirections.

Changes include:
- Added final_url tracking in AsyncPlaywrightCrawlerStrategy
- Added redirected_url field to CrawlResult model
- Updated AsyncWebCrawler to properly handle and store redirect URLs
- Fixed typo in documentation signature
2025-01-19 19:53:38 +08:00
UncleCode
8b6fe6a98f docs(api): add streaming mode documentation and examples
Add comprehensive documentation for the new streaming mode feature in arun_many():
- Update arun_many() API docs to reflect streaming return type
- Add streaming examples in quickstart and multi-url guides
- Document stream parameter in configuration classes
- Add clone() helper method documentation for configs

This change improves documentation for processing large numbers of URLs efficiently.
2025-01-19 18:21:34 +08:00
UncleCode
91463e34f1 feat(config): add streaming support and config cloning
Add streaming capability to crawler configurations and introduce clone() methods
for both BrowserConfig and CrawlerRunConfig to support immutable config updates.
Move stream parameter from arun_many() method to CrawlerRunConfig.

BREAKING CHANGE: Removed stream parameter from AsyncWebCrawler.arun_many() method.
Use config.stream=True instead.
2025-01-19 17:51:47 +08:00
UncleCode
1221be30a3 feat(browser): improve browser context management and add shared data support
Add shared_data parameter to CrawlerRunConfig to allow data sharing between hooks.
Implement browser context reuse based on config signatures to improve memory usage.
Fix Firefox/Webkit channel settings.
Add config parameter to hook callbacks for better context access.
Remove debug print statements.

BREAKING CHANGE: Hook callback signatures now include config parameter
2025-01-19 17:12:03 +08:00
Aravind
6dfa9cb703 Streamline Feature requests, bug reports and Forums with Forms & Templates (#465)
* config:Add bug report template and issue chooser

* config:Add bug report template and issue chooser

* config:Add bug report template and issue chooser

* config:Add bug report template and issue chooser

* config:Add bug report template and issue chooser

* config:Add bug report template and issue chooser

* config: updated new bugs to have needs-triage label by default

* Template for PR

* Template for PR

* Template for PR

* Template for PR

* Added FR template

* Added FR template

* Added FR template

* Added FR template

* Config: updated the text for new labels

* config: changed the order of steps to reproduce

* Config: shortened the form for feature request

* Config: Added a code snippet section to the bug report
2025-01-19 16:53:03 +08:00
UncleCode
e363234172 feat(dispatcher): add streaming support for URL processing
Add new streaming capability to the MemoryAdaptiveDispatcher and AsyncWebCrawler
to allow processing URLs with real-time result streaming. This enables
processing results as they become available rather than waiting for all
URLs to complete.

Key changes:
- Add run_urls_stream method to MemoryAdaptiveDispatcher
- Update AsyncWebCrawler.arun_many to support streaming mode
- Add result queue for better result handling
- Improve type hints and documentation

BREAKING CHANGE: The return type of arun_many now depends on the 'stream'
parameter, returning either List[CrawlResult] or AsyncGenerator[CrawlResult, None]
2025-01-19 14:03:34 +08:00
UncleCode
3d09b6a221 feat(content-filter): add LLMContentFilter for intelligent markdown generation
Add new LLMContentFilter class that uses LLMs to generate high-quality markdown content:
- Implement intelligent content filtering with customizable instructions
- Add chunk processing for handling large documents
- Support parallel processing of content chunks
- Include caching mechanism for filtered results
- Add usage tracking and statistics
- Update documentation with examples and use cases

Also includes minor changes:
- Disable Pydantic warnings in __init__.py
- Add new prompt template for content filtering
2025-01-18 19:31:07 +08:00
UncleCode
2d6b19e1a2 refactor(browser): improve browser path management
Implement more robust browser executable path handling using playwright's built-in browser management. This change:
- Adds async browser path resolution
- Implements path caching in the home folder
- Removes hardcoded browser paths
- Adds httpx dependency
- Removes obsolete test result files

This change makes the browser path resolution more reliable across different platforms and environments.
2025-01-17 22:14:37 +08:00
UncleCode
ece9202b61 fix(dispatcher): adjust memory threshold and fix dispatcher initialization
- Increase memory threshold from 70% to 90% for better resource utilization
- Remove incorrect self parameter from MemoryAdaptiveDispatcher initialization

These changes improve the crawler's performance by allowing more memory usage before throttling and fix a bug in dispatcher initialization.
2025-01-16 21:58:52 +08:00
UncleCode
9d694da939 fix(models): make model fields optional with default values
Make fields in MediaItem and Link models optional with default values to prevent validation errors when data is incomplete. Also expose BaseDispatcher in __init__ and fix markdown field handling in database manager.

BREAKING CHANGE: MediaItem and Link model fields are now optional with default values which may affect existing code expecting required fields.
2025-01-15 22:58:14 +08:00
UncleCode
20c027b79c chore(cleanup): remove unused files and improve type hints
- Remove .pre-commit-config.yaml and duplicate mkdocs configuration files
- Add Optional type hint for proxy parameter in BrowserConfig
- Fix type annotation for results list in AsyncWebCrawler
- Move calculate_batch_size function import to model_loader
- Update prompt imports in extraction_strategy.py

No breaking changes.
2025-01-14 13:07:18 +08:00
devatbosch
8878b3d032 Updated the correct link for "Contribution guidelines" in README.md (#445)
Thank you for pointing this out. I am creating a contributing guide, which is why I changed the name to the contributors, but I forgot to update some other places. Thanks again.
2025-01-13 20:57:31 +08:00
Jōnin bingi
1ab9d115cf Fixing minor typos in README (#440)
@mcam10 Thx for the support. Appreciate
2025-01-13 20:23:52 +08:00