Compare commits

...

112 Commits

Author SHA1 Message Date
Aravind Karnam
f7ce2d42c9 feat: Add deep crawl capabilities to arun_many function 2025-01-30 17:49:58 +05:30
Aravind Karnam
f6edb8342e Refactor: remove the old deep_crawl method 2025-01-30 16:22:41 +05:30
Aravind Karnam
ca3f0126d3 Refactor:Moved deep_crawl_strategy, inside crawler run config 2025-01-30 16:18:15 +05:30
Aravind Karnam
858c18df39 fix: removed child_urls from CrawlResult 2025-01-29 18:08:34 +05:30
Aravind Karnam
2c8f2ec5a6 Refactor: Renamed scrape to traverse and deep_crawl in a few sections where it applies 2025-01-29 16:24:11 +05:30
Aravind Karnam
9ef43bc5f0 Refactor: Move adeep_crawl as method of crawler itself. Create attributes in CrawlResult to reconstruct the tree once deep crawling is completed 2025-01-29 15:58:21 +05:30
Aravind Karnam
84ffdaab9a Refactor: Move adeep_crawl as method of crawler itself. Create attributes in CrawlResult to reconstruct the tree once deep crawling is completed 2025-01-29 13:06:09 +05:30
Aravind Karnam
78223bc847 feat: create ScraperPageResult model to attach score and depth attributes to yielded/returned crawl results 2025-01-28 16:47:30 +05:30
Aravind Karnam
60ce8bbf55 Merge: with v-0.4.3b 2025-01-28 12:59:53 +05:30
Aravind Karnam
85847ff13f feat:
1. Make active_crawls into a dict instead of set and remove jobs array. Effective lookup and storage of active crawls and crawl control.
2. Put a lock on active_crawls, so similtanious push and pop by coroutines doesn't cause a race condition
3. Move the depth check logic outside the child link for loop, as source_url doesn't change in the loop.
2025-01-28 12:39:45 +05:30
Aravind Karnam
f34b4878cf fix: code formatting 2025-01-28 10:00:01 +05:30
Aravind Karnam
d9324e3454 fix: Move the creation of crawler outside the main loop 2025-01-27 18:31:13 +05:30
Aravind Karnam
0ff95c83bc feat: change input params to scraper, Add asynchronous context manager to AsyncWebScraper, Optimise filter application 2025-01-27 18:13:33 +05:30
Aravind Karnam
bb6450f458 Remove robots.txt compliance from scraper 2025-01-27 11:58:54 +05:30
Aravind Karnam
513d008de5 feat: Merge reviews from unclecode for scorers and filters & Remove the robots.txt compliance from scraper since that will be now handled by crawler 2025-01-27 11:54:10 +05:30
UncleCode
dde14eba7d Update README.md (#562) 2025-01-26 11:00:28 +08:00
UncleCode
d0586f09a9 Merge branch 'vr0.4.3b3' 2025-01-25 21:57:29 +08:00
UncleCode
09ac7ed008 feat(demo): uncomment feature demos and add fake-useragent dependency
Uncomments demonstration code for memory dispatcher, streaming support,
content scraping, JSON schema generation, LLM markdown, and robots compliance
in the v0.4.3b2 features demo file. Also adds fake-useragent package as a
project dependency.

This change makes all feature demonstrations active by default and ensures
proper user agent handling capabilities.
2025-01-25 21:56:08 +08:00
UncleCode
97796f39d2 docs(examples): update proxy rotation demo and disable other demos
Modify proxy rotation example to include empty user agent setting and comment out other demo functions for focused testing. This change simplifies the demo file to focus specifically on proxy rotation functionality.

No breaking changes.
2025-01-25 21:52:35 +08:00
UncleCode
4d7f91b378 refactor(user-agent): improve user agent generation system
Redesign user agent generation to be more modular and reliable:
- Add abstract base class UAGen for user agent generation
- Implement ValidUAGenerator using fake-useragent library
- Add OnlineUAGenerator for fetching real-world user agents
- Update browser configurations to use new UA generation system
- Improve client hints generation

This change makes the user agent system more maintainable and provides better real-world user agent coverage.
2025-01-25 21:16:39 +08:00
UncleCode
69a77222ef feat(browser): add CDP URL configuration support
Add support for direct CDP URL configuration in BrowserConfig and ManagedBrowser classes. This allows connecting to remote browser instances using custom CDP endpoints instead of always launching a local browser.

- Added cdp_url parameter to BrowserConfig
- Added cdp_url support in ManagedBrowser.start() method
- Updated documentation for new parameters
2025-01-24 15:53:47 +08:00
UncleCode
0afc3e9e5e refactor(examples): update API usage in features demo
Update the demo script to use the new crawler.arun_many() API instead of dispatcher.run_urls()
and fix result access patterns. Also improve code formatting and remove
extra whitespace.

- Replace dispatcher.run_urls with crawler.arun_many
- Update streaming demo to use new API and correct result access
- Clean up whitespace and formatting
- Simplify result property access patterns
2025-01-23 22:37:29 +08:00
UncleCode
65d33bcc0f style(docs): improve code formatting in features demo
Clean up whitespace and improve readability in v0_4_3b2_features_demo.py:
- Remove excessive blank lines between functions
- Improve config formatting for better readability
- Uncomment memory dispatcher demo in main function

No breaking changes.
2025-01-23 22:36:58 +08:00
UncleCode
6a01008a2b docs(multi-url): improve documentation clarity and update examples
- Restructure multi-URL crawling documentation with better formatting and examples
- Update code examples to use new API syntax (arun_many)
- Add detailed parameter explanations for RateLimiter and Dispatchers
- Enhance CSS styling for better documentation readability
- Fix outdated method calls in feature demo script

BREAKING CHANGE: Updated dispatcher.run_urls() to crawler.arun_many() in examples
2025-01-23 22:33:36 +08:00
UncleCode
cf3e1e748d feat(scraper): add optimized URL scoring system
Implements a new high-performance URL scoring system with multiple scoring strategies:
- FastKeywordRelevanceScorer for keyword matching
- FastPathDepthScorer for URL depth analysis
- FastContentTypeScorer for file type scoring
- FastFreshnessScorer for date-based scoring
- FastDomainAuthorityScorer for domain reputation
- FastCompositeScorer for combining multiple scorers

Key improvements:
- Memory optimization using __slots__
- LRU caching for expensive operations
- Optimized string operations
- Pre-computed scoring tables
- Fast path optimizations for common cases
- Reduced object allocation

Includes comprehensive benchmarking and testing utilities.
2025-01-23 20:46:33 +08:00
UncleCode
6dc01eae3a refactor(core): improve type hints and remove unused file
- Add RelevantContentFilter to __init__.py exports
- Update version to 0.4.3b3
- Enhance type hints in async_configs.py
- Remove empty utils.scraping.py file
- Update mkdocs configuration with version info and GitHub integration

BREAKING CHANGE: None
2025-01-23 18:53:22 +08:00
UncleCode
7b7fe84e0d docs(readme): resolve merge conflict and update version info
Resolves merge conflict in README.md by removing outdated version 0.4.24x information and keeping current version 0.4.3bx details. Updates release notes description to reflect current features including Memory Dispatcher System, Streaming Support, and other improvements.

No breaking changes.
2025-01-22 20:52:42 +08:00
UncleCode
5c36f4308f Merge branch 'main' of https://github.com/unclecode/crawl4ai 2025-01-22 20:51:52 +08:00
UncleCode
45809d1c91 Merge branch 'vr0.4.3b2' 2025-01-22 20:51:46 +08:00
UncleCode
357414c345 docs(readme): update version references and fix links
Update version numbers to v0.4.3bx throughout README.md
Fix contributing guidelines link to point to CONTRIBUTORS.md
Update Aravind's role in CONTRIBUTORS.md to Head of Community and Product
Add pre-release installation instructions
Fix minor formatting in personal story section

No breaking changes
2025-01-22 20:46:39 +08:00
UncleCode
260b9120c3 docs(examples): update v0.4.3 features demo to v0.4.3b2
Rename and replace the features demo file to reflect the beta 2 version number.
The old v0.4.3 demo file is removed and replaced with a new beta 2 version.

Renames:
- docs/examples/v0_4_3_features_demo.py -> docs/examples/v0_4_3b2_features_demo.py
2025-01-22 20:41:43 +08:00
UncleCode
976ea52167 docs(examples): update demo scripts and fix output formats
Update example scripts to reflect latest API changes and improve demonstrations:
- Increase test URLs in dispatcher example from 20 to 40 pages
- Comment out unused dispatcher strategies for cleaner output
- Fix scraping strategies performance script to use correct object notation
- Update v0_4_3_features_demo with additional feature mentions and uncomment demo sections

These changes make the examples more current and better aligned with the actual API.
2025-01-22 20:40:03 +08:00
UncleCode
e6ef8d91ba refactor(scraper): optimize URL validation and filter performance
- Replace validators library with built-in urlparse for URL validation
- Optimize filter statistics update logic for better performance
- Add performance benchmarking suite for filters
- Add execution time tracking to scraper examples
- Update gitignore with windsurfrules

BREAKING CHANGE: Removed dependency on validators library for URL validation
2025-01-22 19:45:56 +08:00
UncleCode
2d69bf2366 refactor(models): rename final_url to redirected_url for consistency
Renames the final_url field to redirected_url across all components to maintain
consistent terminology throughout the codebase. This change affects:
- AsyncCrawlResponse model
- AsyncPlaywrightCrawlerStrategy
- Documentation and examples

No functional changes, purely naming consistency improvement.
2025-01-22 17:14:24 +08:00
UncleCode
dee5fe9851 feat(proxy): add proxy rotation support and documentation
Implements dynamic proxy rotation functionality with authentication support and IP verification. Updates include:
- Added proxy rotation demo in features example
- Updated proxy configuration handling in BrowserManager
- Added proxy rotation documentation
- Updated README with new proxy rotation feature
- Bumped version to 0.4.3b2

This change enables users to dynamically switch between proxies and verify IP addresses for each request.
2025-01-22 16:11:01 +08:00
UncleCode
88697c4630 docs(readme): update version and feature announcements for v0.4.3b1
Update README.md to announce version 0.4.3b1 release with new features including:
- Memory Dispatcher System
- Streaming Support
- LLM-Powered Markdown Generation
- Schema Generation
- Robots.txt Compliance

Add detailed version numbering explanation section to help users understand pre-release versions.
2025-01-21 21:20:04 +08:00
Aravind Karnam
6e78c56dda Refactor: Removed all scheduling logic from scraper. From now scraper expects arun_many to handle all scheduling. Scraper will only do traversal, validations, compliance checks, URL filtering and scoring etc. Reformatted some of the scraper files with Black code formatter 2025-01-21 18:44:43 +05:30
UncleCode
16b8d4945b feat(release): prepare v0.4.3 beta release
Prepare the v0.4.3 beta release with major feature additions and improvements:
- Add JsonXPathExtractionStrategy and LLMContentFilter to exports
- Update version to 0.4.3b1
- Improve documentation for dispatchers and markdown generation
- Update development status to Beta
- Reorganize changelog format

BREAKING CHANGE: Memory threshold in MemoryAdaptiveDispatcher increased to 90% and SemaphoreDispatcher parameter renamed to max_session_permit
2025-01-21 21:03:11 +08:00
Aravind Karnam
67fa06c09b Refactor: Removed all scheduling logic from scraper. From now scraper expects arun_many to handle all scheduling. Scraper will only do traversal, validations, compliance checks, URL filtering and scoring etc. Reformatted some of the scraper files with Black code formatter 2025-01-21 17:49:51 +05:30
UncleCode
d09c611d15 feat(robots): add robots.txt compliance support
Add support for checking and respecting robots.txt rules before crawling websites:
- Implement RobotsParser class with SQLite caching
- Add check_robots_txt parameter to CrawlerRunConfig
- Integrate robots.txt checking in AsyncWebCrawler
- Update documentation with robots.txt compliance examples
- Add tests for robot parser functionality

The cache uses WAL mode for better concurrency and has a default TTL of 7 days.
2025-01-21 17:54:13 +08:00
Aravind Karnam
26d78d8512 Merge branch 'next' into feature/scraper 2025-01-21 12:35:45 +05:30
Aravind Karnam
1079965453 refactor: Remove the URL processing logic out of scraper 2025-01-21 12:16:59 +05:30
UncleCode
9247877037 feat(proxy): add proxy configuration support to CrawlerRunConfig
Add proxy_config parameter to CrawlerRunConfig to support dynamic proxy configuration per crawl request. This enables users to specify different proxy settings for each crawl operation without modifying the browser config.

- Added proxy_config parameter to CrawlerRunConfig
- Updated BrowserManager to apply proxy settings from CrawlerRunConfig
- Updated proxy-security documentation with new usage examples
2025-01-20 22:14:05 +08:00
Aravind
a677c2b61d Merge pull request #496 from aravindkarnam/scraper-uc
Trying to merge scraper on-going development with new developments in parallel processing
2025-01-20 16:55:41 +05:30
UncleCode
2cec527a22 feat(extraction): add LLM-powered schema generation utility
Adds new static method generate_schema() to JsonElementExtractionStrategy classes
that can automatically generate extraction schemas using LLM (OpenAI or Ollama).
This provides a convenient way to bootstrap extraction schemas while maintaining
the performance benefits of selector-based extraction.

Key changes:
- Added generate_schema() static method to base extraction strategy
- Added support for both CSS and XPath schema generation
- Updated documentation with examples and best practices
- Added new prompt templates for schema generation
2025-01-20 17:28:00 +08:00
UncleCode
4b1309cbf2 feat(crawler): add URL redirection tracking
Add capability to track and return final URLs after redirects in crawler responses. This enhancement helps users understand the actual destination of crawled URLs after any redirections.

Changes include:
- Added final_url tracking in AsyncPlaywrightCrawlerStrategy
- Added redirected_url field to CrawlResult model
- Updated AsyncWebCrawler to properly handle and store redirect URLs
- Fixed typo in documentation signature
2025-01-19 19:53:38 +08:00
UncleCode
8b6fe6a98f docs(api): add streaming mode documentation and examples
Add comprehensive documentation for the new streaming mode feature in arun_many():
- Update arun_many() API docs to reflect streaming return type
- Add streaming examples in quickstart and multi-url guides
- Document stream parameter in configuration classes
- Add clone() helper method documentation for configs

This change improves documentation for processing large numbers of URLs efficiently.
2025-01-19 18:21:34 +08:00
UncleCode
91463e34f1 feat(config): add streaming support and config cloning
Add streaming capability to crawler configurations and introduce clone() methods
for both BrowserConfig and CrawlerRunConfig to support immutable config updates.
Move stream parameter from arun_many() method to CrawlerRunConfig.

BREAKING CHANGE: Removed stream parameter from AsyncWebCrawler.arun_many() method.
Use config.stream=True instead.
2025-01-19 17:51:47 +08:00
UncleCode
1221be30a3 feat(browser): improve browser context management and add shared data support
Add shared_data parameter to CrawlerRunConfig to allow data sharing between hooks.
Implement browser context reuse based on config signatures to improve memory usage.
Fix Firefox/Webkit channel settings.
Add config parameter to hook callbacks for better context access.
Remove debug print statements.

BREAKING CHANGE: Hook callback signatures now include config parameter
2025-01-19 17:12:03 +08:00
Aravind
6dfa9cb703 Streamline Feature requests, bug reports and Forums with Forms & Templates (#465)
* config:Add bug report template and issue chooser

* config:Add bug report template and issue chooser

* config:Add bug report template and issue chooser

* config:Add bug report template and issue chooser

* config:Add bug report template and issue chooser

* config:Add bug report template and issue chooser

* config: updated new bugs to have needs-triage label by default

* Template for PR

* Template for PR

* Template for PR

* Template for PR

* Added FR template

* Added FR template

* Added FR template

* Added FR template

* Config: updated the text for new labels

* config: changed the order of steps to reproduce

* Config: shortened the form for feature request

* Config: Added a code snippet section to the bug report
2025-01-19 16:53:03 +08:00
UncleCode
e363234172 feat(dispatcher): add streaming support for URL processing
Add new streaming capability to the MemoryAdaptiveDispatcher and AsyncWebCrawler
to allow processing URLs with real-time result streaming. This enables
processing results as they become available rather than waiting for all
URLs to complete.

Key changes:
- Add run_urls_stream method to MemoryAdaptiveDispatcher
- Update AsyncWebCrawler.arun_many to support streaming mode
- Add result queue for better result handling
- Improve type hints and documentation

BREAKING CHANGE: The return type of arun_many now depends on the 'stream'
parameter, returning either List[CrawlResult] or AsyncGenerator[CrawlResult, None]
2025-01-19 14:03:34 +08:00
UncleCode
3d09b6a221 feat(content-filter): add LLMContentFilter for intelligent markdown generation
Add new LLMContentFilter class that uses LLMs to generate high-quality markdown content:
- Implement intelligent content filtering with customizable instructions
- Add chunk processing for handling large documents
- Support parallel processing of content chunks
- Include caching mechanism for filtered results
- Add usage tracking and statistics
- Update documentation with examples and use cases

Also includes minor changes:
- Disable Pydantic warnings in __init__.py
- Add new prompt template for content filtering
2025-01-18 19:31:07 +08:00
UncleCode
2d6b19e1a2 refactor(browser): improve browser path management
Implement more robust browser executable path handling using playwright's built-in browser management. This change:
- Adds async browser path resolution
- Implements path caching in the home folder
- Removes hardcoded browser paths
- Adds httpx dependency
- Removes obsolete test result files

This change makes the browser path resolution more reliable across different platforms and environments.
2025-01-17 22:14:37 +08:00
UncleCode
ece9202b61 fix(dispatcher): adjust memory threshold and fix dispatcher initialization
- Increase memory threshold from 70% to 90% for better resource utilization
- Remove incorrect self parameter from MemoryAdaptiveDispatcher initialization

These changes improve the crawler's performance by allowing more memory usage before throttling and fix a bug in dispatcher initialization.
2025-01-16 21:58:52 +08:00
UncleCode
9d694da939 fix(models): make model fields optional with default values
Make fields in MediaItem and Link models optional with default values to prevent validation errors when data is incomplete. Also expose BaseDispatcher in __init__ and fix markdown field handling in database manager.

BREAKING CHANGE: MediaItem and Link model fields are now optional with default values which may affect existing code expecting required fields.
2025-01-15 22:58:14 +08:00
UncleCode
20c027b79c chore(cleanup): remove unused files and improve type hints
- Remove .pre-commit-config.yaml and duplicate mkdocs configuration files
- Add Optional type hint for proxy parameter in BrowserConfig
- Fix type annotation for results list in AsyncWebCrawler
- Move calculate_batch_size function import to model_loader
- Update prompt imports in extraction_strategy.py

No breaking changes.
2025-01-14 13:07:18 +08:00
devatbosch
8878b3d032 Updated the correct link for "Contribution guidelines" in README.md (#445)
Thank you for pointing this out. I am creating a contributing guide, which is why I changed the name to the contributors, but I forgot to update some other places. Thanks again.
2025-01-13 20:57:31 +08:00
Jōnin bingi
1ab9d115cf Fixing minor typos in README (#440)
@mcam10 Thx for the support. Appreciate
2025-01-13 20:23:52 +08:00
UncleCode
8ec12d7d68 Apply Ruff Corrections 2025-01-13 19:19:58 +08:00
UncleCode
c3370ec5da refactor(scraping): replace ScrapingMode enum with strategy pattern
Replace the ScrapingMode enum with a proper strategy pattern implementation for content scraping.
This change introduces:
- New ContentScrapingStrategy abstract base class
- Concrete WebScrapingStrategy and LXMLWebScrapingStrategy implementations
- New Pydantic models for structured scraping results
- Updated documentation reflecting the new strategy-based approach

BREAKING CHANGE: ScrapingMode enum has been removed. Users should now use ContentScrapingStrategy implementations instead.
2025-01-13 17:53:12 +08:00
UncleCode
f3ae5a657c feat(scraping): add LXML-based scraping mode for improved performance
Adds a new ScrapingMode enum to allow switching between BeautifulSoup and LXML parsing.
LXML mode offers 10-20x better performance for large HTML documents.

Key changes:
- Added ScrapingMode enum with BEAUTIFULSOUP and LXML options
- Implemented LXMLWebScrapingStrategy class
- Added LXML-based metadata extraction
- Updated documentation with scraping mode usage and performance considerations
- Added cssselect dependency

BREAKING CHANGE: None
2025-01-12 20:46:23 +08:00
UncleCode
825c78a048 refactor(dispatcher): migrate to modular dispatcher system with enhanced monitoring
Reorganize dispatcher functionality into separate components:
- Create dedicated dispatcher classes (MemoryAdaptive, Semaphore)
- Add RateLimiter for smart request throttling
- Implement CrawlerMonitor for real-time progress tracking
- Move dispatcher config from CrawlerRunConfig to separate classes

BREAKING CHANGE: Dispatcher configuration moved from CrawlerRunConfig to dedicated dispatcher classes. Users need to update their configuration approach for multi-URL crawling.
2025-01-11 21:10:27 +08:00
UncleCode
3865342c93 Merge branch 'next' into next-cdp 2025-01-10 16:01:49 +08:00
UncleCode
ac5f461d40 feat(crawler): add memory-adaptive dispatcher with rate limiting
Implements a new MemoryAdaptiveDispatcher class to manage concurrent crawling operations with memory monitoring and rate limiting capabilities. Changes include:

- Added RateLimitConfig dataclass for configuring rate limiting behavior
- Extended CrawlerRunConfig with dispatcher-related settings
- Refactored arun_many to use the new dispatcher system
- Added memory threshold and session permit controls
- Integrated optional progress monitoring display

BREAKING CHANGE: The arun_many method now uses MemoryAdaptiveDispatcher by default, which may affect concurrent crawling behavior
2025-01-10 16:01:18 +08:00
UncleCode
f9c601eb7e docs(urls): update documentation URLs to new domain
Update all documentation URLs from crawl4ai.com/mkdocs to docs.crawl4ai.com across README, examples, and documentation files. This change reflects the new documentation hosting domain.

Also add todo/ directory to .gitignore.
2025-01-09 16:24:41 +08:00
UncleCode
e8b4ac6046 docs(urls): update documentation URLs to new domain
Update all documentation URLs from crawl4ai.com/mkdocs to docs.crawl4ai.com
Improve badges styling and layout in documentation
Increase code font size in documentation CSS

BREAKING CHANGE: Documentation URLs have changed from crawl4ai.com/mkdocs to docs.crawl4ai.com
2025-01-09 16:22:41 +08:00
UncleCode
051a6cf974 docs(readme): update personal story and project vision
Revise the README's personal story section to better reflect the project's
origins, motivation, and vision for open-source data accessibility. Add more
detail about the creator's background and the project's mission to
democratize AI through open data access.

Also includes a minor TODO comment addition in async crawler strategy.
2025-01-08 21:13:31 +08:00
UncleCode
1c9464b988 Update all documents 2025-01-08 19:31:31 +08:00
UncleCode
6838901788 Update All docs 2025 8th Jan 2025-01-08 19:31:17 +08:00
UncleCode
ad5e5d21ca Remove .codeiumignore from version control and add to .gitignore 2025-01-08 13:09:23 +08:00
UncleCode
26d821c0de Remove .codeiumignore from version control and add to .gitignore 2025-01-08 13:08:19 +08:00
UncleCode
010677cbee chore: add .gitattributes file
Add initial .gitattributes file to standardize line endings and file handling across different operating systems.

This will help prevent issues with line ending inconsistencies between developers working on different platforms.
2025-01-08 13:05:00 +08:00
UncleCode
c110d459fb Update .gitattributes 2025-01-07 21:20:17 +08:00
UncleCode
4d1975e0a7 Update .gitattributes 2025-01-07 21:18:45 +08:00
UncleCode
82734a750c Update .gitattributes 2025-01-07 21:11:45 +08:00
UncleCode
56fa4e1e42 refactor(doc)
Update README
2025-01-07 20:53:10 +08:00
UncleCode
ca3e33122e refactor(docs): reorganize documentation structure and update styles
Reorganize documentation into core/advanced/extraction sections for better navigation.
Update terminal theme styles and add rich library for better CLI output.
Remove redundant tutorial files and consolidate content into core sections.
Add personal story to index page for project context.

BREAKING CHANGE: Documentation structure has been significantly reorganized
2025-01-07 20:49:50 +08:00
Aravind Karnam
7a5f83b76f fix: Added browser config and crawler run config from 0.4.22 2024-12-18 10:33:09 +05:30
aravind
7c0fa269a6 Merge pull request #9 from aravindkarnam/main
Pulling version 0.4.22 from main into scraper
2024-12-17 18:43:36 +05:30
Aravind Karnam
2f5e0598bb updated definition of can_process_url to include dept as an argument, as it's needed to skip filters for start_url 2024-11-26 18:26:57 +05:30
Aravind Karnam
ff731e4ea1 fixed the final scraper_quickstart.py example 2024-11-26 17:08:32 +05:30
Aravind Karnam
9530ded83a fixed the final scraper_quickstart.py example 2024-11-26 17:05:54 +05:30
Aravind Karnam
155c756238 <Future pending> issue fix was incorrect. Reverting 2024-11-26 17:04:04 +05:30
Aravind Karnam
a888c91790 Fix "Future attached to a different loop" error by ensuring tasks are created in the correct event loop
- Explicitly retrieve and use the correct event loop when creating tasks to avoid cross-loop issues.
- Ensures proper task scheduling in environments with multiple event loops.
2024-11-26 14:05:02 +05:30
Aravind Karnam
a98d51a62c Remove the can_process_url check from _process_links since it's already being checked in process_url 2024-11-26 11:11:49 +05:30
Aravind Karnam
ee3001b1f7 fix: moved depth as a param to can_process_url and applying filter chain only when depth is not zero. This way
filter chain is skipped but other validations are in place even for start URL
2024-11-26 10:22:14 +05:30
Aravind Karnam
b13fd71040 chore: 1. Expose process_external_links as a param
2. Removed a few unused imports
3. Removed URL normalisation for external links separately as that won't be necessary
2024-11-26 10:07:11 +05:30
Aravind Karnam
2226ef53c8 fix: Exempting the start_url from can_process_url 2024-11-23 14:59:14 +05:30
aravind
3d52b551f2 Merge pull request #8 from aravindkarnam/main
Pulling in 0.3.74
2024-11-23 13:57:36 +05:30
Aravind Karnam
f8e85b1499 Fixed a bug in _process_links, handled condition for when url_scorer is passed as None, renamed the scrapper folder to scraper. 2024-11-23 13:52:34 +05:30
Aravind Karnam
c1797037c0 Fixed a few bugs, import errors and changed to asyncio wait_for instead of timeout to support python versions < 3.11 2024-11-23 12:39:25 +05:30
aravind
60670b2af6 Merge pull request #7 from aravindkarnam/main
pulling the main branch into scraper-uc
2024-11-15 20:43:54 +05:30
UncleCode
0d357ab7d2 feat(scraper): Enhance URL filtering and scoring systems
Implement comprehensive URL filtering and scoring capabilities:

Filters:
- Add URLPatternFilter with glob/regex support
- Implement ContentTypeFilter with MIME type checking
- Add DomainFilter for domain control
- Create FilterChain with stats tracking

Scorers:
- Complete KeywordRelevanceScorer implementation
- Add PathDepthScorer for URL structure scoring
- Implement ContentTypeScorer for file type priorities
- Add FreshnessScorer for date-based scoring
- Add DomainAuthorityScorer for domain weighting
- Create CompositeScorer for combined strategies

Features:
- Add statistics tracking for both filters and scorers
- Implement logging support throughout
- Add resource cleanup methods
- Create comprehensive documentation
- Include performance optimizations

Tests and docs included.
Note: Review URL normalization overlap with recent crawler changes.
2024-11-08 19:02:28 +08:00
UncleCode
bae4665949 feat(scraper): Enhance URL filtering and scoring systems
Implement comprehensive URL filtering and scoring capabilities:

Filters:
- Add URLPatternFilter with glob/regex support
- Implement ContentTypeFilter with MIME type checking
- Add DomainFilter for domain control
- Create FilterChain with stats tracking

Scorers:
- Complete KeywordRelevanceScorer implementation
- Add PathDepthScorer for URL structure scoring
- Implement ContentTypeScorer for file type priorities
- Add FreshnessScorer for date-based scoring
- Add DomainAuthorityScorer for domain weighting
- Create CompositeScorer for combined strategies

Features:
- Add statistics tracking for both filters and scorers
- Implement logging support throughout
- Add resource cleanup methods
- Create comprehensive documentation
- Include performance optimizations

Tests and docs included.
Note: Review URL normalization overlap with recent crawler changes.

- Quick Start is created and added
2024-11-08 18:45:12 +08:00
UncleCode
d11c004fbb Enhanced BFS Strategy: Improved monitoring, resource management & configuration
- Added CrawlStats for comprehensive crawl monitoring
- Implemented proper resource cleanup with shutdown mechanism
- Enhanced URL processing with better validation and politeness controls
- Added configuration options (max_concurrent, timeout, external_links)
- Improved error handling with retry logic
- Added domain-specific queues for better performance
- Created comprehensive documentation

Note: URL normalization needs review - potential duplicate processing
with core crawler for internal links. Currently commented out pending
further investigation of edge cases.
2024-11-08 15:57:23 +08:00
UncleCode
3d1c9a8434 Revieweing the BFS strategy. 2024-11-07 18:54:53 +08:00
UncleCode
be472c624c Refactored AsyncWebScraper to include comprehensive error handling and progress tracking capabilities. Introduced a ScrapingProgress data class to monitor processed and failed URLs. Enhanced scraping methods to log errors and track stats throughout the scraping process. 2024-11-06 21:09:47 +08:00
UncleCode
06b21dcc50 Update .gitignore to include new directories for issues and documentation 2024-11-06 18:44:03 +08:00
UncleCode
0f0f60527d Merge pull request #172 from aravindkarnam/scraper
Scraper
2024-11-06 07:00:44 +01:00
Aravind Karnam
8105fd178e Removed stubs for remove_from_future_crawls since the visited set is updated soon as the URL was queued, Removed add_to_retry_queue(url) since retry with exponential backoff with help of tenacity is going to take care of it. 2024-10-17 15:42:43 +05:30
Aravind Karnam
ce7fce4b16 1. Moved to asyncio.wait instead of gather so that results can be yeilded just as they are ready, rather than in batches
2. Moved the visted.add(url), to before the task is put in queue rather than after the crawl is completed. This makes sure that  duplicate crawls doesn't happen when same URL is found at different depth and that get's queued too because the crawl is not yet completed and visted set is not updated.
3. Named the yield_results attribute to stream instead. Since that seems to be popularly used in all other AI libraries for intermediate results.
2024-10-17 12:25:17 +05:30
Aravind Karnam
de28b59aca removed unused imports 2024-10-16 22:36:48 +05:30
Aravind Karnam
04d8b47b92 Exposed min_crawl_delay for BFSScraperStrategy 2024-10-16 22:34:54 +05:30
Aravind Karnam
2943feeecf 1. Added a flag to yield each crawl result,as they become ready along with the final scraper result as another option
2. Removed ascrape_many method, as I'm currently not focusing on it in the first cut of scraper
3. Added some error handling for cases where robots.txt cannot be fetched or parsed.
2024-10-16 22:05:29 +05:30
Aravind Karnam
8a7d29ce85 updated some comments and removed content type checking functionality from core as it's implemented as a filter 2024-10-16 15:59:37 +05:30
aravind
159bd875bd Merge pull request #5 from aravindkarnam/main
Merging 0.3.6
2024-10-16 10:41:22 +05:30
Aravind Karnam
d743adac68 Fixed some bugs in robots.txt processing 2024-10-03 15:58:57 +05:30
Aravind Karnam
7fe220dbd5 1. Introduced a bool flag to ascrape method to switch between sequential and concurrent processing
2. Introduced a dictionary for depth tracking across various tasks
3. Removed redundancy with crawled_urls variable. Instead created a list with visited set variable in returned object.
2024-10-03 11:17:11 +05:30
aravind
65e013d9d1 Merge pull request #3 from aravindkarnam/main
Merging latest changes from main branch
2024-10-03 09:52:12 +05:30
Aravind Karnam
7f3e2e47ed Parallel processing with retry on failure with exponential backoff - Simplified URL validation and normalisation - respecting Robots.txt 2024-09-19 12:34:12 +05:30
aravind
78f26ac263 Merge pull request #2 from aravindkarnam/staging
Staging
2024-09-18 18:16:23 +05:30
Aravind Karnam
44ce12c62c Created scaffolding for Scraper as per the plan. Implemented the ascrape method in bfs_scraper_strategy 2024-09-09 13:13:34 +05:30
206 changed files with 22704 additions and 15973 deletions

View File

@@ -1,220 +0,0 @@
# Byte-compiled / optimized / DLL files
__pycache__/
*.py[cod]
*$py.class
# C extensions
*.so
# Distribution / packaging
.Python
build/
develop-eggs/
dist/
downloads/
eggs/
.eggs/
lib/
lib64/
parts/
sdist/
var/
wheels/
share/python-wheels/
*.egg-info/
.installed.cfg
*.egg
MANIFEST
# PyInstaller
# Usually these files are written by a python script from a template
# before PyInstaller builds the exe, so as to inject date/other infos into it.
*.manifest
*.spec
# Installer logs
pip-log.txt
pip-delete-this-directory.txt
# Unit test / coverage reports
htmlcov/
.tox/
.nox/
.coverage
.coverage.*
.cache
nosetests.xml
coverage.xml
*.cover
*.py,cover
.hypothesis/
.pytest_cache/
cover/
# Translations
*.mo
*.pot
# Django stuff:
*.log
local_settings.py
db.sqlite3
db.sqlite3-journal
# Flask stuff:
instance/
.webassets-cache
# Scrapy stuff:
.scrapy
# Sphinx documentation
docs/_build/
# PyBuilder
.pybuilder/
target/
# Jupyter Notebook
.ipynb_checkpoints
# IPython
profile_default/
ipython_config.py
# pyenv
# For a library or package, you might want to ignore these files since the code is
# intended to run in multiple environments; otherwise, check them in:
# .python-version
# pipenv
# According to pypa/pipenv#598, it is recommended to include Pipfile.lock in version control.
# However, in case of collaboration, if having platform-specific dependencies or dependencies
# having no cross-platform support, pipenv may install dependencies that don't work, or not
# install all needed dependencies.
#Pipfile.lock
# poetry
# Similar to Pipfile.lock, it is generally recommended to include poetry.lock in version control.
# This is especially recommended for binary packages to ensure reproducibility, and is more
# commonly ignored for libraries.
# https://python-poetry.org/docs/basic-usage/#commit-your-poetrylock-file-to-version-control
#poetry.lock
# pdm
# Similar to Pipfile.lock, it is generally recommended to include pdm.lock in version control.
#pdm.lock
# pdm stores project-wide configurations in .pdm.toml, but it is recommended to not include it
# in version control.
# https://pdm.fming.dev/latest/usage/project/#working-with-version-control
.pdm.toml
.pdm-python
.pdm-build/
# PEP 582; used by e.g. github.com/David-OConnor/pyflow and github.com/pdm-project/pdm
__pypackages__/
# Celery stuff
celerybeat-schedule
celerybeat.pid
# SageMath parsed files
*.sage.py
# Environments
.env
.venv
env/
venv/
ENV/
env.bak/
venv.bak/
# Spyder project settings
.spyderproject
.spyproject
# Rope project settings
.ropeproject
# mkdocs documentation
/site
# mypy
.mypy_cache/
.dmypy.json
dmypy.json
# Pyre type checker
.pyre/
# pytype static type analyzer
.pytype/
# Cython debug symbols
cython_debug/
# PyCharm
# JetBrains specific template is maintained in a separate JetBrains.gitignore that can
# be found at https://github.com/github/gitignore/blob/main/Global/JetBrains.gitignore
# and can be added to the global gitignore or merged into this file. For a more nuclear
# option (not recommended) you can uncomment the following to ignore the entire idea folder.
#.idea/
Crawl4AI.egg-info/
Crawl4AI.egg-info/*
crawler_data.db
.vscode/
.tests/
.test_pads/
test_pad.py
test_pad*.py
.data/
Crawl4AI.egg-info/
requirements0.txt
a.txt
*.sh
.idea
docs/examples/.chainlit/
docs/examples/.chainlit/*
.chainlit/config.toml
.chainlit/translations/en-US.json
local/
.files/
a.txt
.lambda_function.py
ec2*
update_changelog.sh
.DS_Store
docs/.DS_Store
tmp/
test_env/
**/.DS_Store
**/.DS_Store
todo.md
todo_executor.md
git_changes.py
git_changes.md
pypi_build.sh
git_issues.py
git_issues.md
.next/
.tests/
.docs/
.gitboss/
todo_executor.md
protect-all-except-feature.sh
manage-collab.sh
publish.sh
combine.sh
combined_output.txt
tree.md

12
.gitattributes vendored Normal file
View File

@@ -0,0 +1,12 @@
# Documentation
*.html linguist-documentation
docs/* linguist-documentation
docs/examples/* linguist-documentation
docs/md_v2/* linguist-documentation
# Explicitly mark Python as the main language
*.py linguist-detectable=true
*.py linguist-language=Python
# Exclude HTML from language statistics
*.html linguist-detectable=false

View File

@@ -0,0 +1,59 @@
title: "[Feature Request]: "
labels: ["⚙️ New"]
body:
- type: markdown
attributes:
value: |
Thank you for your interest in suggesting a new feature! Before you submit, please take a moment to check if already exists in
this discussions category to avoid duplicates. 😊
- type: textarea
id: needs_to_be_done
attributes:
label: What needs to be done?
description: Please describe the feature or functionality you'd like to see.
placeholder: "e.g., Return alt text along with images scraped from a webpages in Result"
validations:
required: true
- type: textarea
id: problem_to_solve
attributes:
label: What problem does this solve?
description: Explain the pain point or issue this feature will help address.
placeholder: "e.g., Bypass Captchas added by cloudflare"
validations:
required: true
- type: textarea
id: target_users
attributes:
label: Target users/beneficiaries
description: Who would benefit from this feature? (e.g., specific teams, developers, users, etc.)
placeholder: "e.g., Marketing teams, developers"
validations:
required: false
- type: textarea
id: current_workarounds
attributes:
label: Current alternatives/workarounds
description: Are there any existing solutions or workarounds? How does this feature improve upon them?
placeholder: "e.g., Users manually select the css classes mapped to data fields to extract them"
validations:
required: false
- type: markdown
attributes:
value: |
### 💡 Implementation Ideas
- type: textarea
id: proposed_approach
attributes:
label: Proposed approach
description: Share any ideas you have for how this feature could be implemented. Point out any challenges your foresee
and the success metrics for this feature
placeholder: "e.g., Implement a breadth first traversal algorithm for scraper"
validations:
required: false

127
.github/ISSUE_TEMPLATE/bug_report.yml vendored Normal file
View File

@@ -0,0 +1,127 @@
name: Bug Report
description: Report a bug with the Crawl4AI.
title: "[Bug]: "
labels: ["🐞 Bug","🩺 Needs Triage"]
body:
- type: input
id: crawl4ai_version
attributes:
label: crawl4ai version
description: Specify the version of crawl4ai you are using.
placeholder: "e.g., 2.0.0"
validations:
required: true
- type: textarea
id: expected_behavior
attributes:
label: Expected Behavior
description: Describe what you expected to happen.
placeholder: "Provide a detailed explanation of the expected outcome."
validations:
required: true
- type: textarea
id: current_behavior
attributes:
label: Current Behavior
description: Describe what is happening instead of the expected behavior.
placeholder: "Describe the actual result or issue you encountered."
validations:
required: true
- type: dropdown
id: reproducible
attributes:
label: Is this reproducible?
description: Indicate whether this bug can be reproduced consistently.
options:
- "Yes"
- "No"
validations:
required: true
- type: textarea
id: inputs
attributes:
label: Inputs Causing the Bug
description: Provide details about the inputs causing the issue.
placeholder: |
- URL(s):
- Settings used:
- Input data (if applicable):
render: bash
- type: textarea
id: steps_to_reproduce
attributes:
label: Steps to Reproduce
description: Provide step-by-step instructions to reproduce the issue.
placeholder: |
1. Go to...
2. Click on...
3. Observe the issue...
render: bash
- type: textarea
id: code_snippets
attributes:
label: Code snippets
description: Provide code snippets(if any). Add comments as necessary
placeholder: print("Hello world")
render: python
# Header Section with Title
- type: markdown
attributes:
value: |
## Supporting Information
Please provide the following details to help us understand and resolve your issue. This will assist us in reproducing and diagnosing the problem
- type: input
id: os
attributes:
label: OS
description: Please provide the operating system & distro where the issue occurs.
placeholder: "e.g., Windows, macOS, Linux"
validations:
required: true
- type: input
id: python_version
attributes:
label: Python version
description: Specify the Python version being used.
placeholder: "e.g., 3.8.5"
validations:
required: true
# Browser Field
- type: input
id: browser
attributes:
label: Browser
description: Provide the name of the browser you are using.
placeholder: "e.g., Chrome, Firefox, Safari"
validations:
required: false
# Browser Version Field
- type: input
id: browser_version
attributes:
label: Browser version
description: Provide the version of the browser you are using.
placeholder: "e.g., 91.0.4472.124"
validations:
required: false
# Error Logs Field (Text Area)
- type: textarea
id: error_logs
attributes:
label: Error logs & Screenshots (if applicable)
description: If you encountered any errors, please provide the error logs. Attach any relevant screenshots to help us understand the issue.
placeholder: "Paste error logs here and attach your screenshots"
validations:
required: false

8
.github/ISSUE_TEMPLATE/config.yml vendored Normal file
View File

@@ -0,0 +1,8 @@
blank_issues_enabled: false
contact_links:
- name: Feature Requests
url: https://github.com/unclecode/crawl4ai/discussions/categories/feature-requests
about: "Suggest new features or enhancements for Crawl4AI"
- name: Forums - Q&A
url: https://github.com/unclecode/crawl4ai/discussions/categories/forums-q-a
about: "Ask questions or engage in general discussions about Crawl4AI"

19
.github/pull_request_template.md vendored Normal file
View File

@@ -0,0 +1,19 @@
## Summary
Please include a summary of the change and/or which issues are fixed.
eg: `Fixes #123` (Tag GitHub issue numbers in this format, so it automatically links the issues with your PR)
## List of files changed and why
eg: quickstart.py - To update the example as per new changes
## How Has This Been Tested?
Please describe the tests that you ran to verify your changes.
## Checklist:
- [ ] My code follows the style guidelines of this project
- [ ] I have performed a self-review of my own code
- [ ] I have commented my code, particularly in hard-to-understand areas
- [ ] I have made corresponding changes to the documentation
- [ ] I have added/updated unit tests that prove my fix is effective or that my feature works
- [ ] New and existing unit tests pass locally with my changes

11
.gitignore vendored
View File

@@ -227,3 +227,14 @@ tree.md
.do .do
/plans /plans
plans/ plans/
# Codeium
.codeiumignore
todo/
# windsurf rules
.windsurfrules
# windsurf rules
.windsurfrules

View File

@@ -7,23 +7,123 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
--- ---
## [0.4.267] - 2025 - 01 - 06 ### Changed
Okay, here's a detailed changelog in Markdown format, generated from the provided git diff and commit history. I've focused on user-facing changes, fixes, and features, and grouped them as requested:
### Added ## Version 0.4.3b2 (2025-01-21)
This release introduces several powerful new features, including robots.txt compliance, dynamic proxy support, LLM-powered schema generation, and improved documentation.
### Features
- **Robots.txt Compliance:**
- Added robots.txt compliance support with efficient SQLite-based caching.
- New `check_robots_txt` parameter in `CrawlerRunConfig` to enable robots.txt checking before crawling a URL.
- Automated robots.txt checking is now integrated into `AsyncWebCrawler` with 403 status codes for blocked URLs.
- **Proxy Configuration:**
- Added proxy configuration support to `CrawlerRunConfig`, allowing dynamic proxy settings per crawl request.
- Updated documentation with examples for using proxy configuration in crawl operations.
- **LLM-Powered Schema Generation:**
- Introduced a new utility for automatic CSS and XPath schema generation using OpenAI or Ollama models.
- Added comprehensive documentation and examples for schema generation.
- New prompt templates optimized for HTML schema analysis.
- **URL Redirection Tracking:**
- Added URL redirection tracking to capture the final URL after any redirects.
- The final URL is now available in the `redirected_url` field of the `AsyncCrawlResponse` object.
- **Enhanced Streamlined Documentation:**
- Refactored and improved the documentation structure for clarity and ease of use.
- Added detailed explanations of new features and updated examples.
- **Improved Browser Context Management:**
- Enhanced the management of browser contexts and added shared data support.
- Introduced the `shared_data` parameter in `CrawlerRunConfig` to pass data between hooks.
- **Memory Dispatcher System:**
- Migrated to a memory dispatcher system with enhanced monitoring capabilities.
- Introduced `MemoryAdaptiveDispatcher` and `SemaphoreDispatcher` for improved resource management.
- Added `RateLimiter` for rate limiting support.
- New `CrawlerMonitor` for real-time monitoring of crawler operations.
- **Streaming Support:**
- Added streaming support for processing crawled URLs as they are processed.
- Enabled streaming mode with the `stream` parameter in `CrawlerRunConfig`.
- **Content Scraping Strategy:**
- Introduced a new `LXMLWebScrapingStrategy` for faster content scraping.
- Added support for selecting the scraping strategy via the `scraping_strategy` parameter in `CrawlerRunConfig`.
### Bug Fixes
- **Browser Path Management:**
- Improved browser path management for consistent behavior across different environments.
- **Memory Threshold:**
- Adjusted the default memory threshold to improve resource utilization.
- **Pydantic Model Fields:**
- Made several model fields optional with default values to improve flexibility.
### Refactor
- **Documentation Structure:**
- Reorganized documentation structure to improve navigation and readability.
- Updated styles and added new sections for advanced features.
- **Scraping Mode:**
- Replaced the `ScrapingMode` enum with a strategy pattern for more flexible content scraping.
- **Version Update:**
- Updated the version to `0.4.248`.
- **Code Cleanup:**
- Removed unused files and improved type hints.
- Applied Ruff corrections for code quality.
- **Updated dependencies:**
- Updated dependencies to their latest versions to ensure compatibility and security.
- **Ignored certain patterns and directories:**
- Updated `.gitignore` and `.codeiumignore` to ignore additional patterns and directories, streamlining the development environment.
- **Simplified Personal Story in README:**
- Streamlined the personal story and project vision in the `README.md` for clarity.
- **Removed Deprecated Files:**
- Deleted several deprecated files and examples that are no longer relevant.
---
**Previous Releases:**
### 0.4.24x (2024-12-31)
- **Enhanced SSL & Security**: New SSL certificate handling with custom paths and validation options for secure crawling.
- **Smart Content Filtering**: Advanced filtering system with regex support and efficient chunking strategies.
- **Improved JSON Extraction**: Support for complex JSONPath, JSON-CSS, and Microdata extraction.
- **New Field Types**: Added `computed`, `conditional`, `aggregate`, and `template` field types.
- **Performance Boost**: Optimized caching, parallel processing, and memory management.
- **Better Error Handling**: Enhanced debugging capabilities with detailed error tracking.
- **Security Features**: Improved input validation and safe expression evaluation.
### 0.4.247 (2025-01-06)
#### Added
- **Windows Event Loop Configuration**: Introduced a utility function `configure_windows_event_loop` to resolve `NotImplementedError` for asyncio subprocesses on Windows. ([#utils.py](crawl4ai/utils.py), [#tutorials/async-webcrawler-basics.md](docs/md_v3/tutorials/async-webcrawler-basics.md)) - **Windows Event Loop Configuration**: Introduced a utility function `configure_windows_event_loop` to resolve `NotImplementedError` for asyncio subprocesses on Windows. ([#utils.py](crawl4ai/utils.py), [#tutorials/async-webcrawler-basics.md](docs/md_v3/tutorials/async-webcrawler-basics.md))
- **`page_need_scroll` Method**: Added a method to determine if a page requires scrolling before taking actions in `AsyncPlaywrightCrawlerStrategy`. ([#async_crawler_strategy.py](crawl4ai/async_crawler_strategy.py)) - **`page_need_scroll` Method**: Added a method to determine if a page requires scrolling before taking actions in `AsyncPlaywrightCrawlerStrategy`. ([#async_crawler_strategy.py](crawl4ai/async_crawler_strategy.py))
### Changed #### Changed
- **Version Bump**: Updated the version from `0.4.246` to `0.4.247`. ([#__version__.py](crawl4ai/__version__.py)) - **Version Bump**: Updated the version from `0.4.246` to `0.4.247`. ([#__version__.py](crawl4ai/__version__.py))
- **Improved Scrolling Logic**: Enhanced scrolling methods in `AsyncPlaywrightCrawlerStrategy` by adding a `scroll_delay` parameter for better control. ([#async_crawler_strategy.py](crawl4ai/async_crawler_strategy.py)) - **Improved Scrolling Logic**: Enhanced scrolling methods in `AsyncPlaywrightCrawlerStrategy` by adding a `scroll_delay` parameter for better control. ([#async_crawler_strategy.py](crawl4ai/async_crawler_strategy.py))
- **Markdown Generation Example**: Updated the `hello_world.py` example to reflect the latest API changes and better illustrate features. ([#examples/hello_world.py](docs/examples/hello_world.py)) - **Markdown Generation Example**: Updated the `hello_world.py` example to reflect the latest API changes and better illustrate features. ([#examples/hello_world.py](docs/examples/hello_world.py))
- **Documentation Update**: - **Documentation Update**:
- Added Windows-specific instructions for handling asyncio event loops. ([#async-webcrawler-basics.md](docs/md_v3/tutorials/async-webcrawler-basics.md)) - Added Windows-specific instructions for handling asyncio event loops. ([#async-webcrawler-basics.md](docs/md_v3/tutorials/async-webcrawler-basics.md))
### Removed #### Removed
- **Legacy Markdown Generation Code**: Removed outdated and unused code for markdown generation in `content_scraping_strategy.py`. ([#content_scraping_strategy.py](crawl4ai/content_scraping_strategy.py)) - **Legacy Markdown Generation Code**: Removed outdated and unused code for markdown generation in `content_scraping_strategy.py`. ([#content_scraping_strategy.py](crawl4ai/content_scraping_strategy.py))
### Fixed #### Fixed
- **Page Closing to Prevent Memory Leaks**: - **Page Closing to Prevent Memory Leaks**:
- **Description**: Added a `finally` block to ensure pages are closed when no `session_id` is provided. - **Description**: Added a `finally` block to ensure pages are closed when no `session_id` is provided.
- **Impact**: Prevents memory leaks caused by lingering pages after a crawl. - **Impact**: Prevents memory leaks caused by lingering pages after a crawl.
@@ -38,9 +138,11 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
- **Multiple Element Selection**: Modified `_get_elements` in `JsonCssExtractionStrategy` to return all matching elements instead of just the first one, ensuring comprehensive extraction. ([#extraction_strategy.py](crawl4ai/extraction_strategy.py)) - **Multiple Element Selection**: Modified `_get_elements` in `JsonCssExtractionStrategy` to return all matching elements instead of just the first one, ensuring comprehensive extraction. ([#extraction_strategy.py](crawl4ai/extraction_strategy.py))
- **Error Handling in Scrolling**: Added robust error handling to ensure scrolling proceeds safely even if a configuration is missing. ([#async_crawler_strategy.py](crawl4ai/async_crawler_strategy.py)) - **Error Handling in Scrolling**: Added robust error handling to ensure scrolling proceeds safely even if a configuration is missing. ([#async_crawler_strategy.py](crawl4ai/async_crawler_strategy.py))
### Other ## [0.4.267] - 2025 - 01 - 06
- **Git Ignore Update**: Added `/plans` to `.gitignore` for better development environment consistency. ([#.gitignore](.gitignore))
### Added
- **Windows Event Loop Configuration**: Introduced a utility function `configure_windows_event_loop` to resolve `NotImplementedError` for asyncio subprocesses on Windows. ([#utils.py](crawl4ai/utils.py), [#tutorials/async-webcrawler-basics.md](docs/md_v3/tutorials/async-webcrawler-basics.md))
- **`page_need_scroll` Method**: Added a method to determine if a page requires scrolling before taking actions in `AsyncPlaywrightCrawlerStrategy`. ([#async_crawler_strategy.py](crawl4ai/async_crawler_strategy.py))
## [0.4.24] - 2024-12-31 ## [0.4.24] - 2024-12-31

View File

@@ -6,7 +6,7 @@ We would like to thank the following people for their contributions to Crawl4AI:
- [Unclecode](https://github.com/unclecode) - Project Creator and Main Developer - [Unclecode](https://github.com/unclecode) - Project Creator and Main Developer
- [Nasrin](https://github.com/ntohidi) - Project Manager and Developer - [Nasrin](https://github.com/ntohidi) - Project Manager and Developer
- [Aravind Karnam](https://github.com/aravindkarnam) - Developer - [Aravind Karnam](https://github.com/aravindkarnam) - Head of Community and Product
## Community Contributors ## Community Contributors

View File

@@ -21,9 +21,21 @@
Crawl4AI is the #1 trending GitHub repository, actively maintained by a vibrant community. It delivers blazing-fast, AI-ready web crawling tailored for LLMs, AI agents, and data pipelines. Open source, flexible, and built for real-time performance, Crawl4AI empowers developers with unmatched speed, precision, and deployment ease. Crawl4AI is the #1 trending GitHub repository, actively maintained by a vibrant community. It delivers blazing-fast, AI-ready web crawling tailored for LLMs, AI agents, and data pipelines. Open source, flexible, and built for real-time performance, Crawl4AI empowers developers with unmatched speed, precision, and deployment ease.
[✨ Check out latest update v0.4.24x](#-recent-updates) [✨ Check out latest update v0.4.3bx](#-recent-updates)
🎉 **Version 0.4.24x is out!** Major improvements in extraction strategies with enhanced JSON handling, SSL security, and Amazon product extraction. Plus, a completely revamped content filtering system! [Read the release notes →](https://crawl4ai.com/mkdocs/blog) 🎉 **Version 0.4.3bx is out!** This release brings exciting new features like a Memory Dispatcher System, Streaming Support, LLM-Powered Markdown Generation, Schema Generation, and Robots.txt Compliance! [Read the release notes →](https://docs.crawl4ai.com/blog)
<details>
<summary>🤓 <strong>My Personal Story</strong></summary>
My journey with computers started in childhood when my dad, a computer scientist, introduced me to an Amstrad computer. Those early days sparked a fascination with technology, leading me to pursue computer science and specialize in NLP during my postgraduate studies. It was during this time that I first delved into web crawling, building tools to help researchers organize papers and extract information from publications a challenging yet rewarding experience that honed my skills in data extraction.
Fast forward to 2023, I was working on a tool for a project and needed a crawler to convert a webpage into markdown. While exploring solutions, I found one that claimed to be open-source but required creating an account and generating an API token. Worse, it turned out to be a SaaS model charging $16, and its quality didnt meet my standards. Frustrated, I realized this was a deeper problem. That frustration turned into turbo anger mode, and I decided to build my own solution. In just a few days, I created Crawl4AI. To my surprise, it went viral, earning thousands of GitHub stars and resonating with a global community.
I made Crawl4AI open-source for two reasons. First, its my way of giving back to the open-source community that has supported me throughout my career. Second, I believe data should be accessible to everyone, not locked behind paywalls or monopolized by a few. Open access to data lays the foundation for the democratization of AI, a vision where individuals can train their own models and take ownership of their information. This library is the first step in a larger journey to create the best open-source data extraction and generation tool the world has ever seen, built collaboratively by a passionate community.
Thank you to everyone who has supported this project, used it, and shared feedback. Your encouragement motivates me to dream even bigger. Join us, file issues, submit PRs, or spread the word. Together, we can build a tool that truly empowers people to access their own data and reshape the future of AI.
</details>
## 🧐 Why Crawl4AI? ## 🧐 Why Crawl4AI?
@@ -41,6 +53,9 @@ Crawl4AI is the #1 trending GitHub repository, actively maintained by a vibrant
# Install the package # Install the package
pip install -U crawl4ai pip install -U crawl4ai
# For pre release versions
pip install crawl4ai --pre
# Run post-installation setup # Run post-installation setup
crawl4ai-setup crawl4ai-setup
@@ -149,7 +164,7 @@ if __name__ == "__main__":
✨ Play around with this [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1SgRPrByQLzjRfwoRNq1wSGE9nYY_EE8C?usp=sharing) ✨ Play around with this [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1SgRPrByQLzjRfwoRNq1wSGE9nYY_EE8C?usp=sharing)
✨ Visit our [Documentation Website](https://crawl4ai.com/mkdocs/) ✨ Visit our [Documentation Website](https://docs.crawl4ai.com/)
## Installation 🛠️ ## Installation 🛠️
@@ -265,7 +280,7 @@ task_id = response.json()["task_id"]
result = requests.get(f"http://localhost:11235/task/{task_id}") result = requests.get(f"http://localhost:11235/task/{task_id}")
``` ```
For more examples, see our [Docker Examples](https://github.com/unclecode/crawl4ai/blob/main/docs/examples/docker_example.py). For advanced configuration, environment variables, and usage examples, see our [Docker Deployment Guide](https://crawl4ai.com/mkdocs/basic/docker-deployment/). For more examples, see our [Docker Examples](https://github.com/unclecode/crawl4ai/blob/main/docs/examples/docker_example.py). For advanced configuration, environment variables, and usage examples, see our [Docker Deployment Guide](https://docs.crawl4ai.com/basic/docker-deployment/).
</details> </details>
@@ -432,7 +447,7 @@ if __name__ == "__main__":
</details> </details>
<details> <details>
<summary>🤖 <strong>Using You own Browswer with Custome User Profile</strong></summary> <summary>🤖 <strong>Using You own Browser with Custom User Profile</strong></summary>
```python ```python
import os, sys import os, sys
@@ -470,24 +485,70 @@ async def test_news_crawl():
</details> </details>
## ✨ Recent Updates ## ✨ Recent Updates
- 🔒 **Enhanced SSL & Security**: New SSL certificate handling with custom paths and validation options for secure crawling - **🚀 New Dispatcher System**: Scale to thousands of URLs with intelligent **memory monitoring**, **concurrency control**, and optional **rate limiting**. (See `MemoryAdaptiveDispatcher`, `SemaphoreDispatcher`, `RateLimiter`, `CrawlerMonitor`)
- 🔍 **Smart Content Filtering**: Advanced filtering system with regex support and efficient chunking strategies - **⚡ Streaming Mode**: Process results **as they arrive** instead of waiting for an entire batch to complete. (Set `stream=True` in `CrawlerRunConfig`)
- 📦 **Improved JSON Extraction**: Support for complex JSONPath, JSON-CSS, and Microdata extraction - **🤖 Enhanced LLM Integration**:
- 🏗️ **New Field Types**: Added `computed`, `conditional`, `aggregate`, and `template` field types - **Automatic schema generation**: Create extraction rules from HTML using OpenAI or Ollama, no manual CSS/XPath needed.
-**Performance Boost**: Optimized caching, parallel processing, and memory management - **LLM-powered Markdown filtering**: Refine your markdown output with a new `LLMContentFilter` that understands content relevance.
- 🐛 **Better Error Handling**: Enhanced debugging capabilities with detailed error tracking - **Ollama Support**: Use open-source or self-hosted models for private or cost-effective extraction.
- 🔐 **Security Features**: Improved input validation and safe expression evaluation - **🏎️ Faster Scraping Option**: New `LXMLWebScrapingStrategy` offers **10-20x speedup** for large, complex pages (experimental).
- **🤖 robots.txt Compliance**: Respect website rules with `check_robots_txt=True` and efficient local caching.
- **🔄 Proxy Rotation**: Built-in support for dynamic proxy switching and IP verification, with support for authenticated proxies and session persistence.
- **➡️ URL Redirection Tracking**: The `redirected_url` field now captures the final destination after any redirects.
- **🪞 Improved Mirroring**: The `LXMLWebScrapingStrategy` now has much greater fidelity, allowing for almost pixel-perfect mirroring of websites.
- **📈 Enhanced Monitoring**: Track memory, CPU, and individual crawler status with `CrawlerMonitor`.
- **📝 Improved Documentation**: More examples, clearer explanations, and updated tutorials.
Read the full details of this release in our [0.4.24 Release Notes](https://github.com/unclecode/crawl4ai/blob/main/CHANGELOG.md). Read the full details in our [0.4.3bx Release Notes](https://github.com/unclecode/crawl4ai/blob/main/CHANGELOG.md).
## Version Numbering in Crawl4AI
Crawl4AI follows standard Python version numbering conventions (PEP 440) to help users understand the stability and features of each release.
### Version Numbers Explained
Our version numbers follow this pattern: `MAJOR.MINOR.PATCH` (e.g., 0.4.3)
#### Pre-release Versions
We use different suffixes to indicate development stages:
- `dev` (0.4.3dev1): Development versions, unstable
- `a` (0.4.3a1): Alpha releases, experimental features
- `b` (0.4.3b1): Beta releases, feature complete but needs testing
- `rc` (0.4.3rc1): Release candidates, potential final version
#### Installation
- Regular installation (stable version):
```bash
pip install -U crawl4ai
```
- Install pre-release versions:
```bash
pip install crawl4ai --pre
```
- Install specific version:
```bash
pip install crawl4ai==0.4.3b1
```
#### Why Pre-releases?
We use pre-releases to:
- Test new features in real-world scenarios
- Gather feedback before final releases
- Ensure stability for production users
- Allow early adopters to try new features
For production environments, we recommend using the stable version. For testing new features, you can opt-in to pre-releases using the `--pre` flag.
## 📖 Documentation & Roadmap ## 📖 Documentation & Roadmap
> 🚨 **Documentation Update Alert**: We're undertaking a major documentation overhaul next week to reflect recent updates and improvements. Stay tuned for a more comprehensive and up-to-date guide! > 🚨 **Documentation Update Alert**: We're undertaking a major documentation overhaul next week to reflect recent updates and improvements. Stay tuned for a more comprehensive and up-to-date guide!
For current documentation, including installation instructions, advanced features, and API reference, visit our [Documentation Website](https://crawl4ai.com/mkdocs/). For current documentation, including installation instructions, advanced features, and API reference, visit our [Documentation Website](https://docs.crawl4ai.com/).
To check our development plans and upcoming features, visit our [Roadmap](https://github.com/unclecode/crawl4ai/blob/main/ROADMAP.md). To check our development plans and upcoming features, visit our [Roadmap](https://github.com/unclecode/crawl4ai/blob/main/ROADMAP.md).
@@ -511,7 +572,7 @@ To check our development plans and upcoming features, visit our [Roadmap](https:
## 🤝 Contributing ## 🤝 Contributing
We welcome contributions from the open-source community. Check out our [contribution guidelines](https://github.com/unclecode/crawl4ai/blob/main/CONTRIBUTING.md) for more information. We welcome contributions from the open-source community. Check out our [contribution guidelines](https://github.com/unclecode/crawl4ai/blob/main/CONTRIBUTORS.md) for more information.
## 📄 License ## 📄 License

View File

@@ -2,45 +2,88 @@
from .async_webcrawler import AsyncWebCrawler, CacheMode from .async_webcrawler import AsyncWebCrawler, CacheMode
from .async_configs import BrowserConfig, CrawlerRunConfig from .async_configs import BrowserConfig, CrawlerRunConfig
from .extraction_strategy import ExtractionStrategy, LLMExtractionStrategy, CosineStrategy, JsonCssExtractionStrategy from .content_scraping_strategy import (
ContentScrapingStrategy,
WebScrapingStrategy,
LXMLWebScrapingStrategy,
)
from .extraction_strategy import (
ExtractionStrategy,
LLMExtractionStrategy,
CosineStrategy,
JsonCssExtractionStrategy,
JsonXPathExtractionStrategy
)
from .chunking_strategy import ChunkingStrategy, RegexChunking from .chunking_strategy import ChunkingStrategy, RegexChunking
from .markdown_generation_strategy import DefaultMarkdownGenerator from .markdown_generation_strategy import DefaultMarkdownGenerator
from .content_filter_strategy import PruningContentFilter, BM25ContentFilter from .content_filter_strategy import PruningContentFilter, BM25ContentFilter, LLMContentFilter, RelevantContentFilter
from .models import CrawlResult from .models import CrawlResult, MarkdownGenerationResult
from .__version__ import __version__ from .async_dispatcher import (
MemoryAdaptiveDispatcher,
SemaphoreDispatcher,
RateLimiter,
CrawlerMonitor,
DisplayMode,
BaseDispatcher
)
__all__ = [ __all__ = [
"AsyncWebCrawler", "AsyncWebCrawler",
"CrawlResult", "CrawlResult",
"CacheMode", "CacheMode",
'BrowserConfig', "ContentScrapingStrategy",
'CrawlerRunConfig', "WebScrapingStrategy",
'ExtractionStrategy', "LXMLWebScrapingStrategy",
'LLMExtractionStrategy', "BrowserConfig",
'CosineStrategy', "CrawlerRunConfig",
'JsonCssExtractionStrategy', "ExtractionStrategy",
'ChunkingStrategy', "LLMExtractionStrategy",
'RegexChunking', "CosineStrategy",
'DefaultMarkdownGenerator', "JsonCssExtractionStrategy",
'PruningContentFilter', "JsonXPathExtractionStrategy",
'BM25ContentFilter', "ChunkingStrategy",
"RegexChunking",
"DefaultMarkdownGenerator",
"RelevantContentFilter",
"PruningContentFilter",
"BM25ContentFilter",
"LLMContentFilter",
"BaseDispatcher",
"MemoryAdaptiveDispatcher",
"SemaphoreDispatcher",
"RateLimiter",
"CrawlerMonitor",
"DisplayMode",
"MarkdownGenerationResult",
] ]
def is_sync_version_installed(): def is_sync_version_installed():
try: try:
import selenium import selenium
return True return True
except ImportError: except ImportError:
return False return False
if is_sync_version_installed(): if is_sync_version_installed():
try: try:
from .web_crawler import WebCrawler from .web_crawler import WebCrawler
__all__.append("WebCrawler") __all__.append("WebCrawler")
except ImportError: except ImportError:
import warnings print(
print("Warning: Failed to import WebCrawler even though selenium is installed. This might be due to other missing dependencies.") "Warning: Failed to import WebCrawler even though selenium is installed. This might be due to other missing dependencies."
)
else: else:
WebCrawler = None WebCrawler = None
# import warnings # import warnings
# print("Warning: Synchronous WebCrawler is not available. Install crawl4ai[sync] for synchronous support. However, please note that the synchronous version will be deprecated soon.") # print("Warning: Synchronous WebCrawler is not available. Install crawl4ai[sync] for synchronous support. However, please note that the synchronous version will be deprecated soon.")
import warnings
from pydantic import warnings as pydantic_warnings
# Disable all Pydantic warnings
warnings.filterwarnings("ignore", module="pydantic")
# pydantic_warnings.filter_warnings()

View File

@@ -1,2 +1,2 @@
# crawl4ai/_version.py # crawl4ai/_version.py
__version__ = "0.4.247" __version__ = "0.4.3b3"

View File

@@ -5,13 +5,17 @@ from .config import (
PAGE_TIMEOUT, PAGE_TIMEOUT,
IMAGE_SCORE_THRESHOLD, IMAGE_SCORE_THRESHOLD,
SOCIAL_MEDIA_DOMAINS, SOCIAL_MEDIA_DOMAINS,
) )
from .user_agent_generator import UserAgentGenerator
from .user_agent_generator import UserAgentGenerator, UAGen, ValidUAGenerator, OnlineUAGenerator
from .extraction_strategy import ExtractionStrategy from .extraction_strategy import ExtractionStrategy
from .chunking_strategy import ChunkingStrategy from .chunking_strategy import ChunkingStrategy, RegexChunking
from .deep_crawl import DeepCrawlStrategy
from .markdown_generation_strategy import MarkdownGenerationStrategy from .markdown_generation_strategy import MarkdownGenerationStrategy
from typing import Union, List from .content_filter_strategy import RelevantContentFilter, BM25ContentFilter, LLMContentFilter, PruningContentFilter
from .content_scraping_strategy import ContentScrapingStrategy, WebScrapingStrategy
from typing import Optional, Union, List
from .cache_context import CacheMode
class BrowserConfig: class BrowserConfig:
@@ -29,6 +33,7 @@ class BrowserConfig:
Default: True. Default: True.
use_managed_browser (bool): Launch the browser using a managed approach (e.g., via CDP), allowing use_managed_browser (bool): Launch the browser using a managed approach (e.g., via CDP), allowing
advanced manipulation. Default: False. advanced manipulation. Default: False.
cdp_url (str): URL for the Chrome DevTools Protocol (CDP) endpoint. Default: "ws://localhost:9222/devtools/browser/".
debugging_port (int): Port for the browser debugging protocol. Default: 9222. debugging_port (int): Port for the browser debugging protocol. Default: 9222.
use_persistent_context (bool): Use a persistent browser context (like a persistent profile). use_persistent_context (bool): Use a persistent browser context (like a persistent profile).
Automatically sets use_managed_browser=True. Default: False. Automatically sets use_managed_browser=True. Default: False.
@@ -38,7 +43,7 @@ class BrowserConfig:
is "chromium". Default: "chromium". is "chromium". Default: "chromium".
channel (str): The channel to launch (e.g., "chromium", "chrome", "msedge"). Only applies if browser_type channel (str): The channel to launch (e.g., "chromium", "chrome", "msedge"). Only applies if browser_type
is "chromium". Default: "chromium". is "chromium". Default: "chromium".
proxy (str or None): Proxy server URL (e.g., "http://username:password@proxy:port"). If None, no proxy is used. proxy (Optional[str]): Proxy server URL (e.g., "http://username:password@proxy:port"). If None, no proxy is used.
Default: None. Default: None.
proxy_config (dict or None): Detailed proxy configuration, e.g. {"server": "...", "username": "..."}. proxy_config (dict or None): Detailed proxy configuration, e.g. {"server": "...", "username": "..."}.
If None, no additional proxy config. Default: None. If None, no additional proxy config. Default: None.
@@ -77,6 +82,7 @@ class BrowserConfig:
browser_type: str = "chromium", browser_type: str = "chromium",
headless: bool = True, headless: bool = True,
use_managed_browser: bool = False, use_managed_browser: bool = False,
cdp_url: str = None,
use_persistent_context: bool = False, use_persistent_context: bool = False,
user_data_dir: str = None, user_data_dir: str = None,
chrome_channel: str = "chromium", chrome_channel: str = "chromium",
@@ -87,7 +93,7 @@ class BrowserConfig:
viewport_height: int = 600, viewport_height: int = 600,
accept_downloads: bool = False, accept_downloads: bool = False,
downloads_path: str = None, downloads_path: str = None,
storage_state=None, storage_state : Union[str, dict, None]=None,
ignore_https_errors: bool = True, ignore_https_errors: bool = True,
java_script_enabled: bool = True, java_script_enabled: bool = True,
sleep_on_close: bool = False, sleep_on_close: bool = False,
@@ -95,23 +101,30 @@ class BrowserConfig:
cookies: list = None, cookies: list = None,
headers: dict = None, headers: dict = None,
user_agent: str = ( user_agent: str = (
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:109.0) AppleWebKit/537.36 " # "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:109.0) AppleWebKit/537.36 "
"(KHTML, like Gecko) Chrome/116.0.5845.187 Safari/604.1 Edg/117.0.2045.47" # "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 "
# "(KHTML, like Gecko) Chrome/116.0.5845.187 Safari/604.1 Edg/117.0.2045.47"
"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 Chrome/116.0.0.0 Safari/537.36"
), ),
user_agent_mode: str = None, user_agent_mode: str = "",
user_agent_generator_config: dict = None, user_agent_generator_config: dict = {},
text_mode: bool = False, text_mode: bool = False,
light_mode: bool = False, light_mode: bool = False,
extra_args: list = None, extra_args: list = None,
debugging_port : int = 9222, debugging_port: int = 9222,
host: str = "localhost",
): ):
self.browser_type = browser_type self.browser_type = browser_type
self.headless = headless self.headless = headless
self.use_managed_browser = use_managed_browser self.use_managed_browser = use_managed_browser
self.cdp_url = cdp_url
self.use_persistent_context = use_persistent_context self.use_persistent_context = use_persistent_context
self.user_data_dir = user_data_dir self.user_data_dir = user_data_dir
self.chrome_channel = chrome_channel or self.browser_type or "chromium" self.chrome_channel = chrome_channel or self.browser_type or "chromium"
self.channel = channel or self.browser_type or "chromium" self.channel = channel or self.browser_type or "chromium"
if self.browser_type in ["firefox", "webkit"]:
self.channel = ""
self.chrome_channel = ""
self.proxy = proxy self.proxy = proxy
self.proxy_config = proxy_config self.proxy_config = proxy_config
self.viewport_width = viewport_width self.viewport_width = viewport_width
@@ -133,17 +146,15 @@ class BrowserConfig:
self.verbose = verbose self.verbose = verbose
self.debugging_port = debugging_port self.debugging_port = debugging_port
user_agenr_generator = UserAgentGenerator() fa_user_agenr_generator = ValidUAGenerator()
if self.user_agent_mode != "random" and self.user_agent_generator_config: if self.user_agent_mode == "random":
self.user_agent = user_agenr_generator.generate( self.user_agent = fa_user_agenr_generator.generate(
**(self.user_agent_generator_config or {}) **(self.user_agent_generator_config or {})
) )
elif self.user_agent_mode == "random":
self.user_agent = user_agenr_generator.generate()
else: else:
pass pass
self.browser_hint = user_agenr_generator.generate_client_hints(self.user_agent) self.browser_hint = UAGen.generate_client_hints(self.user_agent)
self.headers.setdefault("sec-ch-ua", self.browser_hint) self.headers.setdefault("sec-ch-ua", self.browser_hint)
# If persistent context is requested, ensure managed browser is enabled # If persistent context is requested, ensure managed browser is enabled
@@ -156,6 +167,7 @@ class BrowserConfig:
browser_type=kwargs.get("browser_type", "chromium"), browser_type=kwargs.get("browser_type", "chromium"),
headless=kwargs.get("headless", True), headless=kwargs.get("headless", True),
use_managed_browser=kwargs.get("use_managed_browser", False), use_managed_browser=kwargs.get("use_managed_browser", False),
cdp_url=kwargs.get("cdp_url"),
use_persistent_context=kwargs.get("use_persistent_context", False), use_persistent_context=kwargs.get("use_persistent_context", False),
user_data_dir=kwargs.get("user_data_dir"), user_data_dir=kwargs.get("user_data_dir"),
chrome_channel=kwargs.get("chrome_channel", "chromium"), chrome_channel=kwargs.get("chrome_channel", "chromium"),
@@ -183,6 +195,51 @@ class BrowserConfig:
extra_args=kwargs.get("extra_args", []), extra_args=kwargs.get("extra_args", []),
) )
def to_dict(self):
return {
"browser_type": self.browser_type,
"headless": self.headless,
"use_managed_browser": self.use_managed_browser,
"cdp_url": self.cdp_url,
"use_persistent_context": self.use_persistent_context,
"user_data_dir": self.user_data_dir,
"chrome_channel": self.chrome_channel,
"channel": self.channel,
"proxy": self.proxy,
"proxy_config": self.proxy_config,
"viewport_width": self.viewport_width,
"viewport_height": self.viewport_height,
"accept_downloads": self.accept_downloads,
"downloads_path": self.downloads_path,
"storage_state": self.storage_state,
"ignore_https_errors": self.ignore_https_errors,
"java_script_enabled": self.java_script_enabled,
"cookies": self.cookies,
"headers": self.headers,
"user_agent": self.user_agent,
"user_agent_mode": self.user_agent_mode,
"user_agent_generator_config": self.user_agent_generator_config,
"text_mode": self.text_mode,
"light_mode": self.light_mode,
"extra_args": self.extra_args,
"sleep_on_close": self.sleep_on_close,
"verbose": self.verbose,
"debugging_port": self.debugging_port,
}
def clone(self, **kwargs):
"""Create a copy of this configuration with updated values.
Args:
**kwargs: Key-value pairs of configuration options to update
Returns:
BrowserConfig: A new instance with the specified updates
"""
config_dict = self.to_dict()
config_dict.update(kwargs)
return BrowserConfig.from_kwargs(config_dict)
class CrawlerRunConfig: class CrawlerRunConfig:
""" """
@@ -221,6 +278,10 @@ class CrawlerRunConfig:
Default: False. Default: False.
parser_type (str): Type of parser to use for HTML parsing. parser_type (str): Type of parser to use for HTML parsing.
Default: "lxml". Default: "lxml".
scraping_strategy (ContentScrapingStrategy): Scraping strategy to use.
Default: WebScrapingStrategy.
proxy_config (dict or None): Detailed proxy configuration, e.g. {"server": "...", "username": "..."}.
If None, no additional proxy config. Default: None.
# Caching Parameters # Caching Parameters
cache_mode (CacheMode or None): Defines how caching is handled. cache_mode (CacheMode or None): Defines how caching is handled.
@@ -237,6 +298,8 @@ class CrawlerRunConfig:
Default: False. Default: False.
no_cache_write (bool): Legacy parameter, if True acts like CacheMode.READ_ONLY. no_cache_write (bool): Legacy parameter, if True acts like CacheMode.READ_ONLY.
Default: False. Default: False.
shared_data (dict or None): Shared data to be passed between hooks.
Default: None.
# Page Navigation and Timing Parameters # Page Navigation and Timing Parameters
wait_until (str): The condition to wait for when navigating, e.g. "domcontentloaded". wait_until (str): The condition to wait for when navigating, e.g. "domcontentloaded".
@@ -311,6 +374,20 @@ class CrawlerRunConfig:
Default: True. Default: True.
log_console (bool): If True, log console messages from the page. log_console (bool): If True, log console messages from the page.
Default: False. Default: False.
# Streaming Parameters
stream (bool): If True, enables streaming of crawled URLs as they are processed when used with arun_many.
Default: False.
# Optional Parameters
stream (bool): If True, stream the page content as it is being loaded.
url: str = None # This is not a compulsory parameter
check_robots_txt (bool): Whether to check robots.txt rules before crawling. Default: False
user_agent (str): Custom User-Agent string to use. Default: None
user_agent_mode (str or None): Mode for generating the user agent (e.g., "random"). If None, use the provided
user_agent as-is. Default: None.
user_agent_generator_config (dict or None): Configuration for user agent generation if user_agent_mode is set.
Default: None.
""" """
def __init__( def __init__(
@@ -318,9 +395,10 @@ class CrawlerRunConfig:
# Content Processing Parameters # Content Processing Parameters
word_count_threshold: int = MIN_WORD_THRESHOLD, word_count_threshold: int = MIN_WORD_THRESHOLD,
extraction_strategy: ExtractionStrategy = None, extraction_strategy: ExtractionStrategy = None,
chunking_strategy: ChunkingStrategy = None, chunking_strategy: ChunkingStrategy = RegexChunking(),
deep_crawl_strategy: DeepCrawlStrategy = None,
markdown_generator: MarkdownGenerationStrategy = None, markdown_generator: MarkdownGenerationStrategy = None,
content_filter=None, content_filter : RelevantContentFilter = None,
only_text: bool = False, only_text: bool = False,
css_selector: str = None, css_selector: str = None,
excluded_tags: list = None, excluded_tags: list = None,
@@ -329,18 +407,18 @@ class CrawlerRunConfig:
remove_forms: bool = False, remove_forms: bool = False,
prettiify: bool = False, prettiify: bool = False,
parser_type: str = "lxml", parser_type: str = "lxml",
scraping_strategy: ContentScrapingStrategy = None,
proxy_config: dict = None,
# SSL Parameters # SSL Parameters
fetch_ssl_certificate: bool = False, fetch_ssl_certificate: bool = False,
# Caching Parameters # Caching Parameters
cache_mode=None, cache_mode: CacheMode =None,
session_id: str = None, session_id: str = None,
bypass_cache: bool = False, bypass_cache: bool = False,
disable_cache: bool = False, disable_cache: bool = False,
no_cache_read: bool = False, no_cache_read: bool = False,
no_cache_write: bool = False, no_cache_write: bool = False,
shared_data: dict = None,
# Page Navigation and Timing Parameters # Page Navigation and Timing Parameters
wait_until: str = "domcontentloaded", wait_until: str = "domcontentloaded",
page_timeout: int = PAGE_TIMEOUT, page_timeout: int = PAGE_TIMEOUT,
@@ -350,7 +428,6 @@ class CrawlerRunConfig:
mean_delay: float = 0.1, mean_delay: float = 0.1,
max_range: float = 0.3, max_range: float = 0.3,
semaphore_count: int = 5, semaphore_count: int = 5,
# Page Interaction Parameters # Page Interaction Parameters
js_code: Union[str, List[str]] = None, js_code: Union[str, List[str]] = None,
js_only: bool = False, js_only: bool = False,
@@ -363,7 +440,6 @@ class CrawlerRunConfig:
override_navigator: bool = False, override_navigator: bool = False,
magic: bool = False, magic: bool = False,
adjust_viewport_to_content: bool = False, adjust_viewport_to_content: bool = False,
# Media Handling Parameters # Media Handling Parameters
screenshot: bool = False, screenshot: bool = False,
screenshot_wait_for: float = None, screenshot_wait_for: float = None,
@@ -372,18 +448,21 @@ class CrawlerRunConfig:
image_description_min_word_threshold: int = IMAGE_DESCRIPTION_MIN_WORD_THRESHOLD, image_description_min_word_threshold: int = IMAGE_DESCRIPTION_MIN_WORD_THRESHOLD,
image_score_threshold: int = IMAGE_SCORE_THRESHOLD, image_score_threshold: int = IMAGE_SCORE_THRESHOLD,
exclude_external_images: bool = False, exclude_external_images: bool = False,
# Link and Domain Handling Parameters # Link and Domain Handling Parameters
exclude_social_media_domains: list = None, exclude_social_media_domains: list = None,
exclude_external_links: bool = False, exclude_external_links: bool = False,
exclude_social_media_links: bool = False, exclude_social_media_links: bool = False,
exclude_domains: list = None, exclude_domains: list = None,
# Debugging and Logging Parameters # Debugging and Logging Parameters
verbose: bool = True, verbose: bool = True,
log_console: bool = False, log_console: bool = False,
# Streaming Parameters
stream: bool = False,
url: str = None, url: str = None,
check_robots_txt: bool = False,
user_agent: str = None,
user_agent_mode: str = None,
user_agent_generator_config: dict = {},
): ):
self.url = url self.url = url
@@ -391,6 +470,7 @@ class CrawlerRunConfig:
self.word_count_threshold = word_count_threshold self.word_count_threshold = word_count_threshold
self.extraction_strategy = extraction_strategy self.extraction_strategy = extraction_strategy
self.chunking_strategy = chunking_strategy self.chunking_strategy = chunking_strategy
self.deep_crawl_strategy = deep_crawl_strategy
self.markdown_generator = markdown_generator self.markdown_generator = markdown_generator
self.content_filter = content_filter self.content_filter = content_filter
self.only_text = only_text self.only_text = only_text
@@ -401,6 +481,8 @@ class CrawlerRunConfig:
self.remove_forms = remove_forms self.remove_forms = remove_forms
self.prettiify = prettiify self.prettiify = prettiify
self.parser_type = parser_type self.parser_type = parser_type
self.scraping_strategy = scraping_strategy or WebScrapingStrategy()
self.proxy_config = proxy_config
# SSL Parameters # SSL Parameters
self.fetch_ssl_certificate = fetch_ssl_certificate self.fetch_ssl_certificate = fetch_ssl_certificate
@@ -412,6 +494,7 @@ class CrawlerRunConfig:
self.disable_cache = disable_cache self.disable_cache = disable_cache
self.no_cache_read = no_cache_read self.no_cache_read = no_cache_read
self.no_cache_write = no_cache_write self.no_cache_write = no_cache_write
self.shared_data = shared_data
# Page Navigation and Timing Parameters # Page Navigation and Timing Parameters
self.wait_until = wait_until self.wait_until = wait_until
@@ -446,7 +529,9 @@ class CrawlerRunConfig:
self.exclude_external_images = exclude_external_images self.exclude_external_images = exclude_external_images
# Link and Domain Handling Parameters # Link and Domain Handling Parameters
self.exclude_social_media_domains = exclude_social_media_domains or SOCIAL_MEDIA_DOMAINS self.exclude_social_media_domains = (
exclude_social_media_domains or SOCIAL_MEDIA_DOMAINS
)
self.exclude_external_links = exclude_external_links self.exclude_external_links = exclude_external_links
self.exclude_social_media_links = exclude_social_media_links self.exclude_social_media_links = exclude_social_media_links
self.exclude_domains = exclude_domains or [] self.exclude_domains = exclude_domains or []
@@ -455,19 +540,41 @@ class CrawlerRunConfig:
self.verbose = verbose self.verbose = verbose
self.log_console = log_console self.log_console = log_console
# Streaming Parameters
self.stream = stream
# Robots.txt Handling Parameters
self.check_robots_txt = check_robots_txt
# User Agent Parameters
self.user_agent = user_agent
self.user_agent_mode = user_agent_mode
self.user_agent_generator_config = user_agent_generator_config
# Validate type of extraction strategy and chunking strategy if they are provided # Validate type of extraction strategy and chunking strategy if they are provided
if self.extraction_strategy is not None and not isinstance( if self.extraction_strategy is not None and not isinstance(
self.extraction_strategy, ExtractionStrategy self.extraction_strategy, ExtractionStrategy
): ):
raise ValueError("extraction_strategy must be an instance of ExtractionStrategy") raise ValueError(
"extraction_strategy must be an instance of ExtractionStrategy"
)
if self.deep_crawl_strategy is not None and not isinstance(
self.deep_crawl_strategy, DeepCrawlStrategy
):
raise ValueError(
"deep_crawl_strategy must be an instance of DeepCrawlStrategy"
)
if self.chunking_strategy is not None and not isinstance( if self.chunking_strategy is not None and not isinstance(
self.chunking_strategy, ChunkingStrategy self.chunking_strategy, ChunkingStrategy
): ):
raise ValueError("chunking_strategy must be an instance of ChunkingStrategy") raise ValueError(
"chunking_strategy must be an instance of ChunkingStrategy"
)
# Set default chunking strategy if None # Set default chunking strategy if None
if self.chunking_strategy is None: if self.chunking_strategy is None:
from .chunking_strategy import RegexChunking
self.chunking_strategy = RegexChunking() self.chunking_strategy = RegexChunking()
@staticmethod @staticmethod
@@ -476,7 +583,8 @@ class CrawlerRunConfig:
# Content Processing Parameters # Content Processing Parameters
word_count_threshold=kwargs.get("word_count_threshold", 200), word_count_threshold=kwargs.get("word_count_threshold", 200),
extraction_strategy=kwargs.get("extraction_strategy"), extraction_strategy=kwargs.get("extraction_strategy"),
chunking_strategy=kwargs.get("chunking_strategy"), chunking_strategy=kwargs.get("chunking_strategy", RegexChunking()),
deep_crawl_strategy=kwargs.get("deep_crawl_strategy"),
markdown_generator=kwargs.get("markdown_generator"), markdown_generator=kwargs.get("markdown_generator"),
content_filter=kwargs.get("content_filter"), content_filter=kwargs.get("content_filter"),
only_text=kwargs.get("only_text", False), only_text=kwargs.get("only_text", False),
@@ -487,10 +595,10 @@ class CrawlerRunConfig:
remove_forms=kwargs.get("remove_forms", False), remove_forms=kwargs.get("remove_forms", False),
prettiify=kwargs.get("prettiify", False), prettiify=kwargs.get("prettiify", False),
parser_type=kwargs.get("parser_type", "lxml"), parser_type=kwargs.get("parser_type", "lxml"),
scraping_strategy=kwargs.get("scraping_strategy"),
proxy_config=kwargs.get("proxy_config"),
# SSL Parameters # SSL Parameters
fetch_ssl_certificate=kwargs.get("fetch_ssl_certificate", False), fetch_ssl_certificate=kwargs.get("fetch_ssl_certificate", False),
# Caching Parameters # Caching Parameters
cache_mode=kwargs.get("cache_mode"), cache_mode=kwargs.get("cache_mode"),
session_id=kwargs.get("session_id"), session_id=kwargs.get("session_id"),
@@ -498,7 +606,7 @@ class CrawlerRunConfig:
disable_cache=kwargs.get("disable_cache", False), disable_cache=kwargs.get("disable_cache", False),
no_cache_read=kwargs.get("no_cache_read", False), no_cache_read=kwargs.get("no_cache_read", False),
no_cache_write=kwargs.get("no_cache_write", False), no_cache_write=kwargs.get("no_cache_write", False),
shared_data=kwargs.get("shared_data", None),
# Page Navigation and Timing Parameters # Page Navigation and Timing Parameters
wait_until=kwargs.get("wait_until", "domcontentloaded"), wait_until=kwargs.get("wait_until", "domcontentloaded"),
page_timeout=kwargs.get("page_timeout", 60000), page_timeout=kwargs.get("page_timeout", 60000),
@@ -508,7 +616,6 @@ class CrawlerRunConfig:
mean_delay=kwargs.get("mean_delay", 0.1), mean_delay=kwargs.get("mean_delay", 0.1),
max_range=kwargs.get("max_range", 0.3), max_range=kwargs.get("max_range", 0.3),
semaphore_count=kwargs.get("semaphore_count", 5), semaphore_count=kwargs.get("semaphore_count", 5),
# Page Interaction Parameters # Page Interaction Parameters
js_code=kwargs.get("js_code"), js_code=kwargs.get("js_code"),
js_only=kwargs.get("js_only", False), js_only=kwargs.get("js_only", False),
@@ -521,27 +628,38 @@ class CrawlerRunConfig:
override_navigator=kwargs.get("override_navigator", False), override_navigator=kwargs.get("override_navigator", False),
magic=kwargs.get("magic", False), magic=kwargs.get("magic", False),
adjust_viewport_to_content=kwargs.get("adjust_viewport_to_content", False), adjust_viewport_to_content=kwargs.get("adjust_viewport_to_content", False),
# Media Handling Parameters # Media Handling Parameters
screenshot=kwargs.get("screenshot", False), screenshot=kwargs.get("screenshot", False),
screenshot_wait_for=kwargs.get("screenshot_wait_for"), screenshot_wait_for=kwargs.get("screenshot_wait_for"),
screenshot_height_threshold=kwargs.get("screenshot_height_threshold", SCREENSHOT_HEIGHT_TRESHOLD), screenshot_height_threshold=kwargs.get(
"screenshot_height_threshold", SCREENSHOT_HEIGHT_TRESHOLD
),
pdf=kwargs.get("pdf", False), pdf=kwargs.get("pdf", False),
image_description_min_word_threshold=kwargs.get("image_description_min_word_threshold", IMAGE_DESCRIPTION_MIN_WORD_THRESHOLD), image_description_min_word_threshold=kwargs.get(
image_score_threshold=kwargs.get("image_score_threshold", IMAGE_SCORE_THRESHOLD), "image_description_min_word_threshold",
IMAGE_DESCRIPTION_MIN_WORD_THRESHOLD,
),
image_score_threshold=kwargs.get(
"image_score_threshold", IMAGE_SCORE_THRESHOLD
),
exclude_external_images=kwargs.get("exclude_external_images", False), exclude_external_images=kwargs.get("exclude_external_images", False),
# Link and Domain Handling Parameters # Link and Domain Handling Parameters
exclude_social_media_domains=kwargs.get("exclude_social_media_domains", SOCIAL_MEDIA_DOMAINS), exclude_social_media_domains=kwargs.get(
"exclude_social_media_domains", SOCIAL_MEDIA_DOMAINS
),
exclude_external_links=kwargs.get("exclude_external_links", False), exclude_external_links=kwargs.get("exclude_external_links", False),
exclude_social_media_links=kwargs.get("exclude_social_media_links", False), exclude_social_media_links=kwargs.get("exclude_social_media_links", False),
exclude_domains=kwargs.get("exclude_domains", []), exclude_domains=kwargs.get("exclude_domains", []),
# Debugging and Logging Parameters # Debugging and Logging Parameters
verbose=kwargs.get("verbose", True), verbose=kwargs.get("verbose", True),
log_console=kwargs.get("log_console", False), log_console=kwargs.get("log_console", False),
# Streaming Parameters
stream=kwargs.get("stream", False),
url=kwargs.get("url"), url=kwargs.get("url"),
check_robots_txt=kwargs.get("check_robots_txt", False),
user_agent=kwargs.get("user_agent"),
user_agent_mode=kwargs.get("user_agent_mode"),
user_agent_generator_config=kwargs.get("user_agent_generator_config", {}),
) )
# Create a funciton returns dict of the object # Create a funciton returns dict of the object
@@ -550,6 +668,7 @@ class CrawlerRunConfig:
"word_count_threshold": self.word_count_threshold, "word_count_threshold": self.word_count_threshold,
"extraction_strategy": self.extraction_strategy, "extraction_strategy": self.extraction_strategy,
"chunking_strategy": self.chunking_strategy, "chunking_strategy": self.chunking_strategy,
"deep_crawl_strategy": self.deep_crawl_strategy,
"markdown_generator": self.markdown_generator, "markdown_generator": self.markdown_generator,
"content_filter": self.content_filter, "content_filter": self.content_filter,
"only_text": self.only_text, "only_text": self.only_text,
@@ -560,6 +679,8 @@ class CrawlerRunConfig:
"remove_forms": self.remove_forms, "remove_forms": self.remove_forms,
"prettiify": self.prettiify, "prettiify": self.prettiify,
"parser_type": self.parser_type, "parser_type": self.parser_type,
"scraping_strategy": self.scraping_strategy,
"proxy_config": self.proxy_config,
"fetch_ssl_certificate": self.fetch_ssl_certificate, "fetch_ssl_certificate": self.fetch_ssl_certificate,
"cache_mode": self.cache_mode, "cache_mode": self.cache_mode,
"session_id": self.session_id, "session_id": self.session_id,
@@ -567,6 +688,7 @@ class CrawlerRunConfig:
"disable_cache": self.disable_cache, "disable_cache": self.disable_cache,
"no_cache_read": self.no_cache_read, "no_cache_read": self.no_cache_read,
"no_cache_write": self.no_cache_write, "no_cache_write": self.no_cache_write,
"shared_data": self.shared_data,
"wait_until": self.wait_until, "wait_until": self.wait_until,
"page_timeout": self.page_timeout, "page_timeout": self.page_timeout,
"wait_for": self.wait_for, "wait_for": self.wait_for,
@@ -599,5 +721,36 @@ class CrawlerRunConfig:
"exclude_domains": self.exclude_domains, "exclude_domains": self.exclude_domains,
"verbose": self.verbose, "verbose": self.verbose,
"log_console": self.log_console, "log_console": self.log_console,
"stream": self.stream,
"url": self.url, "url": self.url,
"check_robots_txt": self.check_robots_txt,
"user_agent": self.user_agent,
"user_agent_mode": self.user_agent_mode,
"user_agent_generator_config": self.user_agent_generator_config,
} }
def clone(self, **kwargs):
"""Create a copy of this configuration with updated values.
Args:
**kwargs: Key-value pairs of configuration options to update
Returns:
CrawlerRunConfig: A new instance with the specified updates
Example:
```python
# Create a new config with streaming enabled
stream_config = config.clone(stream=True)
# Create a new config with multiple updates
new_config = config.clone(
stream=True,
cache_mode=CacheMode.BYPASS,
verbose=True
)
```
"""
config_dict = self.to_dict()
config_dict.update(kwargs)
return CrawlerRunConfig.from_kwargs(config_dict)

View File

@@ -2,28 +2,28 @@ import asyncio
import base64 import base64
import time import time
from abc import ABC, abstractmethod from abc import ABC, abstractmethod
from typing import Callable, Dict, Any, List, Optional, Awaitable, Union from typing import Callable, Dict, Any, List, Optional, Union
import os, sys, shutil import os
import tempfile, subprocess import sys
from playwright.async_api import async_playwright, Page, Browser, Error, BrowserContext import shutil
import tempfile
import subprocess
from playwright.async_api import Page, Error, BrowserContext
from playwright.async_api import TimeoutError as PlaywrightTimeoutError from playwright.async_api import TimeoutError as PlaywrightTimeoutError
from io import BytesIO from io import BytesIO
from PIL import Image, ImageDraw, ImageFont from PIL import Image, ImageDraw, ImageFont
from pathlib import Path
from playwright.async_api import ProxySettings
from pydantic import BaseModel
import hashlib import hashlib
import json
import uuid import uuid
from .js_snippet import load_js_script from .js_snippet import load_js_script
from .models import AsyncCrawlResponse from .models import AsyncCrawlResponse
from .utils import get_error_context
from .user_agent_generator import UserAgentGenerator from .user_agent_generator import UserAgentGenerator
from .config import SCREENSHOT_HEIGHT_TRESHOLD, DOWNLOAD_PAGE_TIMEOUT from .config import SCREENSHOT_HEIGHT_TRESHOLD, DOWNLOAD_PAGE_TIMEOUT
from .async_configs import BrowserConfig, CrawlerRunConfig from .async_configs import BrowserConfig, CrawlerRunConfig
from .async_logger import AsyncLogger from .async_logger import AsyncLogger
from playwright_stealth import StealthConfig, stealth_async from playwright_stealth import StealthConfig
from .ssl_certificate import SSLCertificate from .ssl_certificate import SSLCertificate
from .utils import get_home_folder, get_chromium_path
from .user_agent_generator import ValidUAGenerator, OnlineUAGenerator
stealth_config = StealthConfig( stealth_config = StealthConfig(
webdriver=True, webdriver=True,
@@ -94,6 +94,7 @@ class ManagedBrowser:
temp_dir: str temp_dir: str
debugging_port: int debugging_port: int
host: str host: str
def __init__( def __init__(
self, self,
browser_type: str = "chromium", browser_type: str = "chromium",
@@ -102,6 +103,7 @@ class ManagedBrowser:
logger=None, logger=None,
host: str = "localhost", host: str = "localhost",
debugging_port: int = 9222, debugging_port: int = 9222,
cdp_url: Optional[str] = None,
): ):
""" """
Initialize the ManagedBrowser instance. Initialize the ManagedBrowser instance.
@@ -116,6 +118,7 @@ class ManagedBrowser:
logger (logging.Logger): Logger instance for logging messages. Default: None. logger (logging.Logger): Logger instance for logging messages. Default: None.
host (str): Host for debugging the browser. Default: "localhost". host (str): Host for debugging the browser. Default: "localhost".
debugging_port (int): Port for debugging the browser. Default: 9222. debugging_port (int): Port for debugging the browser. Default: 9222.
cdp_url (str or None): CDP URL to connect to the browser. Default: None.
""" """
self.browser_type = browser_type self.browser_type = browser_type
self.user_data_dir = user_data_dir self.user_data_dir = user_data_dir
@@ -126,12 +129,20 @@ class ManagedBrowser:
self.host = host self.host = host
self.logger = logger self.logger = logger
self.shutting_down = False self.shutting_down = False
self.cdp_url = cdp_url
async def start(self) -> str: async def start(self) -> str:
""" """
Starts the browser process and returns the CDP endpoint URL. Starts the browser process or returns CDP endpoint URL.
If user_data_dir is not provided, creates a temporary directory. If cdp_url is provided, returns it directly.
If user_data_dir is not provided for local browser, creates a temporary directory.
Returns:
str: CDP endpoint URL
""" """
# If CDP URL provided, just return it
if self.cdp_url:
return self.cdp_url
# Create temp dir if needed # Create temp dir if needed
if not self.user_data_dir: if not self.user_data_dir:
@@ -139,8 +150,8 @@ class ManagedBrowser:
self.user_data_dir = self.temp_dir self.user_data_dir = self.temp_dir
# Get browser path and args based on OS and browser type # Get browser path and args based on OS and browser type
browser_path = self._get_browser_path() # browser_path = self._get_browser_path()
args = self._get_browser_args() args = await self._get_browser_args()
# Start browser process # Start browser process
try: try:
@@ -201,7 +212,7 @@ class ManagedBrowser:
params={"error": str(e)}, params={"error": str(e)},
) )
def _get_browser_path(self) -> str: def _get_browser_path_WIP(self) -> str:
"""Returns the browser executable path based on OS and browser type""" """Returns the browser executable path based on OS and browser type"""
if sys.platform == "darwin": # macOS if sys.platform == "darwin": # macOS
paths = { paths = {
@@ -224,9 +235,13 @@ class ManagedBrowser:
return paths.get(self.browser_type) return paths.get(self.browser_type)
def _get_browser_args(self) -> List[str]: async def _get_browser_path(self) -> str:
browser_path = await get_chromium_path(self.browser_type)
return browser_path
async def _get_browser_args(self) -> List[str]:
"""Returns browser-specific command line arguments""" """Returns browser-specific command line arguments"""
base_args = [self._get_browser_path()] base_args = [await self._get_browser_path()]
if self.browser_type == "chromium": if self.browser_type == "chromium":
args = [ args = [
@@ -300,6 +315,7 @@ class BrowserManager:
sessions (dict): Dictionary to store session information sessions (dict): Dictionary to store session information
session_ttl (int): Session timeout in seconds session_ttl (int): Session timeout in seconds
""" """
def __init__(self, browser_config: BrowserConfig, logger=None): def __init__(self, browser_config: BrowserConfig, logger=None):
""" """
Initialize the BrowserManager with a browser configuration. Initialize the BrowserManager with a browser configuration.
@@ -321,6 +337,10 @@ class BrowserManager:
self.sessions = {} self.sessions = {}
self.session_ttl = 1800 # 30 minutes self.session_ttl = 1800 # 30 minutes
# Keep track of contexts by a "config signature," so each unique config reuses a single context
self.contexts_by_config = {}
self._contexts_lock = asyncio.Lock()
# Initialize ManagedBrowser if needed # Initialize ManagedBrowser if needed
if self.config.use_managed_browser: if self.config.use_managed_browser:
self.managed_browser = ManagedBrowser( self.managed_browser = ManagedBrowser(
@@ -456,7 +476,7 @@ class BrowserManager:
async def setup_context( async def setup_context(
self, self,
context: BrowserContext, context: BrowserContext,
crawlerRunConfig: CrawlerRunConfig, crawlerRunConfig: CrawlerRunConfig = None,
is_default=False, is_default=False,
): ):
""" """
@@ -501,9 +521,9 @@ class BrowserManager:
context.set_default_navigation_timeout(DOWNLOAD_PAGE_TIMEOUT) context.set_default_navigation_timeout(DOWNLOAD_PAGE_TIMEOUT)
if self.config.downloads_path: if self.config.downloads_path:
context._impl_obj._options["accept_downloads"] = True context._impl_obj._options["accept_downloads"] = True
context._impl_obj._options["downloads_path"] = ( context._impl_obj._options[
self.config.downloads_path "downloads_path"
) ] = self.config.downloads_path
# Handle user agent and browser hints # Handle user agent and browser hints
if self.config.user_agent: if self.config.user_agent:
@@ -516,18 +536,27 @@ class BrowserManager:
# Add default cookie # Add default cookie
await context.add_cookies( await context.add_cookies(
[{"name": "cookiesEnabled", "value": "true", "url": crawlerRunConfig.url}] [
{
"name": "cookiesEnabled",
"value": "true",
"url": crawlerRunConfig.url
if crawlerRunConfig
else "https://crawl4ai.com/",
}
]
) )
# Handle navigator overrides # Handle navigator overrides
if ( if crawlerRunConfig:
crawlerRunConfig.override_navigator if (
or crawlerRunConfig.simulate_user crawlerRunConfig.override_navigator
or crawlerRunConfig.magic or crawlerRunConfig.simulate_user
): or crawlerRunConfig.magic
await context.add_init_script(load_js_script("navigator_overrider")) ):
await context.add_init_script(load_js_script("navigator_overrider"))
async def create_browser_context(self): async def create_browser_context(self, crawlerRunConfig: CrawlerRunConfig = None):
""" """
Creates and returns a new browser context with configured settings. Creates and returns a new browser context with configured settings.
Applies text-only mode settings if text_mode is enabled in config. Applies text-only mode settings if text_mode is enabled in config.
@@ -545,20 +574,57 @@ class BrowserManager:
blocked_extensions = [ blocked_extensions = [
# Images # Images
'jpg', 'jpeg', 'png', 'gif', 'webp', 'svg', 'ico', 'bmp', 'tiff', 'psd', "jpg",
"jpeg",
"png",
"gif",
"webp",
"svg",
"ico",
"bmp",
"tiff",
"psd",
# Fonts # Fonts
'woff', 'woff2', 'ttf', 'otf', 'eot', "woff",
"woff2",
"ttf",
"otf",
"eot",
# Styles # Styles
# 'css', 'less', 'scss', 'sass', # 'css', 'less', 'scss', 'sass',
# Media # Media
'mp4', 'webm', 'ogg', 'avi', 'mov', 'wmv', 'flv', 'm4v', "mp4",
'mp3', 'wav', 'aac', 'm4a', 'opus', 'flac', "webm",
"ogg",
"avi",
"mov",
"wmv",
"flv",
"m4v",
"mp3",
"wav",
"aac",
"m4a",
"opus",
"flac",
# Documents # Documents
'pdf', 'doc', 'docx', 'xls', 'xlsx', 'ppt', 'pptx', "pdf",
"doc",
"docx",
"xls",
"xlsx",
"ppt",
"pptx",
# Archives # Archives
'zip', 'rar', '7z', 'tar', 'gz', "zip",
"rar",
"7z",
"tar",
"gz",
# Scripts and data # Scripts and data
'xml', 'swf', 'wasm' "xml",
"swf",
"wasm",
] ]
# Common context settings # Common context settings
@@ -573,6 +639,19 @@ class BrowserManager:
"java_script_enabled": self.config.java_script_enabled, "java_script_enabled": self.config.java_script_enabled,
} }
if crawlerRunConfig:
# Check if there is value for crawlerRunConfig.proxy_config set add that to context
if crawlerRunConfig.proxy_config:
proxy_settings = {
"server": crawlerRunConfig.proxy_config.get("server"),
}
if crawlerRunConfig.proxy_config.get("username"):
proxy_settings.update({
"username": crawlerRunConfig.proxy_config.get("username"),
"password": crawlerRunConfig.proxy_config.get("password"),
})
context_settings["proxy"] = proxy_settings
if self.config.text_mode: if self.config.text_mode:
text_mode_settings = { text_mode_settings = {
"has_touch": False, "has_touch": False,
@@ -591,7 +670,38 @@ class BrowserManager:
await context.route(f"**/*.{ext}", lambda route: route.abort()) await context.route(f"**/*.{ext}", lambda route: route.abort())
return context return context
# async def get_page(self, session_id: Optional[str], user_agent: str): def _make_config_signature(self, crawlerRunConfig: CrawlerRunConfig) -> str:
"""
Converts the crawlerRunConfig into a dict, excludes ephemeral fields,
then returns a hash of the sorted JSON. This yields a stable signature
that identifies configurations requiring a unique browser context.
"""
import json, hashlib
config_dict = crawlerRunConfig.__dict__.copy()
# Exclude items that do not affect browser-level setup.
# Expand or adjust as needed, e.g. chunking_strategy is purely for data extraction, not for browser config.
ephemeral_keys = [
"session_id",
"js_code",
"scraping_strategy",
"extraction_strategy",
"chunking_strategy",
"cache_mode",
"content_filter",
"semaphore_count",
"url"
]
for key in ephemeral_keys:
if key in config_dict:
del config_dict[key]
# Convert to canonical JSON string
signature_json = json.dumps(config_dict, sort_keys=True, default=str)
# Hash the JSON so we get a compact, unique string
signature_hash = hashlib.sha256(signature_json.encode("utf-8")).hexdigest()
return signature_hash
async def get_page(self, crawlerRunConfig: CrawlerRunConfig): async def get_page(self, crawlerRunConfig: CrawlerRunConfig):
""" """
Get a page for the given session ID, creating a new one if needed. Get a page for the given session ID, creating a new one if needed.
@@ -600,24 +710,38 @@ class BrowserManager:
crawlerRunConfig (CrawlerRunConfig): Configuration object containing all browser settings crawlerRunConfig (CrawlerRunConfig): Configuration object containing all browser settings
Returns: Returns:
Page: The page object for the given session ID. (page, context): The Page and its BrowserContext
BrowserContext: The browser context for the given session ID.
""" """
self._cleanup_expired_sessions() self._cleanup_expired_sessions()
# If a session_id is provided and we already have it, reuse that page + context
if crawlerRunConfig.session_id and crawlerRunConfig.session_id in self.sessions: if crawlerRunConfig.session_id and crawlerRunConfig.session_id in self.sessions:
context, page, _ = self.sessions[crawlerRunConfig.session_id] context, page, _ = self.sessions[crawlerRunConfig.session_id]
# Update last-used timestamp
self.sessions[crawlerRunConfig.session_id] = (context, page, time.time()) self.sessions[crawlerRunConfig.session_id] = (context, page, time.time())
return page, context return page, context
# If using a managed browser, just grab the shared default_context
if self.config.use_managed_browser: if self.config.use_managed_browser:
context = self.default_context context = self.default_context
page = await context.new_page() page = await context.new_page()
else: else:
context = await self.create_browser_context() # Otherwise, check if we have an existing context for this config
await self.setup_context(context, crawlerRunConfig) config_signature = self._make_config_signature(crawlerRunConfig)
async with self._contexts_lock:
if config_signature in self.contexts_by_config:
context = self.contexts_by_config[config_signature]
else:
# Create and setup a new context
context = await self.create_browser_context(crawlerRunConfig)
await self.setup_context(context, crawlerRunConfig)
self.contexts_by_config[config_signature] = context
# Create a new page from the chosen context
page = await context.new_page() page = await context.new_page()
# If a session_id is specified, store this session so we can reuse later
if crawlerRunConfig.session_id: if crawlerRunConfig.session_id:
self.sessions[crawlerRunConfig.session_id] = (context, page, time.time()) self.sessions[crawlerRunConfig.session_id] = (context, page, time.time())
@@ -657,6 +781,18 @@ class BrowserManager:
for session_id in session_ids: for session_id in session_ids:
await self.kill_session(session_id) await self.kill_session(session_id)
# Now close all contexts we created. This reclaims memory from ephemeral contexts.
for ctx in self.contexts_by_config.values():
try:
await ctx.close()
except Exception as e:
self.logger.error(
message="Error closing context: {error}",
tag="ERROR",
params={"error": str(e)}
)
self.contexts_by_config.clear()
if self.browser: if self.browser:
await self.browser.close() await self.browser.close()
self.browser = None self.browser = None
@@ -676,12 +812,12 @@ class AsyncCrawlerStrategy(ABC):
Abstract base class for crawler strategies. Abstract base class for crawler strategies.
Subclasses must implement the crawl method. Subclasses must implement the crawl method.
""" """
@abstractmethod @abstractmethod
async def crawl(self, url: str, **kwargs) -> AsyncCrawlResponse: async def crawl(self, url: str, **kwargs) -> AsyncCrawlResponse:
pass # 4 + 3 pass # 4 + 3
class AsyncPlaywrightCrawlerStrategy(AsyncCrawlerStrategy): class AsyncPlaywrightCrawlerStrategy(AsyncCrawlerStrategy):
""" """
Crawler strategy using Playwright. Crawler strategy using Playwright.
@@ -710,6 +846,7 @@ class AsyncPlaywrightCrawlerStrategy(AsyncCrawlerStrategy):
Run the crawler for a single URL. Run the crawler for a single URL.
""" """
def __init__( def __init__(
self, browser_config: BrowserConfig = None, logger: AsyncLogger = None, **kwargs self, browser_config: BrowserConfig = None, logger: AsyncLogger = None, **kwargs
): ):
@@ -921,7 +1058,9 @@ class AsyncPlaywrightCrawlerStrategy(AsyncCrawlerStrategy):
"or explicitly prefixed with 'js:' or 'css:'." "or explicitly prefixed with 'js:' or 'css:'."
) )
async def csp_compliant_wait( self, page: Page, user_wait_function: str, timeout: float = 30000 ): async def csp_compliant_wait(
self, page: Page, user_wait_function: str, timeout: float = 30000
):
""" """
Wait for a condition in a CSP-compliant way. Wait for a condition in a CSP-compliant way.
@@ -1049,7 +1188,9 @@ class AsyncPlaywrightCrawlerStrategy(AsyncCrawlerStrategy):
page, context = await self.browser_manager.get_page(session_id, user_agent) page, context = await self.browser_manager.get_page(session_id, user_agent)
return session_id return session_id
async def crawl( self, url: str, config: CrawlerRunConfig, **kwargs ) -> AsyncCrawlResponse: async def crawl(
self, url: str, config: CrawlerRunConfig, **kwargs
) -> AsyncCrawlResponse:
""" """
Crawls a given URL or processes raw HTML/local file content based on the URL prefix. Crawls a given URL or processes raw HTML/local file content based on the URL prefix.
@@ -1108,7 +1249,9 @@ class AsyncPlaywrightCrawlerStrategy(AsyncCrawlerStrategy):
"URL must start with 'http://', 'https://', 'file://', or 'raw:'" "URL must start with 'http://', 'https://', 'file://', or 'raw:'"
) )
async def _crawl_web( self, url: str, config: CrawlerRunConfig ) -> AsyncCrawlResponse: async def _crawl_web(
self, url: str, config: CrawlerRunConfig
) -> AsyncCrawlResponse:
""" """
Internal method to crawl web URLs with the specified configuration. Internal method to crawl web URLs with the specified configuration.
@@ -1122,15 +1265,18 @@ class AsyncPlaywrightCrawlerStrategy(AsyncCrawlerStrategy):
config.url = url config.url = url
response_headers = {} response_headers = {}
status_code = None status_code = None
redirected_url = url
# Reset downloaded files list for new crawl # Reset downloaded files list for new crawl
self._downloaded_files = [] self._downloaded_files = []
# Handle user agent with magic mode # Handle user agent with magic mode
user_agent = self.browser_config.user_agent user_agent_to_override = config.user_agent
if config.magic and self.browser_config.user_agent_mode != "random": if user_agent_to_override:
self.browser_config.user_agent = UserAgentGenerator().generate( self.browser_config.user_agent = user_agent_to_override
**(self.browser_config.user_agent_generator_config or {}) elif config.magic or config.user_agent_mode == "random":
self.browser_config.user_agent = ValidUAGenerator().generate(
**(config.user_agent_generator_config or {})
) )
# Get page for session # Get page for session
@@ -1146,7 +1292,7 @@ class AsyncPlaywrightCrawlerStrategy(AsyncCrawlerStrategy):
await context.add_init_script(load_js_script("navigator_overrider")) await context.add_init_script(load_js_script("navigator_overrider"))
# Call hook after page creation # Call hook after page creation
await self.execute_hook("on_page_context_created", page, context=context) await self.execute_hook("on_page_context_created", page, context=context, config=config)
# Set up console logging if requested # Set up console logging if requested
if config.log_console: if config.log_console:
@@ -1187,24 +1333,29 @@ class AsyncPlaywrightCrawlerStrategy(AsyncCrawlerStrategy):
# Handle page navigation and content loading # Handle page navigation and content loading
if not config.js_only: if not config.js_only:
await self.execute_hook("before_goto", page, context=context, url=url) await self.execute_hook("before_goto", page, context=context, url=url, config=config)
try: try:
# Generate a unique nonce for this request # Generate a unique nonce for this request
nonce = hashlib.sha256(os.urandom(32)).hexdigest() nonce = hashlib.sha256(os.urandom(32)).hexdigest()
# Add CSP headers to the request # Add CSP headers to the request
await page.set_extra_http_headers({ await page.set_extra_http_headers(
'Content-Security-Policy': f"default-src 'self'; script-src 'self' 'nonce-{nonce}' 'strict-dynamic'" {
}) "Content-Security-Policy": f"default-src 'self'; script-src 'self' 'nonce-{nonce}' 'strict-dynamic'"
}
)
response = await page.goto( response = await page.goto(
url, wait_until=config.wait_until, timeout=config.page_timeout url, wait_until=config.wait_until, timeout=config.page_timeout
) )
redirected_url = page.url
except Error as e: except Error as e:
raise RuntimeError(f"Failed on navigating ACS-GOTO:\n{str(e)}") raise RuntimeError(f"Failed on navigating ACS-GOTO:\n{str(e)}")
await self.execute_hook("after_goto", page, context=context, url=url, response=response) await self.execute_hook(
"after_goto", page, context=context, url=url, response=response, config=config
)
if response is None: if response is None:
status_code = 200 status_code = 200
@@ -1233,14 +1384,14 @@ class AsyncPlaywrightCrawlerStrategy(AsyncCrawlerStrategy):
style.opacity !== '0'; style.opacity !== '0';
return isVisible; return isVisible;
}""", }""",
timeout=30000 timeout=30000,
) )
if not is_visible and not config.ignore_body_visibility: if not is_visible and not config.ignore_body_visibility:
visibility_info = await self.check_visibility(page) visibility_info = await self.check_visibility(page)
raise Error(f"Body element is hidden: {visibility_info}") raise Error(f"Body element is hidden: {visibility_info}")
except Error as e: except Error:
visibility_info = await self.check_visibility(page) visibility_info = await self.check_visibility(page)
if self.config.verbose: if self.config.verbose:
@@ -1253,7 +1404,6 @@ class AsyncPlaywrightCrawlerStrategy(AsyncCrawlerStrategy):
if not config.ignore_body_visibility: if not config.ignore_body_visibility:
raise Error(f"Body element is hidden: {visibility_info}") raise Error(f"Body element is hidden: {visibility_info}")
# try: # try:
# await page.wait_for_selector("body", state="attached", timeout=30000) # await page.wait_for_selector("body", state="attached", timeout=30000)
@@ -1307,7 +1457,7 @@ class AsyncPlaywrightCrawlerStrategy(AsyncCrawlerStrategy):
images_loaded = await self.csp_compliant_wait( images_loaded = await self.csp_compliant_wait(
page, page,
"() => Array.from(document.getElementsByTagName('img')).every(img => img.complete)", "() => Array.from(document.getElementsByTagName('img')).every(img => img.complete)",
timeout=1000 timeout=1000,
) )
if not images_loaded and self.logger: if not images_loaded and self.logger:
@@ -1320,8 +1470,8 @@ class AsyncPlaywrightCrawlerStrategy(AsyncCrawlerStrategy):
if not self.browser_config.text_mode and config.adjust_viewport_to_content: if not self.browser_config.text_mode and config.adjust_viewport_to_content:
try: try:
dimensions = await self.get_page_dimensions(page) dimensions = await self.get_page_dimensions(page)
page_height = dimensions['height'] page_height = dimensions["height"]
page_width = dimensions['width'] page_width = dimensions["width"]
# page_width = await page.evaluate( # page_width = await page.evaluate(
# "document.documentElement.scrollWidth" # "document.documentElement.scrollWidth"
# ) # )
@@ -1368,15 +1518,17 @@ class AsyncPlaywrightCrawlerStrategy(AsyncCrawlerStrategy):
if config.js_code: if config.js_code:
# execution_result = await self.execute_user_script(page, config.js_code) # execution_result = await self.execute_user_script(page, config.js_code)
execution_result = await self.robust_execute_user_script(page, config.js_code) execution_result = await self.robust_execute_user_script(
page, config.js_code
)
if not execution_result["success"]: if not execution_result["success"]:
self.logger.warning( self.logger.warning(
message="User script execution had issues: {error}", message="User script execution had issues: {error}",
tag="JS_EXEC", tag="JS_EXEC",
params={"error": execution_result.get("error")} params={"error": execution_result.get("error")},
) )
await self.execute_hook("on_execution_started", page, context=context) await self.execute_hook("on_execution_started", page, context=context, config=config)
# Handle user simulation # Handle user simulation
if config.simulate_user or config.magic: if config.simulate_user or config.magic:
@@ -1386,6 +1538,10 @@ class AsyncPlaywrightCrawlerStrategy(AsyncCrawlerStrategy):
await page.keyboard.press("ArrowDown") await page.keyboard.press("ArrowDown")
# Handle wait_for condition # Handle wait_for condition
# Todo: Decide how to handle this
if not config.wait_for and config.css_selector and False:
config.wait_for = f"css:{config.css_selector}"
if config.wait_for: if config.wait_for:
try: try:
await self.smart_wait( await self.smart_wait(
@@ -1415,7 +1571,7 @@ class AsyncPlaywrightCrawlerStrategy(AsyncCrawlerStrategy):
page = await self.process_iframes(page) page = await self.process_iframes(page)
# Pre-content retrieval hooks and delay # Pre-content retrieval hooks and delay
await self.execute_hook("before_retrieve_html", page, context=context) await self.execute_hook("before_retrieve_html", page, context=context, config=config)
if config.delay_before_return_html: if config.delay_before_return_html:
await asyncio.sleep(config.delay_before_return_html) await asyncio.sleep(config.delay_before_return_html)
@@ -1425,7 +1581,9 @@ class AsyncPlaywrightCrawlerStrategy(AsyncCrawlerStrategy):
# Get final HTML content # Get final HTML content
html = await page.content() html = await page.content()
await self.execute_hook("before_return_html", page = page, html = html, context=context) await self.execute_hook(
"before_return_html", page=page, html=html, context=context, config=config
)
# Handle PDF and screenshot generation # Handle PDF and screenshot generation
start_export_time = time.perf_counter() start_export_time = time.perf_counter()
@@ -1471,6 +1629,7 @@ class AsyncPlaywrightCrawlerStrategy(AsyncCrawlerStrategy):
downloaded_files=( downloaded_files=(
self._downloaded_files if self._downloaded_files else None self._downloaded_files if self._downloaded_files else None
), ),
redirected_url=redirected_url,
) )
except Exception as e: except Exception as e:
@@ -1511,7 +1670,7 @@ class AsyncPlaywrightCrawlerStrategy(AsyncCrawlerStrategy):
# total_height = await page.evaluate("document.documentElement.scrollHeight") # total_height = await page.evaluate("document.documentElement.scrollHeight")
dimensions = await self.get_page_dimensions(page) dimensions = await self.get_page_dimensions(page)
total_height = dimensions['height'] total_height = dimensions["height"]
while current_position < total_height: while current_position < total_height:
current_position = min(current_position + viewport_height, total_height) current_position = min(current_position + viewport_height, total_height)
@@ -1521,7 +1680,7 @@ class AsyncPlaywrightCrawlerStrategy(AsyncCrawlerStrategy):
# new_height = await page.evaluate("document.documentElement.scrollHeight") # new_height = await page.evaluate("document.documentElement.scrollHeight")
dimensions = await self.get_page_dimensions(page) dimensions = await self.get_page_dimensions(page)
new_height = dimensions['height'] new_height = dimensions["height"]
if new_height > total_height: if new_height > total_height:
total_height = new_height total_height = new_height
@@ -1598,7 +1757,8 @@ class AsyncPlaywrightCrawlerStrategy(AsyncCrawlerStrategy):
remove_overlays_js = load_js_script("remove_overlay_elements") remove_overlays_js = load_js_script("remove_overlay_elements")
try: try:
await page.evaluate(f""" await page.evaluate(
f"""
(() => {{ (() => {{
try {{ try {{
{remove_overlays_js} {remove_overlays_js}
@@ -1611,7 +1771,8 @@ class AsyncPlaywrightCrawlerStrategy(AsyncCrawlerStrategy):
}}; }};
}} }}
}})() }})()
""") """
)
await page.wait_for_timeout(500) # Wait for any animations to complete await page.wait_for_timeout(500) # Wait for any animations to complete
except Exception as e: except Exception as e:
self.logger.warning( self.logger.warning(
@@ -1707,8 +1868,8 @@ class AsyncPlaywrightCrawlerStrategy(AsyncCrawlerStrategy):
try: try:
# Get page height # Get page height
dimensions = await self.get_page_dimensions(page) dimensions = await self.get_page_dimensions(page)
page_width = dimensions['width'] page_width = dimensions["width"]
page_height = dimensions['height'] page_height = dimensions["height"]
# page_height = await page.evaluate("document.documentElement.scrollHeight") # page_height = await page.evaluate("document.documentElement.scrollHeight")
# page_width = await page.evaluate("document.documentElement.scrollWidth") # page_width = await page.evaluate("document.documentElement.scrollWidth")
@@ -1826,7 +1987,9 @@ class AsyncPlaywrightCrawlerStrategy(AsyncCrawlerStrategy):
tag="WARNING", tag="WARNING",
) )
async def robust_execute_user_script(self, page: Page, js_code: Union[str, List[str]]) -> Dict[str, Any]: async def robust_execute_user_script(
self, page: Page, js_code: Union[str, List[str]]
) -> Dict[str, Any]:
""" """
Executes user-provided JavaScript code with proper error handling and context, Executes user-provided JavaScript code with proper error handling and context,
supporting both synchronous and async user code, plus navigations. supporting both synchronous and async user code, plus navigations.
@@ -1846,7 +2009,7 @@ class AsyncPlaywrightCrawlerStrategy(AsyncCrawlerStrategy):
Dict[str, Any]: The results of the execution Dict[str, Any]: The results of the execution
""" """
try: try:
await page.wait_for_load_state('domcontentloaded') await page.wait_for_load_state("domcontentloaded")
if isinstance(js_code, str): if isinstance(js_code, str):
scripts = [js_code] scripts = [js_code]
@@ -1861,7 +2024,8 @@ class AsyncPlaywrightCrawlerStrategy(AsyncCrawlerStrategy):
# then wait for the new page to load before continuing # then wait for the new page to load before continuing
result = None result = None
try: try:
result = await page.evaluate(f""" result = await page.evaluate(
f"""
(async () => {{ (async () => {{
try {{ try {{
{script} {script}
@@ -1870,51 +2034,56 @@ class AsyncPlaywrightCrawlerStrategy(AsyncCrawlerStrategy):
return {{ success: false, error: err.toString(), stack: err.stack }}; return {{ success: false, error: err.toString(), stack: err.stack }};
}} }}
}})(); }})();
""") """
)
except Error as e: except Error as e:
# If it's due to navigation destroying the context, handle gracefully # If it's due to navigation destroying the context, handle gracefully
if "Execution context was destroyed" in str(e): if "Execution context was destroyed" in str(e):
self.logger.info("Navigation triggered by script, waiting for load state", tag="JS_EXEC") self.logger.info(
"Navigation triggered by script, waiting for load state",
tag="JS_EXEC",
)
try: try:
await page.wait_for_load_state('load', timeout=30000) await page.wait_for_load_state("load", timeout=30000)
except Error as nav_err: except Error as nav_err:
self.logger.warning( self.logger.warning(
message="Navigation wait failed: {error}", message="Navigation wait failed: {error}",
tag="JS_EXEC", tag="JS_EXEC",
params={"error": str(nav_err)} params={"error": str(nav_err)},
) )
try: try:
await page.wait_for_load_state('networkidle', timeout=30000) await page.wait_for_load_state(
"networkidle", timeout=30000
)
except Error as nav_err: except Error as nav_err:
self.logger.warning( self.logger.warning(
message="Network idle wait failed: {error}", message="Network idle wait failed: {error}",
tag="JS_EXEC", tag="JS_EXEC",
params={"error": str(nav_err)} params={"error": str(nav_err)},
) )
# Return partial success, or adapt as you see fit # Return partial success, or adapt as you see fit
result = { result = {
"success": True, "success": True,
"info": "Navigation triggered, ignoring context destroyed error" "info": "Navigation triggered, ignoring context destroyed error",
} }
else: else:
# It's some other error, log and continue # It's some other error, log and continue
self.logger.error( self.logger.error(
message="Playwright execution error: {error}", message="Playwright execution error: {error}",
tag="JS_EXEC", tag="JS_EXEC",
params={"error": str(e)} params={"error": str(e)},
) )
result = {"success": False, "error": str(e)} result = {"success": False, "error": str(e)}
# If we made it this far with no repeated error, do post-load waits # If we made it this far with no repeated error, do post-load waits
t1 = time.time() t1 = time.time()
try: try:
await page.wait_for_load_state('domcontentloaded', timeout=5000) await page.wait_for_load_state("domcontentloaded", timeout=5000)
print("DOM content loaded after script execution in", time.time() - t1)
except Error as e: except Error as e:
self.logger.warning( self.logger.warning(
message="DOM content load timeout: {error}", message="DOM content load timeout: {error}",
tag="JS_EXEC", tag="JS_EXEC",
params={"error": str(e)} params={"error": str(e)},
) )
# t1 = time.time() # t1 = time.time()
@@ -1935,7 +2104,7 @@ class AsyncPlaywrightCrawlerStrategy(AsyncCrawlerStrategy):
self.logger.error( self.logger.error(
message="Script chunk failed: {error}", message="Script chunk failed: {error}",
tag="JS_EXEC", tag="JS_EXEC",
params={"error": str(e)} params={"error": str(e)},
) )
results.append({"success": False, "error": str(e)}) results.append({"success": False, "error": str(e)})
@@ -1945,11 +2114,13 @@ class AsyncPlaywrightCrawlerStrategy(AsyncCrawlerStrategy):
self.logger.error( self.logger.error(
message="Script execution failed: {error}", message="Script execution failed: {error}",
tag="JS_EXEC", tag="JS_EXEC",
params={"error": str(e)} params={"error": str(e)},
) )
return {"success": False, "error": str(e)} return {"success": False, "error": str(e)}
async def execute_user_script(self, page: Page, js_code: Union[str, List[str]]) -> Dict[str, Any]: async def execute_user_script(
self, page: Page, js_code: Union[str, List[str]]
) -> Dict[str, Any]:
""" """
Executes user-provided JavaScript code with proper error handling and context. Executes user-provided JavaScript code with proper error handling and context.
@@ -1962,7 +2133,7 @@ class AsyncPlaywrightCrawlerStrategy(AsyncCrawlerStrategy):
""" """
try: try:
# Ensure the page is ready for script execution # Ensure the page is ready for script execution
await page.wait_for_load_state('domcontentloaded') await page.wait_for_load_state("domcontentloaded")
# Handle single script or multiple scripts # Handle single script or multiple scripts
if isinstance(js_code, str): if isinstance(js_code, str):
@@ -1974,7 +2145,8 @@ class AsyncPlaywrightCrawlerStrategy(AsyncCrawlerStrategy):
for script in scripts: for script in scripts:
try: try:
# Execute the script and wait for network idle # Execute the script and wait for network idle
result = await page.evaluate(f""" result = await page.evaluate(
f"""
(() => {{ (() => {{
return new Promise((resolve) => {{ return new Promise((resolve) => {{
try {{ try {{
@@ -2007,16 +2179,16 @@ class AsyncPlaywrightCrawlerStrategy(AsyncCrawlerStrategy):
}} }}
}}); }});
}})() }})()
""") """
)
# Wait for network idle after script execution # Wait for network idle after script execution
t1 = time.time() t1 = time.time()
await page.wait_for_load_state('domcontentloaded', timeout=5000) await page.wait_for_load_state("domcontentloaded", timeout=5000)
print("DOM content loaded after script execution in", time.time() - t1)
t1 = time.time() t1 = time.time()
await page.wait_for_load_state('networkidle', timeout=5000) await page.wait_for_load_state("networkidle", timeout=5000)
print("Network idle after script execution in", time.time() - t1)
results.append(result if result else {"success": True}) results.append(result if result else {"success": True})
@@ -2025,7 +2197,7 @@ class AsyncPlaywrightCrawlerStrategy(AsyncCrawlerStrategy):
self.logger.error( self.logger.error(
message="Playwright execution error: {error}", message="Playwright execution error: {error}",
tag="JS_EXEC", tag="JS_EXEC",
params={"error": str(e)} params={"error": str(e)},
) )
results.append({"success": False, "error": str(e)}) results.append({"success": False, "error": str(e)})
@@ -2035,7 +2207,7 @@ class AsyncPlaywrightCrawlerStrategy(AsyncCrawlerStrategy):
self.logger.error( self.logger.error(
message="Script execution failed: {error}", message="Script execution failed: {error}",
tag="JS_EXEC", tag="JS_EXEC",
params={"error": str(e)} params={"error": str(e)},
) )
return {"success": False, "error": str(e)} return {"success": False, "error": str(e)}
@@ -2043,7 +2215,7 @@ class AsyncPlaywrightCrawlerStrategy(AsyncCrawlerStrategy):
self.logger.error( self.logger.error(
message="Script execution failed: {error}", message="Script execution failed: {error}",
tag="JS_EXEC", tag="JS_EXEC",
params={"error": str(e)} params={"error": str(e)},
) )
return {"success": False, "error": str(e)} return {"success": False, "error": str(e)}
@@ -2057,7 +2229,8 @@ class AsyncPlaywrightCrawlerStrategy(AsyncCrawlerStrategy):
Returns: Returns:
Boolean indicating visibility Boolean indicating visibility
""" """
return await page.evaluate(""" return await page.evaluate(
"""
() => { () => {
const element = document.body; const element = document.body;
if (!element) return false; if (!element) return false;
@@ -2067,7 +2240,8 @@ class AsyncPlaywrightCrawlerStrategy(AsyncCrawlerStrategy):
style.opacity !== '0'; style.opacity !== '0';
return isVisible; return isVisible;
} }
""") """
)
async def safe_scroll(self, page: Page, x: int, y: int, delay: float = 0.1): async def safe_scroll(self, page: Page, x: int, y: int, delay: float = 0.1):
""" """
@@ -2079,7 +2253,7 @@ class AsyncPlaywrightCrawlerStrategy(AsyncCrawlerStrategy):
y: Vertical scroll position y: Vertical scroll position
""" """
result = await self.csp_scroll_to(page, x, y) result = await self.csp_scroll_to(page, x, y)
if result['success']: if result["success"]:
await page.wait_for_timeout(delay * 1000) await page.wait_for_timeout(delay * 1000)
return result return result
@@ -2126,11 +2300,11 @@ class AsyncPlaywrightCrawlerStrategy(AsyncCrawlerStrategy):
}}""" }}"""
) )
if not result['success']: if not result["success"]:
self.logger.warning( self.logger.warning(
message="Scroll operation failed: {error}", message="Scroll operation failed: {error}",
tag="SCROLL", tag="SCROLL",
params={"error": result.get('error')} params={"error": result.get("error")},
) )
return result return result
@@ -2139,12 +2313,9 @@ class AsyncPlaywrightCrawlerStrategy(AsyncCrawlerStrategy):
self.logger.error( self.logger.error(
message="Failed to execute scroll: {error}", message="Failed to execute scroll: {error}",
tag="SCROLL", tag="SCROLL",
params={"error": str(e)} params={"error": str(e)},
) )
return { return {"success": False, "error": str(e)}
"success": False,
"error": str(e)
}
async def get_page_dimensions(self, page: Page): async def get_page_dimensions(self, page: Page):
""" """
@@ -2156,12 +2327,14 @@ class AsyncPlaywrightCrawlerStrategy(AsyncCrawlerStrategy):
Returns: Returns:
Dict containing width and height of the page Dict containing width and height of the page
""" """
return await page.evaluate(""" return await page.evaluate(
"""
() => { () => {
const {scrollWidth, scrollHeight} = document.documentElement; const {scrollWidth, scrollHeight} = document.documentElement;
return {width: scrollWidth, height: scrollHeight}; return {width: scrollWidth, height: scrollHeight};
} }
""") """
)
async def page_need_scroll(self, page: Page) -> bool: async def page_need_scroll(self, page: Page) -> bool:
""" """
@@ -2174,18 +2347,20 @@ class AsyncPlaywrightCrawlerStrategy(AsyncCrawlerStrategy):
bool: True if page needs scrolling bool: True if page needs scrolling
""" """
try: try:
need_scroll = await page.evaluate(""" need_scroll = await page.evaluate(
"""
() => { () => {
const scrollHeight = document.documentElement.scrollHeight; const scrollHeight = document.documentElement.scrollHeight;
const viewportHeight = window.innerHeight; const viewportHeight = window.innerHeight;
return scrollHeight > viewportHeight; return scrollHeight > viewportHeight;
} }
""") """
)
return need_scroll return need_scroll
except Exception as e: except Exception as e:
self.logger.warning( self.logger.warning(
message="Failed to check scroll need: {error}. Defaulting to True for safety.", message="Failed to check scroll need: {error}. Defaulting to True for safety.",
tag="SCROLL", tag="SCROLL",
params={"error": str(e)} params={"error": str(e)},
) )
return True # Default to scrolling if check fails return True # Default to scrolling if check fails

View File

@@ -1,27 +1,30 @@
import os, sys import os
from pathlib import Path from pathlib import Path
import aiosqlite import aiosqlite
import asyncio import asyncio
from typing import Optional, Tuple, Dict from typing import Optional, Dict
from contextlib import asynccontextmanager from contextlib import asynccontextmanager
import logging import logging
import json # Added for serialization/deserialization import json # Added for serialization/deserialization
from .utils import ensure_content_dirs, generate_content_hash from .utils import ensure_content_dirs, generate_content_hash
from .models import CrawlResult, MarkdownGenerationResult from .models import CrawlResult, MarkdownGenerationResult
import xxhash
import aiofiles import aiofiles
from .config import NEED_MIGRATION
from .version_manager import VersionManager from .version_manager import VersionManager
from .async_logger import AsyncLogger from .async_logger import AsyncLogger
from .utils import get_error_context, create_box_message from .utils import get_error_context, create_box_message
# Set up logging
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)
base_directory = DB_PATH = os.path.join(os.getenv("CRAWL4_AI_BASE_DIRECTORY", Path.home()), ".crawl4ai") # Set up logging
# logging.basicConfig(level=logging.INFO)
# logger = logging.getLogger(__name__)
# logger.setLevel(logging.INFO)
base_directory = DB_PATH = os.path.join(
os.getenv("CRAWL4_AI_BASE_DIRECTORY", Path.home()), ".crawl4ai"
)
os.makedirs(DB_PATH, exist_ok=True) os.makedirs(DB_PATH, exist_ok=True)
DB_PATH = os.path.join(base_directory, "crawl4ai.db") DB_PATH = os.path.join(base_directory, "crawl4ai.db")
class AsyncDatabaseManager: class AsyncDatabaseManager:
def __init__(self, pool_size: int = 10, max_retries: int = 3): def __init__(self, pool_size: int = 10, max_retries: int = 3):
self.db_path = DB_PATH self.db_path = DB_PATH
@@ -37,10 +40,9 @@ class AsyncDatabaseManager:
self.logger = AsyncLogger( self.logger = AsyncLogger(
log_file=os.path.join(base_directory, ".crawl4ai", "crawler_db.log"), log_file=os.path.join(base_directory, ".crawl4ai", "crawler_db.log"),
verbose=False, verbose=False,
tag_width=10 tag_width=10,
) )
async def initialize(self): async def initialize(self):
"""Initialize the database and connection pool""" """Initialize the database and connection pool"""
try: try:
@@ -67,28 +69,32 @@ class AsyncDatabaseManager:
if needs_update: if needs_update:
self.logger.info("New version detected, running updates", tag="INIT") self.logger.info("New version detected, running updates", tag="INIT")
await self.update_db_schema() await self.update_db_schema()
from .migrations import run_migration # Import here to avoid circular imports from .migrations import (
run_migration,
) # Import here to avoid circular imports
await run_migration() await run_migration()
self.version_manager.update_version() # Update stored version after successful migration self.version_manager.update_version() # Update stored version after successful migration
self.logger.success("Version update completed successfully", tag="COMPLETE") self.logger.success(
"Version update completed successfully", tag="COMPLETE"
)
else: else:
self.logger.success("Database initialization completed successfully", tag="COMPLETE") self.logger.success(
"Database initialization completed successfully", tag="COMPLETE"
)
except Exception as e: except Exception as e:
self.logger.error( self.logger.error(
message="Database initialization error: {error}", message="Database initialization error: {error}",
tag="ERROR", tag="ERROR",
params={"error": str(e)} params={"error": str(e)},
) )
self.logger.info( self.logger.info(
message="Database will be initialized on first use", message="Database will be initialized on first use", tag="INIT"
tag="INIT"
) )
raise raise
async def cleanup(self): async def cleanup(self):
"""Cleanup connections when shutting down""" """Cleanup connections when shutting down"""
async with self.pool_lock: async with self.pool_lock:
@@ -107,6 +113,7 @@ class AsyncDatabaseManager:
self._initialized = True self._initialized = True
except Exception as e: except Exception as e:
import sys import sys
error_context = get_error_context(sys.exc_info()) error_context = get_error_context(sys.exc_info())
self.logger.error( self.logger.error(
message="Database initialization failed:\n{error}\n\nContext:\n{context}\n\nTraceback:\n{traceback}", message="Database initialization failed:\n{error}\n\nContext:\n{context}\n\nTraceback:\n{traceback}",
@@ -115,8 +122,8 @@ class AsyncDatabaseManager:
params={ params={
"error": str(e), "error": str(e),
"context": error_context["code_context"], "context": error_context["code_context"],
"traceback": error_context["full_traceback"] "traceback": error_context["full_traceback"],
} },
) )
raise raise
@@ -127,29 +134,40 @@ class AsyncDatabaseManager:
async with self.pool_lock: async with self.pool_lock:
if task_id not in self.connection_pool: if task_id not in self.connection_pool:
try: try:
conn = await aiosqlite.connect( conn = await aiosqlite.connect(self.db_path, timeout=30.0)
self.db_path, await conn.execute("PRAGMA journal_mode = WAL")
timeout=30.0 await conn.execute("PRAGMA busy_timeout = 5000")
)
await conn.execute('PRAGMA journal_mode = WAL')
await conn.execute('PRAGMA busy_timeout = 5000')
# Verify database structure # Verify database structure
async with conn.execute("PRAGMA table_info(crawled_data)") as cursor: async with conn.execute(
"PRAGMA table_info(crawled_data)"
) as cursor:
columns = await cursor.fetchall() columns = await cursor.fetchall()
column_names = [col[1] for col in columns] column_names = [col[1] for col in columns]
expected_columns = { expected_columns = {
'url', 'html', 'cleaned_html', 'markdown', 'extracted_content', "url",
'success', 'media', 'links', 'metadata', 'screenshot', "html",
'response_headers', 'downloaded_files' "cleaned_html",
"markdown",
"extracted_content",
"success",
"media",
"links",
"metadata",
"screenshot",
"response_headers",
"downloaded_files",
} }
missing_columns = expected_columns - set(column_names) missing_columns = expected_columns - set(column_names)
if missing_columns: if missing_columns:
raise ValueError(f"Database missing columns: {missing_columns}") raise ValueError(
f"Database missing columns: {missing_columns}"
)
self.connection_pool[task_id] = conn self.connection_pool[task_id] = conn
except Exception as e: except Exception as e:
import sys import sys
error_context = get_error_context(sys.exc_info()) error_context = get_error_context(sys.exc_info())
error_message = ( error_message = (
f"Unexpected error in db get_connection at line {error_context['line_no']} " f"Unexpected error in db get_connection at line {error_context['line_no']} "
@@ -158,7 +176,7 @@ class AsyncDatabaseManager:
f"Code context:\n{error_context['code_context']}" f"Code context:\n{error_context['code_context']}"
) )
self.logger.error( self.logger.error(
message=create_box_message(error_message, type= "error"), message=create_box_message(error_message, type="error"),
) )
raise raise
@@ -167,6 +185,7 @@ class AsyncDatabaseManager:
except Exception as e: except Exception as e:
import sys import sys
error_context = get_error_context(sys.exc_info()) error_context = get_error_context(sys.exc_info())
error_message = ( error_message = (
f"Unexpected error in db get_connection at line {error_context['line_no']} " f"Unexpected error in db get_connection at line {error_context['line_no']} "
@@ -175,7 +194,7 @@ class AsyncDatabaseManager:
f"Code context:\n{error_context['code_context']}" f"Code context:\n{error_context['code_context']}"
) )
self.logger.error( self.logger.error(
message=create_box_message(error_message, type= "error"), message=create_box_message(error_message, type="error"),
) )
raise raise
finally: finally:
@@ -185,7 +204,6 @@ class AsyncDatabaseManager:
del self.connection_pool[task_id] del self.connection_pool[task_id]
self.connection_semaphore.release() self.connection_semaphore.release()
async def execute_with_retry(self, operation, *args): async def execute_with_retry(self, operation, *args):
"""Execute database operations with retry logic""" """Execute database operations with retry logic"""
for attempt in range(self.max_retries): for attempt in range(self.max_retries):
@@ -200,10 +218,7 @@ class AsyncDatabaseManager:
message="Operation failed after {retries} attempts: {error}", message="Operation failed after {retries} attempts: {error}",
tag="ERROR", tag="ERROR",
force_verbose=True, force_verbose=True,
params={ params={"retries": self.max_retries, "error": str(e)},
"retries": self.max_retries,
"error": str(e)
}
) )
raise raise
await asyncio.sleep(1 * (attempt + 1)) # Exponential backoff await asyncio.sleep(1 * (attempt + 1)) # Exponential backoff
@@ -211,7 +226,8 @@ class AsyncDatabaseManager:
async def ainit_db(self): async def ainit_db(self):
"""Initialize database schema""" """Initialize database schema"""
async with aiosqlite.connect(self.db_path, timeout=30.0) as db: async with aiosqlite.connect(self.db_path, timeout=30.0) as db:
await db.execute(''' await db.execute(
"""
CREATE TABLE IF NOT EXISTS crawled_data ( CREATE TABLE IF NOT EXISTS crawled_data (
url TEXT PRIMARY KEY, url TEXT PRIMARY KEY,
html TEXT, html TEXT,
@@ -226,11 +242,10 @@ class AsyncDatabaseManager:
response_headers TEXT DEFAULT "{}", response_headers TEXT DEFAULT "{}",
downloaded_files TEXT DEFAULT "{}" -- New column added downloaded_files TEXT DEFAULT "{}" -- New column added
) )
''') """
)
await db.commit() await db.commit()
async def update_db_schema(self): async def update_db_schema(self):
"""Update database schema if needed""" """Update database schema if needed"""
async with aiosqlite.connect(self.db_path, timeout=30.0) as db: async with aiosqlite.connect(self.db_path, timeout=30.0) as db:
@@ -239,7 +254,14 @@ class AsyncDatabaseManager:
column_names = [column[1] for column in columns] column_names = [column[1] for column in columns]
# List of new columns to add # List of new columns to add
new_columns = ['media', 'links', 'metadata', 'screenshot', 'response_headers', 'downloaded_files'] new_columns = [
"media",
"links",
"metadata",
"screenshot",
"response_headers",
"downloaded_files",
]
for column in new_columns: for column in new_columns:
if column not in column_names: if column not in column_names:
@@ -248,22 +270,26 @@ class AsyncDatabaseManager:
async def aalter_db_add_column(self, new_column: str, db): async def aalter_db_add_column(self, new_column: str, db):
"""Add new column to the database""" """Add new column to the database"""
if new_column == 'response_headers': if new_column == "response_headers":
await db.execute(f'ALTER TABLE crawled_data ADD COLUMN {new_column} TEXT DEFAULT "{{}}"') await db.execute(
f'ALTER TABLE crawled_data ADD COLUMN {new_column} TEXT DEFAULT "{{}}"'
)
else: else:
await db.execute(f'ALTER TABLE crawled_data ADD COLUMN {new_column} TEXT DEFAULT ""') await db.execute(
f'ALTER TABLE crawled_data ADD COLUMN {new_column} TEXT DEFAULT ""'
)
self.logger.info( self.logger.info(
message="Added column '{column}' to the database", message="Added column '{column}' to the database",
tag="INIT", tag="INIT",
params={"column": new_column} params={"column": new_column},
) )
async def aget_cached_url(self, url: str) -> Optional[CrawlResult]: async def aget_cached_url(self, url: str) -> Optional[CrawlResult]:
"""Retrieve cached URL data as CrawlResult""" """Retrieve cached URL data as CrawlResult"""
async def _get(db): async def _get(db):
async with db.execute( async with db.execute(
'SELECT * FROM crawled_data WHERE url = ?', (url,) "SELECT * FROM crawled_data WHERE url = ?", (url,)
) as cursor: ) as cursor:
row = await cursor.fetchone() row = await cursor.fetchone()
if not row: if not row:
@@ -276,42 +302,58 @@ class AsyncDatabaseManager:
# Load content from files using stored hashes # Load content from files using stored hashes
content_fields = { content_fields = {
'html': row_dict['html'], "html": row_dict["html"],
'cleaned_html': row_dict['cleaned_html'], "cleaned_html": row_dict["cleaned_html"],
'markdown': row_dict['markdown'], "markdown": row_dict["markdown"],
'extracted_content': row_dict['extracted_content'], "extracted_content": row_dict["extracted_content"],
'screenshot': row_dict['screenshot'], "screenshot": row_dict["screenshot"],
'screenshots': row_dict['screenshot'], "screenshots": row_dict["screenshot"],
} }
for field, hash_value in content_fields.items(): for field, hash_value in content_fields.items():
if hash_value: if hash_value:
content = await self._load_content( content = await self._load_content(
hash_value, hash_value,
field.split('_')[0] # Get content type from field name field.split("_")[0], # Get content type from field name
) )
row_dict[field] = content or "" row_dict[field] = content or ""
else: else:
row_dict[field] = "" row_dict[field] = ""
# Parse JSON fields # Parse JSON fields
json_fields = ['media', 'links', 'metadata', 'response_headers', 'markdown'] json_fields = [
"media",
"links",
"metadata",
"response_headers",
"markdown",
]
for field in json_fields: for field in json_fields:
try: try:
row_dict[field] = json.loads(row_dict[field]) if row_dict[field] else {} row_dict[field] = (
json.loads(row_dict[field]) if row_dict[field] else {}
)
except json.JSONDecodeError: except json.JSONDecodeError:
row_dict[field] = {} # Very UGLY, never mention it to me please
if field == "markdown" and isinstance(row_dict[field], str):
row_dict[field] = row_dict[field]
else:
row_dict[field] = {}
if isinstance(row_dict['markdown'], Dict): if isinstance(row_dict["markdown"], Dict):
row_dict['markdown_v2'] = row_dict['markdown'] row_dict["markdown_v2"] = row_dict["markdown"]
if row_dict['markdown'].get('raw_markdown'): if row_dict["markdown"].get("raw_markdown"):
row_dict['markdown'] = row_dict['markdown']['raw_markdown'] row_dict["markdown"] = row_dict["markdown"]["raw_markdown"]
# Parse downloaded_files # Parse downloaded_files
try: try:
row_dict['downloaded_files'] = json.loads(row_dict['downloaded_files']) if row_dict['downloaded_files'] else [] row_dict["downloaded_files"] = (
json.loads(row_dict["downloaded_files"])
if row_dict["downloaded_files"]
else []
)
except json.JSONDecodeError: except json.JSONDecodeError:
row_dict['downloaded_files'] = [] row_dict["downloaded_files"] = []
# Remove any fields not in CrawlResult model # Remove any fields not in CrawlResult model
valid_fields = CrawlResult.__annotations__.keys() valid_fields = CrawlResult.__annotations__.keys()
@@ -326,7 +368,7 @@ class AsyncDatabaseManager:
message="Error retrieving cached URL: {error}", message="Error retrieving cached URL: {error}",
tag="ERROR", tag="ERROR",
force_verbose=True, force_verbose=True,
params={"error": str(e)} params={"error": str(e)},
) )
return None return None
@@ -334,37 +376,52 @@ class AsyncDatabaseManager:
"""Cache CrawlResult data""" """Cache CrawlResult data"""
# Store content files and get hashes # Store content files and get hashes
content_map = { content_map = {
'html': (result.html, 'html'), "html": (result.html, "html"),
'cleaned_html': (result.cleaned_html or "", 'cleaned'), "cleaned_html": (result.cleaned_html or "", "cleaned"),
'markdown': None, "markdown": None,
'extracted_content': (result.extracted_content or "", 'extracted'), "extracted_content": (result.extracted_content or "", "extracted"),
'screenshot': (result.screenshot or "", 'screenshots') "screenshot": (result.screenshot or "", "screenshots"),
} }
try: try:
if isinstance(result.markdown, MarkdownGenerationResult): if isinstance(result.markdown, MarkdownGenerationResult):
content_map['markdown'] = (result.markdown.model_dump_json(), 'markdown') content_map["markdown"] = (
elif hasattr(result, 'markdown_v2'): result.markdown.model_dump_json(),
content_map['markdown'] = (result.markdown_v2.model_dump_json(), 'markdown') "markdown",
)
elif hasattr(result, "markdown_v2"):
content_map["markdown"] = (
result.markdown_v2.model_dump_json(),
"markdown",
)
elif isinstance(result.markdown, str): elif isinstance(result.markdown, str):
markdown_result = MarkdownGenerationResult(raw_markdown=result.markdown) markdown_result = MarkdownGenerationResult(raw_markdown=result.markdown)
content_map['markdown'] = (markdown_result.model_dump_json(), 'markdown') content_map["markdown"] = (
markdown_result.model_dump_json(),
"markdown",
)
else: else:
content_map['markdown'] = (MarkdownGenerationResult().model_dump_json(), 'markdown') content_map["markdown"] = (
MarkdownGenerationResult().model_dump_json(),
"markdown",
)
except Exception as e: except Exception as e:
self.logger.warning( self.logger.warning(
message=f"Error processing markdown content: {str(e)}", message=f"Error processing markdown content: {str(e)}", tag="WARNING"
tag="WARNING"
) )
# Fallback to empty markdown result # Fallback to empty markdown result
content_map['markdown'] = (MarkdownGenerationResult().model_dump_json(), 'markdown') content_map["markdown"] = (
MarkdownGenerationResult().model_dump_json(),
"markdown",
)
content_hashes = {} content_hashes = {}
for field, (content, content_type) in content_map.items(): for field, (content, content_type) in content_map.items():
content_hashes[field] = await self._store_content(content, content_type) content_hashes[field] = await self._store_content(content, content_type)
async def _cache(db): async def _cache(db):
await db.execute(''' await db.execute(
"""
INSERT INTO crawled_data ( INSERT INTO crawled_data (
url, html, cleaned_html, markdown, url, html, cleaned_html, markdown,
extracted_content, success, media, links, metadata, extracted_content, success, media, links, metadata,
@@ -383,20 +440,22 @@ class AsyncDatabaseManager:
screenshot = excluded.screenshot, screenshot = excluded.screenshot,
response_headers = excluded.response_headers, response_headers = excluded.response_headers,
downloaded_files = excluded.downloaded_files downloaded_files = excluded.downloaded_files
''', ( """,
result.url, (
content_hashes['html'], result.url,
content_hashes['cleaned_html'], content_hashes["html"],
content_hashes['markdown'], content_hashes["cleaned_html"],
content_hashes['extracted_content'], content_hashes["markdown"],
result.success, content_hashes["extracted_content"],
json.dumps(result.media), result.success,
json.dumps(result.links), json.dumps(result.media),
json.dumps(result.metadata or {}), json.dumps(result.links),
content_hashes['screenshot'], json.dumps(result.metadata or {}),
json.dumps(result.response_headers or {}), content_hashes["screenshot"],
json.dumps(result.downloaded_files or []) json.dumps(result.response_headers or {}),
)) json.dumps(result.downloaded_files or []),
),
)
try: try:
await self.execute_with_retry(_cache) await self.execute_with_retry(_cache)
@@ -405,14 +464,14 @@ class AsyncDatabaseManager:
message="Error caching URL: {error}", message="Error caching URL: {error}",
tag="ERROR", tag="ERROR",
force_verbose=True, force_verbose=True,
params={"error": str(e)} params={"error": str(e)},
) )
async def aget_total_count(self) -> int: async def aget_total_count(self) -> int:
"""Get total number of cached URLs""" """Get total number of cached URLs"""
async def _count(db): async def _count(db):
async with db.execute('SELECT COUNT(*) FROM crawled_data') as cursor: async with db.execute("SELECT COUNT(*) FROM crawled_data") as cursor:
result = await cursor.fetchone() result = await cursor.fetchone()
return result[0] if result else 0 return result[0] if result else 0
@@ -423,14 +482,15 @@ class AsyncDatabaseManager:
message="Error getting total count: {error}", message="Error getting total count: {error}",
tag="ERROR", tag="ERROR",
force_verbose=True, force_verbose=True,
params={"error": str(e)} params={"error": str(e)},
) )
return 0 return 0
async def aclear_db(self): async def aclear_db(self):
"""Clear all data from the database""" """Clear all data from the database"""
async def _clear(db): async def _clear(db):
await db.execute('DELETE FROM crawled_data') await db.execute("DELETE FROM crawled_data")
try: try:
await self.execute_with_retry(_clear) await self.execute_with_retry(_clear)
@@ -439,13 +499,14 @@ class AsyncDatabaseManager:
message="Error clearing database: {error}", message="Error clearing database: {error}",
tag="ERROR", tag="ERROR",
force_verbose=True, force_verbose=True,
params={"error": str(e)} params={"error": str(e)},
) )
async def aflush_db(self): async def aflush_db(self):
"""Drop the entire table""" """Drop the entire table"""
async def _flush(db): async def _flush(db):
await db.execute('DROP TABLE IF EXISTS crawled_data') await db.execute("DROP TABLE IF EXISTS crawled_data")
try: try:
await self.execute_with_retry(_flush) await self.execute_with_retry(_flush)
@@ -454,10 +515,9 @@ class AsyncDatabaseManager:
message="Error flushing database: {error}", message="Error flushing database: {error}",
tag="ERROR", tag="ERROR",
force_verbose=True, force_verbose=True,
params={"error": str(e)} params={"error": str(e)},
) )
async def _store_content(self, content: str, content_type: str) -> str: async def _store_content(self, content: str, content_type: str) -> str:
"""Store content in filesystem and return hash""" """Store content in filesystem and return hash"""
if not content: if not content:
@@ -468,28 +528,31 @@ class AsyncDatabaseManager:
# Only write if file doesn't exist # Only write if file doesn't exist
if not os.path.exists(file_path): if not os.path.exists(file_path):
async with aiofiles.open(file_path, 'w', encoding='utf-8') as f: async with aiofiles.open(file_path, "w", encoding="utf-8") as f:
await f.write(content) await f.write(content)
return content_hash return content_hash
async def _load_content(self, content_hash: str, content_type: str) -> Optional[str]: async def _load_content(
self, content_hash: str, content_type: str
) -> Optional[str]:
"""Load content from filesystem by hash""" """Load content from filesystem by hash"""
if not content_hash: if not content_hash:
return None return None
file_path = os.path.join(self.content_paths[content_type], content_hash) file_path = os.path.join(self.content_paths[content_type], content_hash)
try: try:
async with aiofiles.open(file_path, 'r', encoding='utf-8') as f: async with aiofiles.open(file_path, "r", encoding="utf-8") as f:
return await f.read() return await f.read()
except: except:
self.logger.error( self.logger.error(
message="Failed to load content: {file_path}", message="Failed to load content: {file_path}",
tag="ERROR", tag="ERROR",
force_verbose=True, force_verbose=True,
params={"file_path": file_path} params={"file_path": file_path},
) )
return None return None
# Create a singleton instance # Create a singleton instance
async_db_manager = AsyncDatabaseManager() async_db_manager = AsyncDatabaseManager()

View File

@@ -0,0 +1,647 @@
from typing import Dict, Optional, List, Tuple
from .async_configs import CrawlerRunConfig
from .models import (
CrawlResult,
CrawlerTaskResult,
CrawlStatus,
DisplayMode,
CrawlStats,
DomainState,
)
from rich.live import Live
from rich.table import Table
from rich.console import Console
from rich import box
from datetime import datetime, timedelta
from collections.abc import AsyncGenerator
import time
import psutil
import asyncio
import uuid
from urllib.parse import urlparse
import random
from abc import ABC, abstractmethod
class RateLimiter:
def __init__(
self,
base_delay: Tuple[float, float] = (1.0, 3.0),
max_delay: float = 60.0,
max_retries: int = 3,
rate_limit_codes: List[int] = None,
):
self.base_delay = base_delay
self.max_delay = max_delay
self.max_retries = max_retries
self.rate_limit_codes = rate_limit_codes or [429, 503]
self.domains: Dict[str, DomainState] = {}
def get_domain(self, url: str) -> str:
return urlparse(url).netloc
async def wait_if_needed(self, url: str) -> None:
domain = self.get_domain(url)
state = self.domains.get(domain)
if not state:
self.domains[domain] = DomainState()
state = self.domains[domain]
now = time.time()
if state.last_request_time:
wait_time = max(0, state.current_delay - (now - state.last_request_time))
if wait_time > 0:
await asyncio.sleep(wait_time)
# Random delay within base range if no current delay
if state.current_delay == 0:
state.current_delay = random.uniform(*self.base_delay)
state.last_request_time = time.time()
def update_delay(self, url: str, status_code: int) -> bool:
domain = self.get_domain(url)
state = self.domains[domain]
if status_code in self.rate_limit_codes:
state.fail_count += 1
if state.fail_count > self.max_retries:
return False
# Exponential backoff with random jitter
state.current_delay = min(
state.current_delay * 2 * random.uniform(0.75, 1.25), self.max_delay
)
else:
# Gradually reduce delay on success
state.current_delay = max(
random.uniform(*self.base_delay), state.current_delay * 0.75
)
state.fail_count = 0
return True
class CrawlerMonitor:
def __init__(
self,
max_visible_rows: int = 15,
display_mode: DisplayMode = DisplayMode.DETAILED,
):
self.console = Console()
self.max_visible_rows = max_visible_rows
self.display_mode = display_mode
self.stats: Dict[str, CrawlStats] = {}
self.process = psutil.Process()
self.start_time = datetime.now()
self.live = Live(self._create_table(), refresh_per_second=2)
def start(self):
self.live.start()
def stop(self):
self.live.stop()
def add_task(self, task_id: str, url: str):
self.stats[task_id] = CrawlStats(
task_id=task_id, url=url, status=CrawlStatus.QUEUED
)
self.live.update(self._create_table())
def update_task(self, task_id: str, **kwargs):
if task_id in self.stats:
for key, value in kwargs.items():
setattr(self.stats[task_id], key, value)
self.live.update(self._create_table())
def _create_aggregated_table(self) -> Table:
"""Creates a compact table showing only aggregated statistics"""
table = Table(
box=box.ROUNDED,
title="Crawler Status Overview",
title_style="bold magenta",
header_style="bold blue",
show_lines=True,
)
# Calculate statistics
total_tasks = len(self.stats)
queued = sum(
1 for stat in self.stats.values() if stat.status == CrawlStatus.QUEUED
)
in_progress = sum(
1 for stat in self.stats.values() if stat.status == CrawlStatus.IN_PROGRESS
)
completed = sum(
1 for stat in self.stats.values() if stat.status == CrawlStatus.COMPLETED
)
failed = sum(
1 for stat in self.stats.values() if stat.status == CrawlStatus.FAILED
)
# Memory statistics
current_memory = self.process.memory_info().rss / (1024 * 1024)
total_task_memory = sum(stat.memory_usage for stat in self.stats.values())
peak_memory = max(
(stat.peak_memory for stat in self.stats.values()), default=0.0
)
# Duration
duration = datetime.now() - self.start_time
# Create status row
table.add_column("Status", style="bold cyan")
table.add_column("Count", justify="right")
table.add_column("Percentage", justify="right")
table.add_row("Total Tasks", str(total_tasks), "100%")
table.add_row(
"[yellow]In Queue[/yellow]",
str(queued),
f"{(queued/total_tasks*100):.1f}%" if total_tasks > 0 else "0%",
)
table.add_row(
"[blue]In Progress[/blue]",
str(in_progress),
f"{(in_progress/total_tasks*100):.1f}%" if total_tasks > 0 else "0%",
)
table.add_row(
"[green]Completed[/green]",
str(completed),
f"{(completed/total_tasks*100):.1f}%" if total_tasks > 0 else "0%",
)
table.add_row(
"[red]Failed[/red]",
str(failed),
f"{(failed/total_tasks*100):.1f}%" if total_tasks > 0 else "0%",
)
# Add memory information
table.add_section()
table.add_row(
"[magenta]Current Memory[/magenta]", f"{current_memory:.1f} MB", ""
)
table.add_row(
"[magenta]Total Task Memory[/magenta]", f"{total_task_memory:.1f} MB", ""
)
table.add_row(
"[magenta]Peak Task Memory[/magenta]", f"{peak_memory:.1f} MB", ""
)
table.add_row(
"[yellow]Runtime[/yellow]",
str(timedelta(seconds=int(duration.total_seconds()))),
"",
)
return table
def _create_detailed_table(self) -> Table:
table = Table(
box=box.ROUNDED,
title="Crawler Performance Monitor",
title_style="bold magenta",
header_style="bold blue",
)
# Add columns
table.add_column("Task ID", style="cyan", no_wrap=True)
table.add_column("URL", style="cyan", no_wrap=True)
table.add_column("Status", style="bold")
table.add_column("Memory (MB)", justify="right")
table.add_column("Peak (MB)", justify="right")
table.add_column("Duration", justify="right")
table.add_column("Info", style="italic")
# Add summary row
total_memory = sum(stat.memory_usage for stat in self.stats.values())
active_count = sum(
1 for stat in self.stats.values() if stat.status == CrawlStatus.IN_PROGRESS
)
completed_count = sum(
1 for stat in self.stats.values() if stat.status == CrawlStatus.COMPLETED
)
failed_count = sum(
1 for stat in self.stats.values() if stat.status == CrawlStatus.FAILED
)
table.add_row(
"[bold yellow]SUMMARY",
f"Total: {len(self.stats)}",
f"Active: {active_count}",
f"{total_memory:.1f}",
f"{self.process.memory_info().rss / (1024 * 1024):.1f}",
str(
timedelta(
seconds=int((datetime.now() - self.start_time).total_seconds())
)
),
f"{completed_count}{failed_count}",
style="bold",
)
table.add_section()
# Add rows for each task
visible_stats = sorted(
self.stats.values(),
key=lambda x: (
x.status != CrawlStatus.IN_PROGRESS,
x.status != CrawlStatus.QUEUED,
x.end_time or datetime.max,
),
)[: self.max_visible_rows]
for stat in visible_stats:
status_style = {
CrawlStatus.QUEUED: "white",
CrawlStatus.IN_PROGRESS: "yellow",
CrawlStatus.COMPLETED: "green",
CrawlStatus.FAILED: "red",
}[stat.status]
table.add_row(
stat.task_id[:8], # Show first 8 chars of task ID
stat.url[:40] + "..." if len(stat.url) > 40 else stat.url,
f"[{status_style}]{stat.status.value}[/{status_style}]",
f"{stat.memory_usage:.1f}",
f"{stat.peak_memory:.1f}",
stat.duration,
stat.error_message[:40] if stat.error_message else "",
)
return table
def _create_table(self) -> Table:
"""Creates the appropriate table based on display mode"""
if self.display_mode == DisplayMode.AGGREGATED:
return self._create_aggregated_table()
return self._create_detailed_table()
class BaseDispatcher(ABC):
def __init__(
self,
rate_limiter: Optional[RateLimiter] = None,
monitor: Optional[CrawlerMonitor] = None,
):
self.crawler = None
self._domain_last_hit: Dict[str, float] = {}
self.concurrent_sessions = 0
self.rate_limiter = rate_limiter
self.monitor = monitor
@abstractmethod
async def crawl_url(
self,
url: str,
config: CrawlerRunConfig,
task_id: str,
monitor: Optional[CrawlerMonitor] = None,
) -> CrawlerTaskResult:
pass
@abstractmethod
async def run_urls(
self,
urls: List[str],
crawler: "AsyncWebCrawler", # noqa: F821
config: CrawlerRunConfig,
monitor: Optional[CrawlerMonitor] = None,
) -> List[CrawlerTaskResult]:
pass
class MemoryAdaptiveDispatcher(BaseDispatcher):
def __init__(
self,
memory_threshold_percent: float = 90.0,
check_interval: float = 1.0,
max_session_permit: int = 20,
memory_wait_timeout: float = 300.0, # 5 minutes default timeout
rate_limiter: Optional[RateLimiter] = None,
monitor: Optional[CrawlerMonitor] = None,
):
super().__init__(rate_limiter, monitor)
self.memory_threshold_percent = memory_threshold_percent
self.check_interval = check_interval
self.max_session_permit = max_session_permit
self.memory_wait_timeout = memory_wait_timeout
self.result_queue = asyncio.Queue() # Queue for storing results
async def crawl_url(
self,
url: str,
config: CrawlerRunConfig,
task_id: str,
) -> CrawlerTaskResult:
start_time = datetime.now()
error_message = ""
memory_usage = peak_memory = 0.0
try:
if self.monitor:
self.monitor.update_task(
task_id, status=CrawlStatus.IN_PROGRESS, start_time=start_time
)
self.concurrent_sessions += 1
if self.rate_limiter:
await self.rate_limiter.wait_if_needed(url)
process = psutil.Process()
start_memory = process.memory_info().rss / (1024 * 1024)
result = await self.crawler.arun(url, config=config, session_id=task_id)
end_memory = process.memory_info().rss / (1024 * 1024)
memory_usage = peak_memory = end_memory - start_memory
if self.rate_limiter and result.status_code:
if not self.rate_limiter.update_delay(url, result.status_code):
error_message = f"Rate limit retry count exceeded for domain {urlparse(url).netloc}"
if self.monitor:
self.monitor.update_task(task_id, status=CrawlStatus.FAILED)
result = CrawlerTaskResult(
task_id=task_id,
url=url,
result=result,
memory_usage=memory_usage,
peak_memory=peak_memory,
start_time=start_time,
end_time=datetime.now(),
error_message=error_message,
)
await self.result_queue.put(result)
return result
if not result.success:
error_message = result.error_message
if self.monitor:
self.monitor.update_task(task_id, status=CrawlStatus.FAILED)
elif self.monitor:
self.monitor.update_task(task_id, status=CrawlStatus.COMPLETED)
except Exception as e:
error_message = str(e)
if self.monitor:
self.monitor.update_task(task_id, status=CrawlStatus.FAILED)
result = CrawlResult(
url=url, html="", metadata={}, success=False, error_message=str(e)
)
finally:
end_time = datetime.now()
if self.monitor:
self.monitor.update_task(
task_id,
end_time=end_time,
memory_usage=memory_usage,
peak_memory=peak_memory,
error_message=error_message,
)
self.concurrent_sessions -= 1
return CrawlerTaskResult(
task_id=task_id,
url=url,
result=result,
memory_usage=memory_usage,
peak_memory=peak_memory,
start_time=start_time,
end_time=end_time,
error_message=error_message,
)
async def run_urls(
self,
urls: List[str],
crawler: "AsyncWebCrawler", # noqa: F821
config: CrawlerRunConfig,
) -> List[CrawlerTaskResult]:
self.crawler = crawler
if self.monitor:
self.monitor.start()
try:
pending_tasks = []
active_tasks = []
task_queue = []
for url in urls:
task_id = str(uuid.uuid4())
if self.monitor:
self.monitor.add_task(task_id, url)
task_queue.append((url, task_id))
while task_queue or active_tasks:
wait_start_time = time.time()
while len(active_tasks) < self.max_session_permit and task_queue:
if psutil.virtual_memory().percent >= self.memory_threshold_percent:
# Check if we've exceeded the timeout
if time.time() - wait_start_time > self.memory_wait_timeout:
raise MemoryError(
f"Memory usage above threshold ({self.memory_threshold_percent}%) for more than {self.memory_wait_timeout} seconds"
)
await asyncio.sleep(self.check_interval)
continue
url, task_id = task_queue.pop(0)
task = asyncio.create_task(self.crawl_url(url, config, task_id))
active_tasks.append(task)
if not active_tasks:
await asyncio.sleep(self.check_interval)
continue
done, pending = await asyncio.wait(
active_tasks, return_when=asyncio.FIRST_COMPLETED
)
pending_tasks.extend(done)
active_tasks = list(pending)
return await asyncio.gather(*pending_tasks)
finally:
if self.monitor:
self.monitor.stop()
async def run_urls_stream(
self,
urls: List[str],
crawler: "AsyncWebCrawler",
config: CrawlerRunConfig,
) -> AsyncGenerator[CrawlerTaskResult, None]:
self.crawler = crawler
if self.monitor:
self.monitor.start()
try:
active_tasks = []
task_queue = []
completed_count = 0
total_urls = len(urls)
# Initialize task queue
for url in urls:
task_id = str(uuid.uuid4())
if self.monitor:
self.monitor.add_task(task_id, url)
task_queue.append((url, task_id))
while completed_count < total_urls:
# Start new tasks if memory permits
while len(active_tasks) < self.max_session_permit and task_queue:
if psutil.virtual_memory().percent >= self.memory_threshold_percent:
await asyncio.sleep(self.check_interval)
continue
url, task_id = task_queue.pop(0)
task = asyncio.create_task(self.crawl_url(url, config, task_id))
active_tasks.append(task)
if not active_tasks and not task_queue:
break
# Wait for any task to complete and yield results
if active_tasks:
done, pending = await asyncio.wait(
active_tasks,
timeout=0.1,
return_when=asyncio.FIRST_COMPLETED
)
for completed_task in done:
result = await completed_task
completed_count += 1
yield result
active_tasks = list(pending)
else:
await asyncio.sleep(self.check_interval)
finally:
if self.monitor:
self.monitor.stop()
class SemaphoreDispatcher(BaseDispatcher):
def __init__(
self,
semaphore_count: int = 5,
max_session_permit: int = 20,
rate_limiter: Optional[RateLimiter] = None,
monitor: Optional[CrawlerMonitor] = None,
):
super().__init__(rate_limiter, monitor)
self.semaphore_count = semaphore_count
self.max_session_permit = max_session_permit
async def crawl_url(
self,
url: str,
config: CrawlerRunConfig,
task_id: str,
semaphore: asyncio.Semaphore = None,
) -> CrawlerTaskResult:
start_time = datetime.now()
error_message = ""
memory_usage = peak_memory = 0.0
try:
if self.monitor:
self.monitor.update_task(
task_id, status=CrawlStatus.IN_PROGRESS, start_time=start_time
)
if self.rate_limiter:
await self.rate_limiter.wait_if_needed(url)
async with semaphore:
process = psutil.Process()
start_memory = process.memory_info().rss / (1024 * 1024)
result = await self.crawler.arun(url, config=config, session_id=task_id)
end_memory = process.memory_info().rss / (1024 * 1024)
memory_usage = peak_memory = end_memory - start_memory
if self.rate_limiter and result.status_code:
if not self.rate_limiter.update_delay(url, result.status_code):
error_message = f"Rate limit retry count exceeded for domain {urlparse(url).netloc}"
if self.monitor:
self.monitor.update_task(task_id, status=CrawlStatus.FAILED)
return CrawlerTaskResult(
task_id=task_id,
url=url,
result=result,
memory_usage=memory_usage,
peak_memory=peak_memory,
start_time=start_time,
end_time=datetime.now(),
error_message=error_message,
)
if not result.success:
error_message = result.error_message
if self.monitor:
self.monitor.update_task(task_id, status=CrawlStatus.FAILED)
elif self.monitor:
self.monitor.update_task(task_id, status=CrawlStatus.COMPLETED)
except Exception as e:
error_message = str(e)
if self.monitor:
self.monitor.update_task(task_id, status=CrawlStatus.FAILED)
result = CrawlResult(
url=url, html="", metadata={}, success=False, error_message=str(e)
)
finally:
end_time = datetime.now()
if self.monitor:
self.monitor.update_task(
task_id,
end_time=end_time,
memory_usage=memory_usage,
peak_memory=peak_memory,
error_message=error_message,
)
return CrawlerTaskResult(
task_id=task_id,
url=url,
result=result,
memory_usage=memory_usage,
peak_memory=peak_memory,
start_time=start_time,
end_time=end_time,
error_message=error_message,
)
async def run_urls(
self,
crawler: "AsyncWebCrawler", # noqa: F821
urls: List[str],
config: CrawlerRunConfig,
) -> List[CrawlerTaskResult]:
self.crawler = crawler
if self.monitor:
self.monitor.start()
try:
semaphore = asyncio.Semaphore(self.semaphore_count)
tasks = []
for url in urls:
task_id = str(uuid.uuid4())
if self.monitor:
self.monitor.add_task(task_id, url)
task = asyncio.create_task(
self.crawl_url(url, config, task_id, semaphore)
)
tasks.append(task)
return await asyncio.gather(*tasks, return_exceptions=True)
finally:
if self.monitor:
self.monitor.stop()

View File

@@ -0,0 +1,588 @@
from typing import Dict, Optional, List, Tuple
from .async_configs import CrawlerRunConfig
from .models import (
CrawlResult,
CrawlerTaskResult,
CrawlStatus,
DisplayMode,
CrawlStats,
DomainState,
)
from rich.live import Live
from rich.table import Table
from rich.console import Console
from rich import box
from datetime import datetime, timedelta
import time
import psutil
import asyncio
import uuid
from urllib.parse import urlparse
import random
from abc import ABC, abstractmethod
class RateLimiter:
def __init__(
self,
base_delay: Tuple[float, float] = (1.0, 3.0),
max_delay: float = 60.0,
max_retries: int = 3,
rate_limit_codes: List[int] = None,
):
self.base_delay = base_delay
self.max_delay = max_delay
self.max_retries = max_retries
self.rate_limit_codes = rate_limit_codes or [429, 503]
self.domains: Dict[str, DomainState] = {}
def get_domain(self, url: str) -> str:
return urlparse(url).netloc
async def wait_if_needed(self, url: str) -> None:
domain = self.get_domain(url)
state = self.domains.get(domain)
if not state:
self.domains[domain] = DomainState()
state = self.domains[domain]
now = time.time()
if state.last_request_time:
wait_time = max(0, state.current_delay - (now - state.last_request_time))
if wait_time > 0:
await asyncio.sleep(wait_time)
# Random delay within base range if no current delay
if state.current_delay == 0:
state.current_delay = random.uniform(*self.base_delay)
state.last_request_time = time.time()
def update_delay(self, url: str, status_code: int) -> bool:
domain = self.get_domain(url)
state = self.domains[domain]
if status_code in self.rate_limit_codes:
state.fail_count += 1
if state.fail_count > self.max_retries:
return False
# Exponential backoff with random jitter
state.current_delay = min(
state.current_delay * 2 * random.uniform(0.75, 1.25), self.max_delay
)
else:
# Gradually reduce delay on success
state.current_delay = max(
random.uniform(*self.base_delay), state.current_delay * 0.75
)
state.fail_count = 0
return True
class CrawlerMonitor:
def __init__(
self,
max_visible_rows: int = 15,
display_mode: DisplayMode = DisplayMode.DETAILED,
):
self.console = Console()
self.max_visible_rows = max_visible_rows
self.display_mode = display_mode
self.stats: Dict[str, CrawlStats] = {}
self.process = psutil.Process()
self.start_time = datetime.now()
self.live = Live(self._create_table(), refresh_per_second=2)
def start(self):
self.live.start()
def stop(self):
self.live.stop()
def add_task(self, task_id: str, url: str):
self.stats[task_id] = CrawlStats(
task_id=task_id, url=url, status=CrawlStatus.QUEUED
)
self.live.update(self._create_table())
def update_task(self, task_id: str, **kwargs):
if task_id in self.stats:
for key, value in kwargs.items():
setattr(self.stats[task_id], key, value)
self.live.update(self._create_table())
def _create_aggregated_table(self) -> Table:
"""Creates a compact table showing only aggregated statistics"""
table = Table(
box=box.ROUNDED,
title="Crawler Status Overview",
title_style="bold magenta",
header_style="bold blue",
show_lines=True,
)
# Calculate statistics
total_tasks = len(self.stats)
queued = sum(
1 for stat in self.stats.values() if stat.status == CrawlStatus.QUEUED
)
in_progress = sum(
1 for stat in self.stats.values() if stat.status == CrawlStatus.IN_PROGRESS
)
completed = sum(
1 for stat in self.stats.values() if stat.status == CrawlStatus.COMPLETED
)
failed = sum(
1 for stat in self.stats.values() if stat.status == CrawlStatus.FAILED
)
# Memory statistics
current_memory = self.process.memory_info().rss / (1024 * 1024)
total_task_memory = sum(stat.memory_usage for stat in self.stats.values())
peak_memory = max(
(stat.peak_memory for stat in self.stats.values()), default=0.0
)
# Duration
duration = datetime.now() - self.start_time
# Create status row
table.add_column("Status", style="bold cyan")
table.add_column("Count", justify="right")
table.add_column("Percentage", justify="right")
table.add_row("Total Tasks", str(total_tasks), "100%")
table.add_row(
"[yellow]In Queue[/yellow]",
str(queued),
f"{(queued/total_tasks*100):.1f}%" if total_tasks > 0 else "0%",
)
table.add_row(
"[blue]In Progress[/blue]",
str(in_progress),
f"{(in_progress/total_tasks*100):.1f}%" if total_tasks > 0 else "0%",
)
table.add_row(
"[green]Completed[/green]",
str(completed),
f"{(completed/total_tasks*100):.1f}%" if total_tasks > 0 else "0%",
)
table.add_row(
"[red]Failed[/red]",
str(failed),
f"{(failed/total_tasks*100):.1f}%" if total_tasks > 0 else "0%",
)
# Add memory information
table.add_section()
table.add_row(
"[magenta]Current Memory[/magenta]", f"{current_memory:.1f} MB", ""
)
table.add_row(
"[magenta]Total Task Memory[/magenta]", f"{total_task_memory:.1f} MB", ""
)
table.add_row(
"[magenta]Peak Task Memory[/magenta]", f"{peak_memory:.1f} MB", ""
)
table.add_row(
"[yellow]Runtime[/yellow]",
str(timedelta(seconds=int(duration.total_seconds()))),
"",
)
return table
def _create_detailed_table(self) -> Table:
table = Table(
box=box.ROUNDED,
title="Crawler Performance Monitor",
title_style="bold magenta",
header_style="bold blue",
)
# Add columns
table.add_column("Task ID", style="cyan", no_wrap=True)
table.add_column("URL", style="cyan", no_wrap=True)
table.add_column("Status", style="bold")
table.add_column("Memory (MB)", justify="right")
table.add_column("Peak (MB)", justify="right")
table.add_column("Duration", justify="right")
table.add_column("Info", style="italic")
# Add summary row
total_memory = sum(stat.memory_usage for stat in self.stats.values())
active_count = sum(
1 for stat in self.stats.values() if stat.status == CrawlStatus.IN_PROGRESS
)
completed_count = sum(
1 for stat in self.stats.values() if stat.status == CrawlStatus.COMPLETED
)
failed_count = sum(
1 for stat in self.stats.values() if stat.status == CrawlStatus.FAILED
)
table.add_row(
"[bold yellow]SUMMARY",
f"Total: {len(self.stats)}",
f"Active: {active_count}",
f"{total_memory:.1f}",
f"{self.process.memory_info().rss / (1024 * 1024):.1f}",
str(
timedelta(
seconds=int((datetime.now() - self.start_time).total_seconds())
)
),
f"{completed_count}{failed_count}",
style="bold",
)
table.add_section()
# Add rows for each task
visible_stats = sorted(
self.stats.values(),
key=lambda x: (
x.status != CrawlStatus.IN_PROGRESS,
x.status != CrawlStatus.QUEUED,
x.end_time or datetime.max,
),
)[: self.max_visible_rows]
for stat in visible_stats:
status_style = {
CrawlStatus.QUEUED: "white",
CrawlStatus.IN_PROGRESS: "yellow",
CrawlStatus.COMPLETED: "green",
CrawlStatus.FAILED: "red",
}[stat.status]
table.add_row(
stat.task_id[:8], # Show first 8 chars of task ID
stat.url[:40] + "..." if len(stat.url) > 40 else stat.url,
f"[{status_style}]{stat.status.value}[/{status_style}]",
f"{stat.memory_usage:.1f}",
f"{stat.peak_memory:.1f}",
stat.duration,
stat.error_message[:40] if stat.error_message else "",
)
return table
def _create_table(self) -> Table:
"""Creates the appropriate table based on display mode"""
if self.display_mode == DisplayMode.AGGREGATED:
return self._create_aggregated_table()
return self._create_detailed_table()
class BaseDispatcher(ABC):
def __init__(
self,
rate_limiter: Optional[RateLimiter] = None,
monitor: Optional[CrawlerMonitor] = None,
):
self.crawler = None
self._domain_last_hit: Dict[str, float] = {}
self.concurrent_sessions = 0
self.rate_limiter = rate_limiter
self.monitor = monitor
@abstractmethod
async def crawl_url(
self,
url: str,
config: CrawlerRunConfig,
task_id: str,
monitor: Optional[CrawlerMonitor] = None,
) -> CrawlerTaskResult:
pass
@abstractmethod
async def run_urls(
self,
urls: List[str],
crawler: "AsyncWebCrawler", # noqa: F821
config: CrawlerRunConfig,
monitor: Optional[CrawlerMonitor] = None,
) -> List[CrawlerTaskResult]:
pass
class MemoryAdaptiveDispatcher(BaseDispatcher):
def __init__(
self,
memory_threshold_percent: float = 90.0,
check_interval: float = 1.0,
max_session_permit: int = 20,
memory_wait_timeout: float = 300.0, # 5 minutes default timeout
rate_limiter: Optional[RateLimiter] = None,
monitor: Optional[CrawlerMonitor] = None,
):
super().__init__(rate_limiter, monitor)
self.memory_threshold_percent = memory_threshold_percent
self.check_interval = check_interval
self.max_session_permit = max_session_permit
self.memory_wait_timeout = memory_wait_timeout
async def crawl_url(
self,
url: str,
config: CrawlerRunConfig,
task_id: str,
) -> CrawlerTaskResult:
start_time = datetime.now()
error_message = ""
memory_usage = peak_memory = 0.0
try:
if self.monitor:
self.monitor.update_task(
task_id, status=CrawlStatus.IN_PROGRESS, start_time=start_time
)
self.concurrent_sessions += 1
if self.rate_limiter:
await self.rate_limiter.wait_if_needed(url)
process = psutil.Process()
start_memory = process.memory_info().rss / (1024 * 1024)
result = await self.crawler.arun(url, config=config, session_id=task_id)
end_memory = process.memory_info().rss / (1024 * 1024)
memory_usage = peak_memory = end_memory - start_memory
if self.rate_limiter and result.status_code:
if not self.rate_limiter.update_delay(url, result.status_code):
error_message = f"Rate limit retry count exceeded for domain {urlparse(url).netloc}"
if self.monitor:
self.monitor.update_task(task_id, status=CrawlStatus.FAILED)
return CrawlerTaskResult(
task_id=task_id,
url=url,
result=result,
memory_usage=memory_usage,
peak_memory=peak_memory,
start_time=start_time,
end_time=datetime.now(),
error_message=error_message,
)
if not result.success:
error_message = result.error_message
if self.monitor:
self.monitor.update_task(task_id, status=CrawlStatus.FAILED)
elif self.monitor:
self.monitor.update_task(task_id, status=CrawlStatus.COMPLETED)
except Exception as e:
error_message = str(e)
if self.monitor:
self.monitor.update_task(task_id, status=CrawlStatus.FAILED)
result = CrawlResult(
url=url, html="", metadata={}, success=False, error_message=str(e)
)
finally:
end_time = datetime.now()
if self.monitor:
self.monitor.update_task(
task_id,
end_time=end_time,
memory_usage=memory_usage,
peak_memory=peak_memory,
error_message=error_message,
)
self.concurrent_sessions -= 1
return CrawlerTaskResult(
task_id=task_id,
url=url,
result=result,
memory_usage=memory_usage,
peak_memory=peak_memory,
start_time=start_time,
end_time=end_time,
error_message=error_message,
)
async def run_urls(
self,
urls: List[str],
crawler: "AsyncWebCrawler", # noqa: F821
config: CrawlerRunConfig,
) -> List[CrawlerTaskResult]:
self.crawler = crawler
if self.monitor:
self.monitor.start()
try:
pending_tasks = []
active_tasks = []
task_queue = []
for url in urls:
task_id = str(uuid.uuid4())
if self.monitor:
self.monitor.add_task(task_id, url)
task_queue.append((url, task_id))
while task_queue or active_tasks:
wait_start_time = time.time()
while len(active_tasks) < self.max_session_permit and task_queue:
if psutil.virtual_memory().percent >= self.memory_threshold_percent:
# Check if we've exceeded the timeout
if time.time() - wait_start_time > self.memory_wait_timeout:
raise MemoryError(
f"Memory usage above threshold ({self.memory_threshold_percent}%) for more than {self.memory_wait_timeout} seconds"
)
await asyncio.sleep(self.check_interval)
continue
url, task_id = task_queue.pop(0)
task = asyncio.create_task(self.crawl_url(url, config, task_id))
active_tasks.append(task)
if not active_tasks:
await asyncio.sleep(self.check_interval)
continue
done, pending = await asyncio.wait(
active_tasks, return_when=asyncio.FIRST_COMPLETED
)
pending_tasks.extend(done)
active_tasks = list(pending)
return await asyncio.gather(*pending_tasks)
finally:
if self.monitor:
self.monitor.stop()
class SemaphoreDispatcher(BaseDispatcher):
def __init__(
self,
semaphore_count: int = 5,
max_session_permit: int = 20,
rate_limiter: Optional[RateLimiter] = None,
monitor: Optional[CrawlerMonitor] = None,
):
super().__init__(rate_limiter, monitor)
self.semaphore_count = semaphore_count
self.max_session_permit = max_session_permit
async def crawl_url(
self,
url: str,
config: CrawlerRunConfig,
task_id: str,
semaphore: asyncio.Semaphore = None,
) -> CrawlerTaskResult:
start_time = datetime.now()
error_message = ""
memory_usage = peak_memory = 0.0
try:
if self.monitor:
self.monitor.update_task(
task_id, status=CrawlStatus.IN_PROGRESS, start_time=start_time
)
if self.rate_limiter:
await self.rate_limiter.wait_if_needed(url)
async with semaphore:
process = psutil.Process()
start_memory = process.memory_info().rss / (1024 * 1024)
result = await self.crawler.arun(url, config=config, session_id=task_id)
end_memory = process.memory_info().rss / (1024 * 1024)
memory_usage = peak_memory = end_memory - start_memory
if self.rate_limiter and result.status_code:
if not self.rate_limiter.update_delay(url, result.status_code):
error_message = f"Rate limit retry count exceeded for domain {urlparse(url).netloc}"
if self.monitor:
self.monitor.update_task(task_id, status=CrawlStatus.FAILED)
return CrawlerTaskResult(
task_id=task_id,
url=url,
result=result,
memory_usage=memory_usage,
peak_memory=peak_memory,
start_time=start_time,
end_time=datetime.now(),
error_message=error_message,
)
if not result.success:
error_message = result.error_message
if self.monitor:
self.monitor.update_task(task_id, status=CrawlStatus.FAILED)
elif self.monitor:
self.monitor.update_task(task_id, status=CrawlStatus.COMPLETED)
except Exception as e:
error_message = str(e)
if self.monitor:
self.monitor.update_task(task_id, status=CrawlStatus.FAILED)
result = CrawlResult(
url=url, html="", metadata={}, success=False, error_message=str(e)
)
finally:
end_time = datetime.now()
if self.monitor:
self.monitor.update_task(
task_id,
end_time=end_time,
memory_usage=memory_usage,
peak_memory=peak_memory,
error_message=error_message,
)
return CrawlerTaskResult(
task_id=task_id,
url=url,
result=result,
memory_usage=memory_usage,
peak_memory=peak_memory,
start_time=start_time,
end_time=end_time,
error_message=error_message,
)
async def run_urls(
self,
crawler: "AsyncWebCrawler", # noqa: F821
urls: List[str],
config: CrawlerRunConfig,
) -> List[CrawlerTaskResult]:
self.crawler = crawler
if self.monitor:
self.monitor.start()
try:
semaphore = asyncio.Semaphore(self.semaphore_count)
tasks = []
for url in urls:
task_id = str(uuid.uuid4())
if self.monitor:
self.monitor.add_task(task_id, url)
task = asyncio.create_task(
self.crawl_url(url, config, task_id, semaphore)
)
tasks.append(task)
return await asyncio.gather(*tasks, return_exceptions=True)
finally:
if self.monitor:
self.monitor.stop()

View File

@@ -1,10 +1,10 @@
from enum import Enum from enum import Enum
from typing import Optional, Dict, Any, Union from typing import Optional, Dict, Any
from colorama import Fore, Back, Style, init from colorama import Fore, Style, init
import time
import os import os
from datetime import datetime from datetime import datetime
class LogLevel(Enum): class LogLevel(Enum):
DEBUG = 1 DEBUG = 1
INFO = 2 INFO = 2
@@ -12,6 +12,7 @@ class LogLevel(Enum):
WARNING = 4 WARNING = 4
ERROR = 5 ERROR = 5
class AsyncLogger: class AsyncLogger:
""" """
Asynchronous logger with support for colored console output and file logging. Asynchronous logger with support for colored console output and file logging.
@@ -19,16 +20,16 @@ class AsyncLogger:
""" """
DEFAULT_ICONS = { DEFAULT_ICONS = {
'INIT': '', "INIT": "",
'READY': '', "READY": "",
'FETCH': '', "FETCH": "",
'SCRAPE': '', "SCRAPE": "",
'EXTRACT': '', "EXTRACT": "",
'COMPLETE': '', "COMPLETE": "",
'ERROR': '×', "ERROR": "×",
'DEBUG': '', "DEBUG": "",
'INFO': '', "INFO": "",
'WARNING': '', "WARNING": "",
} }
DEFAULT_COLORS = { DEFAULT_COLORS = {
@@ -46,7 +47,7 @@ class AsyncLogger:
tag_width: int = 10, tag_width: int = 10,
icons: Optional[Dict[str, str]] = None, icons: Optional[Dict[str, str]] = None,
colors: Optional[Dict[LogLevel, str]] = None, colors: Optional[Dict[LogLevel, str]] = None,
verbose: bool = True verbose: bool = True,
): ):
""" """
Initialize the logger. Initialize the logger.
@@ -77,18 +78,20 @@ class AsyncLogger:
def _get_icon(self, tag: str) -> str: def _get_icon(self, tag: str) -> str:
"""Get the icon for a tag, defaulting to info icon if not found.""" """Get the icon for a tag, defaulting to info icon if not found."""
return self.icons.get(tag, self.icons['INFO']) return self.icons.get(tag, self.icons["INFO"])
def _write_to_file(self, message: str): def _write_to_file(self, message: str):
"""Write a message to the log file if configured.""" """Write a message to the log file if configured."""
if self.log_file: if self.log_file:
timestamp = datetime.now().strftime('%Y-%m-%d %H:%M:%S.%f')[:-3] timestamp = datetime.now().strftime("%Y-%m-%d %H:%M:%S.%f")[:-3]
with open(self.log_file, 'a', encoding='utf-8') as f: with open(self.log_file, "a", encoding="utf-8") as f:
# Strip ANSI color codes for file output # Strip ANSI color codes for file output
clean_message = message.replace(Fore.RESET, '').replace(Style.RESET_ALL, '') clean_message = message.replace(Fore.RESET, "").replace(
Style.RESET_ALL, ""
)
for color in vars(Fore).values(): for color in vars(Fore).values():
if isinstance(color, str): if isinstance(color, str):
clean_message = clean_message.replace(color, '') clean_message = clean_message.replace(color, "")
f.write(f"[{timestamp}] {clean_message}\n") f.write(f"[{timestamp}] {clean_message}\n")
def _log( def _log(
@@ -99,7 +102,7 @@ class AsyncLogger:
params: Optional[Dict[str, Any]] = None, params: Optional[Dict[str, Any]] = None,
colors: Optional[Dict[str, str]] = None, colors: Optional[Dict[str, str]] = None,
base_color: Optional[str] = None, base_color: Optional[str] = None,
**kwargs **kwargs,
): ):
""" """
Core logging method that handles message formatting and output. Core logging method that handles message formatting and output.
@@ -128,12 +131,13 @@ class AsyncLogger:
if key in params: if key in params:
value_str = str(params[key]) value_str = str(params[key])
formatted_message = formatted_message.replace( formatted_message = formatted_message.replace(
value_str, value_str, f"{color}{value_str}{Style.RESET_ALL}"
f"{color}{value_str}{Style.RESET_ALL}"
) )
except KeyError as e: except KeyError as e:
formatted_message = f"LOGGING ERROR: Missing parameter {e} in message template" formatted_message = (
f"LOGGING ERROR: Missing parameter {e} in message template"
)
level = LogLevel.ERROR level = LogLevel.ERROR
else: else:
formatted_message = message formatted_message = message
@@ -175,7 +179,7 @@ class AsyncLogger:
success: bool, success: bool,
timing: float, timing: float,
tag: str = "FETCH", tag: str = "FETCH",
url_length: int = 50 url_length: int = 50,
): ):
""" """
Convenience method for logging URL fetch status. Convenience method for logging URL fetch status.
@@ -195,20 +199,16 @@ class AsyncLogger:
"url": url, "url": url,
"url_length": url_length, "url_length": url_length,
"status": success, "status": success,
"timing": timing "timing": timing,
}, },
colors={ colors={
"status": Fore.GREEN if success else Fore.RED, "status": Fore.GREEN if success else Fore.RED,
"timing": Fore.YELLOW "timing": Fore.YELLOW,
} },
) )
def error_status( def error_status(
self, self, url: str, error: str, tag: str = "ERROR", url_length: int = 50
url: str,
error: str,
tag: str = "ERROR",
url_length: int = 50
): ):
""" """
Convenience method for logging error status. Convenience method for logging error status.
@@ -223,9 +223,5 @@ class AsyncLogger:
level=LogLevel.ERROR, level=LogLevel.ERROR,
message="{url:.{url_length}}... | Error: {error}", message="{url:.{url_length}}... | Error: {error}",
tag=tag, tag=tag,
params={ params={"url": url, "url_length": url_length, "error": error},
"url": url,
"url_length": url_length,
"error": error
}
) )

File diff suppressed because it is too large Load Diff

View File

@@ -12,6 +12,7 @@ class CacheMode(Enum):
- WRITE_ONLY: Only write to cache, don't read - WRITE_ONLY: Only write to cache, don't read
- BYPASS: Bypass cache for this operation - BYPASS: Bypass cache for this operation
""" """
ENABLED = "enabled" ENABLED = "enabled"
DISABLED = "disabled" DISABLED = "disabled"
READ_ONLY = "read_only" READ_ONLY = "read_only"
@@ -36,6 +37,7 @@ class CacheContext:
is_raw_html (bool): True if the URL is raw HTML, False otherwise. is_raw_html (bool): True if the URL is raw HTML, False otherwise.
_url_display (str): The display name for the URL (web, local file, or raw HTML). _url_display (str): The display name for the URL (web, local file, or raw HTML).
""" """
def __init__(self, url: str, cache_mode: CacheMode, always_bypass: bool = False): def __init__(self, url: str, cache_mode: CacheMode, always_bypass: bool = False):
""" """
Initializes the CacheContext with the provided URL and cache mode. Initializes the CacheContext with the provided URL and cache mode.
@@ -48,8 +50,8 @@ class CacheContext:
self.url = url self.url = url
self.cache_mode = cache_mode self.cache_mode = cache_mode
self.always_bypass = always_bypass self.always_bypass = always_bypass
self.is_cacheable = url.startswith(('http://', 'https://', 'file://')) self.is_cacheable = url.startswith(("http://", "https://", "file://"))
self.is_web_url = url.startswith(('http://', 'https://')) self.is_web_url = url.startswith(("http://", "https://"))
self.is_local_file = url.startswith("file://") self.is_local_file = url.startswith("file://")
self.is_raw_html = url.startswith("raw:") self.is_raw_html = url.startswith("raw:")
self._url_display = url if not self.is_raw_html else "Raw HTML" self._url_display = url if not self.is_raw_html else "Raw HTML"
@@ -94,7 +96,7 @@ def _legacy_to_cache_mode(
disable_cache: bool = False, disable_cache: bool = False,
bypass_cache: bool = False, bypass_cache: bool = False,
no_cache_read: bool = False, no_cache_read: bool = False,
no_cache_write: bool = False no_cache_write: bool = False,
) -> CacheMode: ) -> CacheMode:
""" """
Converts legacy cache parameters to the new CacheMode enum. Converts legacy cache parameters to the new CacheMode enum.

View File

@@ -3,7 +3,7 @@ import re
from collections import Counter from collections import Counter
import string import string
from .model_loader import load_nltk_punkt from .model_loader import load_nltk_punkt
from .utils import *
# Define the abstract base class for chunking strategies # Define the abstract base class for chunking strategies
class ChunkingStrategy(ABC): class ChunkingStrategy(ABC):
@@ -24,19 +24,23 @@ class ChunkingStrategy(ABC):
""" """
pass pass
# Create an identity chunking strategy f(x) = [x] # Create an identity chunking strategy f(x) = [x]
class IdentityChunking(ChunkingStrategy): class IdentityChunking(ChunkingStrategy):
""" """
Chunking strategy that returns the input text as a single chunk. Chunking strategy that returns the input text as a single chunk.
""" """
def chunk(self, text: str) -> list: def chunk(self, text: str) -> list:
return [text] return [text]
# Regex-based chunking # Regex-based chunking
class RegexChunking(ChunkingStrategy): class RegexChunking(ChunkingStrategy):
""" """
Chunking strategy that splits text based on regular expression patterns. Chunking strategy that splits text based on regular expression patterns.
""" """
def __init__(self, patterns=None, **kwargs): def __init__(self, patterns=None, **kwargs):
""" """
Initialize the RegexChunking object. Initialize the RegexChunking object.
@@ -45,7 +49,7 @@ class RegexChunking(ChunkingStrategy):
patterns (list): A list of regular expression patterns to split text. patterns (list): A list of regular expression patterns to split text.
""" """
if patterns is None: if patterns is None:
patterns = [r'\n\n'] # Default split pattern patterns = [r"\n\n"] # Default split pattern
self.patterns = patterns self.patterns = patterns
def chunk(self, text: str) -> list: def chunk(self, text: str) -> list:
@@ -57,18 +61,19 @@ class RegexChunking(ChunkingStrategy):
paragraphs = new_paragraphs paragraphs = new_paragraphs
return paragraphs return paragraphs
# NLP-based sentence chunking # NLP-based sentence chunking
class NlpSentenceChunking(ChunkingStrategy): class NlpSentenceChunking(ChunkingStrategy):
""" """
Chunking strategy that splits text into sentences using NLTK's sentence tokenizer. Chunking strategy that splits text into sentences using NLTK's sentence tokenizer.
""" """
def __init__(self, **kwargs): def __init__(self, **kwargs):
""" """
Initialize the NlpSentenceChunking object. Initialize the NlpSentenceChunking object.
""" """
load_nltk_punkt() load_nltk_punkt()
def chunk(self, text: str) -> list: def chunk(self, text: str) -> list:
# Improved regex for sentence splitting # Improved regex for sentence splitting
# sentence_endings = re.compile( # sentence_endings = re.compile(
@@ -77,11 +82,13 @@ class NlpSentenceChunking(ChunkingStrategy):
# sentences = sentence_endings.split(text) # sentences = sentence_endings.split(text)
# sens = [sent.strip() for sent in sentences if sent] # sens = [sent.strip() for sent in sentences if sent]
from nltk.tokenize import sent_tokenize from nltk.tokenize import sent_tokenize
sentences = sent_tokenize(text) sentences = sent_tokenize(text)
sens = [sent.strip() for sent in sentences] sens = [sent.strip() for sent in sentences]
return list(set(sens)) return list(set(sens))
# Topic-based segmentation using TextTiling # Topic-based segmentation using TextTiling
class TopicSegmentationChunking(ChunkingStrategy): class TopicSegmentationChunking(ChunkingStrategy):
""" """
@@ -100,6 +107,7 @@ class TopicSegmentationChunking(ChunkingStrategy):
num_keywords (int): The number of keywords to extract for each topic segment. num_keywords (int): The number of keywords to extract for each topic segment.
""" """
import nltk as nl import nltk as nl
self.tokenizer = nl.tokenize.TextTilingTokenizer() self.tokenizer = nl.tokenize.TextTilingTokenizer()
self.num_keywords = num_keywords self.num_keywords = num_keywords
@@ -111,8 +119,14 @@ class TopicSegmentationChunking(ChunkingStrategy):
def extract_keywords(self, text: str) -> list: def extract_keywords(self, text: str) -> list:
# Tokenize and remove stopwords and punctuation # Tokenize and remove stopwords and punctuation
import nltk as nl import nltk as nl
tokens = nl.toknize.word_tokenize(text) tokens = nl.toknize.word_tokenize(text)
tokens = [token.lower() for token in tokens if token not in nl.corpus.stopwords.words('english') and token not in string.punctuation] tokens = [
token.lower()
for token in tokens
if token not in nl.corpus.stopwords.words("english")
and token not in string.punctuation
]
# Calculate frequency distribution # Calculate frequency distribution
freq_dist = Counter(tokens) freq_dist = Counter(tokens)
@@ -123,9 +137,12 @@ class TopicSegmentationChunking(ChunkingStrategy):
# Segment the text into topics # Segment the text into topics
segments = self.chunk(text) segments = self.chunk(text)
# Extract keywords for each topic segment # Extract keywords for each topic segment
segments_with_topics = [(segment, self.extract_keywords(segment)) for segment in segments] segments_with_topics = [
(segment, self.extract_keywords(segment)) for segment in segments
]
return segments_with_topics return segments_with_topics
# Fixed-length word chunks # Fixed-length word chunks
class FixedLengthWordChunking(ChunkingStrategy): class FixedLengthWordChunking(ChunkingStrategy):
""" """
@@ -136,6 +153,7 @@ class FixedLengthWordChunking(ChunkingStrategy):
2. Create chunks of fixed length 2. Create chunks of fixed length
3. Return the list of chunks 3. Return the list of chunks
""" """
def __init__(self, chunk_size=100, **kwargs): def __init__(self, chunk_size=100, **kwargs):
""" """
Initialize the fixed-length word chunking strategy with the given chunk size. Initialize the fixed-length word chunking strategy with the given chunk size.
@@ -147,7 +165,11 @@ class FixedLengthWordChunking(ChunkingStrategy):
def chunk(self, text: str) -> list: def chunk(self, text: str) -> list:
words = text.split() words = text.split()
return [' '.join(words[i:i + self.chunk_size]) for i in range(0, len(words), self.chunk_size)] return [
" ".join(words[i : i + self.chunk_size])
for i in range(0, len(words), self.chunk_size)
]
# Sliding window chunking # Sliding window chunking
class SlidingWindowChunking(ChunkingStrategy): class SlidingWindowChunking(ChunkingStrategy):
@@ -159,6 +181,7 @@ class SlidingWindowChunking(ChunkingStrategy):
2. Create chunks of fixed length 2. Create chunks of fixed length
3. Return the list of chunks 3. Return the list of chunks
""" """
def __init__(self, window_size=100, step=50, **kwargs): def __init__(self, window_size=100, step=50, **kwargs):
""" """
Initialize the sliding window chunking strategy with the given window size and Initialize the sliding window chunking strategy with the given window size and
@@ -179,15 +202,16 @@ class SlidingWindowChunking(ChunkingStrategy):
return [text] return [text]
for i in range(0, len(words) - self.window_size + 1, self.step): for i in range(0, len(words) - self.window_size + 1, self.step):
chunk = ' '.join(words[i:i + self.window_size]) chunk = " ".join(words[i : i + self.window_size])
chunks.append(chunk) chunks.append(chunk)
# Handle the last chunk if it doesn't align perfectly # Handle the last chunk if it doesn't align perfectly
if i + self.window_size < len(words): if i + self.window_size < len(words):
chunks.append(' '.join(words[-self.window_size:])) chunks.append(" ".join(words[-self.window_size :]))
return chunks return chunks
class OverlappingWindowChunking(ChunkingStrategy): class OverlappingWindowChunking(ChunkingStrategy):
""" """
Chunking strategy that splits text into overlapping word chunks. Chunking strategy that splits text into overlapping word chunks.
@@ -198,6 +222,7 @@ class OverlappingWindowChunking(ChunkingStrategy):
3. Slide the window by the overlap size 3. Slide the window by the overlap size
4. Return the list of chunks 4. Return the list of chunks
""" """
def __init__(self, window_size=1000, overlap=100, **kwargs): def __init__(self, window_size=1000, overlap=100, **kwargs):
""" """
Initialize the overlapping window chunking strategy with the given window size and Initialize the overlapping window chunking strategy with the given window size and
@@ -220,7 +245,7 @@ class OverlappingWindowChunking(ChunkingStrategy):
start = 0 start = 0
while start < len(words): while start < len(words):
end = start + self.window_size end = start + self.window_size
chunk = ' '.join(words[start:end]) chunk = " ".join(words[start:end])
chunks.append(chunk) chunks.append(chunk)
if end >= len(words): if end >= len(words):

View File

@@ -8,14 +8,21 @@ from .async_logger import AsyncLogger
logger = AsyncLogger(verbose=True) logger = AsyncLogger(verbose=True)
docs_manager = DocsManager(logger) docs_manager = DocsManager(logger)
def print_table(headers: List[str], rows: List[List[str]], padding: int = 2): def print_table(headers: List[str], rows: List[List[str]], padding: int = 2):
"""Print formatted table with headers and rows""" """Print formatted table with headers and rows"""
widths = [max(len(str(cell)) for cell in col) for col in zip(headers, *rows)] widths = [max(len(str(cell)) for cell in col) for col in zip(headers, *rows)]
border = '+' + '+'.join('-' * (w + 2 * padding) for w in widths) + '+' border = "+" + "+".join("-" * (w + 2 * padding) for w in widths) + "+"
def format_row(row): def format_row(row):
return '|' + '|'.join(f"{' ' * padding}{str(cell):<{w}}{' ' * padding}" return (
for cell, w in zip(row, widths)) + '|' "|"
+ "|".join(
f"{' ' * padding}{str(cell):<{w}}{' ' * padding}"
for cell, w in zip(row, widths)
)
+ "|"
)
click.echo(border) click.echo(border)
click.echo(format_row(headers)) click.echo(format_row(headers))
@@ -24,19 +31,24 @@ def print_table(headers: List[str], rows: List[List[str]], padding: int = 2):
click.echo(format_row(row)) click.echo(format_row(row))
click.echo(border) click.echo(border)
@click.group() @click.group()
def cli(): def cli():
"""Crawl4AI Command Line Interface""" """Crawl4AI Command Line Interface"""
pass pass
@cli.group() @cli.group()
def docs(): def docs():
"""Documentation operations""" """Documentation operations"""
pass pass
@docs.command() @docs.command()
@click.argument('sections', nargs=-1) @click.argument("sections", nargs=-1)
@click.option('--mode', type=click.Choice(['extended', 'condensed']), default='extended') @click.option(
"--mode", type=click.Choice(["extended", "condensed"]), default="extended"
)
def combine(sections: tuple, mode: str): def combine(sections: tuple, mode: str):
"""Combine documentation sections""" """Combine documentation sections"""
try: try:
@@ -46,16 +58,17 @@ def combine(sections: tuple, mode: str):
logger.error(str(e), tag="ERROR") logger.error(str(e), tag="ERROR")
sys.exit(1) sys.exit(1)
@docs.command() @docs.command()
@click.argument('query') @click.argument("query")
@click.option('--top-k', '-k', default=5) @click.option("--top-k", "-k", default=5)
@click.option('--build-index', is_flag=True, help='Build index if missing') @click.option("--build-index", is_flag=True, help="Build index if missing")
def search(query: str, top_k: int, build_index: bool): def search(query: str, top_k: int, build_index: bool):
"""Search documentation""" """Search documentation"""
try: try:
result = docs_manager.search(query, top_k) result = docs_manager.search(query, top_k)
if result == "No search index available. Call build_search_index() first.": if result == "No search index available. Call build_search_index() first.":
if build_index or click.confirm('No search index found. Build it now?'): if build_index or click.confirm("No search index found. Build it now?"):
asyncio.run(docs_manager.llm_text.generate_index_files()) asyncio.run(docs_manager.llm_text.generate_index_files())
result = docs_manager.search(query, top_k) result = docs_manager.search(query, top_k)
click.echo(result) click.echo(result)
@@ -63,6 +76,7 @@ def search(query: str, top_k: int, build_index: bool):
click.echo(f"Error: {str(e)}", err=True) click.echo(f"Error: {str(e)}", err=True)
sys.exit(1) sys.exit(1)
@docs.command() @docs.command()
def update(): def update():
"""Update docs from GitHub""" """Update docs from GitHub"""
@@ -73,22 +87,25 @@ def update():
click.echo(f"Error: {str(e)}", err=True) click.echo(f"Error: {str(e)}", err=True)
sys.exit(1) sys.exit(1)
@docs.command() @docs.command()
@click.option('--force-facts', is_flag=True, help='Force regenerate fact files') @click.option("--force-facts", is_flag=True, help="Force regenerate fact files")
@click.option('--clear-cache', is_flag=True, help='Clear BM25 cache') @click.option("--clear-cache", is_flag=True, help="Clear BM25 cache")
def index(force_facts: bool, clear_cache: bool): def index(force_facts: bool, clear_cache: bool):
"""Build or rebuild search indexes""" """Build or rebuild search indexes"""
try: try:
asyncio.run(docs_manager.ensure_docs_exist()) asyncio.run(docs_manager.ensure_docs_exist())
asyncio.run(docs_manager.llm_text.generate_index_files( asyncio.run(
force_generate_facts=force_facts, docs_manager.llm_text.generate_index_files(
clear_bm25_cache=clear_cache force_generate_facts=force_facts, clear_bm25_cache=clear_cache
)) )
)
click.echo("Search indexes built successfully") click.echo("Search indexes built successfully")
except Exception as e: except Exception as e:
click.echo(f"Error: {str(e)}", err=True) click.echo(f"Error: {str(e)}", err=True)
sys.exit(1) sys.exit(1)
# Add docs list command # Add docs list command
@docs.command() @docs.command()
def list(): def list():
@@ -101,5 +118,6 @@ def list():
click.echo(f"Error: {str(e)}", err=True) click.echo(f"Error: {str(e)}", err=True)
sys.exit(1) sys.exit(1)
if __name__ == '__main__':
if __name__ == "__main__":
cli() cli()

View File

@@ -8,7 +8,7 @@ DEFAULT_PROVIDER = "openai/gpt-4o-mini"
MODEL_REPO_BRANCH = "new-release-0.0.2" MODEL_REPO_BRANCH = "new-release-0.0.2"
# Provider-model dictionary, ONLY used when the extraction strategy is LLMExtractionStrategy # Provider-model dictionary, ONLY used when the extraction strategy is LLMExtractionStrategy
PROVIDER_MODELS = { PROVIDER_MODELS = {
"ollama/llama3": "no-token-needed", # Any model from Ollama no need for API token "ollama/llama3": "no-token-needed", # Any model from Ollama no need for API token
"groq/llama3-70b-8192": os.getenv("GROQ_API_KEY"), "groq/llama3-70b-8192": os.getenv("GROQ_API_KEY"),
"groq/llama3-8b-8192": os.getenv("GROQ_API_KEY"), "groq/llama3-8b-8192": os.getenv("GROQ_API_KEY"),
"openai/gpt-4o-mini": os.getenv("OPENAI_API_KEY"), "openai/gpt-4o-mini": os.getenv("OPENAI_API_KEY"),
@@ -22,7 +22,7 @@ PROVIDER_MODELS = {
} }
# Chunk token threshold # Chunk token threshold
CHUNK_TOKEN_THRESHOLD = 2 ** 11 # 2048 tokens CHUNK_TOKEN_THRESHOLD = 2**11 # 2048 tokens
OVERLAP_RATE = 0.1 OVERLAP_RATE = 0.1
WORD_TOKEN_RATE = 1.3 WORD_TOKEN_RATE = 1.3
@@ -30,19 +30,41 @@ WORD_TOKEN_RATE = 1.3
MIN_WORD_THRESHOLD = 1 MIN_WORD_THRESHOLD = 1
IMAGE_DESCRIPTION_MIN_WORD_THRESHOLD = 1 IMAGE_DESCRIPTION_MIN_WORD_THRESHOLD = 1
IMPORTANT_ATTRS = ['src', 'href', 'alt', 'title', 'width', 'height'] IMPORTANT_ATTRS = ["src", "href", "alt", "title", "width", "height"]
ONLY_TEXT_ELIGIBLE_TAGS = ['b', 'i', 'u', 'span', 'del', 'ins', 'sub', 'sup', 'strong', 'em', 'code', 'kbd', 'var', 's', 'q', 'abbr', 'cite', 'dfn', 'time', 'small', 'mark'] ONLY_TEXT_ELIGIBLE_TAGS = [
"b",
"i",
"u",
"span",
"del",
"ins",
"sub",
"sup",
"strong",
"em",
"code",
"kbd",
"var",
"s",
"q",
"abbr",
"cite",
"dfn",
"time",
"small",
"mark",
]
SOCIAL_MEDIA_DOMAINS = [ SOCIAL_MEDIA_DOMAINS = [
'facebook.com', "facebook.com",
'twitter.com', "twitter.com",
'x.com', "x.com",
'linkedin.com', "linkedin.com",
'instagram.com', "instagram.com",
'pinterest.com', "pinterest.com",
'tiktok.com', "tiktok.com",
'snapchat.com', "snapchat.com",
'reddit.com', "reddit.com",
] ]
# Threshold for the Image extraction - Range is 1 to 6 # Threshold for the Image extraction - Range is 1 to 6
# Images are scored based on point based system, to filter based on usefulness. Points are assigned # Images are scored based on point based system, to filter based on usefulness. Points are assigned
@@ -60,5 +82,6 @@ NEED_MIGRATION = True
URL_LOG_SHORTEN_LENGTH = 30 URL_LOG_SHORTEN_LENGTH = 30
SHOW_DEPRECATION_WARNINGS = True SHOW_DEPRECATION_WARNINGS = True
SCREENSHOT_HEIGHT_TRESHOLD = 10000 SCREENSHOT_HEIGHT_TRESHOLD = 10000
PAGE_TIMEOUT=60000 PAGE_TIMEOUT = 60000
DOWNLOAD_PAGE_TIMEOUT=60000 DOWNLOAD_PAGE_TIMEOUT = 60000
DEEP_CRAWL_BATCH_SIZE = 5

View File

@@ -1,44 +1,95 @@
import re import re
import time
from bs4 import BeautifulSoup, Tag from bs4 import BeautifulSoup, Tag
from typing import List, Tuple, Dict from typing import List, Tuple, Dict, Optional
from rank_bm25 import BM25Okapi from rank_bm25 import BM25Okapi
from time import perf_counter
from collections import deque from collections import deque
from bs4 import BeautifulSoup, NavigableString, Tag, Comment from bs4 import NavigableString, Comment
from .utils import clean_tokens from .utils import clean_tokens, perform_completion_with_backoff, escape_json_string, sanitize_html, get_home_folder, extract_xml_data
from abc import ABC, abstractmethod from abc import ABC, abstractmethod
import math import math
from snowballstemmer import stemmer from snowballstemmer import stemmer
from .config import DEFAULT_PROVIDER, OVERLAP_RATE, WORD_TOKEN_RATE
from .models import TokenUsage
from .prompts import PROMPT_FILTER_CONTENT
import os
import json
import hashlib
from pathlib import Path
from concurrent.futures import ThreadPoolExecutor, as_completed
from .async_logger import AsyncLogger, LogLevel
from colorama import Fore, Style, init
class RelevantContentFilter(ABC): class RelevantContentFilter(ABC):
"""Abstract base class for content filtering strategies""" """Abstract base class for content filtering strategies"""
def __init__(self, user_query: str = None): def __init__(self, user_query: str = None):
self.user_query = user_query self.user_query = user_query
self.included_tags = { self.included_tags = {
# Primary structure # Primary structure
'article', 'main', 'section', 'div', "article",
"main",
"section",
"div",
# List structures # List structures
'ul', 'ol', 'li', 'dl', 'dt', 'dd', "ul",
"ol",
"li",
"dl",
"dt",
"dd",
# Text content # Text content
'p', 'span', 'blockquote', 'pre', 'code', "p",
"span",
"blockquote",
"pre",
"code",
# Headers # Headers
'h1', 'h2', 'h3', 'h4', 'h5', 'h6', "h1",
"h2",
"h3",
"h4",
"h5",
"h6",
# Tables # Tables
'table', 'thead', 'tbody', 'tr', 'td', 'th', "table",
"thead",
"tbody",
"tr",
"td",
"th",
# Other semantic elements # Other semantic elements
'figure', 'figcaption', 'details', 'summary', "figure",
"figcaption",
"details",
"summary",
# Text formatting # Text formatting
'em', 'strong', 'b', 'i', 'mark', 'small', "em",
"strong",
"b",
"i",
"mark",
"small",
# Rich content # Rich content
'time', 'address', 'cite', 'q' "time",
"address",
"cite",
"q",
} }
self.excluded_tags = { self.excluded_tags = {
'nav', 'footer', 'header', 'aside', 'script', "nav",
'style', 'form', 'iframe', 'noscript' "footer",
"header",
"aside",
"script",
"style",
"form",
"iframe",
"noscript",
} }
self.header_tags = {'h1', 'h2', 'h3', 'h4', 'h5', 'h6'} self.header_tags = {"h1", "h2", "h3", "h4", "h5", "h6"}
self.negative_patterns = re.compile( self.negative_patterns = re.compile(
r'nav|footer|header|sidebar|ads|comment|promo|advert|social|share', r"nav|footer|header|sidebar|ads|comment|promo|advert|social|share", re.I
re.I
) )
self.min_word_count = 2 self.min_word_count = 2
@@ -62,28 +113,30 @@ class RelevantContentFilter(ABC):
except Exception: except Exception:
pass pass
if soup.find('h1'): if soup.find("h1"):
query_parts.append(soup.find('h1').get_text()) query_parts.append(soup.find("h1").get_text())
# Meta tags # Meta tags
temp = "" temp = ""
for meta_name in ['keywords', 'description']: for meta_name in ["keywords", "description"]:
meta = soup.find('meta', attrs={'name': meta_name}) meta = soup.find("meta", attrs={"name": meta_name})
if meta and meta.get('content'): if meta and meta.get("content"):
query_parts.append(meta['content']) query_parts.append(meta["content"])
temp += meta['content'] temp += meta["content"]
# If still empty, grab first significant paragraph # If still empty, grab first significant paragraph
if not temp: if not temp:
# Find the first tag P thatits text contains more than 50 characters # Find the first tag P thatits text contains more than 50 characters
for p in body.find_all('p'): for p in body.find_all("p"):
if len(p.get_text()) > 150: if len(p.get_text()) > 150:
query_parts.append(p.get_text()[:150]) query_parts.append(p.get_text()[:150])
break break
return ' '.join(filter(None, query_parts)) return " ".join(filter(None, query_parts))
def extract_text_chunks(self, body: Tag, min_word_threshold: int = None) -> List[Tuple[str, str]]: def extract_text_chunks(
self, body: Tag, min_word_threshold: int = None
) -> List[Tuple[str, str]]:
""" """
Extracts text chunks from a BeautifulSoup body element while preserving order. Extracts text chunks from a BeautifulSoup body element while preserving order.
Returns list of tuples (text, tag_name) for classification. Returns list of tuples (text, tag_name) for classification.
@@ -96,14 +149,42 @@ class RelevantContentFilter(ABC):
""" """
# Tags to ignore - inline elements that shouldn't break text flow # Tags to ignore - inline elements that shouldn't break text flow
INLINE_TAGS = { INLINE_TAGS = {
'a', 'abbr', 'acronym', 'b', 'bdo', 'big', 'br', 'button', 'cite', 'code', "a",
'dfn', 'em', 'i', 'img', 'input', 'kbd', 'label', 'map', 'object', 'q', "abbr",
'samp', 'script', 'select', 'small', 'span', 'strong', 'sub', 'sup', "acronym",
'textarea', 'time', 'tt', 'var' "b",
"bdo",
"big",
"br",
"button",
"cite",
"code",
"dfn",
"em",
"i",
"img",
"input",
"kbd",
"label",
"map",
"object",
"q",
"samp",
"script",
"select",
"small",
"span",
"strong",
"sub",
"sup",
"textarea",
"time",
"tt",
"var",
} }
# Tags that typically contain meaningful headers # Tags that typically contain meaningful headers
HEADER_TAGS = {'h1', 'h2', 'h3', 'h4', 'h5', 'h6', 'header'} HEADER_TAGS = {"h1", "h2", "h3", "h4", "h5", "h6", "header"}
chunks = [] chunks = []
current_text = [] current_text = []
@@ -111,9 +192,8 @@ class RelevantContentFilter(ABC):
def should_break_chunk(tag: Tag) -> bool: def should_break_chunk(tag: Tag) -> bool:
"""Determine if a tag should cause a break in the current text chunk""" """Determine if a tag should cause a break in the current text chunk"""
return ( return tag.name not in INLINE_TAGS and not (
tag.name not in INLINE_TAGS tag.name == "p" and len(current_text) == 0
and not (tag.name == 'p' and len(current_text) == 0)
) )
# Use deque for efficient push/pop operations # Use deque for efficient push/pop operations
@@ -125,9 +205,11 @@ class RelevantContentFilter(ABC):
if visited: if visited:
# End of block element - flush accumulated text # End of block element - flush accumulated text
if current_text and should_break_chunk(element): if current_text and should_break_chunk(element):
text = ' '.join(''.join(current_text).split()) text = " ".join("".join(current_text).split())
if text: if text:
tag_type = 'header' if element.name in HEADER_TAGS else 'content' tag_type = (
"header" if element.name in HEADER_TAGS else "content"
)
chunks.append((chunk_index, text, tag_type, element)) chunks.append((chunk_index, text, tag_type, element))
chunk_index += 1 chunk_index += 1
current_text = [] current_text = []
@@ -153,18 +235,23 @@ class RelevantContentFilter(ABC):
# Handle any remaining text # Handle any remaining text
if current_text: if current_text:
text = ' '.join(''.join(current_text).split()) text = " ".join("".join(current_text).split())
if text: if text:
chunks.append((chunk_index, text, 'content', body)) chunks.append((chunk_index, text, "content", body))
if min_word_threshold: if min_word_threshold:
chunks = [chunk for chunk in chunks if len(chunk[1].split()) >= min_word_threshold] chunks = [
chunk for chunk in chunks if len(chunk[1].split()) >= min_word_threshold
]
return chunks return chunks
def _deprecated_extract_text_chunks(self, soup: BeautifulSoup) -> List[Tuple[int, str, Tag]]: def _deprecated_extract_text_chunks(
self, soup: BeautifulSoup
) -> List[Tuple[int, str, Tag]]:
"""Common method for extracting text chunks""" """Common method for extracting text chunks"""
_text_cache = {} _text_cache = {}
def fast_text(element: Tag) -> str: def fast_text(element: Tag) -> str:
elem_id = id(element) elem_id = id(element)
if elem_id in _text_cache: if elem_id in _text_cache:
@@ -175,7 +262,7 @@ class RelevantContentFilter(ABC):
text = content.strip() text = content.strip()
if text: if text:
texts.append(text) texts.append(text)
result = ' '.join(texts) result = " ".join(texts)
_text_cache[elem_id] = result _text_cache[elem_id] = result
return result return result
@@ -210,10 +297,9 @@ class RelevantContentFilter(ABC):
"""Common method for exclusion logic""" """Common method for exclusion logic"""
if tag.name in self.excluded_tags: if tag.name in self.excluded_tags:
return True return True
class_id = ' '.join(filter(None, [ class_id = " ".join(
' '.join(tag.get('class', [])), filter(None, [" ".join(tag.get("class", [])), tag.get("id", "")])
tag.get('id', '') )
]))
return bool(self.negative_patterns.search(class_id)) return bool(self.negative_patterns.search(class_id))
def clean_element(self, tag: Tag) -> str: def clean_element(self, tag: Tag) -> str:
@@ -221,8 +307,16 @@ class RelevantContentFilter(ABC):
if not tag or not isinstance(tag, Tag): if not tag or not isinstance(tag, Tag):
return "" return ""
unwanted_tags = {'script', 'style', 'aside', 'form', 'iframe', 'noscript'} unwanted_tags = {"script", "style", "aside", "form", "iframe", "noscript"}
unwanted_attrs = {'style', 'onclick', 'onmouseover', 'align', 'bgcolor', 'class', 'id'} unwanted_attrs = {
"style",
"onclick",
"onmouseover",
"align",
"bgcolor",
"class",
"id",
}
# Use string builder pattern for better performance # Use string builder pattern for better performance
builder = [] builder = []
@@ -237,25 +331,25 @@ class RelevantContentFilter(ABC):
return return
# Start tag # Start tag
builder.append(f'<{elem.name}') builder.append(f"<{elem.name}")
# Add cleaned attributes # Add cleaned attributes
attrs = {k: v for k, v in elem.attrs.items() if k not in unwanted_attrs} attrs = {k: v for k, v in elem.attrs.items() if k not in unwanted_attrs}
for key, value in attrs.items(): for key, value in attrs.items():
builder.append(f' {key}="{value}"') builder.append(f' {key}="{value}"')
builder.append('>') builder.append(">")
# Process children # Process children
for child in elem.children: for child in elem.children:
render_tag(child) render_tag(child)
# Close tag # Close tag
builder.append(f'</{elem.name}>') builder.append(f"</{elem.name}>")
try: try:
render_tag(tag) render_tag(tag)
return ''.join(builder) return "".join(builder)
except Exception: except Exception:
return str(tag) # Fallback to original if anything fails return str(tag) # Fallback to original if anything fails
@@ -280,7 +374,13 @@ class BM25ContentFilter(RelevantContentFilter):
Methods: Methods:
filter_content(self, html: str, min_word_threshold: int = None) filter_content(self, html: str, min_word_threshold: int = None)
""" """
def __init__(self, user_query: str = None, bm25_threshold: float = 1.0, language: str = 'english'):
def __init__(
self,
user_query: str = None,
bm25_threshold: float = 1.0,
language: str = "english",
):
""" """
Initializes the BM25ContentFilter class, if not provided, falls back to page metadata. Initializes the BM25ContentFilter class, if not provided, falls back to page metadata.
@@ -295,17 +395,17 @@ class BM25ContentFilter(RelevantContentFilter):
super().__init__(user_query=user_query) super().__init__(user_query=user_query)
self.bm25_threshold = bm25_threshold self.bm25_threshold = bm25_threshold
self.priority_tags = { self.priority_tags = {
'h1': 5.0, "h1": 5.0,
'h2': 4.0, "h2": 4.0,
'h3': 3.0, "h3": 3.0,
'title': 4.0, "title": 4.0,
'strong': 2.0, "strong": 2.0,
'b': 1.5, "b": 1.5,
'em': 1.5, "em": 1.5,
'blockquote': 2.0, "blockquote": 2.0,
'code': 2.0, "code": 2.0,
'pre': 1.5, "pre": 1.5,
'th': 1.5, # Table headers "th": 1.5, # Table headers
} }
self.stemmer = stemmer(language) self.stemmer = stemmer(language)
@@ -327,13 +427,13 @@ class BM25ContentFilter(RelevantContentFilter):
if not html or not isinstance(html, str): if not html or not isinstance(html, str):
return [] return []
soup = BeautifulSoup(html, 'lxml') soup = BeautifulSoup(html, "lxml")
# Check if body is present # Check if body is present
if not soup.body: if not soup.body:
# Wrap in body tag if missing # Wrap in body tag if missing
soup = BeautifulSoup(f'<body>{html}</body>', 'lxml') soup = BeautifulSoup(f"<body>{html}</body>", "lxml")
body = soup.find('body') body = soup.find("body")
query = self.extract_page_query(soup, body) query = self.extract_page_query(soup, body)
@@ -354,9 +454,13 @@ class BM25ContentFilter(RelevantContentFilter):
# for _, chunk, _, _ in candidates] # for _, chunk, _, _ in candidates]
# tokenized_query = [ps.stem(word) for word in query.lower().split()] # tokenized_query = [ps.stem(word) for word in query.lower().split()]
tokenized_corpus = [[self.stemmer.stemWord(word) for word in chunk.lower().split()] tokenized_corpus = [
for _, chunk, _, _ in candidates] [self.stemmer.stemWord(word) for word in chunk.lower().split()]
tokenized_query = [self.stemmer.stemWord(word) for word in query.lower().split()] for _, chunk, _, _ in candidates
]
tokenized_query = [
self.stemmer.stemWord(word) for word in query.lower().split()
]
# tokenized_corpus = [[self.stemmer.stemWord(word) for word in tokenize_text(chunk.lower())] # tokenized_corpus = [[self.stemmer.stemWord(word) for word in tokenize_text(chunk.lower())]
# for _, chunk, _, _ in candidates] # for _, chunk, _, _ in candidates]
@@ -378,7 +482,8 @@ class BM25ContentFilter(RelevantContentFilter):
# Filter candidates by threshold # Filter candidates by threshold
selected_candidates = [ selected_candidates = [
(index, chunk, tag) for adjusted_score, index, chunk, tag in adjusted_candidates (index, chunk, tag)
for adjusted_score, index, chunk, tag in adjusted_candidates
if adjusted_score >= self.bm25_threshold if adjusted_score >= self.bm25_threshold
] ]
@@ -411,8 +516,14 @@ class PruningContentFilter(RelevantContentFilter):
Methods: Methods:
filter_content(self, html: str, min_word_threshold: int = None): filter_content(self, html: str, min_word_threshold: int = None):
""" """
def __init__(self, user_query: str = None, min_word_threshold: int = None,
threshold_type: str = 'fixed', threshold: float = 0.48): def __init__(
self,
user_query: str = None,
min_word_threshold: int = None,
threshold_type: str = "fixed",
threshold: float = 0.48,
):
""" """
Initializes the PruningContentFilter class, if not provided, falls back to page metadata. Initializes the PruningContentFilter class, if not provided, falls back to page metadata.
@@ -432,49 +543,49 @@ class PruningContentFilter(RelevantContentFilter):
# Add tag importance for dynamic threshold # Add tag importance for dynamic threshold
self.tag_importance = { self.tag_importance = {
'article': 1.5, "article": 1.5,
'main': 1.4, "main": 1.4,
'section': 1.3, "section": 1.3,
'p': 1.2, "p": 1.2,
'h1': 1.4, "h1": 1.4,
'h2': 1.3, "h2": 1.3,
'h3': 1.2, "h3": 1.2,
'div': 0.7, "div": 0.7,
'span': 0.6 "span": 0.6,
} }
# Metric configuration # Metric configuration
self.metric_config = { self.metric_config = {
'text_density': True, "text_density": True,
'link_density': True, "link_density": True,
'tag_weight': True, "tag_weight": True,
'class_id_weight': True, "class_id_weight": True,
'text_length': True, "text_length": True,
} }
self.metric_weights = { self.metric_weights = {
'text_density': 0.4, "text_density": 0.4,
'link_density': 0.2, "link_density": 0.2,
'tag_weight': 0.2, "tag_weight": 0.2,
'class_id_weight': 0.1, "class_id_weight": 0.1,
'text_length': 0.1, "text_length": 0.1,
} }
self.tag_weights = { self.tag_weights = {
'div': 0.5, "div": 0.5,
'p': 1.0, "p": 1.0,
'article': 1.5, "article": 1.5,
'section': 1.0, "section": 1.0,
'span': 0.3, "span": 0.3,
'li': 0.5, "li": 0.5,
'ul': 0.5, "ul": 0.5,
'ol': 0.5, "ol": 0.5,
'h1': 1.2, "h1": 1.2,
'h2': 1.1, "h2": 1.1,
'h3': 1.0, "h3": 1.0,
'h4': 0.9, "h4": 0.9,
'h5': 0.8, "h5": 0.8,
'h6': 0.7, "h6": 0.7,
} }
def filter_content(self, html: str, min_word_threshold: int = None) -> List[str]: def filter_content(self, html: str, min_word_threshold: int = None) -> List[str]:
@@ -495,22 +606,22 @@ class PruningContentFilter(RelevantContentFilter):
if not html or not isinstance(html, str): if not html or not isinstance(html, str):
return [] return []
soup = BeautifulSoup(html, 'lxml') soup = BeautifulSoup(html, "lxml")
if not soup.body: if not soup.body:
soup = BeautifulSoup(f'<body>{html}</body>', 'lxml') soup = BeautifulSoup(f"<body>{html}</body>", "lxml")
# Remove comments and unwanted tags # Remove comments and unwanted tags
self._remove_comments(soup) self._remove_comments(soup)
self._remove_unwanted_tags(soup) self._remove_unwanted_tags(soup)
# Prune tree starting from body # Prune tree starting from body
body = soup.find('body') body = soup.find("body")
self._prune_tree(body) self._prune_tree(body)
# Extract remaining content as list of HTML strings # Extract remaining content as list of HTML strings
content_blocks = [] content_blocks = []
for element in body.children: for element in body.children:
if isinstance(element, str) or not hasattr(element, 'name'): if isinstance(element, str) or not hasattr(element, "name"):
continue continue
if len(element.get_text(strip=True)) > 0: if len(element.get_text(strip=True)) > 0:
content_blocks.append(str(element)) content_blocks.append(str(element))
@@ -535,24 +646,28 @@ class PruningContentFilter(RelevantContentFilter):
Args: Args:
node (Tag): The node from which the pruning starts. node (Tag): The node from which the pruning starts.
""" """
if not node or not hasattr(node, 'name') or node.name is None: if not node or not hasattr(node, "name") or node.name is None:
return return
text_len = len(node.get_text(strip=True)) text_len = len(node.get_text(strip=True))
tag_len = len(node.encode_contents().decode('utf-8')) tag_len = len(node.encode_contents().decode("utf-8"))
link_text_len = sum(len(s.strip()) for s in (a.string for a in node.find_all('a', recursive=False)) if s) link_text_len = sum(
len(s.strip())
for s in (a.string for a in node.find_all("a", recursive=False))
if s
)
metrics = { metrics = {
'node': node, "node": node,
'tag_name': node.name, "tag_name": node.name,
'text_len': text_len, "text_len": text_len,
'tag_len': tag_len, "tag_len": tag_len,
'link_text_len': link_text_len "link_text_len": link_text_len,
} }
score = self._compute_composite_score(metrics, text_len, tag_len, link_text_len) score = self._compute_composite_score(metrics, text_len, tag_len, link_text_len)
if self.threshold_type == 'fixed': if self.threshold_type == "fixed":
should_remove = score < self.threshold should_remove = score < self.threshold
else: # dynamic else: # dynamic
tag_importance = self.tag_importance.get(node.name, 0.7) tag_importance = self.tag_importance.get(node.name, 0.7)
@@ -572,7 +687,7 @@ class PruningContentFilter(RelevantContentFilter):
if should_remove: if should_remove:
node.decompose() node.decompose()
else: else:
children = [child for child in node.children if hasattr(child, 'name')] children = [child for child in node.children if hasattr(child, "name")]
for child in children: for child in children:
self._prune_tree(child) self._prune_tree(child)
@@ -580,48 +695,305 @@ class PruningContentFilter(RelevantContentFilter):
"""Computes the composite score""" """Computes the composite score"""
if self.min_word_threshold: if self.min_word_threshold:
# Get raw text from metrics node - avoid extra processing # Get raw text from metrics node - avoid extra processing
text = metrics['node'].get_text(strip=True) text = metrics["node"].get_text(strip=True)
word_count = text.count(' ') + 1 word_count = text.count(" ") + 1
if word_count < self.min_word_threshold: if word_count < self.min_word_threshold:
return -1.0 # Guaranteed removal return -1.0 # Guaranteed removal
score = 0.0 score = 0.0
total_weight = 0.0 total_weight = 0.0
if self.metric_config['text_density']: if self.metric_config["text_density"]:
density = text_len / tag_len if tag_len > 0 else 0 density = text_len / tag_len if tag_len > 0 else 0
score += self.metric_weights['text_density'] * density score += self.metric_weights["text_density"] * density
total_weight += self.metric_weights['text_density'] total_weight += self.metric_weights["text_density"]
if self.metric_config['link_density']: if self.metric_config["link_density"]:
density = 1 - (link_text_len / text_len if text_len > 0 else 0) density = 1 - (link_text_len / text_len if text_len > 0 else 0)
score += self.metric_weights['link_density'] * density score += self.metric_weights["link_density"] * density
total_weight += self.metric_weights['link_density'] total_weight += self.metric_weights["link_density"]
if self.metric_config['tag_weight']: if self.metric_config["tag_weight"]:
tag_score = self.tag_weights.get(metrics['tag_name'], 0.5) tag_score = self.tag_weights.get(metrics["tag_name"], 0.5)
score += self.metric_weights['tag_weight'] * tag_score score += self.metric_weights["tag_weight"] * tag_score
total_weight += self.metric_weights['tag_weight'] total_weight += self.metric_weights["tag_weight"]
if self.metric_config['class_id_weight']: if self.metric_config["class_id_weight"]:
class_score = self._compute_class_id_weight(metrics['node']) class_score = self._compute_class_id_weight(metrics["node"])
score += self.metric_weights['class_id_weight'] * max(0, class_score) score += self.metric_weights["class_id_weight"] * max(0, class_score)
total_weight += self.metric_weights['class_id_weight'] total_weight += self.metric_weights["class_id_weight"]
if self.metric_config['text_length']: if self.metric_config["text_length"]:
score += self.metric_weights['text_length'] * math.log(text_len + 1) score += self.metric_weights["text_length"] * math.log(text_len + 1)
total_weight += self.metric_weights['text_length'] total_weight += self.metric_weights["text_length"]
return score / total_weight if total_weight > 0 else 0 return score / total_weight if total_weight > 0 else 0
def _compute_class_id_weight(self, node): def _compute_class_id_weight(self, node):
"""Computes the class ID weight""" """Computes the class ID weight"""
class_id_score = 0 class_id_score = 0
if 'class' in node.attrs: if "class" in node.attrs:
classes = ' '.join(node['class']) classes = " ".join(node["class"])
if self.negative_patterns.match(classes): if self.negative_patterns.match(classes):
class_id_score -= 0.5 class_id_score -= 0.5
if 'id' in node.attrs: if "id" in node.attrs:
element_id = node['id'] element_id = node["id"]
if self.negative_patterns.match(element_id): if self.negative_patterns.match(element_id):
class_id_score -= 0.5 class_id_score -= 0.5
return class_id_score return class_id_score
class LLMContentFilter(RelevantContentFilter):
"""Content filtering using LLMs to generate relevant markdown."""
def __init__(
self,
provider: str = DEFAULT_PROVIDER,
api_token: Optional[str] = None,
instruction: str = None,
chunk_token_threshold: int = int(1e9),
overlap_rate: float = OVERLAP_RATE,
word_token_rate: float = WORD_TOKEN_RATE,
base_url: Optional[str] = None,
api_base: Optional[str] = None,
extra_args: Dict = None,
verbose: bool = False,
logger: Optional[AsyncLogger] = None,
):
super().__init__(None)
self.provider = provider
self.api_token = (
api_token
or PROVIDER_MODELS.get(provider, "no-token")
or os.getenv("OPENAI_API_KEY")
)
self.instruction = instruction
self.chunk_token_threshold = chunk_token_threshold
self.overlap_rate = overlap_rate
self.word_token_rate = word_token_rate
self.base_url = base_url
self.api_base = api_base or base_url
self.extra_args = extra_args or {}
self.verbose = verbose
# Setup logger with custom styling for LLM operations
if logger:
self.logger = logger
elif verbose:
self.logger = AsyncLogger(
verbose=True,
icons={
**AsyncLogger.DEFAULT_ICONS,
"LLM": "", # Star for LLM operations
"CHUNK": "", # Diamond for chunks
"CACHE": "", # Lightning for cache operations
},
colors={
**AsyncLogger.DEFAULT_COLORS,
LogLevel.INFO: Fore.MAGENTA + Style.DIM, # Dimmed purple for LLM ops
}
)
else:
self.logger = None
self.usages = []
self.total_usage = TokenUsage()
def _get_cache_key(self, html: str, instruction: str) -> str:
"""Generate a unique cache key based on HTML and instruction"""
content = f"{html}{instruction}"
return hashlib.md5(content.encode()).hexdigest()
def _merge_chunks(self, text: str) -> List[str]:
"""Split text into chunks with overlap"""
# Calculate tokens and sections
total_tokens = len(text.split()) * self.word_token_rate
num_sections = max(1, math.floor(total_tokens / self.chunk_token_threshold))
adjusted_chunk_threshold = total_tokens / num_sections
# Split into words
words = text.split()
chunks = []
current_chunk = []
current_token_count = 0
for word in words:
word_tokens = len(word) * self.word_token_rate
if current_token_count + word_tokens <= adjusted_chunk_threshold:
current_chunk.append(word)
current_token_count += word_tokens
else:
# Add overlap if not the last chunk
if chunks and self.overlap_rate > 0:
overlap_size = int(len(current_chunk) * self.overlap_rate)
current_chunk.extend(current_chunk[-overlap_size:])
chunks.append(" ".join(current_chunk))
current_chunk = [word]
current_token_count = word_tokens
if current_chunk:
chunks.append(" ".join(current_chunk))
return chunks
def filter_content(self, html: str, ignore_cache: bool = False) -> List[str]:
if not html or not isinstance(html, str):
return []
if self.logger:
self.logger.info(
"Starting LLM content filtering process",
tag="LLM",
params={"provider": self.provider},
colors={"provider": Fore.CYAN}
)
# Cache handling
cache_dir = Path(get_home_folder()) / "llm_cache" / "content_filter"
cache_dir.mkdir(parents=True, exist_ok=True)
cache_key = self._get_cache_key(html, self.instruction or "")
cache_file = cache_dir / f"{cache_key}.json"
if not ignore_cache and cache_file.exists():
if self.logger:
self.logger.info("Found cached result", tag="CACHE")
try:
with cache_file.open('r') as f:
cached_data = json.load(f)
usage = TokenUsage(**cached_data['usage'])
self.usages.append(usage)
self.total_usage.completion_tokens += usage.completion_tokens
self.total_usage.prompt_tokens += usage.prompt_tokens
self.total_usage.total_tokens += usage.total_tokens
return cached_data['blocks']
except Exception as e:
if self.logger:
self.logger.error(f"Cache read error: {str(e)}", tag="CACHE")
# Split into chunks
html_chunks = self._merge_chunks(html)
if self.logger:
self.logger.info(
"Split content into {chunk_count} chunks",
tag="CHUNK",
params={"chunk_count": len(html_chunks)},
colors={"chunk_count": Fore.YELLOW}
)
extracted_content = []
start_time = time.time()
# Process chunks in parallel
with ThreadPoolExecutor(max_workers=4) as executor:
futures = []
for i, chunk in enumerate(html_chunks):
if self.logger:
self.logger.debug(
"Processing chunk {chunk_num}/{total_chunks}",
tag="CHUNK",
params={
"chunk_num": i + 1,
"total_chunks": len(html_chunks)
}
)
prompt_variables = {
"HTML": escape_json_string(sanitize_html(chunk)),
"REQUEST": self.instruction or "Convert this HTML into clean, relevant markdown, removing any noise or irrelevant content."
}
prompt = PROMPT_FILTER_CONTENT
for var, value in prompt_variables.items():
prompt = prompt.replace("{" + var + "}", value)
future = executor.submit(
perform_completion_with_backoff,
self.provider,
prompt,
self.api_token,
base_url=self.api_base,
extra_args=self.extra_args
)
futures.append((i, future))
# Collect results in order
ordered_results = []
for i, future in sorted(futures):
try:
response = future.result()
# Track usage
usage = TokenUsage(
completion_tokens=response.usage.completion_tokens,
prompt_tokens=response.usage.prompt_tokens,
total_tokens=response.usage.total_tokens,
completion_tokens_details=response.usage.completion_tokens_details.__dict__
if response.usage.completion_tokens_details else {},
prompt_tokens_details=response.usage.prompt_tokens_details.__dict__
if response.usage.prompt_tokens_details else {},
)
self.usages.append(usage)
self.total_usage.completion_tokens += usage.completion_tokens
self.total_usage.prompt_tokens += usage.prompt_tokens
self.total_usage.total_tokens += usage.total_tokens
blocks = extract_xml_data(["content"], response.choices[0].message.content)["content"]
if blocks:
ordered_results.append(blocks)
if self.logger:
self.logger.success(
"Successfully processed chunk {chunk_num}",
tag="CHUNK",
params={"chunk_num": i + 1}
)
except Exception as e:
if self.logger:
self.logger.error(
"Error processing chunk {chunk_num}: {error}",
tag="CHUNK",
params={
"chunk_num": i + 1,
"error": str(e)
}
)
end_time = time.time()
if self.logger:
self.logger.success(
"Completed processing in {time:.2f}s",
tag="LLM",
params={"time": end_time - start_time},
colors={"time": Fore.YELLOW}
)
result = ordered_results if ordered_results else []
# Cache the final result
cache_data = {
'blocks': result,
'usage': self.total_usage.__dict__
}
with cache_file.open('w') as f:
json.dump(cache_data, f)
if self.logger:
self.logger.info("Cached results for future use", tag="CACHE")
return result
def show_usage(self) -> None:
"""Print usage statistics"""
print("\n=== Token Usage Summary ===")
print(f"{'Type':<15} {'Count':>12}")
print("-" * 30)
print(f"{'Completion':<15} {self.total_usage.completion_tokens:>12,}")
print(f"{'Prompt':<15} {self.total_usage.prompt_tokens:>12,}")
print(f"{'Total':<15} {self.total_usage.total_tokens:>12,}")
if self.usages:
print("\n=== Usage History ===")
print(f"{'Request #':<10} {'Completion':>12} {'Prompt':>12} {'Total':>12}")
print("-" * 48)
for i, usage in enumerate(self.usages, 1):
print(
f"{i:<10} {usage.completion_tokens:>12,} "
f"{usage.prompt_tokens:>12,} {usage.total_tokens:>12,}"
)

File diff suppressed because it is too large Load Diff

View File

@@ -15,32 +15,30 @@ import logging, time
import base64 import base64
from PIL import Image, ImageDraw, ImageFont from PIL import Image, ImageDraw, ImageFont
from io import BytesIO from io import BytesIO
from typing import List, Callable from typing import Callable
import requests import requests
import os import os
from pathlib import Path from pathlib import Path
from .utils import * from .utils import *
logger = logging.getLogger('selenium.webdriver.remote.remote_connection') logger = logging.getLogger("selenium.webdriver.remote.remote_connection")
logger.setLevel(logging.WARNING) logger.setLevel(logging.WARNING)
logger_driver = logging.getLogger('selenium.webdriver.common.service') logger_driver = logging.getLogger("selenium.webdriver.common.service")
logger_driver.setLevel(logging.WARNING) logger_driver.setLevel(logging.WARNING)
urllib3_logger = logging.getLogger('urllib3.connectionpool') urllib3_logger = logging.getLogger("urllib3.connectionpool")
urllib3_logger.setLevel(logging.WARNING) urllib3_logger.setLevel(logging.WARNING)
# Disable http.client logging # Disable http.client logging
http_client_logger = logging.getLogger('http.client') http_client_logger = logging.getLogger("http.client")
http_client_logger.setLevel(logging.WARNING) http_client_logger.setLevel(logging.WARNING)
# Disable driver_finder and service logging # Disable driver_finder and service logging
driver_finder_logger = logging.getLogger('selenium.webdriver.common.driver_finder') driver_finder_logger = logging.getLogger("selenium.webdriver.common.driver_finder")
driver_finder_logger.setLevel(logging.WARNING) driver_finder_logger.setLevel(logging.WARNING)
class CrawlerStrategy(ABC): class CrawlerStrategy(ABC):
@abstractmethod @abstractmethod
def crawl(self, url: str, **kwargs) -> str: def crawl(self, url: str, **kwargs) -> str:
@@ -58,8 +56,9 @@ class CrawlerStrategy(ABC):
def set_hook(self, hook_type: str, hook: Callable): def set_hook(self, hook_type: str, hook: Callable):
pass pass
class CloudCrawlerStrategy(CrawlerStrategy): class CloudCrawlerStrategy(CrawlerStrategy):
def __init__(self, use_cached_html = False): def __init__(self, use_cached_html=False):
super().__init__() super().__init__()
self.use_cached_html = use_cached_html self.use_cached_html = use_cached_html
@@ -76,6 +75,7 @@ class CloudCrawlerStrategy(CrawlerStrategy):
html = response["results"][0]["html"] html = response["results"][0]["html"]
return sanitize_input_encode(html) return sanitize_input_encode(html)
class LocalSeleniumCrawlerStrategy(CrawlerStrategy): class LocalSeleniumCrawlerStrategy(CrawlerStrategy):
def __init__(self, use_cached_html=False, js_code=None, **kwargs): def __init__(self, use_cached_html=False, js_code=None, **kwargs):
super().__init__() super().__init__()
@@ -87,9 +87,14 @@ class LocalSeleniumCrawlerStrategy(CrawlerStrategy):
if kwargs.get("user_agent"): if kwargs.get("user_agent"):
self.options.add_argument("--user-agent=" + kwargs.get("user_agent")) self.options.add_argument("--user-agent=" + kwargs.get("user_agent"))
else: else:
user_agent = kwargs.get("user_agent", "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36") user_agent = kwargs.get(
"user_agent",
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36",
)
self.options.add_argument(f"--user-agent={user_agent}") self.options.add_argument(f"--user-agent={user_agent}")
self.options.add_argument("user-agent=Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36") self.options.add_argument(
"user-agent=Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36"
)
self.options.headless = kwargs.get("headless", True) self.options.headless = kwargs.get("headless", True)
if self.options.headless: if self.options.headless:
@@ -123,11 +128,11 @@ class LocalSeleniumCrawlerStrategy(CrawlerStrategy):
# Hooks # Hooks
self.hooks = { self.hooks = {
'on_driver_created': None, "on_driver_created": None,
'on_user_agent_updated': None, "on_user_agent_updated": None,
'before_get_url': None, "before_get_url": None,
'after_get_url': None, "after_get_url": None,
'before_return_html': None "before_return_html": None,
} }
# chromedriver_autoinstaller.install() # chromedriver_autoinstaller.install()
@@ -138,7 +143,6 @@ class LocalSeleniumCrawlerStrategy(CrawlerStrategy):
# chromedriver_path = chromedriver_autoinstaller.utils.download_chromedriver() # chromedriver_path = chromedriver_autoinstaller.utils.download_chromedriver()
# self.service = Service(chromedriver_autoinstaller.install()) # self.service = Service(chromedriver_autoinstaller.install())
# chromedriver_path = ChromeDriverManager().install() # chromedriver_path = ChromeDriverManager().install()
# self.service = Service(chromedriver_path) # self.service = Service(chromedriver_path)
# self.service.log_path = "NUL" # self.service.log_path = "NUL"
@@ -148,14 +152,12 @@ class LocalSeleniumCrawlerStrategy(CrawlerStrategy):
self.service = Service() self.service = Service()
self.driver = webdriver.Chrome(options=self.options) self.driver = webdriver.Chrome(options=self.options)
self.driver = self.execute_hook('on_driver_created', self.driver) self.driver = self.execute_hook("on_driver_created", self.driver)
if kwargs.get("cookies"): if kwargs.get("cookies"):
for cookie in kwargs.get("cookies"): for cookie in kwargs.get("cookies"):
self.driver.add_cookie(cookie) self.driver.add_cookie(cookie)
def set_hook(self, hook_type: str, hook: Callable): def set_hook(self, hook_type: str, hook: Callable):
if hook_type in self.hooks: if hook_type in self.hooks:
self.hooks[hook_type] = hook self.hooks[hook_type] = hook
@@ -170,7 +172,9 @@ class LocalSeleniumCrawlerStrategy(CrawlerStrategy):
if isinstance(result, webdriver.Chrome): if isinstance(result, webdriver.Chrome):
return result return result
else: else:
raise TypeError(f"Hook {hook_type} must return an instance of webdriver.Chrome or None.") raise TypeError(
f"Hook {hook_type} must return an instance of webdriver.Chrome or None."
)
# If the hook returns None or there is no hook, return self.driver # If the hook returns None or there is no hook, return self.driver
return self.driver return self.driver
@@ -178,15 +182,15 @@ class LocalSeleniumCrawlerStrategy(CrawlerStrategy):
self.options.add_argument(f"user-agent={user_agent}") self.options.add_argument(f"user-agent={user_agent}")
self.driver.quit() self.driver.quit()
self.driver = webdriver.Chrome(service=self.service, options=self.options) self.driver = webdriver.Chrome(service=self.service, options=self.options)
self.driver = self.execute_hook('on_user_agent_updated', self.driver) self.driver = self.execute_hook("on_user_agent_updated", self.driver)
def set_custom_headers(self, headers: dict): def set_custom_headers(self, headers: dict):
# Enable Network domain for sending headers # Enable Network domain for sending headers
self.driver.execute_cdp_cmd('Network.enable', {}) self.driver.execute_cdp_cmd("Network.enable", {})
# Set extra HTTP headers # Set extra HTTP headers
self.driver.execute_cdp_cmd('Network.setExtraHTTPHeaders', {'headers': headers}) self.driver.execute_cdp_cmd("Network.setExtraHTTPHeaders", {"headers": headers})
def _ensure_page_load(self, max_checks=6, check_interval=0.01): def _ensure_page_load(self, max_checks=6, check_interval=0.01):
initial_length = len(self.driver.page_source) initial_length = len(self.driver.page_source)
for ix in range(max_checks): for ix in range(max_checks):
@@ -202,36 +206,53 @@ class LocalSeleniumCrawlerStrategy(CrawlerStrategy):
def crawl(self, url: str, **kwargs) -> str: def crawl(self, url: str, **kwargs) -> str:
# Create md5 hash of the URL # Create md5 hash of the URL
import hashlib import hashlib
url_hash = hashlib.md5(url.encode()).hexdigest() url_hash = hashlib.md5(url.encode()).hexdigest()
if self.use_cached_html: if self.use_cached_html:
cache_file_path = os.path.join(os.getenv("CRAWL4_AI_BASE_DIRECTORY", Path.home()), ".crawl4ai", "cache", url_hash) cache_file_path = os.path.join(
os.getenv("CRAWL4_AI_BASE_DIRECTORY", Path.home()),
".crawl4ai",
"cache",
url_hash,
)
if os.path.exists(cache_file_path): if os.path.exists(cache_file_path):
with open(cache_file_path, "r") as f: with open(cache_file_path, "r") as f:
return sanitize_input_encode(f.read()) return sanitize_input_encode(f.read())
try: try:
self.driver = self.execute_hook('before_get_url', self.driver) self.driver = self.execute_hook("before_get_url", self.driver)
if self.verbose: if self.verbose:
print(f"[LOG] 🕸️ Crawling {url} using LocalSeleniumCrawlerStrategy...") print(f"[LOG] 🕸️ Crawling {url} using LocalSeleniumCrawlerStrategy...")
self.driver.get(url) #<html><head></head><body></body></html> self.driver.get(url) # <html><head></head><body></body></html>
WebDriverWait(self.driver, 20).until( WebDriverWait(self.driver, 20).until(
lambda d: d.execute_script('return document.readyState') == 'complete' lambda d: d.execute_script("return document.readyState") == "complete"
) )
WebDriverWait(self.driver, 10).until( WebDriverWait(self.driver, 10).until(
EC.presence_of_all_elements_located((By.TAG_NAME, "body")) EC.presence_of_all_elements_located((By.TAG_NAME, "body"))
) )
self.driver.execute_script("window.scrollTo(0, document.body.scrollHeight);") self.driver.execute_script(
"window.scrollTo(0, document.body.scrollHeight);"
)
self.driver = self.execute_hook('after_get_url', self.driver) self.driver = self.execute_hook("after_get_url", self.driver)
html = sanitize_input_encode(self._ensure_page_load()) # self.driver.page_source html = sanitize_input_encode(
can_not_be_done_headless = False # Look at my creativity for naming variables self._ensure_page_load()
) # self.driver.page_source
can_not_be_done_headless = (
False # Look at my creativity for naming variables
)
# TODO: Very ugly approach, but promise to change it! # TODO: Very ugly approach, but promise to change it!
if kwargs.get('bypass_headless', False) or html == "<html><head></head><body></body></html>": if (
print("[LOG] 🙌 Page could not be loaded in headless mode. Trying non-headless mode...") kwargs.get("bypass_headless", False)
or html == "<html><head></head><body></body></html>"
):
print(
"[LOG] 🙌 Page could not be loaded in headless mode. Trying non-headless mode..."
)
can_not_be_done_headless = True can_not_be_done_headless = True
options = Options() options = Options()
options.headless = False options.headless = False
@@ -239,7 +260,7 @@ class LocalSeleniumCrawlerStrategy(CrawlerStrategy):
options.add_argument("--window-size=5,5") options.add_argument("--window-size=5,5")
driver = webdriver.Chrome(service=self.service, options=options) driver = webdriver.Chrome(service=self.service, options=options)
driver.get(url) driver.get(url)
self.driver = self.execute_hook('after_get_url', driver) self.driver = self.execute_hook("after_get_url", driver)
html = sanitize_input_encode(driver.page_source) html = sanitize_input_encode(driver.page_source)
driver.quit() driver.quit()
@@ -249,17 +270,21 @@ class LocalSeleniumCrawlerStrategy(CrawlerStrategy):
self.driver.execute_script(self.js_code) self.driver.execute_script(self.js_code)
# Optionally, wait for some condition after executing the JS code # Optionally, wait for some condition after executing the JS code
WebDriverWait(self.driver, 10).until( WebDriverWait(self.driver, 10).until(
lambda driver: driver.execute_script("return document.readyState") == "complete" lambda driver: driver.execute_script("return document.readyState")
== "complete"
) )
elif self.js_code and type(self.js_code) == list: elif self.js_code and type(self.js_code) == list:
for js in self.js_code: for js in self.js_code:
self.driver.execute_script(js) self.driver.execute_script(js)
WebDriverWait(self.driver, 10).until( WebDriverWait(self.driver, 10).until(
lambda driver: driver.execute_script("return document.readyState") == "complete" lambda driver: driver.execute_script(
"return document.readyState"
)
== "complete"
) )
# Optionally, wait for some condition after executing the JS code : Contributed by (https://github.com/jonymusky) # Optionally, wait for some condition after executing the JS code : Contributed by (https://github.com/jonymusky)
wait_for = kwargs.get('wait_for', False) wait_for = kwargs.get("wait_for", False)
if wait_for: if wait_for:
if callable(wait_for): if callable(wait_for):
print("[LOG] 🔄 Waiting for condition...") print("[LOG] 🔄 Waiting for condition...")
@@ -272,10 +297,15 @@ class LocalSeleniumCrawlerStrategy(CrawlerStrategy):
if not can_not_be_done_headless: if not can_not_be_done_headless:
html = sanitize_input_encode(self.driver.page_source) html = sanitize_input_encode(self.driver.page_source)
self.driver = self.execute_hook('before_return_html', self.driver, html) self.driver = self.execute_hook("before_return_html", self.driver, html)
# Store in cache # Store in cache
cache_file_path = os.path.join(os.getenv("CRAWL4_AI_BASE_DIRECTORY", Path.home()), ".crawl4ai", "cache", url_hash) cache_file_path = os.path.join(
os.getenv("CRAWL4_AI_BASE_DIRECTORY", Path.home()),
".crawl4ai",
"cache",
url_hash,
)
with open(cache_file_path, "w", encoding="utf-8") as f: with open(cache_file_path, "w", encoding="utf-8") as f:
f.write(html) f.write(html)
@@ -284,16 +314,16 @@ class LocalSeleniumCrawlerStrategy(CrawlerStrategy):
return html return html
except InvalidArgumentException as e: except InvalidArgumentException as e:
if not hasattr(e, 'msg'): if not hasattr(e, "msg"):
e.msg = sanitize_input_encode(str(e)) e.msg = sanitize_input_encode(str(e))
raise InvalidArgumentException(f"Failed to crawl {url}: {e.msg}") raise InvalidArgumentException(f"Failed to crawl {url}: {e.msg}")
except WebDriverException as e: except WebDriverException as e:
# If e does nlt have msg attribute create it and set it to str(e) # If e does nlt have msg attribute create it and set it to str(e)
if not hasattr(e, 'msg'): if not hasattr(e, "msg"):
e.msg = sanitize_input_encode(str(e)) e.msg = sanitize_input_encode(str(e))
raise WebDriverException(f"Failed to crawl {url}: {e.msg}") raise WebDriverException(f"Failed to crawl {url}: {e.msg}")
except Exception as e: except Exception as e:
if not hasattr(e, 'msg'): if not hasattr(e, "msg"):
e.msg = sanitize_input_encode(str(e)) e.msg = sanitize_input_encode(str(e))
raise Exception(f"Failed to crawl {url}: {e.msg}") raise Exception(f"Failed to crawl {url}: {e.msg}")
@@ -301,7 +331,9 @@ class LocalSeleniumCrawlerStrategy(CrawlerStrategy):
try: try:
# Get the dimensions of the page # Get the dimensions of the page
total_width = self.driver.execute_script("return document.body.scrollWidth") total_width = self.driver.execute_script("return document.body.scrollWidth")
total_height = self.driver.execute_script("return document.body.scrollHeight") total_height = self.driver.execute_script(
"return document.body.scrollHeight"
)
# Set the window size to the dimensions of the page # Set the window size to the dimensions of the page
self.driver.set_window_size(total_width, total_height) self.driver.set_window_size(total_width, total_height)
@@ -313,23 +345,25 @@ class LocalSeleniumCrawlerStrategy(CrawlerStrategy):
image = Image.open(BytesIO(screenshot)) image = Image.open(BytesIO(screenshot))
# Convert image to RGB mode (this will handle both RGB and RGBA images) # Convert image to RGB mode (this will handle both RGB and RGBA images)
rgb_image = image.convert('RGB') rgb_image = image.convert("RGB")
# Convert to JPEG and compress # Convert to JPEG and compress
buffered = BytesIO() buffered = BytesIO()
rgb_image.save(buffered, format="JPEG", quality=85) rgb_image.save(buffered, format="JPEG", quality=85)
img_base64 = base64.b64encode(buffered.getvalue()).decode('utf-8') img_base64 = base64.b64encode(buffered.getvalue()).decode("utf-8")
if self.verbose: if self.verbose:
print(f"[LOG] 📸 Screenshot taken and converted to base64") print("[LOG] 📸 Screenshot taken and converted to base64")
return img_base64 return img_base64
except Exception as e: except Exception as e:
error_message = sanitize_input_encode(f"Failed to take screenshot: {str(e)}") error_message = sanitize_input_encode(
f"Failed to take screenshot: {str(e)}"
)
print(error_message) print(error_message)
# Generate an image with black background # Generate an image with black background
img = Image.new('RGB', (800, 600), color='black') img = Image.new("RGB", (800, 600), color="black")
draw = ImageDraw.Draw(img) draw = ImageDraw.Draw(img)
# Load a font # Load a font
@@ -352,7 +386,7 @@ class LocalSeleniumCrawlerStrategy(CrawlerStrategy):
# Convert to base64 # Convert to base64
buffered = BytesIO() buffered = BytesIO()
img.save(buffered, format="JPEG") img.save(buffered, format="JPEG")
img_base64 = base64.b64encode(buffered.getvalue()).decode('utf-8') img_base64 = base64.b64encode(buffered.getvalue()).decode("utf-8")
return img_base64 return img_base64

View File

@@ -7,11 +7,13 @@ DB_PATH = os.path.join(os.getenv("CRAWL4_AI_BASE_DIRECTORY", Path.home()), ".cra
os.makedirs(DB_PATH, exist_ok=True) os.makedirs(DB_PATH, exist_ok=True)
DB_PATH = os.path.join(DB_PATH, "crawl4ai.db") DB_PATH = os.path.join(DB_PATH, "crawl4ai.db")
def init_db(): def init_db():
global DB_PATH global DB_PATH
conn = sqlite3.connect(DB_PATH) conn = sqlite3.connect(DB_PATH)
cursor = conn.cursor() cursor = conn.cursor()
cursor.execute(''' cursor.execute(
"""
CREATE TABLE IF NOT EXISTS crawled_data ( CREATE TABLE IF NOT EXISTS crawled_data (
url TEXT PRIMARY KEY, url TEXT PRIMARY KEY,
html TEXT, html TEXT,
@@ -24,31 +26,42 @@ def init_db():
metadata TEXT DEFAULT "{}", metadata TEXT DEFAULT "{}",
screenshot TEXT DEFAULT "" screenshot TEXT DEFAULT ""
) )
''') """
)
conn.commit() conn.commit()
conn.close() conn.close()
def alter_db_add_screenshot(new_column: str = "media"): def alter_db_add_screenshot(new_column: str = "media"):
check_db_path() check_db_path()
try: try:
conn = sqlite3.connect(DB_PATH) conn = sqlite3.connect(DB_PATH)
cursor = conn.cursor() cursor = conn.cursor()
cursor.execute(f'ALTER TABLE crawled_data ADD COLUMN {new_column} TEXT DEFAULT ""') cursor.execute(
f'ALTER TABLE crawled_data ADD COLUMN {new_column} TEXT DEFAULT ""'
)
conn.commit() conn.commit()
conn.close() conn.close()
except Exception as e: except Exception as e:
print(f"Error altering database to add screenshot column: {e}") print(f"Error altering database to add screenshot column: {e}")
def check_db_path(): def check_db_path():
if not DB_PATH: if not DB_PATH:
raise ValueError("Database path is not set or is empty.") raise ValueError("Database path is not set or is empty.")
def get_cached_url(url: str) -> Optional[Tuple[str, str, str, str, str, str, str, bool, str]]:
def get_cached_url(
url: str,
) -> Optional[Tuple[str, str, str, str, str, str, str, bool, str]]:
check_db_path() check_db_path()
try: try:
conn = sqlite3.connect(DB_PATH) conn = sqlite3.connect(DB_PATH)
cursor = conn.cursor() cursor = conn.cursor()
cursor.execute('SELECT url, html, cleaned_html, markdown, extracted_content, success, media, links, metadata, screenshot FROM crawled_data WHERE url = ?', (url,)) cursor.execute(
"SELECT url, html, cleaned_html, markdown, extracted_content, success, media, links, metadata, screenshot FROM crawled_data WHERE url = ?",
(url,),
)
result = cursor.fetchone() result = cursor.fetchone()
conn.close() conn.close()
return result return result
@@ -56,12 +69,25 @@ def get_cached_url(url: str) -> Optional[Tuple[str, str, str, str, str, str, str
print(f"Error retrieving cached URL: {e}") print(f"Error retrieving cached URL: {e}")
return None return None
def cache_url(url: str, html: str, cleaned_html: str, markdown: str, extracted_content: str, success: bool, media : str = "{}", links : str = "{}", metadata : str = "{}", screenshot: str = ""):
def cache_url(
url: str,
html: str,
cleaned_html: str,
markdown: str,
extracted_content: str,
success: bool,
media: str = "{}",
links: str = "{}",
metadata: str = "{}",
screenshot: str = "",
):
check_db_path() check_db_path()
try: try:
conn = sqlite3.connect(DB_PATH) conn = sqlite3.connect(DB_PATH)
cursor = conn.cursor() cursor = conn.cursor()
cursor.execute(''' cursor.execute(
"""
INSERT INTO crawled_data (url, html, cleaned_html, markdown, extracted_content, success, media, links, metadata, screenshot) INSERT INTO crawled_data (url, html, cleaned_html, markdown, extracted_content, success, media, links, metadata, screenshot)
VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?) VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?)
ON CONFLICT(url) DO UPDATE SET ON CONFLICT(url) DO UPDATE SET
@@ -74,18 +100,32 @@ def cache_url(url: str, html: str, cleaned_html: str, markdown: str, extracted_c
links = excluded.links, links = excluded.links,
metadata = excluded.metadata, metadata = excluded.metadata,
screenshot = excluded.screenshot screenshot = excluded.screenshot
''', (url, html, cleaned_html, markdown, extracted_content, success, media, links, metadata, screenshot)) """,
(
url,
html,
cleaned_html,
markdown,
extracted_content,
success,
media,
links,
metadata,
screenshot,
),
)
conn.commit() conn.commit()
conn.close() conn.close()
except Exception as e: except Exception as e:
print(f"Error caching URL: {e}") print(f"Error caching URL: {e}")
def get_total_count() -> int: def get_total_count() -> int:
check_db_path() check_db_path()
try: try:
conn = sqlite3.connect(DB_PATH) conn = sqlite3.connect(DB_PATH)
cursor = conn.cursor() cursor = conn.cursor()
cursor.execute('SELECT COUNT(*) FROM crawled_data') cursor.execute("SELECT COUNT(*) FROM crawled_data")
result = cursor.fetchone() result = cursor.fetchone()
conn.close() conn.close()
return result[0] return result[0]
@@ -93,43 +133,48 @@ def get_total_count() -> int:
print(f"Error getting total count: {e}") print(f"Error getting total count: {e}")
return 0 return 0
def clear_db(): def clear_db():
check_db_path() check_db_path()
try: try:
conn = sqlite3.connect(DB_PATH) conn = sqlite3.connect(DB_PATH)
cursor = conn.cursor() cursor = conn.cursor()
cursor.execute('DELETE FROM crawled_data') cursor.execute("DELETE FROM crawled_data")
conn.commit() conn.commit()
conn.close() conn.close()
except Exception as e: except Exception as e:
print(f"Error clearing database: {e}") print(f"Error clearing database: {e}")
def flush_db(): def flush_db():
check_db_path() check_db_path()
try: try:
conn = sqlite3.connect(DB_PATH) conn = sqlite3.connect(DB_PATH)
cursor = conn.cursor() cursor = conn.cursor()
cursor.execute('DROP TABLE crawled_data') cursor.execute("DROP TABLE crawled_data")
conn.commit() conn.commit()
conn.close() conn.close()
except Exception as e: except Exception as e:
print(f"Error flushing database: {e}") print(f"Error flushing database: {e}")
def update_existing_records(new_column: str = "media", default_value: str = "{}"): def update_existing_records(new_column: str = "media", default_value: str = "{}"):
check_db_path() check_db_path()
try: try:
conn = sqlite3.connect(DB_PATH) conn = sqlite3.connect(DB_PATH)
cursor = conn.cursor() cursor = conn.cursor()
cursor.execute(f'UPDATE crawled_data SET {new_column} = "{default_value}" WHERE screenshot IS NULL') cursor.execute(
f'UPDATE crawled_data SET {new_column} = "{default_value}" WHERE screenshot IS NULL'
)
conn.commit() conn.commit()
conn.close() conn.close()
except Exception as e: except Exception as e:
print(f"Error updating existing records: {e}") print(f"Error updating existing records: {e}")
if __name__ == "__main__": if __name__ == "__main__":
# Delete the existing database file # Delete the existing database file
if os.path.exists(DB_PATH): if os.path.exists(DB_PATH):
os.remove(DB_PATH) os.remove(DB_PATH)
init_db() init_db()
# alter_db_add_screenshot("COL_NAME") # alter_db_add_screenshot("COL_NAME")

View File

@@ -0,0 +1,29 @@
from .bfs_deep_crawl_strategy import BFSDeepCrawlStrategy
from .filters import (
URLFilter,
FilterChain,
URLPatternFilter,
ContentTypeFilter,
DomainFilter,
)
from .scorers import (
KeywordRelevanceScorer,
PathDepthScorer,
FreshnessScorer,
CompositeScorer,
)
from .deep_crawl_strategty import DeepCrawlStrategy
__all__ = [
"BFSDeepCrawlStrategy",
"FilterChain",
"URLFilter",
"URLPatternFilter",
"ContentTypeFilter",
"DomainFilter",
"KeywordRelevanceScorer",
"PathDepthScorer",
"FreshnessScorer",
"CompositeScorer",
"DeepCrawlStrategy",
]

View File

@@ -0,0 +1,193 @@
from typing import AsyncGenerator, Optional, Dict, Set, List
from datetime import datetime
import asyncio
import logging
from urllib.parse import urlparse
from ..models import CrawlResult, TraversalStats
from .filters import FilterChain
from .scorers import URLScorer
from .deep_crawl_strategty import DeepCrawlStrategy
from ..config import DEEP_CRAWL_BATCH_SIZE
class BFSDeepCrawlStrategy(DeepCrawlStrategy):
"""Best-First Search traversal strategy with filtering and scoring."""
def __init__(
self,
max_depth: int,
filter_chain: FilterChain,
url_scorer: URLScorer,
process_external_links: bool = False,
logger: Optional[logging.Logger] = None,
):
self.max_depth = max_depth
self.filter_chain = filter_chain
self.url_scorer = url_scorer
self.logger = logger or logging.getLogger(__name__)
# Crawl control
self.stats = TraversalStats(start_time=datetime.now())
self._cancel_event = asyncio.Event()
self.process_external_links = process_external_links
async def can_process_url(self, url: str, depth: int) -> bool:
"""Check if URL can be processed based on filters
This is our gatekeeper method that determines if a URL should be processed. It:
- Validates URL format using a robust built-in method
- Applies custom filters from the filter chain
- Updates statistics for blocked URLs
- Returns False early if any check fails
"""
try:
result = urlparse(url)
if not all([result.scheme, result.netloc]):
raise ValueError("Invalid URL")
if result.scheme not in ("http", "https"):
raise ValueError("URL must be HTTP or HTTPS")
if not result.netloc or "." not in result.netloc:
raise ValueError("Invalid domain")
except Exception as e:
self.logger.warning(f"Invalid URL: {url}. Error: {str(e)}")
return False
# Apply the filter chain if it's not start page
if depth != 0 and not self.filter_chain.apply(url):
return False
return True
async def _process_links(
self,
result: CrawlResult,
source_url: str,
queue: asyncio.PriorityQueue,
visited: Set[str],
depths: Dict[str, int],
) -> List[str]:
"""Process extracted links from crawl result.
This is our link processor that:
Checks depth limits
Handles both internal and external links
Checks if URL is visited already
Checks if URL can be processed - validates URL, applies Filters with can_process_url
Scores URLs for priority
Updates depth tracking dictionary
Adds valid URLs to the queue
Updates maximum depth statistics
"""
next_depth = depths[source_url] + 1
# If depth limit reached, exit without processing links
if next_depth > self.max_depth:
return
links_to_process = result.links["internal"]
if self.process_external_links:
links_to_process += result.links["external"]
for link in links_to_process:
url = link["href"]
if url in visited:
continue
if not await self.can_process_url(url, next_depth):
self.stats.urls_skipped += 1
continue
score = self.url_scorer.score(url) if self.url_scorer else 0
await queue.put((score, next_depth, url, source_url))
depths[url] = next_depth
self.stats.total_depth_reached = max(
self.stats.total_depth_reached, next_depth
)
async def arun(
self,
start_url: str,
crawler: "AsyncWebCrawler",
crawler_run_config: Optional["CrawlerRunConfig"] = None,
) -> AsyncGenerator[CrawlResult, None]:
"""Implement BFS traversal strategy"""
# Initialize traversal state
"""
queue: A priority queue where items are tuples of (score, depth, url)
Score: Determines traversal priority (lower = higher priority)
Depth: Current distance from start_url
URL: The actual URL to crawl
visited: Keeps track of URLs we've already seen to avoid cycles
depths: Maps URLs to their depths from the start URL
active_crawls: Tracks currently running crawl tasks
"""
queue = asyncio.PriorityQueue()
await queue.put((0, 0, start_url, None))
visited: Set[str] = set()
depths = {start_url: 0}
active_crawls = {} # Track URLs currently being processed with depth and score
active_crawls_lock = (
asyncio.Lock()
) # Create the lock within the same event loop
try:
while (
not queue.empty() or active_crawls
) and not self._cancel_event.is_set():
"""
This sets up our main control loop which:
- Continues while there are URLs to process (not queue.empty())
- Or while there are active crawls still running (arun_many)
- Can be interrupted via cancellation (not self._cancel_event.is_set())
"""
# Collect batch of URLs into active_crawls to process
async with active_crawls_lock:
while (
len(active_crawls) < DEEP_CRAWL_BATCH_SIZE and not queue.empty()
):
score, depth, url, parent_url = await queue.get()
active_crawls[url] = {
"depth": depth,
"score": score,
"parent_url": parent_url,
}
self.stats.current_depth = depth
if not active_crawls:
# If no active crawls exist, wait a bit and continue
await asyncio.sleep(0.1)
continue
# Process batch
try:
# This is very important to ensure recursively you don't deep_crawl down the children.
if crawler_run_config:
crawler_run_config = crawler_run_config.clone(
deep_crawl_strategy=None, stream=True
)
async for result in await crawler.arun_many(
urls=list(active_crawls.keys()),
config=crawler_run_config
):
async with active_crawls_lock:
crawl_info = active_crawls.pop(result.url, None)
if crawl_info and result.success:
await self._process_links(
result, result.url, queue, visited, depths
)
result.depth = crawl_info["depth"]
result.score = crawl_info["score"]
result.parent_url = crawl_info["parent_url"]
yield result
else:
self.logger.warning(
f"Failed to crawl {result.url}: {result.error_message}"
)
except Exception as e:
self.logger.error(f"Batch processing error: {e}")
# Continue processing other batches
continue
except Exception as e:
self.logger.error(f"Error in crawl process: {e}")
raise
finally:
self.stats.end_time = datetime.now()
async def shutdown(self):
"""Clean up resources and stop crawling"""
self._cancel_event.set()

View File

@@ -0,0 +1,30 @@
from abc import ABC, abstractmethod
from typing import AsyncGenerator, Optional
from ..models import CrawlResult
class DeepCrawlStrategy(ABC):
@abstractmethod
async def arun(
self,
url: str,
crawler: "AsyncWebCrawler",
crawler_run_config: Optional["CrawlerRunConfig"] = None,
) -> AsyncGenerator[CrawlResult, None]:
"""Traverse the given URL using the specified crawler.
Args:
url (str): The starting URL for the traversal.
crawler (AsyncWebCrawler): The crawler instance to use for traversal.
crawler_run_config (CrawlerRunConfig, optional): The configuration for the crawler.
Returns:
AsyncGenerator[CrawlResult, None]: An async generator yielding crawl results.
"""
pass
@abstractmethod
async def shutdown(self):
"""Clean up resources used by the strategy"""
pass

View File

@@ -0,0 +1,868 @@
from abc import ABC, abstractmethod
from typing import List, Pattern, Set, Union, FrozenSet
import re, time
from urllib.parse import urlparse
from array import array
import logging
from functools import lru_cache
import fnmatch
from dataclasses import dataclass
from typing import ClassVar
import weakref
import mimetypes
@dataclass
class FilterStats:
# PERF: Using dataclass creates overhead with __init__ and property access
# PERF: Could use __slots__ to reduce memory footprint
# PERF: Consider using array.array('I') for atomic increments
total_urls: int = 0
rejected_urls: int = 0
passed_urls: int = 0
class URLFilter(ABC):
# PERF: Logger creation is expensive, consider lazy initialization
# PERF: stats object creation adds overhead for each filter instance
def __init__(self, name: str = None):
self.name = name or self.__class__.__name__
self.stats = FilterStats()
self.logger = logging.getLogger(f"urlfilter.{self.name}")
@abstractmethod
def apply(self, url: str) -> bool:
pass
def _update_stats(self, passed: bool):
# PERF: Already optimized but could use bitwise operations
# PERF: Consider removing stats entirely in production/fast mode
self.stats.total_urls += 1
self.stats.passed_urls += passed
self.stats.rejected_urls += not passed
class FilterChain:
# PERF: List traversal for each URL is expensive
# PERF: Could use array.array instead of list for filters
# PERF: Consider adding fast path for single filter case
def __init__(self, filters: List[URLFilter] = None):
self.filters = filters or []
self.stats = FilterStats()
self.logger = logging.getLogger("urlfilter.chain")
def apply(self, url: str) -> bool:
# PERF: Logging on every rejection is expensive
# PERF: Could reorder filters by rejection rate
# PERF: Consider batch processing mode
self.stats.total_urls += 1
for filter_ in self.filters:
if not filter_.apply(url):
self.stats.rejected_urls += 1
self.logger.debug(f"URL {url} rejected by {filter_.name}")
return False
self.stats.passed_urls += 1
return True
class URLPatternFilter(URLFilter):
# PERF: Converting glob to regex is expensive
# PERF: Multiple regex compilation is slow
# PERF: List of patterns causes multiple regex evaluations
def __init__(
self,
patterns: Union[str, Pattern, List[Union[str, Pattern]]],
use_glob: bool = True,
):
super().__init__()
self.patterns = [patterns] if isinstance(patterns, (str, Pattern)) else patterns
self.use_glob = use_glob
self._compiled_patterns = []
# PERF: This could be consolidated into a single regex with OR conditions
# PERF: glob_to_regex creates complex patterns, could be simplified
for pattern in self.patterns:
if isinstance(pattern, str) and use_glob:
self._compiled_patterns.append(self._glob_to_regex(pattern))
else:
self._compiled_patterns.append(
re.compile(pattern) if isinstance(pattern, str) else pattern
)
def _glob_to_regex(self, pattern: str) -> Pattern:
# PERF: fnmatch.translate creates overly complex patterns
# PERF: Could cache common translations
return re.compile(fnmatch.translate(pattern))
def apply(self, url: str) -> bool:
# PERF: any() with generator is slower than direct loop with early return
# PERF: searching entire string is slower than anchored match
matches = any(pattern.search(url) for pattern in self._compiled_patterns)
self._update_stats(matches)
return matches
class ContentTypeFilter(URLFilter):
# PERF: mimetypes guessing is extremely slow
# PERF: URL parsing on every check is expensive
# PERF: No caching of results for similar extensions
def __init__(
self, allowed_types: Union[str, List[str]], check_extension: bool = True
):
super().__init__()
self.allowed_types = (
[allowed_types] if isinstance(allowed_types, str) else allowed_types
)
self.check_extension = check_extension
self._normalize_types()
def _normalize_types(self):
"""Normalize content type strings"""
self.allowed_types = [t.lower() for t in self.allowed_types]
def _check_extension(self, url: str) -> bool:
# PERF: urlparse is called on every check
# PERF: multiple string splits are expensive
# PERF: mimetypes.guess_type is very slow
ext = (
urlparse(url).path.split(".")[-1].lower()
if "." in urlparse(url).path
else ""
)
if not ext:
return True
# PERF: guess_type is main bottleneck
guessed_type = mimetypes.guess_type(url)[0]
return any(
allowed in (guessed_type or "").lower() for allowed in self.allowed_types
)
def apply(self, url: str) -> bool:
"""Check if URL's content type is allowed"""
result = True
if self.check_extension:
result = self._check_extension(url)
self._update_stats(result)
return result
class DomainFilter(URLFilter):
# PERF: Set lookups are fast but string normalizations on init are not
# PERF: Creating two sets doubles memory usage
def __init__(
self,
allowed_domains: Union[str, List[str]] = None,
blocked_domains: Union[str, List[str]] = None,
):
super().__init__()
# PERF: Normalizing domains on every init is wasteful
# PERF: Could use frozenset for immutable lists
self.allowed_domains = (
set(self._normalize_domains(allowed_domains)) if allowed_domains else None
)
self.blocked_domains = (
set(self._normalize_domains(blocked_domains)) if blocked_domains else set()
)
def _normalize_domains(self, domains: Union[str, List[str]]) -> List[str]:
# PERF: strip() and lower() create new strings for each domain
# PERF: List comprehension creates intermediate list
if isinstance(domains, str):
domains = [domains]
return [d.lower().strip() for d in domains]
def _extract_domain(self, url: str) -> str:
# PERF: urlparse is called for every URL check
# PERF: lower() creates new string every time
# PERF: Could cache recent results
return urlparse(url).netloc.lower()
def apply(self, url: str) -> bool:
# PERF: Two separate set lookups in worst case
# PERF: Domain extraction happens before knowing if we have any filters
domain = self._extract_domain(url)
if domain in self.blocked_domains:
self._update_stats(False)
return False
if self.allowed_domains is not None and domain not in self.allowed_domains:
self._update_stats(False)
return False
self._update_stats(True)
return True
# Example usage:
def create_common_filter_chain() -> FilterChain:
"""Create a commonly used filter chain"""
return FilterChain(
[
URLPatternFilter(
[
"*.html",
"*.htm", # HTML files
"*/article/*",
"*/blog/*", # Common content paths
]
),
ContentTypeFilter(["text/html", "application/xhtml+xml"]),
DomainFilter(blocked_domains=["ads.*", "analytics.*"]),
]
)
####################################################################################
# Uncledoe: Optimized Version
####################################################################################
# Use __slots__ and array for maximum memory/speed efficiency
class FastFilterStats:
__slots__ = ("_counters",)
def __init__(self):
# Use array of unsigned ints for atomic operations
self._counters = array("I", [0, 0, 0]) # total, passed, rejected
@property
def total_urls(self):
return self._counters[0]
@property
def passed_urls(self):
return self._counters[1]
@property
def rejected_urls(self):
return self._counters[2]
class FastURLFilter(ABC):
"""Optimized base filter class"""
__slots__ = ("name", "stats", "_logger_ref")
def __init__(self, name: str = None):
self.name = name or self.__class__.__name__
self.stats = FastFilterStats()
# Lazy logger initialization using weakref
self._logger_ref = None
@property
def logger(self):
if self._logger_ref is None or self._logger_ref() is None:
logger = logging.getLogger(f"urlfilter.{self.name}")
self._logger_ref = weakref.ref(logger)
return self._logger_ref()
@abstractmethod
def apply(self, url: str) -> bool:
pass
def _update_stats(self, passed: bool):
# Use direct array index for speed
self.stats._counters[0] += 1 # total
self.stats._counters[1] += passed # passed
self.stats._counters[2] += not passed # rejected
class FastFilterChain:
"""Optimized filter chain"""
__slots__ = ("filters", "stats", "_logger_ref")
def __init__(self, filters: List[FastURLFilter] = None):
self.filters = tuple(filters or []) # Immutable tuple for speed
self.stats = FastFilterStats()
self._logger_ref = None
@property
def logger(self):
if self._logger_ref is None or self._logger_ref() is None:
logger = logging.getLogger("urlfilter.chain")
self._logger_ref = weakref.ref(logger)
return self._logger_ref()
def add_filter(self, filter_: FastURLFilter) -> "FastFilterChain":
"""Add a filter to the chain"""
self.filters.append(filter_)
return self # Enable method chaining
def apply(self, url: str) -> bool:
"""Optimized apply with minimal operations"""
self.stats._counters[0] += 1 # total
# Direct tuple iteration is faster than list
for f in self.filters:
if not f.apply(url):
self.stats._counters[2] += 1 # rejected
return False
self.stats._counters[1] += 1 # passed
return True
class FastURLPatternFilter(FastURLFilter):
"""Pattern filter balancing speed and completeness"""
__slots__ = ('_simple_suffixes', '_simple_prefixes', '_domain_patterns', '_path_patterns')
PATTERN_TYPES = {
'SUFFIX': 1, # *.html
'PREFIX': 2, # /foo/*
'DOMAIN': 3, # *.example.com
'PATH': 4 , # Everything else
'REGEX': 5
}
def __init__(self, patterns: Union[str, Pattern, List[Union[str, Pattern]]], use_glob: bool = True):
super().__init__()
patterns = [patterns] if isinstance(patterns, (str, Pattern)) else patterns
self._simple_suffixes = set()
self._simple_prefixes = set()
self._domain_patterns = []
self._path_patterns = []
for pattern in patterns:
pattern_type = self._categorize_pattern(pattern)
self._add_pattern(pattern, pattern_type)
def _categorize_pattern(self, pattern: str) -> int:
"""Categorize pattern for specialized handling"""
if not isinstance(pattern, str):
return self.PATTERN_TYPES['PATH']
# Check if it's a regex pattern
if pattern.startswith('^') or pattern.endswith('$') or '\\d' in pattern:
return self.PATTERN_TYPES['REGEX']
if pattern.count('*') == 1:
if pattern.startswith('*.'):
return self.PATTERN_TYPES['SUFFIX']
if pattern.endswith('/*'):
return self.PATTERN_TYPES['PREFIX']
if '://' in pattern and pattern.startswith('*.'):
return self.PATTERN_TYPES['DOMAIN']
return self.PATTERN_TYPES['PATH']
def _add_pattern(self, pattern: str, pattern_type: int):
"""Add pattern to appropriate matcher"""
if pattern_type == self.PATTERN_TYPES['REGEX']:
# For regex patterns, compile directly without glob translation
if isinstance(pattern, str) and (pattern.startswith('^') or pattern.endswith('$') or '\\d' in pattern):
self._path_patterns.append(re.compile(pattern))
return
elif pattern_type == self.PATTERN_TYPES['SUFFIX']:
self._simple_suffixes.add(pattern[2:])
elif pattern_type == self.PATTERN_TYPES['PREFIX']:
self._simple_prefixes.add(pattern[:-2])
elif pattern_type == self.PATTERN_TYPES['DOMAIN']:
self._domain_patterns.append(
re.compile(pattern.replace('*.', r'[^/]+\.'))
)
else:
if isinstance(pattern, str):
# Handle complex glob patterns
if '**' in pattern:
pattern = pattern.replace('**', '.*')
if '{' in pattern:
# Convert {a,b} to (a|b)
pattern = re.sub(r'\{([^}]+)\}',
lambda m: f'({"|".join(m.group(1).split(","))})',
pattern)
pattern = fnmatch.translate(pattern)
self._path_patterns.append(
pattern if isinstance(pattern, Pattern) else re.compile(pattern)
)
@lru_cache(maxsize=10000)
def apply(self, url: str) -> bool:
"""Hierarchical pattern matching"""
# Quick suffix check (*.html)
if self._simple_suffixes:
path = url.split('?')[0]
if path.split('/')[-1].split('.')[-1] in self._simple_suffixes:
self._update_stats(True)
return True
# Domain check
if self._domain_patterns:
for pattern in self._domain_patterns:
if pattern.match(url):
self._update_stats(True)
return True
# Prefix check (/foo/*)
if self._simple_prefixes:
path = url.split('?')[0]
if any(path.startswith(p) for p in self._simple_prefixes):
self._update_stats(True)
return True
# Complex patterns
if self._path_patterns:
if any(p.search(url) for p in self._path_patterns):
self._update_stats(True)
return True
self._update_stats(False)
return False
class FastContentTypeFilter(FastURLFilter):
"""Optimized content type filter using fast lookups"""
__slots__ = ("allowed_types", "_ext_map", "_check_extension")
# Fast extension to mime type mapping
_MIME_MAP = {
# Text Formats
"txt": "text/plain",
"html": "text/html",
"htm": "text/html",
"xhtml": "application/xhtml+xml",
"css": "text/css",
"csv": "text/csv",
"ics": "text/calendar",
"js": "application/javascript",
# Images
"bmp": "image/bmp",
"gif": "image/gif",
"jpeg": "image/jpeg",
"jpg": "image/jpeg",
"png": "image/png",
"svg": "image/svg+xml",
"tiff": "image/tiff",
"ico": "image/x-icon",
"webp": "image/webp",
# Audio
"mp3": "audio/mpeg",
"wav": "audio/wav",
"ogg": "audio/ogg",
"m4a": "audio/mp4",
"aac": "audio/aac",
# Video
"mp4": "video/mp4",
"mpeg": "video/mpeg",
"webm": "video/webm",
"avi": "video/x-msvideo",
"mov": "video/quicktime",
"flv": "video/x-flv",
"wmv": "video/x-ms-wmv",
"mkv": "video/x-matroska",
# Applications
"json": "application/json",
"xml": "application/xml",
"pdf": "application/pdf",
"zip": "application/zip",
"gz": "application/gzip",
"tar": "application/x-tar",
"rar": "application/vnd.rar",
"7z": "application/x-7z-compressed",
"exe": "application/vnd.microsoft.portable-executable",
"msi": "application/x-msdownload",
# Fonts
"woff": "font/woff",
"woff2": "font/woff2",
"ttf": "font/ttf",
"otf": "font/otf",
# Microsoft Office
"doc": "application/msword",
"dot": "application/msword",
"docx": "application/vnd.openxmlformats-officedocument.wordprocessingml.document",
"xlsx": "application/vnd.openxmlformats-officedocument.spreadsheetml.sheet",
"xls": "application/vnd.ms-excel",
"ppt": "application/vnd.ms-powerpoint",
"pptx": "application/vnd.openxmlformats-officedocument.presentationml.presentation",
# OpenDocument Formats
"odt": "application/vnd.oasis.opendocument.text",
"ods": "application/vnd.oasis.opendocument.spreadsheet",
"odp": "application/vnd.oasis.opendocument.presentation",
# Archives
"tar.gz": "application/gzip",
"tgz": "application/gzip",
"bz2": "application/x-bzip2",
# Others
"rtf": "application/rtf",
"apk": "application/vnd.android.package-archive",
"epub": "application/epub+zip",
"jar": "application/java-archive",
"swf": "application/x-shockwave-flash",
"midi": "audio/midi",
"mid": "audio/midi",
"ps": "application/postscript",
"ai": "application/postscript",
"eps": "application/postscript",
# Custom or less common
"bin": "application/octet-stream",
"dmg": "application/x-apple-diskimage",
"iso": "application/x-iso9660-image",
"deb": "application/x-debian-package",
"rpm": "application/x-rpm",
"sqlite": "application/vnd.sqlite3",
# Placeholder
"unknown": "application/octet-stream", # Fallback for unknown file types
}
@staticmethod
@lru_cache(maxsize=1000)
def _extract_extension(path: str) -> str:
"""Fast extension extraction with caching"""
if "." not in path:
return ""
return path.rpartition(".")[-1].lower()
def __init__(
self, allowed_types: Union[str, List[str]], check_extension: bool = True
):
super().__init__()
# Normalize and store as frozenset for fast lookup
self.allowed_types = frozenset(
t.lower()
for t in (
allowed_types if isinstance(allowed_types, list) else [allowed_types]
)
)
self._check_extension = check_extension
# Pre-compute extension map for allowed types
self._ext_map = frozenset(
ext
for ext, mime in self._MIME_MAP.items()
if any(allowed in mime for allowed in self.allowed_types)
)
@lru_cache(maxsize=1000)
def _check_url_cached(self, url: str) -> bool:
"""Cached URL checking"""
if not self._check_extension:
return True
path = url.split("?")[0] # Fast path split
ext = self._extract_extension(path)
if not ext:
return True
return ext in self._ext_map
def apply(self, url: str) -> bool:
"""Fast extension check with caching"""
result = self._check_url_cached(url)
self._update_stats(result)
return result
class FastDomainFilter(FastURLFilter):
"""Optimized domain filter with fast lookups and caching"""
__slots__ = ("_allowed_domains", "_blocked_domains", "_domain_cache")
# Regex for fast domain extraction
_DOMAIN_REGEX = re.compile(r"://([^/]+)")
def __init__(
self,
allowed_domains: Union[str, List[str]] = None,
blocked_domains: Union[str, List[str]] = None,
):
super().__init__()
# Convert inputs to frozensets for immutable, fast lookups
self._allowed_domains = (
frozenset(self._normalize_domains(allowed_domains))
if allowed_domains
else None
)
self._blocked_domains = (
frozenset(self._normalize_domains(blocked_domains))
if blocked_domains
else frozenset()
)
@staticmethod
def _normalize_domains(domains: Union[str, List[str]]) -> Set[str]:
"""Fast domain normalization"""
if isinstance(domains, str):
return {domains.lower()}
return {d.lower() for d in domains}
@staticmethod
@lru_cache(maxsize=10000)
def _extract_domain(url: str) -> str:
"""Ultra-fast domain extraction with regex and caching"""
match = FastDomainFilter._DOMAIN_REGEX.search(url)
return match.group(1).lower() if match else ""
def apply(self, url: str) -> bool:
"""Optimized domain checking with early returns"""
# Skip processing if no filters
if not self._blocked_domains and self._allowed_domains is None:
self._update_stats(True)
return True
domain = self._extract_domain(url)
# Early return for blocked domains
if domain in self._blocked_domains:
self._update_stats(False)
return False
# If no allowed domains specified, accept all non-blocked
if self._allowed_domains is None:
self._update_stats(True)
return True
# Final allowed domains check
result = domain in self._allowed_domains
self._update_stats(result)
return result
def create_fast_filter_chain() -> FastFilterChain:
"""Create an optimized filter chain with filters ordered by rejection rate"""
return FastFilterChain(
[
# Domain filter first (fastest rejection)
FastDomainFilter(blocked_domains=["ads.*", "analytics.*"]),
# Content filter second (medium speed)
FastContentTypeFilter(["text/html", "application/xhtml+xml"]),
# Pattern filter last (most expensive)
FastURLPatternFilter(
[
"*.html",
"*.htm",
"*/article/*",
"*/blog/*",
]
),
]
)
def run_performance_test():
import time
import random
from itertools import cycle
# Generate test URLs
base_urls = [
"https://example.com/article/123",
"https://blog.example.com/post/456",
"https://ads.example.com/tracking",
"https://example.com/about.html",
"https://analytics.example.com/script.js",
"https://example.com/products.php",
"https://subdomain.example.com/blog/post-123",
"https://example.com/path/file.pdf",
]
# Create more varied test data
test_urls = []
for base in base_urls:
# Add original
test_urls.append(base)
# Add variations
parts = base.split("/")
for i in range(10):
parts[-1] = f"page_{i}.html"
test_urls.append("/".join(parts))
# Multiply to get enough test data
test_urls = test_urls * 10000 # Creates ~800k URLs
def benchmark(name: str, func, *args, warmup=True):
if warmup:
# Warmup run
func(*args)
# Actual timing
start = time.perf_counter_ns()
result = func(*args)
elapsed = (time.perf_counter_ns() - start) / 1_000_000 # Convert to ms
print(
f"{name:<30} {elapsed:>8.3f} ms ({len(test_urls)/elapsed*1000:,.0f} URLs/sec)"
)
return result
print("\nBenchmarking original vs optimized implementations...")
print("-" * 70)
# Original implementation
pattern_filter = URLPatternFilter(["*.html", "*/article/*"])
content_filter = ContentTypeFilter(["text/html"])
domain_filter = DomainFilter(blocked_domains=["ads.*", "analytics.*"])
chain = FilterChain([pattern_filter, content_filter, domain_filter])
# Optimized implementation
fast_pattern_filter = FastURLPatternFilter(["*.html", "*/article/*"])
fast_content_filter = FastContentTypeFilter(["text/html"])
fast_domain_filter = FastDomainFilter(blocked_domains=["ads.*", "analytics.*"])
fast_chain = FastFilterChain(
[fast_domain_filter, fast_content_filter, fast_pattern_filter]
)
# Test individual filters
print("\nSingle filter performance (first 1000 URLs):")
test_subset = test_urls[:1000]
print("\nPattern Filters:")
benchmark(
"Original Pattern Filter",
lambda: [pattern_filter.apply(url) for url in test_subset],
)
benchmark(
"Optimized Pattern Filter",
lambda: [fast_pattern_filter.apply(url) for url in test_subset],
)
print("\nContent Filters:")
benchmark(
"Original Content Filter",
lambda: [content_filter.apply(url) for url in test_subset],
)
benchmark(
"Optimized Content Filter",
lambda: [fast_content_filter.apply(url) for url in test_subset],
)
print("\nDomain Filters:")
benchmark(
"Original Domain Filter",
lambda: [domain_filter.apply(url) for url in test_subset],
)
benchmark(
"Optimized Domain Filter",
lambda: [fast_domain_filter.apply(url) for url in test_subset],
)
print("\nFull Chain Performance (all URLs):")
# Test chain
benchmark("Original Chain", lambda: [chain.apply(url) for url in test_urls])
benchmark("Optimized Chain", lambda: [fast_chain.apply(url) for url in test_urls])
# Memory usage
import sys
print("\nMemory Usage per Filter:")
print(f"Original Pattern Filter: {sys.getsizeof(pattern_filter):,} bytes")
print(f"Optimized Pattern Filter: {sys.getsizeof(fast_pattern_filter):,} bytes")
print(f"Original Content Filter: {sys.getsizeof(content_filter):,} bytes")
print(f"Optimized Content Filter: {sys.getsizeof(fast_content_filter):,} bytes")
print(f"Original Domain Filter: {sys.getsizeof(domain_filter):,} bytes")
print(f"Optimized Domain Filter: {sys.getsizeof(fast_domain_filter):,} bytes")
def test_pattern_filter():
import time
from itertools import chain
# Test cases as list of tuples instead of dict for multiple patterns
test_cases = [
# Simple suffix patterns (*.html)
("*.html", {
"https://example.com/page.html": True,
"https://example.com/path/doc.html": True,
"https://example.com/page.htm": False,
"https://example.com/page.html?param=1": True,
}),
# Path prefix patterns (/foo/*)
("*/article/*", {
"https://example.com/article/123": True,
"https://example.com/blog/article/456": True,
"https://example.com/articles/789": False,
"https://example.com/article": False,
}),
# Complex patterns
("blog-*-[0-9]", {
"https://example.com/blog-post-1": True,
"https://example.com/blog-test-9": True,
"https://example.com/blog-post": False,
"https://example.com/blog-post-x": False,
}),
# Multiple patterns case
(["*.pdf", "*/download/*"], {
"https://example.com/doc.pdf": True,
"https://example.com/download/file.txt": True,
"https://example.com/path/download/doc": True,
"https://example.com/uploads/file.txt": False,
}),
# Edge cases
("*", {
"https://example.com": True,
"": True,
"http://test.com/path": True,
}),
# Complex regex
(r"^https?://.*\.example\.com/\d+", {
"https://sub.example.com/123": True,
"http://test.example.com/456": True,
"https://example.com/789": False,
"https://sub.example.com/abc": False,
})
]
def run_accuracy_test():
print("\nAccuracy Tests:")
print("-" * 50)
all_passed = True
for patterns, test_urls in test_cases:
filter_obj = FastURLPatternFilter(patterns)
for url, expected in test_urls.items():
result = filter_obj.apply(url)
if result != expected:
print(f"❌ Failed: Pattern '{patterns}' with URL '{url}'")
print(f" Expected: {expected}, Got: {result}")
all_passed = False
else:
print(f"✅ Passed: Pattern '{patterns}' with URL '{url}'")
return all_passed
def run_speed_test():
print("\nSpeed Tests:")
print("-" * 50)
# Create a large set of test URLs
all_urls = list(chain.from_iterable(urls.keys() for _, urls in test_cases))
test_urls = all_urls * 10000 # 100K+ URLs
# Test both implementations
original = URLPatternFilter(["*.html", "*/article/*", "blog-*"])
optimized = FastURLPatternFilter(["*.html", "*/article/*", "blog-*"])
def benchmark(name, filter_obj):
start = time.perf_counter()
for url in test_urls:
filter_obj.apply(url)
elapsed = time.perf_counter() - start
urls_per_sec = len(test_urls) / elapsed
print(f"{name:<20} {elapsed:.3f}s ({urls_per_sec:,.0f} URLs/sec)")
benchmark("Original Filter:", original)
benchmark("Optimized Filter:", optimized)
# Run tests
print("Running Pattern Filter Tests...")
accuracy_passed = run_accuracy_test()
if accuracy_passed:
print("\n✨ All accuracy tests passed!")
run_speed_test()
else:
print("\n❌ Some accuracy tests failed!")
if __name__ == "__main__":
run_performance_test()
# test_pattern_filter()

File diff suppressed because it is too large Load Diff

View File

@@ -4,6 +4,7 @@ from pathlib import Path
from crawl4ai.async_logger import AsyncLogger from crawl4ai.async_logger import AsyncLogger
from crawl4ai.llmtxt import AsyncLLMTextManager from crawl4ai.llmtxt import AsyncLLMTextManager
class DocsManager: class DocsManager:
def __init__(self, logger=None): def __init__(self, logger=None):
self.docs_dir = Path.home() / ".crawl4ai" / "docs" self.docs_dir = Path.home() / ".crawl4ai" / "docs"
@@ -21,7 +22,10 @@ class DocsManager:
"""Copy from local docs or download from GitHub""" """Copy from local docs or download from GitHub"""
try: try:
# Try local first # Try local first
if self.local_docs.exists() and (any(self.local_docs.glob("*.md")) or any(self.local_docs.glob("*.tokens"))): if self.local_docs.exists() and (
any(self.local_docs.glob("*.md"))
or any(self.local_docs.glob("*.tokens"))
):
# Empty the local docs directory # Empty the local docs directory
for file_path in self.docs_dir.glob("*.md"): for file_path in self.docs_dir.glob("*.md"):
file_path.unlink() file_path.unlink()
@@ -36,14 +40,14 @@ class DocsManager:
# Fallback to GitHub # Fallback to GitHub
response = requests.get( response = requests.get(
"https://api.github.com/repos/unclecode/crawl4ai/contents/docs/llm.txt", "https://api.github.com/repos/unclecode/crawl4ai/contents/docs/llm.txt",
headers={'Accept': 'application/vnd.github.v3+json'} headers={"Accept": "application/vnd.github.v3+json"},
) )
response.raise_for_status() response.raise_for_status()
for item in response.json(): for item in response.json():
if item['type'] == 'file' and item['name'].endswith('.md'): if item["type"] == "file" and item["name"].endswith(".md"):
content = requests.get(item['download_url']).text content = requests.get(item["download_url"]).text
with open(self.docs_dir / item['name'], 'w', encoding='utf-8') as f: with open(self.docs_dir / item["name"], "w", encoding="utf-8") as f:
f.write(content) f.write(content)
return True return True
@@ -57,7 +61,11 @@ class DocsManager:
# Remove [0-9]+_ prefix # Remove [0-9]+_ prefix
names = [name.split("_", 1)[1] if name[0].isdigit() else name for name in names] names = [name.split("_", 1)[1] if name[0].isdigit() else name for name in names]
# Exclude those end with .xs.md and .q.md # Exclude those end with .xs.md and .q.md
names = [name for name in names if not name.endswith(".xs") and not name.endswith(".q")] names = [
name
for name in names
if not name.endswith(".xs") and not name.endswith(".q")
]
return names return names
def generate(self, sections, mode="extended"): def generate(self, sections, mode="extended"):

File diff suppressed because it is too large Load Diff

File diff suppressed because it is too large Load Diff

View File

@@ -903,7 +903,13 @@ class HTML2Text(html.parser.HTMLParser):
self.empty_link = False self.empty_link = False
if not self.code and not self.pre and not entity_char: if not self.code and not self.pre and not entity_char:
data = escape_md_section(data, snob=self.escape_snob, escape_dot=self.escape_dot, escape_plus=self.escape_plus, escape_dash=self.escape_dash) data = escape_md_section(
data,
snob=self.escape_snob,
escape_dot=self.escape_dot,
escape_plus=self.escape_plus,
escape_dash=self.escape_dash,
)
self.preceding_data = data self.preceding_data = data
self.o(data, puredata=True) self.o(data, puredata=True)
@@ -1006,6 +1012,7 @@ class HTML2Text(html.parser.HTMLParser):
newlines += 1 newlines += 1
return result return result
def html2text(html: str, baseurl: str = "", bodywidth: Optional[int] = None) -> str: def html2text(html: str, baseurl: str = "", bodywidth: Optional[int] = None) -> str:
if bodywidth is None: if bodywidth is None:
bodywidth = config.BODY_WIDTH bodywidth = config.BODY_WIDTH
@@ -1013,6 +1020,7 @@ def html2text(html: str, baseurl: str = "", bodywidth: Optional[int] = None) ->
return h.handle(html) return h.handle(html)
class CustomHTML2Text(HTML2Text): class CustomHTML2Text(HTML2Text):
def __init__(self, *args, handle_code_in_pre=False, **kwargs): def __init__(self, *args, handle_code_in_pre=False, **kwargs):
super().__init__(*args, **kwargs) super().__init__(*args, **kwargs)
@@ -1041,9 +1049,9 @@ class CustomHTML2Text(HTML2Text):
def update_params(self, **kwargs): def update_params(self, **kwargs):
"""Update parameters and set preserved tags.""" """Update parameters and set preserved tags."""
for key, value in kwargs.items(): for key, value in kwargs.items():
if key == 'preserve_tags': if key == "preserve_tags":
self.preserve_tags = set(value) self.preserve_tags = set(value)
elif key == 'handle_code_in_pre': elif key == "handle_code_in_pre":
self.handle_code_in_pre = value self.handle_code_in_pre = value
else: else:
setattr(self, key, value) setattr(self, key, value)
@@ -1056,17 +1064,19 @@ class CustomHTML2Text(HTML2Text):
self.current_preserved_tag = tag self.current_preserved_tag = tag
self.preserved_content = [] self.preserved_content = []
# Format opening tag with attributes # Format opening tag with attributes
attr_str = ''.join(f' {k}="{v}"' for k, v in attrs.items() if v is not None) attr_str = "".join(
self.preserved_content.append(f'<{tag}{attr_str}>') f' {k}="{v}"' for k, v in attrs.items() if v is not None
)
self.preserved_content.append(f"<{tag}{attr_str}>")
self.preserve_depth += 1 self.preserve_depth += 1
return return
else: else:
self.preserve_depth -= 1 self.preserve_depth -= 1
if self.preserve_depth == 0: if self.preserve_depth == 0:
self.preserved_content.append(f'</{tag}>') self.preserved_content.append(f"</{tag}>")
# Output the preserved HTML block with proper spacing # Output the preserved HTML block with proper spacing
preserved_html = ''.join(self.preserved_content) preserved_html = "".join(self.preserved_content)
self.o('\n' + preserved_html + '\n') self.o("\n" + preserved_html + "\n")
self.current_preserved_tag = None self.current_preserved_tag = None
return return
@@ -1074,29 +1084,31 @@ class CustomHTML2Text(HTML2Text):
if self.preserve_depth > 0: if self.preserve_depth > 0:
if start: if start:
# Format nested tags with attributes # Format nested tags with attributes
attr_str = ''.join(f' {k}="{v}"' for k, v in attrs.items() if v is not None) attr_str = "".join(
self.preserved_content.append(f'<{tag}{attr_str}>') f' {k}="{v}"' for k, v in attrs.items() if v is not None
)
self.preserved_content.append(f"<{tag}{attr_str}>")
else: else:
self.preserved_content.append(f'</{tag}>') self.preserved_content.append(f"</{tag}>")
return return
# Handle pre tags # Handle pre tags
if tag == 'pre': if tag == "pre":
if start: if start:
self.o('```\n') # Markdown code block start self.o("```\n") # Markdown code block start
self.inside_pre = True self.inside_pre = True
else: else:
self.o('\n```\n') # Markdown code block end self.o("\n```\n") # Markdown code block end
self.inside_pre = False self.inside_pre = False
elif tag == 'code': elif tag == "code":
if self.inside_pre and not self.handle_code_in_pre: if self.inside_pre and not self.handle_code_in_pre:
# Ignore code tags inside pre blocks if handle_code_in_pre is False # Ignore code tags inside pre blocks if handle_code_in_pre is False
return return
if start: if start:
self.o('`') # Markdown inline code start self.o("`") # Markdown inline code start
self.inside_code = True self.inside_code = True
else: else:
self.o('`') # Markdown inline code end self.o("`") # Markdown inline code end
self.inside_code = False self.inside_code = False
else: else:
super().handle_tag(tag, attrs, start) super().handle_tag(tag, attrs, start)
@@ -1113,13 +1125,12 @@ class CustomHTML2Text(HTML2Text):
return return
if self.inside_code: if self.inside_code:
# Inline code: no newlines allowed # Inline code: no newlines allowed
self.o(data.replace('\n', ' ')) self.o(data.replace("\n", " "))
return return
# Default behavior for other tags # Default behavior for other tags
super().handle_data(data, entity_char) super().handle_data(data, entity_char)
# # Handle pre tags # # Handle pre tags
# if tag == 'pre': # if tag == 'pre':
# if start: # if start:

View File

@@ -1,2 +1,3 @@
class OutCallback: class OutCallback:
def __call__(self, s: str) -> None: ... def __call__(self, s: str) -> None:
...

View File

@@ -210,7 +210,7 @@ def escape_md_section(
snob: bool = False, snob: bool = False,
escape_dot: bool = True, escape_dot: bool = True,
escape_plus: bool = True, escape_plus: bool = True,
escape_dash: bool = True escape_dash: bool = True,
) -> str: ) -> str:
""" """
Escapes markdown-sensitive characters across whole document sections. Escapes markdown-sensitive characters across whole document sections.
@@ -233,6 +233,7 @@ def escape_md_section(
return text return text
def reformat_table(lines: List[str], right_margin: int) -> List[str]: def reformat_table(lines: List[str], right_margin: int) -> List[str]:
""" """
Given the lines of a table Given the lines of a table

View File

@@ -6,6 +6,7 @@ from .async_logger import AsyncLogger, LogLevel
# Initialize logger # Initialize logger
logger = AsyncLogger(log_level=LogLevel.DEBUG, verbose=True) logger = AsyncLogger(log_level=LogLevel.DEBUG, verbose=True)
def post_install(): def post_install():
"""Run all post-installation tasks""" """Run all post-installation tasks"""
logger.info("Running post-installation setup...", tag="INIT") logger.info("Running post-installation setup...", tag="INIT")
@@ -13,18 +14,36 @@ def post_install():
run_migration() run_migration()
logger.success("Post-installation setup completed!", tag="COMPLETE") logger.success("Post-installation setup completed!", tag="COMPLETE")
def install_playwright(): def install_playwright():
logger.info("Installing Playwright browsers...", tag="INIT") logger.info("Installing Playwright browsers...", tag="INIT")
try: try:
# subprocess.check_call([sys.executable, "-m", "playwright", "install", "--with-deps", "--force", "chrome"]) # subprocess.check_call([sys.executable, "-m", "playwright", "install", "--with-deps", "--force", "chrome"])
subprocess.check_call([sys.executable, "-m", "playwright", "install", "--with-deps", "--force", "chromium"]) subprocess.check_call(
logger.success("Playwright installation completed successfully.", tag="COMPLETE") [
except subprocess.CalledProcessError as e: sys.executable,
"-m",
"playwright",
"install",
"--with-deps",
"--force",
"chromium",
]
)
logger.success(
"Playwright installation completed successfully.", tag="COMPLETE"
)
except subprocess.CalledProcessError:
# logger.error(f"Error during Playwright installation: {e}", tag="ERROR") # logger.error(f"Error during Playwright installation: {e}", tag="ERROR")
logger.warning(f"Please run '{sys.executable} -m playwright install --with-deps' manually after the installation.") logger.warning(
except Exception as e: f"Please run '{sys.executable} -m playwright install --with-deps' manually after the installation."
)
except Exception:
# logger.error(f"Unexpected error during Playwright installation: {e}", tag="ERROR") # logger.error(f"Unexpected error during Playwright installation: {e}", tag="ERROR")
logger.warning(f"Please run '{sys.executable} -m playwright install --with-deps' manually after the installation.") logger.warning(
f"Please run '{sys.executable} -m playwright install --with-deps' manually after the installation."
)
def run_migration(): def run_migration():
"""Initialize database during installation""" """Initialize database during installation"""
@@ -33,18 +52,26 @@ def run_migration():
from crawl4ai.async_database import async_db_manager from crawl4ai.async_database import async_db_manager
asyncio.run(async_db_manager.initialize()) asyncio.run(async_db_manager.initialize())
logger.success("Database initialization completed successfully.", tag="COMPLETE") logger.success(
"Database initialization completed successfully.", tag="COMPLETE"
)
except ImportError: except ImportError:
logger.warning("Database module not found. Will initialize on first use.") logger.warning("Database module not found. Will initialize on first use.")
except Exception as e: except Exception as e:
logger.warning(f"Database initialization failed: {e}") logger.warning(f"Database initialization failed: {e}")
logger.warning("Database will be initialized on first use") logger.warning("Database will be initialized on first use")
async def run_doctor(): async def run_doctor():
"""Test if Crawl4AI is working properly""" """Test if Crawl4AI is working properly"""
logger.info("Running Crawl4AI health check...", tag="INIT") logger.info("Running Crawl4AI health check...", tag="INIT")
try: try:
from .async_webcrawler import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig, CacheMode from .async_webcrawler import (
AsyncWebCrawler,
BrowserConfig,
CrawlerRunConfig,
CacheMode,
)
browser_config = BrowserConfig( browser_config = BrowserConfig(
headless=True, headless=True,
@@ -52,7 +79,7 @@ async def run_doctor():
ignore_https_errors=True, ignore_https_errors=True,
light_mode=True, light_mode=True,
viewport_width=1280, viewport_width=1280,
viewport_height=720 viewport_height=720,
) )
run_config = CrawlerRunConfig( run_config = CrawlerRunConfig(
@@ -62,10 +89,7 @@ async def run_doctor():
async with AsyncWebCrawler(config=browser_config) as crawler: async with AsyncWebCrawler(config=browser_config) as crawler:
logger.info("Testing crawling capabilities...", tag="TEST") logger.info("Testing crawling capabilities...", tag="TEST")
result = await crawler.arun( result = await crawler.arun(url="https://crawl4ai.com", config=run_config)
url="https://crawl4ai.com",
config=run_config
)
if result and result.markdown: if result and result.markdown:
logger.success("✅ Crawling test passed!", tag="COMPLETE") logger.success("✅ Crawling test passed!", tag="COMPLETE")
@@ -77,7 +101,9 @@ async def run_doctor():
logger.error(f"❌ Test failed: {e}", tag="ERROR") logger.error(f"❌ Test failed: {e}", tag="ERROR")
return False return False
def doctor(): def doctor():
"""Entry point for the doctor command""" """Entry point for the doctor command"""
import asyncio import asyncio
return asyncio.run(run_doctor()) return asyncio.run(run_doctor())

View File

@@ -1,15 +1,18 @@
import os, sys import os
# Create a function get name of a js script, then load from the CURRENT folder of this script and return its content as string, make sure its error free # Create a function get name of a js script, then load from the CURRENT folder of this script and return its content as string, make sure its error free
def load_js_script(script_name): def load_js_script(script_name):
# Get the path of the current script # Get the path of the current script
current_script_path = os.path.dirname(os.path.realpath(__file__)) current_script_path = os.path.dirname(os.path.realpath(__file__))
# Get the path of the script to load # Get the path of the script to load
script_path = os.path.join(current_script_path, script_name + '.js') script_path = os.path.join(current_script_path, script_name + ".js")
# Check if the script exists # Check if the script exists
if not os.path.exists(script_path): if not os.path.exists(script_path):
raise ValueError(f"Script {script_name} not found in the folder {current_script_path}") raise ValueError(
f"Script {script_name} not found in the folder {current_script_path}"
)
# Load the content of the script # Load the content of the script
with open(script_path, 'r') as f: with open(script_path, "r") as f:
script_content = f.read() script_content = f.read()
return script_content return script_content

View File

@@ -11,16 +11,16 @@ from rank_bm25 import BM25Okapi
from nltk.tokenize import word_tokenize from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer from nltk.stem import WordNetLemmatizer
from litellm import completion, batch_completion from litellm import batch_completion
from .async_logger import AsyncLogger from .async_logger import AsyncLogger
import litellm import litellm
import pickle import pickle
import hashlib # <--- ADDED for file-hash import hashlib # <--- ADDED for file-hash
from fnmatch import fnmatch
import glob import glob
litellm.set_verbose = False litellm.set_verbose = False
def _compute_file_hash(file_path: Path) -> str: def _compute_file_hash(file_path: Path) -> str:
"""Compute MD5 hash for the file's entire content.""" """Compute MD5 hash for the file's entire content."""
hash_md5 = hashlib.md5() hash_md5 = hashlib.md5()
@@ -29,13 +29,14 @@ def _compute_file_hash(file_path: Path) -> str:
hash_md5.update(chunk) hash_md5.update(chunk)
return hash_md5.hexdigest() return hash_md5.hexdigest()
class AsyncLLMTextManager: class AsyncLLMTextManager:
def __init__( def __init__(
self, self,
docs_dir: Path, docs_dir: Path,
logger: Optional[AsyncLogger] = None, logger: Optional[AsyncLogger] = None,
max_concurrent_calls: int = 5, max_concurrent_calls: int = 5,
batch_size: int = 3 batch_size: int = 3,
) -> None: ) -> None:
self.docs_dir = docs_dir self.docs_dir = docs_dir
self.logger = logger self.logger = logger
@@ -51,7 +52,7 @@ class AsyncLLMTextManager:
contents = [] contents = []
for file_path in doc_batch: for file_path in doc_batch:
try: try:
with open(file_path, 'r', encoding='utf-8') as f: with open(file_path, "r", encoding="utf-8") as f:
contents.append(f.read()) contents.append(f.read())
except Exception as e: except Exception as e:
self.logger.error(f"Error reading {file_path}: {str(e)}") self.logger.error(f"Error reading {file_path}: {str(e)}")
@@ -77,43 +78,53 @@ Wrap your response in <index>...</index> tags.
# Prepare messages for batch processing # Prepare messages for batch processing
messages_list = [ messages_list = [
[ [
{"role": "user", "content": f"{prompt}\n\nGenerate index for this documentation:\n\n{content}"} {
"role": "user",
"content": f"{prompt}\n\nGenerate index for this documentation:\n\n{content}",
}
] ]
for content in contents if content for content in contents
if content
] ]
try: try:
responses = batch_completion( responses = batch_completion(
model="anthropic/claude-3-5-sonnet-latest", model="anthropic/claude-3-5-sonnet-latest",
messages=messages_list, messages=messages_list,
logger_fn=None logger_fn=None,
) )
# Process responses and save index files # Process responses and save index files
for response, file_path in zip(responses, doc_batch): for response, file_path in zip(responses, doc_batch):
try: try:
index_content_match = re.search( index_content_match = re.search(
r'<index>(.*?)</index>', r"<index>(.*?)</index>",
response.choices[0].message.content, response.choices[0].message.content,
re.DOTALL re.DOTALL,
) )
if not index_content_match: if not index_content_match:
self.logger.warning(f"No <index>...</index> content found for {file_path}") self.logger.warning(
f"No <index>...</index> content found for {file_path}"
)
continue continue
index_content = re.sub( index_content = re.sub(
r"\n\s*\n", "\n", index_content_match.group(1) r"\n\s*\n", "\n", index_content_match.group(1)
).strip() ).strip()
if index_content: if index_content:
index_file = file_path.with_suffix('.q.md') index_file = file_path.with_suffix(".q.md")
with open(index_file, 'w', encoding='utf-8') as f: with open(index_file, "w", encoding="utf-8") as f:
f.write(index_content) f.write(index_content)
self.logger.info(f"Created index file: {index_file}") self.logger.info(f"Created index file: {index_file}")
else: else:
self.logger.warning(f"No index content found in response for {file_path}") self.logger.warning(
f"No index content found in response for {file_path}"
)
except Exception as e: except Exception as e:
self.logger.error(f"Error processing response for {file_path}: {str(e)}") self.logger.error(
f"Error processing response for {file_path}: {str(e)}"
)
except Exception as e: except Exception as e:
self.logger.error(f"Error in batch completion: {str(e)}") self.logger.error(f"Error in batch completion: {str(e)}")
@@ -171,7 +182,12 @@ Wrap your response in <index>...</index> tags.
lemmatizer = WordNetLemmatizer() lemmatizer = WordNetLemmatizer()
stop_words = set(stopwords.words("english")) - { stop_words = set(stopwords.words("english")) - {
"how", "what", "when", "where", "why", "which", "how",
"what",
"when",
"where",
"why",
"which",
} }
tokens = [] tokens = []
@@ -222,7 +238,9 @@ Wrap your response in <index>...</index> tags.
self.logger.info("Checking which .q.md files need (re)indexing...") self.logger.info("Checking which .q.md files need (re)indexing...")
# Gather all .q.md files # Gather all .q.md files
q_files = [self.docs_dir / f for f in os.listdir(self.docs_dir) if f.endswith(".q.md")] q_files = [
self.docs_dir / f for f in os.listdir(self.docs_dir) if f.endswith(".q.md")
]
# We'll store known (unchanged) facts in these lists # We'll store known (unchanged) facts in these lists
existing_facts: List[str] = [] existing_facts: List[str] = []
@@ -243,7 +261,9 @@ Wrap your response in <index>...</index> tags.
# Otherwise, load the existing cache and compare hash # Otherwise, load the existing cache and compare hash
cache = self._load_or_create_token_cache(qf) cache = self._load_or_create_token_cache(qf)
# If the .q.tokens was out of date (i.e. changed hash), we reindex # If the .q.tokens was out of date (i.e. changed hash), we reindex
if len(cache["facts"]) == 0 or cache.get("content_hash") != _compute_file_hash(qf): if len(cache["facts"]) == 0 or cache.get(
"content_hash"
) != _compute_file_hash(qf):
needSet.append(qf) needSet.append(qf)
else: else:
# File is unchanged → retrieve cached token data # File is unchanged → retrieve cached token data
@@ -255,20 +275,29 @@ Wrap your response in <index>...</index> tags.
if not needSet and not clear_cache: if not needSet and not clear_cache:
# If no file needs reindexing, try loading existing index # If no file needs reindexing, try loading existing index
if self.maybe_load_bm25_index(clear_cache=False): if self.maybe_load_bm25_index(clear_cache=False):
self.logger.info("No new/changed .q.md files found. Using existing BM25 index.") self.logger.info(
"No new/changed .q.md files found. Using existing BM25 index."
)
return return
else: else:
# If there's no existing index, we must build a fresh index from the old caches # If there's no existing index, we must build a fresh index from the old caches
self.logger.info("No existing BM25 index found. Building from cached facts.") self.logger.info(
"No existing BM25 index found. Building from cached facts."
)
if existing_facts: if existing_facts:
self.logger.info(f"Building BM25 index with {len(existing_facts)} cached facts.") self.logger.info(
f"Building BM25 index with {len(existing_facts)} cached facts."
)
self.bm25_index = BM25Okapi(existing_tokens) self.bm25_index = BM25Okapi(existing_tokens)
self.tokenized_facts = existing_facts self.tokenized_facts = existing_facts
with open(self.bm25_index_file, "wb") as f: with open(self.bm25_index_file, "wb") as f:
pickle.dump({ pickle.dump(
"bm25_index": self.bm25_index, {
"tokenized_facts": self.tokenized_facts "bm25_index": self.bm25_index,
}, f) "tokenized_facts": self.tokenized_facts,
},
f,
)
else: else:
self.logger.warning("No facts found at all. Index remains empty.") self.logger.warning("No facts found at all. Index remains empty.")
return return
@@ -311,7 +340,9 @@ Wrap your response in <index>...</index> tags.
self._save_token_cache(file, fresh_cache) self._save_token_cache(file, fresh_cache)
mem_usage = process.memory_info().rss / 1024 / 1024 mem_usage = process.memory_info().rss / 1024 / 1024
self.logger.debug(f"Memory usage after {file.name}: {mem_usage:.2f}MB") self.logger.debug(
f"Memory usage after {file.name}: {mem_usage:.2f}MB"
)
except Exception as e: except Exception as e:
self.logger.error(f"Error processing {file}: {str(e)}") self.logger.error(f"Error processing {file}: {str(e)}")
@@ -328,21 +359,28 @@ Wrap your response in <index>...</index> tags.
all_tokens = existing_tokens + new_tokens all_tokens = existing_tokens + new_tokens
# 3) Build BM25 index from combined facts # 3) Build BM25 index from combined facts
self.logger.info(f"Building BM25 index with {len(all_facts)} total facts (old + new).") self.logger.info(
f"Building BM25 index with {len(all_facts)} total facts (old + new)."
)
self.bm25_index = BM25Okapi(all_tokens) self.bm25_index = BM25Okapi(all_tokens)
self.tokenized_facts = all_facts self.tokenized_facts = all_facts
# 4) Save the updated BM25 index to disk # 4) Save the updated BM25 index to disk
with open(self.bm25_index_file, "wb") as f: with open(self.bm25_index_file, "wb") as f:
pickle.dump({ pickle.dump(
"bm25_index": self.bm25_index, {
"tokenized_facts": self.tokenized_facts "bm25_index": self.bm25_index,
}, f) "tokenized_facts": self.tokenized_facts,
},
f,
)
final_mem = process.memory_info().rss / 1024 / 1024 final_mem = process.memory_info().rss / 1024 / 1024
self.logger.info(f"Search index updated. Final memory usage: {final_mem:.2f}MB") self.logger.info(f"Search index updated. Final memory usage: {final_mem:.2f}MB")
async def generate_index_files(self, force_generate_facts: bool = False, clear_bm25_cache: bool = False) -> None: async def generate_index_files(
self, force_generate_facts: bool = False, clear_bm25_cache: bool = False
) -> None:
""" """
Generate index files for all documents in parallel batches Generate index files for all documents in parallel batches
@@ -353,15 +391,17 @@ Wrap your response in <index>...</index> tags.
self.logger.info("Starting index generation for documentation files.") self.logger.info("Starting index generation for documentation files.")
md_files = [ md_files = [
self.docs_dir / f for f in os.listdir(self.docs_dir) self.docs_dir / f
if f.endswith('.md') and not any(f.endswith(x) for x in ['.q.md', '.xs.md']) for f in os.listdir(self.docs_dir)
if f.endswith(".md") and not any(f.endswith(x) for x in [".q.md", ".xs.md"])
] ]
# Filter out files that already have .q files unless force=True # Filter out files that already have .q files unless force=True
if not force_generate_facts: if not force_generate_facts:
md_files = [ md_files = [
f for f in md_files f
if not (self.docs_dir / f.name.replace('.md', '.q.md')).exists() for f in md_files
if not (self.docs_dir / f.name.replace(".md", ".q.md")).exists()
] ]
if not md_files: if not md_files:
@@ -369,8 +409,10 @@ Wrap your response in <index>...</index> tags.
else: else:
# Process documents in batches # Process documents in batches
for i in range(0, len(md_files), self.batch_size): for i in range(0, len(md_files), self.batch_size):
batch = md_files[i:i + self.batch_size] batch = md_files[i : i + self.batch_size]
self.logger.info(f"Processing batch {i//self.batch_size + 1}/{(len(md_files)//self.batch_size) + 1}") self.logger.info(
f"Processing batch {i//self.batch_size + 1}/{(len(md_files)//self.batch_size) + 1}"
)
await self._process_document_batch(batch) await self._process_document_batch(batch)
self.logger.info("Index generation complete, building/updating search index.") self.logger.info("Index generation complete, building/updating search index.")
@@ -378,21 +420,31 @@ Wrap your response in <index>...</index> tags.
def generate(self, sections: List[str], mode: str = "extended") -> str: def generate(self, sections: List[str], mode: str = "extended") -> str:
# Get all markdown files # Get all markdown files
all_files = glob.glob(str(self.docs_dir / "[0-9]*.md")) + \ all_files = glob.glob(str(self.docs_dir / "[0-9]*.md")) + glob.glob(
glob.glob(str(self.docs_dir / "[0-9]*.xs.md")) str(self.docs_dir / "[0-9]*.xs.md")
)
# Extract base names without extensions # Extract base names without extensions
base_docs = {Path(f).name.split('.')[0] for f in all_files base_docs = {
if not Path(f).name.endswith('.q.md')} Path(f).name.split(".")[0]
for f in all_files
if not Path(f).name.endswith(".q.md")
}
# Filter by sections if provided # Filter by sections if provided
if sections: if sections:
base_docs = {doc for doc in base_docs base_docs = {
if any(section.lower() in doc.lower() for section in sections)} doc
for doc in base_docs
if any(section.lower() in doc.lower() for section in sections)
}
# Get file paths based on mode # Get file paths based on mode
files = [] files = []
for doc in sorted(base_docs, key=lambda x: int(x.split('_')[0]) if x.split('_')[0].isdigit() else 999999): for doc in sorted(
base_docs,
key=lambda x: int(x.split("_")[0]) if x.split("_")[0].isdigit() else 999999,
):
if mode == "condensed": if mode == "condensed":
xs_file = self.docs_dir / f"{doc}.xs.md" xs_file = self.docs_dir / f"{doc}.xs.md"
regular_file = self.docs_dir / f"{doc}.md" regular_file = self.docs_dir / f"{doc}.md"
@@ -404,7 +456,7 @@ Wrap your response in <index>...</index> tags.
content = [] content = []
for file in files: for file in files:
try: try:
with open(file, 'r', encoding='utf-8') as f: with open(file, "r", encoding="utf-8") as f:
fname = Path(file).name fname = Path(file).name
content.append(f"{'#'*20}\n# {fname}\n{'#'*20}\n\n{f.read()}") content.append(f"{'#'*20}\n# {fname}\n{'#'*20}\n\n{f.read()}")
except Exception as e: except Exception as e:
@@ -443,15 +495,9 @@ Wrap your response in <index>...</index> tags.
for file, _ in ranked_files: for file, _ in ranked_files:
main_doc = str(file).replace(".q.md", ".md") main_doc = str(file).replace(".q.md", ".md")
if os.path.exists(self.docs_dir / main_doc): if os.path.exists(self.docs_dir / main_doc):
with open(self.docs_dir / main_doc, "r", encoding='utf-8') as f: with open(self.docs_dir / main_doc, "r", encoding="utf-8") as f:
only_file_name = main_doc.split("/")[-1] only_file_name = main_doc.split("/")[-1]
content = [ content = ["#" * 20, f"# {only_file_name}", "#" * 20, "", f.read()]
"#" * 20,
f"# {only_file_name}",
"#" * 20,
"",
f.read()
]
results.append("\n".join(content)) results.append("\n".join(content))
return "\n\n---\n\n".join(results) return "\n\n---\n\n".join(results)
@@ -482,7 +528,9 @@ Wrap your response in <index>...</index> tags.
if len(components) == 3: if len(components) == 3:
code_ref = components[2].strip() code_ref = components[2].strip()
code_tokens = self.preprocess_text(code_ref) code_tokens = self.preprocess_text(code_ref)
code_match_score = len(set(query_tokens) & set(code_tokens)) / len(query_tokens) code_match_score = len(set(query_tokens) & set(code_tokens)) / len(
query_tokens
)
file_data[file_path]["total_score"] += score file_data[file_path]["total_score"] += score
file_data[file_path]["match_count"] += 1 file_data[file_path]["match_count"] += 1

View File

@@ -2,41 +2,51 @@ from abc import ABC, abstractmethod
from typing import Optional, Dict, Any, Tuple from typing import Optional, Dict, Any, Tuple
from .models import MarkdownGenerationResult from .models import MarkdownGenerationResult
from .html2text import CustomHTML2Text from .html2text import CustomHTML2Text
from .content_filter_strategy import RelevantContentFilter, BM25ContentFilter from .content_filter_strategy import RelevantContentFilter
import re import re
from urllib.parse import urljoin from urllib.parse import urljoin
# Pre-compile the regex pattern # Pre-compile the regex pattern
LINK_PATTERN = re.compile(r'!?\[([^\]]+)\]\(([^)]+?)(?:\s+"([^"]*)")?\)') LINK_PATTERN = re.compile(r'!?\[([^\]]+)\]\(([^)]+?)(?:\s+"([^"]*)")?\)')
def fast_urljoin(base: str, url: str) -> str: def fast_urljoin(base: str, url: str) -> str:
"""Fast URL joining for common cases.""" """Fast URL joining for common cases."""
if url.startswith(('http://', 'https://', 'mailto:', '//')): if url.startswith(("http://", "https://", "mailto:", "//")):
return url return url
if url.startswith('/'): if url.startswith("/"):
# Handle absolute paths # Handle absolute paths
if base.endswith('/'): if base.endswith("/"):
return base[:-1] + url return base[:-1] + url
return base + url return base + url
return urljoin(base, url) return urljoin(base, url)
class MarkdownGenerationStrategy(ABC): class MarkdownGenerationStrategy(ABC):
"""Abstract base class for markdown generation strategies.""" """Abstract base class for markdown generation strategies."""
def __init__(self, content_filter: Optional[RelevantContentFilter] = None, options: Optional[Dict[str, Any]] = None):
def __init__(
self,
content_filter: Optional[RelevantContentFilter] = None,
options: Optional[Dict[str, Any]] = None,
):
self.content_filter = content_filter self.content_filter = content_filter
self.options = options or {} self.options = options or {}
@abstractmethod @abstractmethod
def generate_markdown(self, def generate_markdown(
cleaned_html: str, self,
base_url: str = "", cleaned_html: str,
html2text_options: Optional[Dict[str, Any]] = None, base_url: str = "",
content_filter: Optional[RelevantContentFilter] = None, html2text_options: Optional[Dict[str, Any]] = None,
citations: bool = True, content_filter: Optional[RelevantContentFilter] = None,
**kwargs) -> MarkdownGenerationResult: citations: bool = True,
**kwargs,
) -> MarkdownGenerationResult:
"""Generate markdown from cleaned HTML.""" """Generate markdown from cleaned HTML."""
pass pass
class DefaultMarkdownGenerator(MarkdownGenerationStrategy): class DefaultMarkdownGenerator(MarkdownGenerationStrategy):
""" """
Default implementation of markdown generation strategy. Default implementation of markdown generation strategy.
@@ -54,10 +64,17 @@ class DefaultMarkdownGenerator(MarkdownGenerationStrategy):
Returns: Returns:
MarkdownGenerationResult: Result containing raw markdown, fit markdown, fit HTML, and references markdown. MarkdownGenerationResult: Result containing raw markdown, fit markdown, fit HTML, and references markdown.
""" """
def __init__(self, content_filter: Optional[RelevantContentFilter] = None, options: Optional[Dict[str, Any]] = None):
def __init__(
self,
content_filter: Optional[RelevantContentFilter] = None,
options: Optional[Dict[str, Any]] = None,
):
super().__init__(content_filter, options) super().__init__(content_filter, options)
def convert_links_to_citations(self, markdown: str, base_url: str = "") -> Tuple[str, str]: def convert_links_to_citations(
self, markdown: str, base_url: str = ""
) -> Tuple[str, str]:
""" """
Convert links in markdown to citations. Convert links in markdown to citations.
@@ -83,28 +100,34 @@ class DefaultMarkdownGenerator(MarkdownGenerationStrategy):
counter = 1 counter = 1
for match in LINK_PATTERN.finditer(markdown): for match in LINK_PATTERN.finditer(markdown):
parts.append(markdown[last_end:match.start()]) parts.append(markdown[last_end : match.start()])
text, url, title = match.groups() text, url, title = match.groups()
# Use cached URL if available, otherwise compute and cache # Use cached URL if available, otherwise compute and cache
if base_url and not url.startswith(('http://', 'https://', 'mailto:')): if base_url and not url.startswith(("http://", "https://", "mailto:")):
if url not in url_cache: if url not in url_cache:
url_cache[url] = fast_urljoin(base_url, url) url_cache[url] = fast_urljoin(base_url, url)
url = url_cache[url] url = url_cache[url]
if url not in link_map: if url not in link_map:
desc = [] desc = []
if title: desc.append(title) if title:
if text and text != title: desc.append(text) desc.append(title)
if text and text != title:
desc.append(text)
link_map[url] = (counter, ": " + " - ".join(desc) if desc else "") link_map[url] = (counter, ": " + " - ".join(desc) if desc else "")
counter += 1 counter += 1
num = link_map[url][0] num = link_map[url][0]
parts.append(f"{text}{num}" if not match.group(0).startswith('!') else f"![{text}{num}⟩]") parts.append(
f"{text}{num}"
if not match.group(0).startswith("!")
else f"![{text}{num}⟩]"
)
last_end = match.end() last_end = match.end()
parts.append(markdown[last_end:]) parts.append(markdown[last_end:])
converted_text = ''.join(parts) converted_text = "".join(parts)
# Pre-build reference strings # Pre-build reference strings
references = ["\n\n## References\n\n"] references = ["\n\n## References\n\n"]
@@ -113,16 +136,18 @@ class DefaultMarkdownGenerator(MarkdownGenerationStrategy):
for url, (num, desc) in sorted(link_map.items(), key=lambda x: x[1][0]) for url, (num, desc) in sorted(link_map.items(), key=lambda x: x[1][0])
) )
return converted_text, ''.join(references) return converted_text, "".join(references)
def generate_markdown(self, def generate_markdown(
cleaned_html: str, self,
base_url: str = "", cleaned_html: str,
html2text_options: Optional[Dict[str, Any]] = None, base_url: str = "",
options: Optional[Dict[str, Any]] = None, html2text_options: Optional[Dict[str, Any]] = None,
content_filter: Optional[RelevantContentFilter] = None, options: Optional[Dict[str, Any]] = None,
citations: bool = True, content_filter: Optional[RelevantContentFilter] = None,
**kwargs) -> MarkdownGenerationResult: citations: bool = True,
**kwargs,
) -> MarkdownGenerationResult:
""" """
Generate markdown with citations from cleaned HTML. Generate markdown with citations from cleaned HTML.
@@ -147,14 +172,14 @@ class DefaultMarkdownGenerator(MarkdownGenerationStrategy):
# Initialize HTML2Text with default options for better conversion # Initialize HTML2Text with default options for better conversion
h = CustomHTML2Text(baseurl=base_url) h = CustomHTML2Text(baseurl=base_url)
default_options = { default_options = {
'body_width': 0, # Disable text wrapping "body_width": 0, # Disable text wrapping
'ignore_emphasis': False, "ignore_emphasis": False,
'ignore_links': False, "ignore_links": False,
'ignore_images': False, "ignore_images": False,
'protect_links': True, "protect_links": True,
'single_line_break': True, "single_line_break": True,
'mark_code': True, "mark_code": True,
'escape_snob': False "escape_snob": False,
} }
# Update with custom options if provided # Update with custom options if provided
@@ -179,16 +204,17 @@ class DefaultMarkdownGenerator(MarkdownGenerationStrategy):
except Exception as e: except Exception as e:
raw_markdown = f"Error converting HTML to markdown: {str(e)}" raw_markdown = f"Error converting HTML to markdown: {str(e)}"
raw_markdown = raw_markdown.replace(' ```', '```') raw_markdown = raw_markdown.replace(" ```", "```")
# Convert links to citations # Convert links to citations
markdown_with_citations: str = raw_markdown markdown_with_citations: str = raw_markdown
references_markdown: str = "" references_markdown: str = ""
if citations: if citations:
try: try:
markdown_with_citations, references_markdown = self.convert_links_to_citations( (
raw_markdown, base_url markdown_with_citations,
) references_markdown,
) = self.convert_links_to_citations(raw_markdown, base_url)
except Exception as e: except Exception as e:
markdown_with_citations = raw_markdown markdown_with_citations = raw_markdown
references_markdown = f"Error generating citations: {str(e)}" references_markdown = f"Error generating citations: {str(e)}"
@@ -200,7 +226,9 @@ class DefaultMarkdownGenerator(MarkdownGenerationStrategy):
try: try:
content_filter = content_filter or self.content_filter content_filter = content_filter or self.content_filter
filtered_html = content_filter.filter_content(cleaned_html) filtered_html = content_filter.filter_content(cleaned_html)
filtered_html = '\n'.join('<div>{}</div>'.format(s) for s in filtered_html) filtered_html = "\n".join(
"<div>{}</div>".format(s) for s in filtered_html
)
fit_markdown = h.handle(filtered_html) fit_markdown = h.handle(filtered_html)
except Exception as e: except Exception as e:
fit_markdown = f"Error generating fit markdown: {str(e)}" fit_markdown = f"Error generating fit markdown: {str(e)}"

View File

@@ -1,13 +1,11 @@
import os import os
import asyncio import asyncio
import logging
from pathlib import Path from pathlib import Path
import aiosqlite import aiosqlite
from typing import Optional from typing import Optional
import xxhash import xxhash
import aiofiles import aiofiles
import shutil import shutil
import time
from datetime import datetime from datetime import datetime
from .async_logger import AsyncLogger, LogLevel from .async_logger import AsyncLogger, LogLevel
@@ -17,6 +15,7 @@ logger = AsyncLogger(log_level=LogLevel.DEBUG, verbose=True)
# logging.basicConfig(level=logging.INFO) # logging.basicConfig(level=logging.INFO)
# logger = logging.getLogger(__name__) # logger = logging.getLogger(__name__)
class DatabaseMigration: class DatabaseMigration:
def __init__(self, db_path: str): def __init__(self, db_path: str):
self.db_path = db_path self.db_path = db_path
@@ -24,11 +23,11 @@ class DatabaseMigration:
def _ensure_content_dirs(self, base_path: str) -> dict: def _ensure_content_dirs(self, base_path: str) -> dict:
dirs = { dirs = {
'html': 'html_content', "html": "html_content",
'cleaned': 'cleaned_html', "cleaned": "cleaned_html",
'markdown': 'markdown_content', "markdown": "markdown_content",
'extracted': 'extracted_content', "extracted": "extracted_content",
'screenshots': 'screenshots' "screenshots": "screenshots",
} }
content_paths = {} content_paths = {}
for key, dirname in dirs.items(): for key, dirname in dirs.items():
@@ -52,7 +51,7 @@ class DatabaseMigration:
file_path = os.path.join(self.content_paths[content_type], content_hash) file_path = os.path.join(self.content_paths[content_type], content_hash)
if not os.path.exists(file_path): if not os.path.exists(file_path):
async with aiofiles.open(file_path, 'w', encoding='utf-8') as f: async with aiofiles.open(file_path, "w", encoding="utf-8") as f:
await f.write(content) await f.write(content)
return content_hash return content_hash
@@ -66,24 +65,36 @@ class DatabaseMigration:
async with aiosqlite.connect(self.db_path) as db: async with aiosqlite.connect(self.db_path) as db:
# Get all rows # Get all rows
async with db.execute( async with db.execute(
'''SELECT url, html, cleaned_html, markdown, """SELECT url, html, cleaned_html, markdown,
extracted_content, screenshot FROM crawled_data''' extracted_content, screenshot FROM crawled_data"""
) as cursor: ) as cursor:
rows = await cursor.fetchall() rows = await cursor.fetchall()
migrated_count = 0 migrated_count = 0
for row in rows: for row in rows:
url, html, cleaned_html, markdown, extracted_content, screenshot = row (
url,
html,
cleaned_html,
markdown,
extracted_content,
screenshot,
) = row
# Store content in files and get hashes # Store content in files and get hashes
html_hash = await self._store_content(html, 'html') html_hash = await self._store_content(html, "html")
cleaned_hash = await self._store_content(cleaned_html, 'cleaned') cleaned_hash = await self._store_content(cleaned_html, "cleaned")
markdown_hash = await self._store_content(markdown, 'markdown') markdown_hash = await self._store_content(markdown, "markdown")
extracted_hash = await self._store_content(extracted_content, 'extracted') extracted_hash = await self._store_content(
screenshot_hash = await self._store_content(screenshot, 'screenshots') extracted_content, "extracted"
)
screenshot_hash = await self._store_content(
screenshot, "screenshots"
)
# Update database with hashes # Update database with hashes
await db.execute(''' await db.execute(
"""
UPDATE crawled_data UPDATE crawled_data
SET html = ?, SET html = ?,
cleaned_html = ?, cleaned_html = ?,
@@ -91,26 +102,37 @@ class DatabaseMigration:
extracted_content = ?, extracted_content = ?,
screenshot = ? screenshot = ?
WHERE url = ? WHERE url = ?
''', (html_hash, cleaned_hash, markdown_hash, """,
extracted_hash, screenshot_hash, url)) (
html_hash,
cleaned_hash,
markdown_hash,
extracted_hash,
screenshot_hash,
url,
),
)
migrated_count += 1 migrated_count += 1
if migrated_count % 100 == 0: if migrated_count % 100 == 0:
logger.info(f"Migrated {migrated_count} records...", tag="INIT") logger.info(f"Migrated {migrated_count} records...", tag="INIT")
await db.commit() await db.commit()
logger.success(f"Migration completed. {migrated_count} records processed.", tag="COMPLETE") logger.success(
f"Migration completed. {migrated_count} records processed.",
tag="COMPLETE",
)
except Exception as e: except Exception as e:
# logger.error(f"Migration failed: {e}") # logger.error(f"Migration failed: {e}")
logger.error( logger.error(
message="Migration failed: {error}", message="Migration failed: {error}",
tag="ERROR", tag="ERROR",
params={"error": str(e)} params={"error": str(e)},
) )
raise e raise e
async def backup_database(db_path: str) -> str: async def backup_database(db_path: str) -> str:
"""Create backup of existing database""" """Create backup of existing database"""
if not os.path.exists(db_path): if not os.path.exists(db_path):
@@ -118,7 +140,7 @@ async def backup_database(db_path: str) -> str:
return None return None
# Create backup with timestamp # Create backup with timestamp
timestamp = datetime.now().strftime('%Y%m%d_%H%M%S') timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
backup_path = f"{db_path}.backup_{timestamp}" backup_path = f"{db_path}.backup_{timestamp}"
try: try:
@@ -132,12 +154,11 @@ async def backup_database(db_path: str) -> str:
except Exception as e: except Exception as e:
# logger.error(f"Backup failed: {e}") # logger.error(f"Backup failed: {e}")
logger.error( logger.error(
message="Migration failed: {error}", message="Migration failed: {error}", tag="ERROR", params={"error": str(e)}
tag="ERROR", )
params={"error": str(e)}
)
raise e raise e
async def run_migration(db_path: Optional[str] = None): async def run_migration(db_path: Optional[str] = None):
"""Run database migration""" """Run database migration"""
if db_path is None: if db_path is None:
@@ -155,14 +176,19 @@ async def run_migration(db_path: Optional[str] = None):
migration = DatabaseMigration(db_path) migration = DatabaseMigration(db_path)
await migration.migrate_database() await migration.migrate_database()
def main(): def main():
"""CLI entry point for migration""" """CLI entry point for migration"""
import argparse import argparse
parser = argparse.ArgumentParser(description='Migrate Crawl4AI database to file-based storage')
parser.add_argument('--db-path', help='Custom database path') parser = argparse.ArgumentParser(
description="Migrate Crawl4AI database to file-based storage"
)
parser.add_argument("--db-path", help="Custom database path")
args = parser.parse_args() args = parser.parse_args()
asyncio.run(run_migration(args.db_path)) asyncio.run(run_migration(args.db_path))
if __name__ == "__main__": if __name__ == "__main__":
main() main()

View File

@@ -2,75 +2,86 @@ from functools import lru_cache
from pathlib import Path from pathlib import Path
import subprocess, os import subprocess, os
import shutil import shutil
import tarfile
from .model_loader import * from .model_loader import *
import argparse import argparse
import urllib.request
from crawl4ai.config import MODEL_REPO_BRANCH from crawl4ai.config import MODEL_REPO_BRANCH
__location__ = os.path.realpath(os.path.join(os.getcwd(), os.path.dirname(__file__))) __location__ = os.path.realpath(os.path.join(os.getcwd(), os.path.dirname(__file__)))
@lru_cache() @lru_cache()
def get_available_memory(device): def get_available_memory(device):
import torch import torch
if device.type == 'cuda':
if device.type == "cuda":
return torch.cuda.get_device_properties(device).total_memory return torch.cuda.get_device_properties(device).total_memory
elif device.type == 'mps': elif device.type == "mps":
return 48 * 1024 ** 3 # Assuming 8GB for MPS, as a conservative estimate return 48 * 1024**3 # Assuming 8GB for MPS, as a conservative estimate
else: else:
return 0 return 0
@lru_cache() @lru_cache()
def calculate_batch_size(device): def calculate_batch_size(device):
available_memory = get_available_memory(device) available_memory = get_available_memory(device)
if device.type == 'cpu': if device.type == "cpu":
return 16 return 16
elif device.type in ['cuda', 'mps']: elif device.type in ["cuda", "mps"]:
# Adjust these thresholds based on your model size and available memory # Adjust these thresholds based on your model size and available memory
if available_memory >= 31 * 1024 ** 3: # > 32GB if available_memory >= 31 * 1024**3: # > 32GB
return 256 return 256
elif available_memory >= 15 * 1024 ** 3: # > 16GB to 32GB elif available_memory >= 15 * 1024**3: # > 16GB to 32GB
return 128 return 128
elif available_memory >= 8 * 1024 ** 3: # 8GB to 16GB elif available_memory >= 8 * 1024**3: # 8GB to 16GB
return 64 return 64
else: else:
return 32 return 32
else: else:
return 16 # Default batch size return 16 # Default batch size
@lru_cache() @lru_cache()
def get_device(): def get_device():
import torch import torch
if torch.cuda.is_available(): if torch.cuda.is_available():
device = torch.device('cuda') device = torch.device("cuda")
elif torch.backends.mps.is_available(): elif torch.backends.mps.is_available():
device = torch.device('mps') device = torch.device("mps")
else: else:
device = torch.device('cpu') device = torch.device("cpu")
return device return device
def set_model_device(model): def set_model_device(model):
device = get_device() device = get_device()
model.to(device) model.to(device)
return model, device return model, device
@lru_cache() @lru_cache()
def get_home_folder(): def get_home_folder():
home_folder = os.path.join(os.getenv("CRAWL4_AI_BASE_DIRECTORY", Path.home()), ".crawl4ai") home_folder = os.path.join(
os.getenv("CRAWL4_AI_BASE_DIRECTORY", Path.home()), ".crawl4ai"
)
os.makedirs(home_folder, exist_ok=True) os.makedirs(home_folder, exist_ok=True)
os.makedirs(f"{home_folder}/cache", exist_ok=True) os.makedirs(f"{home_folder}/cache", exist_ok=True)
os.makedirs(f"{home_folder}/models", exist_ok=True) os.makedirs(f"{home_folder}/models", exist_ok=True)
return home_folder return home_folder
@lru_cache() @lru_cache()
def load_bert_base_uncased(): def load_bert_base_uncased():
from transformers import BertTokenizer, BertModel, AutoTokenizer, AutoModel from transformers import BertTokenizer, BertModel
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased', resume_download=None)
model = BertModel.from_pretrained('bert-base-uncased', resume_download=None) tokenizer = BertTokenizer.from_pretrained("bert-base-uncased", resume_download=None)
model = BertModel.from_pretrained("bert-base-uncased", resume_download=None)
model.eval() model.eval()
model, device = set_model_device(model) model, device = set_model_device(model)
return tokenizer, model return tokenizer, model
@lru_cache() @lru_cache()
def load_HF_embedding_model(model_name="BAAI/bge-small-en-v1.5") -> tuple: def load_HF_embedding_model(model_name="BAAI/bge-small-en-v1.5") -> tuple:
"""Load the Hugging Face model for embedding. """Load the Hugging Face model for embedding.
@@ -81,30 +92,35 @@ def load_HF_embedding_model(model_name="BAAI/bge-small-en-v1.5") -> tuple:
Returns: Returns:
tuple: The tokenizer and model. tuple: The tokenizer and model.
""" """
from transformers import BertTokenizer, BertModel, AutoTokenizer, AutoModel from transformers import AutoTokenizer, AutoModel
tokenizer = AutoTokenizer.from_pretrained(model_name, resume_download=None) tokenizer = AutoTokenizer.from_pretrained(model_name, resume_download=None)
model = AutoModel.from_pretrained(model_name, resume_download=None) model = AutoModel.from_pretrained(model_name, resume_download=None)
model.eval() model.eval()
model, device = set_model_device(model) model, device = set_model_device(model)
return tokenizer, model return tokenizer, model
@lru_cache() @lru_cache()
def load_text_classifier(): def load_text_classifier():
from transformers import AutoTokenizer, AutoModelForSequenceClassification from transformers import AutoTokenizer, AutoModelForSequenceClassification
from transformers import pipeline from transformers import pipeline
import torch
tokenizer = AutoTokenizer.from_pretrained("dstefa/roberta-base_topic_classification_nyt_news") tokenizer = AutoTokenizer.from_pretrained(
model = AutoModelForSequenceClassification.from_pretrained("dstefa/roberta-base_topic_classification_nyt_news") "dstefa/roberta-base_topic_classification_nyt_news"
)
model = AutoModelForSequenceClassification.from_pretrained(
"dstefa/roberta-base_topic_classification_nyt_news"
)
model.eval() model.eval()
model, device = set_model_device(model) model, device = set_model_device(model)
pipe = pipeline("text-classification", model=model, tokenizer=tokenizer) pipe = pipeline("text-classification", model=model, tokenizer=tokenizer)
return pipe return pipe
@lru_cache() @lru_cache()
def load_text_multilabel_classifier(): def load_text_multilabel_classifier():
from transformers import AutoModelForSequenceClassification, AutoTokenizer from transformers import AutoModelForSequenceClassification, AutoTokenizer
import numpy as np
from scipy.special import expit from scipy.special import expit
import torch import torch
@@ -117,17 +133,26 @@ def load_text_multilabel_classifier():
# device = torch.device("cpu") # device = torch.device("cpu")
# # return load_spacy_model(), torch.device("cpu") # # return load_spacy_model(), torch.device("cpu")
MODEL = "cardiffnlp/tweet-topic-21-multi" MODEL = "cardiffnlp/tweet-topic-21-multi"
tokenizer = AutoTokenizer.from_pretrained(MODEL, resume_download=None) tokenizer = AutoTokenizer.from_pretrained(MODEL, resume_download=None)
model = AutoModelForSequenceClassification.from_pretrained(MODEL, resume_download=None) model = AutoModelForSequenceClassification.from_pretrained(
MODEL, resume_download=None
)
model.eval() model.eval()
model, device = set_model_device(model) model, device = set_model_device(model)
class_mapping = model.config.id2label class_mapping = model.config.id2label
def _classifier(texts, threshold=0.5, max_length=64): def _classifier(texts, threshold=0.5, max_length=64):
tokens = tokenizer(texts, return_tensors='pt', padding=True, truncation=True, max_length=max_length) tokens = tokenizer(
tokens = {key: val.to(device) for key, val in tokens.items()} # Move tokens to the selected device texts,
return_tensors="pt",
padding=True,
truncation=True,
max_length=max_length,
)
tokens = {
key: val.to(device) for key, val in tokens.items()
} # Move tokens to the selected device
with torch.no_grad(): with torch.no_grad():
output = model(**tokens) output = model(**tokens)
@@ -138,25 +163,31 @@ def load_text_multilabel_classifier():
batch_labels = [] batch_labels = []
for prediction in predictions: for prediction in predictions:
labels = [class_mapping[i] for i, value in enumerate(prediction) if value == 1] labels = [
class_mapping[i] for i, value in enumerate(prediction) if value == 1
]
batch_labels.append(labels) batch_labels.append(labels)
return batch_labels return batch_labels
return _classifier, device return _classifier, device
@lru_cache() @lru_cache()
def load_nltk_punkt(): def load_nltk_punkt():
import nltk import nltk
try: try:
nltk.data.find('tokenizers/punkt') nltk.data.find("tokenizers/punkt")
except LookupError: except LookupError:
nltk.download('punkt') nltk.download("punkt")
return nltk.data.find('tokenizers/punkt') return nltk.data.find("tokenizers/punkt")
@lru_cache() @lru_cache()
def load_spacy_model(): def load_spacy_model():
import spacy import spacy
name = "models/reuters" name = "models/reuters"
home_folder = get_home_folder() home_folder = get_home_folder()
model_folder = Path(home_folder) / name model_folder = Path(home_folder) / name
@@ -176,7 +207,9 @@ def load_spacy_model():
if model_folder.exists(): if model_folder.exists():
shutil.rmtree(model_folder) shutil.rmtree(model_folder)
except PermissionError: except PermissionError:
print("[WARNING] Unable to remove existing folders. Please manually delete the following folders and try again:") print(
"[WARNING] Unable to remove existing folders. Please manually delete the following folders and try again:"
)
print(f"- {repo_folder}") print(f"- {repo_folder}")
print(f"- {model_folder}") print(f"- {model_folder}")
return None return None
@@ -187,7 +220,7 @@ def load_spacy_model():
["git", "clone", "-b", branch, repo_url, str(repo_folder)], ["git", "clone", "-b", branch, repo_url, str(repo_folder)],
stdout=subprocess.DEVNULL, stdout=subprocess.DEVNULL,
stderr=subprocess.DEVNULL, stderr=subprocess.DEVNULL,
check=True check=True,
) )
# Create the models directory if it doesn't exist # Create the models directory if it doesn't exist
@@ -215,6 +248,7 @@ def load_spacy_model():
print(f"Error loading spacy model: {e}") print(f"Error loading spacy model: {e}")
return None return None
def download_all_models(remove_existing=False): def download_all_models(remove_existing=False):
"""Download all models required for Crawl4AI.""" """Download all models required for Crawl4AI."""
if remove_existing: if remove_existing:
@@ -243,14 +277,20 @@ def download_all_models(remove_existing=False):
load_nltk_punkt() load_nltk_punkt()
print("[LOG] ✅ All models downloaded successfully.") print("[LOG] ✅ All models downloaded successfully.")
def main(): def main():
print("[LOG] Welcome to the Crawl4AI Model Downloader!") print("[LOG] Welcome to the Crawl4AI Model Downloader!")
print("[LOG] This script will download all the models required for Crawl4AI.") print("[LOG] This script will download all the models required for Crawl4AI.")
parser = argparse.ArgumentParser(description="Crawl4AI Model Downloader") parser = argparse.ArgumentParser(description="Crawl4AI Model Downloader")
parser.add_argument('--remove-existing', action='store_true', help="Remove existing models before downloading") parser.add_argument(
"--remove-existing",
action="store_true",
help="Remove existing models before downloading",
)
args = parser.parse_args() args = parser.parse_args()
download_all_models(remove_existing=args.remove_existing) download_all_models(remove_existing=args.remove_existing)
if __name__ == "__main__": if __name__ == "__main__":
main() main()

View File

@@ -1,8 +1,71 @@
from __future__ import annotations
from pydantic import BaseModel, HttpUrl from pydantic import BaseModel, HttpUrl
from typing import List, Dict, Optional, Callable, Awaitable, Union, Any from typing import List, Dict, Optional, Callable, Awaitable, Union, Any
from enum import Enum
from dataclasses import dataclass from dataclasses import dataclass
from .ssl_certificate import SSLCertificate from .ssl_certificate import SSLCertificate
from datetime import datetime
from datetime import timedelta
from math import inf
###############################
# Dispatcher Models
###############################
@dataclass
class DomainState:
last_request_time: float = 0
current_delay: float = 0
fail_count: int = 0
@dataclass
class CrawlerTaskResult:
task_id: str
url: str
result: "CrawlResult"
memory_usage: float
peak_memory: float
start_time: datetime
end_time: datetime
error_message: str = ""
class CrawlStatus(Enum):
QUEUED = "QUEUED"
IN_PROGRESS = "IN_PROGRESS"
COMPLETED = "COMPLETED"
FAILED = "FAILED"
@dataclass
class CrawlStats:
task_id: str
url: str
status: CrawlStatus
start_time: Optional[datetime] = None
end_time: Optional[datetime] = None
memory_usage: float = 0.0
peak_memory: float = 0.0
error_message: str = ""
@property
def duration(self) -> str:
if not self.start_time:
return "0:00"
end = self.end_time or datetime.now()
duration = end - self.start_time
return str(timedelta(seconds=int(duration.total_seconds())))
class DisplayMode(Enum):
DETAILED = "DETAILED"
AGGREGATED = "AGGREGATED"
###############################
# Crawler Models
###############################
@dataclass @dataclass
class TokenUsage: class TokenUsage:
completion_tokens: int = 0 completion_tokens: int = 0
@@ -16,6 +79,7 @@ class UrlModel(BaseModel):
url: HttpUrl url: HttpUrl
forced: bool = False forced: bool = False
class MarkdownGenerationResult(BaseModel): class MarkdownGenerationResult(BaseModel):
raw_markdown: str raw_markdown: str
markdown_with_citations: str markdown_with_citations: str
@@ -23,6 +87,28 @@ class MarkdownGenerationResult(BaseModel):
fit_markdown: Optional[str] = None fit_markdown: Optional[str] = None
fit_html: Optional[str] = None fit_html: Optional[str] = None
class DispatchResult(BaseModel):
task_id: str
memory_usage: float
peak_memory: float
start_time: datetime
end_time: datetime
error_message: str = ""
@dataclass
class TraversalStats:
"""Statistics for the traversal process"""
start_time: datetime
urls_processed: int = 0
urls_failed: int = 0
urls_skipped: int = 0
total_depth_reached: int = 0
current_depth: int = 0
class CrawlResult(BaseModel): class CrawlResult(BaseModel):
url: str url: str
html: str html: str
@@ -32,7 +118,7 @@ class CrawlResult(BaseModel):
links: Dict[str, List[Dict]] = {} links: Dict[str, List[Dict]] = {}
downloaded_files: Optional[List[str]] = None downloaded_files: Optional[List[str]] = None
screenshot: Optional[str] = None screenshot: Optional[str] = None
pdf : Optional[bytes] = None pdf: Optional[bytes] = None
markdown: Optional[Union[str, MarkdownGenerationResult]] = None markdown: Optional[Union[str, MarkdownGenerationResult]] = None
markdown_v2: Optional[MarkdownGenerationResult] = None markdown_v2: Optional[MarkdownGenerationResult] = None
fit_markdown: Optional[str] = None fit_markdown: Optional[str] = None
@@ -44,6 +130,13 @@ class CrawlResult(BaseModel):
response_headers: Optional[dict] = None response_headers: Optional[dict] = None
status_code: Optional[int] = None status_code: Optional[int] = None
ssl_certificate: Optional[SSLCertificate] = None ssl_certificate: Optional[SSLCertificate] = None
dispatch_result: Optional[DispatchResult] = None
redirected_url: Optional[str] = None
# Attributes for position
depth: Optional[int] = None
score: Optional[float] = -inf
parent_url: Optional[str] = None
class Config: class Config:
arbitrary_types_allowed = True arbitrary_types_allowed = True
@@ -56,6 +149,51 @@ class AsyncCrawlResponse(BaseModel):
get_delayed_content: Optional[Callable[[Optional[float]], Awaitable[str]]] = None get_delayed_content: Optional[Callable[[Optional[float]], Awaitable[str]]] = None
downloaded_files: Optional[List[str]] = None downloaded_files: Optional[List[str]] = None
ssl_certificate: Optional[SSLCertificate] = None ssl_certificate: Optional[SSLCertificate] = None
redirected_url: Optional[str] = None
class Config: class Config:
arbitrary_types_allowed = True arbitrary_types_allowed = True
###############################
# Scraping Models
###############################
class MediaItem(BaseModel):
src: Optional[str] = ""
alt: Optional[str] = ""
desc: Optional[str] = ""
score: Optional[int] = 0
type: str = "image"
group_id: Optional[int] = 0
format: Optional[str] = None
width: Optional[int] = None
class Link(BaseModel):
href: Optional[str] = ""
text: Optional[str] = ""
title: Optional[str] = ""
base_domain: Optional[str] = ""
class Media(BaseModel):
images: List[MediaItem] = []
videos: List[MediaItem] = (
[]
) # Using MediaItem model for now, can be extended with Video model if needed
audios: List[MediaItem] = (
[]
) # Using MediaItem model for now, can be extended with Audio model if needed
class Links(BaseModel):
internal: List[Link] = []
external: List[Link] = []
class ScrapingResult(BaseModel):
cleaned_html: str
success: bool
media: Media = Media()
links: Links = Links()
metadata: Dict[str, Any] = {}

View File

@@ -202,3 +202,808 @@ Avoid Common Mistakes:
Result Result
Output the final list of JSON objects, wrapped in <blocks>...</blocks> XML tags. Make sure to close the tag properly.""" Output the final list of JSON objects, wrapped in <blocks>...</blocks> XML tags. Make sure to close the tag properly."""
PROMPT_FILTER_CONTENT = """Your task is to filter and convert HTML content into clean, focused markdown that's optimized for use with LLMs and information retrieval systems.
INPUT HTML:
<|HTML_CONTENT_START|>
{HTML}
<|HTML_CONTENT_END|>
SPECIFIC INSTRUCTION:
<|USER_INSTRUCTION_START|>
{REQUEST}
<|USER_INSTRUCTION_END|>
TASK DETAILS:
1. Content Selection
- DO: Keep essential information, main content, key details
- DO: Preserve hierarchical structure using markdown headers
- DO: Keep code blocks, tables, key lists
- DON'T: Include navigation menus, ads, footers, cookie notices
- DON'T: Keep social media widgets, sidebars, related content
2. Content Transformation
- DO: Use proper markdown syntax (#, ##, **, `, etc)
- DO: Convert tables to markdown tables
- DO: Preserve code formatting with ```language blocks
- DO: Maintain link texts but remove tracking parameters
- DON'T: Include HTML tags in output
- DON'T: Keep class names, ids, or other HTML attributes
3. Content Organization
- DO: Maintain logical flow of information
- DO: Group related content under appropriate headers
- DO: Use consistent header levels
- DON'T: Fragment related content
- DON'T: Duplicate information
Example Input:
<div class="main-content"><h1>Setup Guide</h1><p>Follow these steps...</p></div>
<div class="sidebar">Related articles...</div>
Example Output:
# Setup Guide
Follow these steps...
IMPORTANT: If specific instruction is provided above, prioritize those requirements over these general guidelines.
OUTPUT FORMAT:
Wrap your response in <content> tags. Use proper markdown throughout.
<content>
[Your markdown content here]
</content>
Begin filtering now."""
JSON_SCHEMA_BUILDER= """
# HTML Schema Generation Instructions
You are a specialized model designed to analyze HTML patterns and generate extraction schemas. Your primary job is to create structured JSON schemas that can be used to extract data from HTML in a consistent and reliable way. When presented with HTML content, you must analyze its structure and generate a schema that captures all relevant data points.
## Your Core Responsibilities:
1. Analyze HTML structure to identify repeating patterns and important data points
2. Generate valid JSON schemas following the specified format
3. Create appropriate selectors that will work reliably for data extraction
4. Name fields meaningfully based on their content and purpose
5. Handle both specific user requests and autonomous pattern detection
## Available Schema Types You Can Generate:
<schema_types>
1. Basic Single-Level Schema
- Use for simple, flat data structures
- Example: Product cards, user profiles
- Direct field extractions
2. Nested Object Schema
- Use for hierarchical data
- Example: Articles with author details
- Contains objects within objects
3. List Schema
- Use for repeating elements
- Example: Comment sections, product lists
- Handles arrays of similar items
4. Complex Nested Lists
- Use for multi-level data
- Example: Categories with subcategories
- Multiple levels of nesting
5. Transformation Schema
- Use for data requiring processing
- Supports regex and text transformations
- Special attribute handling
</schema_types>
<schema_structure>
Your output must always be a JSON object with this structure:
{
"name": "Descriptive name of the pattern",
"baseSelector": "CSS selector for the repeating element",
"fields": [
{
"name": "field_name",
"selector": "CSS selector",
"type": "text|attribute|nested|list|regex",
"attribute": "attribute_name", // Optional
"transform": "transformation_type", // Optional
"pattern": "regex_pattern", // Optional
"fields": [] // For nested/list types
}
]
}
</schema_structure>
<type_definitions>
Available field types:
- text: Direct text extraction
- attribute: HTML attribute extraction
- nested: Object containing other fields
- list: Array of similar items
- regex: Pattern-based extraction
</type_definitions>
<behavior_rules>
1. When given a specific query:
- Focus on extracting requested data points
- Use most specific selectors possible
- Include all fields mentioned in the query
2. When no query is provided:
- Identify main content areas
- Extract all meaningful data points
- Use semantic structure to determine importance
- Include prices, dates, titles, and other common data types
3. Always:
- Use reliable CSS selectors
- Handle dynamic class names appropriately
- Create descriptive field names
- Follow consistent naming conventions
</behavior_rules>
<examples>
1. Basic Product Card Example:
<html>
<div class="product-card" data-cat-id="electronics" data-subcat-id="laptops">
<h2 class="product-title">Gaming Laptop</h2>
<span class="price">$999.99</span>
<img src="laptop.jpg" alt="Gaming Laptop">
</div>
</html>
Generated Schema:
{
"name": "Product Cards",
"baseSelector": ".product-card",
"baseFields": [
{"name": "data_cat_id", "type": "attribute", "attribute": "data-cat-id"},
{"name": "data_subcat_id", "type": "attribute", "attribute": "data-subcat-id"}
],
"fields": [
{
"name": "title",
"selector": ".product-title",
"type": "text"
},
{
"name": "price",
"selector": ".price",
"type": "text"
},
{
"name": "image_url",
"selector": "img",
"type": "attribute",
"attribute": "src"
}
]
}
2. Article with Author Details Example:
<html>
<article>
<h1>The Future of AI</h1>
<div class="author-info">
<span class="author-name">Dr. Smith</span>
<img src="author.jpg" alt="Dr. Smith">
</div>
</article>
</html>
Generated Schema:
{
"name": "Article Details",
"baseSelector": "article",
"fields": [
{
"name": "title",
"selector": "h1",
"type": "text"
},
{
"name": "author",
"type": "nested",
"selector": ".author-info",
"fields": [
{
"name": "name",
"selector": ".author-name",
"type": "text"
},
{
"name": "avatar",
"selector": "img",
"type": "attribute",
"attribute": "src"
}
]
}
]
}
3. Comments Section Example:
<html>
<div class="comments-container">
<div class="comment" data-user-id="123">
<div class="user-name">John123</div>
<p class="comment-text">Great article!</p>
</div>
<div class="comment" data-user-id="456">
<div class="user-name">Alice456</div>
<p class="comment-text">Thanks for sharing.</p>
</div>
</div>
</html>
Generated Schema:
{
"name": "Comment Section",
"baseSelector": ".comments-container",
"baseFields": [
{"name": "data_user_id", "type": "attribute", "attribute": "data-user-id"}
],
"fields": [
{
"name": "comments",
"type": "list",
"selector": ".comment",
"fields": [
{
"name": "user",
"selector": ".user-name",
"type": "text"
},
{
"name": "content",
"selector": ".comment-text",
"type": "text"
}
]
}
]
}
4. E-commerce Categories Example:
<html>
<div class="category-section" data-category="electronics">
<h2>Electronics</h2>
<div class="subcategory">
<h3>Laptops</h3>
<div class="product">
<span class="product-name">MacBook Pro</span>
<span class="price">$1299</span>
</div>
<div class="product">
<span class="product-name">Dell XPS</span>
<span class="price">$999</span>
</div>
</div>
</div>
</html>
Generated Schema:
{
"name": "E-commerce Categories",
"baseSelector": ".category-section",
"baseFields": [
{"name": "data_category", "type": "attribute", "attribute": "data-category"}
],
"fields": [
{
"name": "category_name",
"selector": "h2",
"type": "text"
},
{
"name": "subcategories",
"type": "nested_list",
"selector": ".subcategory",
"fields": [
{
"name": "name",
"selector": "h3",
"type": "text"
},
{
"name": "products",
"type": "list",
"selector": ".product",
"fields": [
{
"name": "name",
"selector": ".product-name",
"type": "text"
},
{
"name": "price",
"selector": ".price",
"type": "text"
}
]
}
]
}
]
}
5. Job Listings with Transformations Example:
<html>
<div class="job-post">
<h3 class="job-title">Senior Developer</h3>
<span class="salary-text">Salary: $120,000/year</span>
<span class="location"> New York, NY </span>
</div>
</html>
Generated Schema:
{
"name": "Job Listings",
"baseSelector": ".job-post",
"fields": [
{
"name": "title",
"selector": ".job-title",
"type": "text",
"transform": "uppercase"
},
{
"name": "salary",
"selector": ".salary-text",
"type": "regex",
"pattern": "\\$([\\d,]+)"
},
{
"name": "location",
"selector": ".location",
"type": "text",
"transform": "strip"
}
]
}
6. Skyscanner Place Card Example:
<html>
<div class="PlaceCard_descriptionContainer__M2NjN" data-testid="description-container">
<div class="PlaceCard_nameContainer__ZjZmY" tabindex="0" role="link">
<div class="PlaceCard_nameContent__ODUwZ">
<span class="BpkText_bpk-text__MjhhY BpkText_bpk-text--heading-4__Y2FlY">Doha</span>
</div>
<span class="BpkText_bpk-text__MjhhY BpkText_bpk-text--heading-4__Y2FlY PlaceCard_subName__NTVkY">Qatar</span>
</div>
<span class="PlaceCard_advertLabel__YTM0N">Sunny days and the warmest welcome awaits</span>
<a class="BpkLink_bpk-link__MmQwY PlaceCard_descriptionLink__NzYwN" href="/flights/del/doha/" data-testid="flights-link">
<div class="PriceDescription_container__NjEzM">
<span class="BpkText_bpk-text--heading-5__MTRjZ">₹17,559</span>
</div>
</a>
</div>
</html>
Generated Schema:
{
"name": "Skyscanner Place Cards",
"baseSelector": "div[class^='PlaceCard_descriptionContainer__']",
"baseFields": [
{"name": "data_testid", "type": "attribute", "attribute": "data-testid"}
],
"fields": [
{
"name": "city_name",
"selector": "div[class^='PlaceCard_nameContent__'] .BpkText_bpk-text--heading-4__",
"type": "text"
},
{
"name": "country_name",
"selector": "span[class*='PlaceCard_subName__']",
"type": "text"
},
{
"name": "description",
"selector": "span[class*='PlaceCard_advertLabel__']",
"type": "text"
},
{
"name": "flight_price",
"selector": "a[data-testid='flights-link'] .BpkText_bpk-text--heading-5__",
"type": "text"
},
{
"name": "flight_url",
"selector": "a[data-testid='flights-link']",
"type": "attribute",
"attribute": "href"
}
]
}
</examples>
<output_requirements>
Your output must:
1. Be valid JSON only
2. Include no explanatory text
3. Follow the exact schema structure provided
4. Use appropriate field types
5. Include all required fields
6. Use valid CSS selectors
</output_requirements>
"""
JSON_SCHEMA_BUILDER_XPATH = """
# HTML Schema Generation Instructions
You are a specialized model designed to analyze HTML patterns and generate extraction schemas. Your primary job is to create structured JSON schemas that can be used to extract data from HTML in a consistent and reliable way. When presented with HTML content, you must analyze its structure and generate a schema that captures all relevant data points.
## Your Core Responsibilities:
1. Analyze HTML structure to identify repeating patterns and important data points
2. Generate valid JSON schemas following the specified format
3. Create appropriate XPath selectors that will work reliably for data extraction
4. Name fields meaningfully based on their content and purpose
5. Handle both specific user requests and autonomous pattern detection
## Available Schema Types You Can Generate:
<schema_types>
1. Basic Single-Level Schema
- Use for simple, flat data structures
- Example: Product cards, user profiles
- Direct field extractions
2. Nested Object Schema
- Use for hierarchical data
- Example: Articles with author details
- Contains objects within objects
3. List Schema
- Use for repeating elements
- Example: Comment sections, product lists
- Handles arrays of similar items
4. Complex Nested Lists
- Use for multi-level data
- Example: Categories with subcategories
- Multiple levels of nesting
5. Transformation Schema
- Use for data requiring processing
- Supports regex and text transformations
- Special attribute handling
</schema_types>
<schema_structure>
Your output must always be a JSON object with this structure:
{
"name": "Descriptive name of the pattern",
"baseSelector": "XPath selector for the repeating element",
"fields": [
{
"name": "field_name",
"selector": "XPath selector",
"type": "text|attribute|nested|list|regex",
"attribute": "attribute_name", // Optional
"transform": "transformation_type", // Optional
"pattern": "regex_pattern", // Optional
"fields": [] // For nested/list types
}
]
}
</schema_structure>
<type_definitions>
Available field types:
- text: Direct text extraction
- attribute: HTML attribute extraction
- nested: Object containing other fields
- list: Array of similar items
- regex: Pattern-based extraction
</type_definitions>
<behavior_rules>
1. When given a specific query:
- Focus on extracting requested data points
- Use most specific selectors possible
- Include all fields mentioned in the query
2. When no query is provided:
- Identify main content areas
- Extract all meaningful data points
- Use semantic structure to determine importance
- Include prices, dates, titles, and other common data types
3. Always:
- Use reliable XPath selectors
- Handle dynamic element IDs appropriately
- Create descriptive field names
- Follow consistent naming conventions
</behavior_rules>
<examples>
1. Basic Product Card Example:
<html>
<div class="product-card" data-cat-id="electronics" data-subcat-id="laptops">
<h2 class="product-title">Gaming Laptop</h2>
<span class="price">$999.99</span>
<img src="laptop.jpg" alt="Gaming Laptop">
</div>
</html>
Generated Schema:
{
"name": "Product Cards",
"baseSelector": "//div[@class='product-card']",
"baseFields": [
{"name": "data_cat_id", "type": "attribute", "attribute": "data-cat-id"},
{"name": "data_subcat_id", "type": "attribute", "attribute": "data-subcat-id"}
],
"fields": [
{
"name": "title",
"selector": ".//h2[@class='product-title']",
"type": "text"
},
{
"name": "price",
"selector": ".//span[@class='price']",
"type": "text"
},
{
"name": "image_url",
"selector": ".//img",
"type": "attribute",
"attribute": "src"
}
]
}
2. Article with Author Details Example:
<html>
<article>
<h1>The Future of AI</h1>
<div class="author-info">
<span class="author-name">Dr. Smith</span>
<img src="author.jpg" alt="Dr. Smith">
</div>
</article>
</html>
Generated Schema:
{
"name": "Article Details",
"baseSelector": "//article",
"fields": [
{
"name": "title",
"selector": ".//h1",
"type": "text"
},
{
"name": "author",
"type": "nested",
"selector": ".//div[@class='author-info']",
"fields": [
{
"name": "name",
"selector": ".//span[@class='author-name']",
"type": "text"
},
{
"name": "avatar",
"selector": ".//img",
"type": "attribute",
"attribute": "src"
}
]
}
]
}
3. Comments Section Example:
<html>
<div class="comments-container">
<div class="comment" data-user-id="123">
<div class="user-name">John123</div>
<p class="comment-text">Great article!</p>
</div>
<div class="comment" data-user-id="456">
<div class="user-name">Alice456</div>
<p class="comment-text">Thanks for sharing.</p>
</div>
</div>
</html>
Generated Schema:
{
"name": "Comment Section",
"baseSelector": "//div[@class='comments-container']",
"fields": [
{
"name": "comments",
"type": "list",
"selector": ".//div[@class='comment']",
"baseFields": [
{"name": "data_user_id", "type": "attribute", "attribute": "data-user-id"}
],
"fields": [
{
"name": "user",
"selector": ".//div[@class='user-name']",
"type": "text"
},
{
"name": "content",
"selector": ".//p[@class='comment-text']",
"type": "text"
}
]
}
]
}
4. E-commerce Categories Example:
<html>
<div class="category-section" data-category="electronics">
<h2>Electronics</h2>
<div class="subcategory">
<h3>Laptops</h3>
<div class="product">
<span class="product-name">MacBook Pro</span>
<span class="price">$1299</span>
</div>
<div class="product">
<span class="product-name">Dell XPS</span>
<span class="price">$999</span>
</div>
</div>
</div>
</html>
Generated Schema:
{
"name": "E-commerce Categories",
"baseSelector": "//div[@class='category-section']",
"baseFields": [
{"name": "data_category", "type": "attribute", "attribute": "data-category"}
],
"fields": [
{
"name": "category_name",
"selector": ".//h2",
"type": "text"
},
{
"name": "subcategories",
"type": "nested_list",
"selector": ".//div[@class='subcategory']",
"fields": [
{
"name": "name",
"selector": ".//h3",
"type": "text"
},
{
"name": "products",
"type": "list",
"selector": ".//div[@class='product']",
"fields": [
{
"name": "name",
"selector": ".//span[@class='product-name']",
"type": "text"
},
{
"name": "price",
"selector": ".//span[@class='price']",
"type": "text"
}
]
}
]
}
]
}
5. Job Listings with Transformations Example:
<html>
<div class="job-post">
<h3 class="job-title">Senior Developer</h3>
<span class="salary-text">Salary: $120,000/year</span>
<span class="location"> New York, NY </span>
</div>
</html>
Generated Schema:
{
"name": "Job Listings",
"baseSelector": "//div[@class='job-post']",
"fields": [
{
"name": "title",
"selector": ".//h3[@class='job-title']",
"type": "text",
"transform": "uppercase"
},
{
"name": "salary",
"selector": ".//span[@class='salary-text']",
"type": "regex",
"pattern": "\\$([\\d,]+)"
},
{
"name": "location",
"selector": ".//span[@class='location']",
"type": "text",
"transform": "strip"
}
]
}
6. Skyscanner Place Card Example:
<html>
<div class="PlaceCard_descriptionContainer__M2NjN" data-testid="description-container">
<div class="PlaceCard_nameContainer__ZjZmY" tabindex="0" role="link">
<div class="PlaceCard_nameContent__ODUwZ">
<span class="BpkText_bpk-text__MjhhY BpkText_bpk-text--heading-4__Y2FlY">Doha</span>
</div>
<span class="BpkText_bpk-text__MjhhY BpkText_bpk-text--heading-4__Y2FlY PlaceCard_subName__NTVkY">Qatar</span>
</div>
<span class="PlaceCard_advertLabel__YTM0N">Sunny days and the warmest welcome awaits</span>
<a class="BpkLink_bpk-link__MmQwY PlaceCard_descriptionLink__NzYwN" href="/flights/del/doha/" data-testid="flights-link">
<div class="PriceDescription_container__NjEzM">
<span class="BpkText_bpk-text--heading-5__MTRjZ">₹17,559</span>
</div>
</a>
</div>
</html>
Generated Schema:
{
"name": "Skyscanner Place Cards",
"baseSelector": "//div[contains(@class, 'PlaceCard_descriptionContainer__')]",
"baseFields": [
{"name": "data_testid", "type": "attribute", "attribute": "data-testid"}
],
"fields": [
{
"name": "city_name",
"selector": ".//div[contains(@class, 'PlaceCard_nameContent__')]//span[contains(@class, 'BpkText_bpk-text--heading-4__')]",
"type": "text"
},
{
"name": "country_name",
"selector": ".//span[contains(@class, 'PlaceCard_subName__')]",
"type": "text"
},
{
"name": "description",
"selector": ".//span[contains(@class, 'PlaceCard_advertLabel__')]",
"type": "text"
},
{
"name": "flight_price",
"selector": ".//a[@data-testid='flights-link']//span[contains(@class, 'BpkText_bpk-text--heading-5__')]",
"type": "text"
},
{
"name": "flight_url",
"selector": ".//a[@data-testid='flights-link']",
"type": "attribute",
"attribute": "href"
}
]
}
</examples>
<output_requirements>
Your output must:
1. Be valid JSON only
2. Include no explanatory text
3. Follow the exact schema structure provided
4. Use appropriate field types
5. Include all required fields
6. Use valid XPath selectors
</output_requirements>
"""

View File

@@ -26,11 +26,12 @@ class SSLCertificate:
export_as_json() -> Dict[str, Any]: Export the certificate as JSON format. export_as_json() -> Dict[str, Any]: Export the certificate as JSON format.
export_as_text() -> str: Export the certificate as text format. export_as_text() -> str: Export the certificate as text format.
""" """
def __init__(self, cert_info: Dict[str, Any]): def __init__(self, cert_info: Dict[str, Any]):
self._cert_info = self._decode_cert_data(cert_info) self._cert_info = self._decode_cert_data(cert_info)
@staticmethod @staticmethod
def from_url(url: str, timeout: int = 10) -> Optional['SSLCertificate']: def from_url(url: str, timeout: int = 10) -> Optional["SSLCertificate"]:
""" """
Create SSLCertificate instance from a URL. Create SSLCertificate instance from a URL.
@@ -43,14 +44,16 @@ class SSLCertificate:
""" """
try: try:
hostname = urlparse(url).netloc hostname = urlparse(url).netloc
if ':' in hostname: if ":" in hostname:
hostname = hostname.split(':')[0] hostname = hostname.split(":")[0]
context = ssl.create_default_context() context = ssl.create_default_context()
with socket.create_connection((hostname, 443), timeout=timeout) as sock: with socket.create_connection((hostname, 443), timeout=timeout) as sock:
with context.wrap_socket(sock, server_hostname=hostname) as ssock: with context.wrap_socket(sock, server_hostname=hostname) as ssock:
cert_binary = ssock.getpeercert(binary_form=True) cert_binary = ssock.getpeercert(binary_form=True)
x509 = OpenSSL.crypto.load_certificate(OpenSSL.crypto.FILETYPE_ASN1, cert_binary) x509 = OpenSSL.crypto.load_certificate(
OpenSSL.crypto.FILETYPE_ASN1, cert_binary
)
cert_info = { cert_info = {
"subject": dict(x509.get_subject().get_components()), "subject": dict(x509.get_subject().get_components()),
@@ -61,32 +64,33 @@ class SSLCertificate:
"not_after": x509.get_notAfter(), "not_after": x509.get_notAfter(),
"fingerprint": x509.digest("sha256").hex(), "fingerprint": x509.digest("sha256").hex(),
"signature_algorithm": x509.get_signature_algorithm(), "signature_algorithm": x509.get_signature_algorithm(),
"raw_cert": base64.b64encode(cert_binary) "raw_cert": base64.b64encode(cert_binary),
} }
# Add extensions # Add extensions
extensions = [] extensions = []
for i in range(x509.get_extension_count()): for i in range(x509.get_extension_count()):
ext = x509.get_extension(i) ext = x509.get_extension(i)
extensions.append({ extensions.append(
"name": ext.get_short_name(), {"name": ext.get_short_name(), "value": str(ext)}
"value": str(ext) )
})
cert_info["extensions"] = extensions cert_info["extensions"] = extensions
return SSLCertificate(cert_info) return SSLCertificate(cert_info)
except Exception as e: except Exception:
return None return None
@staticmethod @staticmethod
def _decode_cert_data(data: Any) -> Any: def _decode_cert_data(data: Any) -> Any:
"""Helper method to decode bytes in certificate data.""" """Helper method to decode bytes in certificate data."""
if isinstance(data, bytes): if isinstance(data, bytes):
return data.decode('utf-8') return data.decode("utf-8")
elif isinstance(data, dict): elif isinstance(data, dict):
return { return {
(k.decode('utf-8') if isinstance(k, bytes) else k): SSLCertificate._decode_cert_data(v) (
k.decode("utf-8") if isinstance(k, bytes) else k
): SSLCertificate._decode_cert_data(v)
for k, v in data.items() for k, v in data.items()
} }
elif isinstance(data, list): elif isinstance(data, list):
@@ -105,7 +109,7 @@ class SSLCertificate:
""" """
json_str = json.dumps(self._cert_info, indent=2, ensure_ascii=False) json_str = json.dumps(self._cert_info, indent=2, ensure_ascii=False)
if filepath: if filepath:
Path(filepath).write_text(json_str, encoding='utf-8') Path(filepath).write_text(json_str, encoding="utf-8")
return None return None
return json_str return json_str
@@ -122,18 +126,17 @@ class SSLCertificate:
try: try:
x509 = OpenSSL.crypto.load_certificate( x509 = OpenSSL.crypto.load_certificate(
OpenSSL.crypto.FILETYPE_ASN1, OpenSSL.crypto.FILETYPE_ASN1,
base64.b64decode(self._cert_info['raw_cert']) base64.b64decode(self._cert_info["raw_cert"]),
) )
pem_data = OpenSSL.crypto.dump_certificate( pem_data = OpenSSL.crypto.dump_certificate(
OpenSSL.crypto.FILETYPE_PEM, OpenSSL.crypto.FILETYPE_PEM, x509
x509 ).decode("utf-8")
).decode('utf-8')
if filepath: if filepath:
Path(filepath).write_text(pem_data, encoding='utf-8') Path(filepath).write_text(pem_data, encoding="utf-8")
return None return None
return pem_data return pem_data
except Exception as e: except Exception:
return None return None
def to_der(self, filepath: Optional[str] = None) -> Optional[bytes]: def to_der(self, filepath: Optional[str] = None) -> Optional[bytes]:
@@ -147,7 +150,7 @@ class SSLCertificate:
Optional[bytes]: DER bytes if successful, None otherwise. Optional[bytes]: DER bytes if successful, None otherwise.
""" """
try: try:
der_data = base64.b64decode(self._cert_info['raw_cert']) der_data = base64.b64decode(self._cert_info["raw_cert"])
if filepath: if filepath:
Path(filepath).write_bytes(der_data) Path(filepath).write_bytes(der_data)
return None return None
@@ -158,24 +161,24 @@ class SSLCertificate:
@property @property
def issuer(self) -> Dict[str, str]: def issuer(self) -> Dict[str, str]:
"""Get certificate issuer information.""" """Get certificate issuer information."""
return self._cert_info.get('issuer', {}) return self._cert_info.get("issuer", {})
@property @property
def subject(self) -> Dict[str, str]: def subject(self) -> Dict[str, str]:
"""Get certificate subject information.""" """Get certificate subject information."""
return self._cert_info.get('subject', {}) return self._cert_info.get("subject", {})
@property @property
def valid_from(self) -> str: def valid_from(self) -> str:
"""Get certificate validity start date.""" """Get certificate validity start date."""
return self._cert_info.get('not_before', '') return self._cert_info.get("not_before", "")
@property @property
def valid_until(self) -> str: def valid_until(self) -> str:
"""Get certificate validity end date.""" """Get certificate validity end date."""
return self._cert_info.get('not_after', '') return self._cert_info.get("not_after", "")
@property @property
def fingerprint(self) -> str: def fingerprint(self) -> str:
"""Get certificate fingerprint.""" """Get certificate fingerprint."""
return self._cert_info.get('fingerprint', '') return self._cert_info.get("fingerprint", "")

View File

@@ -2,8 +2,146 @@ import random
from typing import Optional, Literal, List, Dict, Tuple from typing import Optional, Literal, List, Dict, Tuple
import re import re
from abc import ABC, abstractmethod
import random
from fake_useragent import UserAgent
import requests
from lxml import html
import json
from typing import Optional, List, Union, Dict
class UserAgentGenerator: class UAGen(ABC):
@abstractmethod
def generate(self,
browsers: Optional[List[str]] = None,
os: Optional[Union[str, List[str]]] = None,
min_version: float = 0.0,
platforms: Optional[Union[str, List[str]]] = None,
pct_threshold: Optional[float] = None,
fallback: str = "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 Chrome/116.0.0.0 Safari/537.36") -> Union[str, Dict]:
pass
@staticmethod
def generate_client_hints( user_agent: str) -> str:
"""Generate Sec-CH-UA header value based on user agent string"""
def _parse_user_agent(user_agent: str) -> Dict[str, str]:
"""Parse a user agent string to extract browser and version information"""
browsers = {
"chrome": r"Chrome/(\d+)",
"edge": r"Edg/(\d+)",
"safari": r"Version/(\d+)",
"firefox": r"Firefox/(\d+)",
}
result = {}
for browser, pattern in browsers.items():
match = re.search(pattern, user_agent)
if match:
result[browser] = match.group(1)
return result
browsers = _parse_user_agent(user_agent)
# Client hints components
hints = []
# Handle different browser combinations
if "chrome" in browsers:
hints.append(f'"Chromium";v="{browsers["chrome"]}"')
hints.append('"Not_A Brand";v="8"')
if "edge" in browsers:
hints.append(f'"Microsoft Edge";v="{browsers["edge"]}"')
else:
hints.append(f'"Google Chrome";v="{browsers["chrome"]}"')
elif "firefox" in browsers:
# Firefox doesn't typically send Sec-CH-UA
return '""'
elif "safari" in browsers:
# Safari's format for client hints
hints.append(f'"Safari";v="{browsers["safari"]}"')
hints.append('"Not_A Brand";v="8"')
return ", ".join(hints)
class ValidUAGenerator(UAGen):
def __init__(self):
self.ua = UserAgent()
def generate(self,
browsers: Optional[List[str]] = None,
os: Optional[Union[str, List[str]]] = None,
min_version: float = 0.0,
platforms: Optional[Union[str, List[str]]] = None,
pct_threshold: Optional[float] = None,
fallback: str = "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 Chrome/116.0.0.0 Safari/537.36") -> str:
self.ua = UserAgent(
browsers=browsers or ['Chrome', 'Firefox', 'Edge'],
os=os or ['Windows', 'Mac OS X'],
min_version=min_version,
platforms=platforms or ['desktop'],
fallback=fallback
)
return self.ua.random
class OnlineUAGenerator(UAGen):
def __init__(self):
self.agents = []
self._fetch_agents()
def _fetch_agents(self):
try:
response = requests.get(
'https://www.useragents.me/',
timeout=5,
headers={'Accept': 'text/html,application/xhtml+xml'}
)
response.raise_for_status()
tree = html.fromstring(response.content)
json_text = tree.cssselect('#most-common-desktop-useragents-json-csv > div:nth-child(1) > textarea')[0].text
self.agents = json.loads(json_text)
except Exception as e:
print(f"Error fetching agents: {e}")
def generate(self,
browsers: Optional[List[str]] = None,
os: Optional[Union[str, List[str]]] = None,
min_version: float = 0.0,
platforms: Optional[Union[str, List[str]]] = None,
pct_threshold: Optional[float] = None,
fallback: str = "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 Chrome/116.0.0.0 Safari/537.36") -> Dict:
if not self.agents:
self._fetch_agents()
filtered_agents = self.agents
if pct_threshold:
filtered_agents = [a for a in filtered_agents if a['pct'] >= pct_threshold]
if browsers:
filtered_agents = [a for a in filtered_agents
if any(b.lower() in a['ua'].lower() for b in browsers)]
if os:
os_list = [os] if isinstance(os, str) else os
filtered_agents = [a for a in filtered_agents
if any(o.lower() in a['ua'].lower() for o in os_list)]
if platforms:
platform_list = [platforms] if isinstance(platforms, str) else platforms
filtered_agents = [a for a in filtered_agents
if any(p.lower() in a['ua'].lower() for p in platform_list)]
return filtered_agents[0] if filtered_agents else {'ua': fallback, 'pct': 0}
class UserAgentGenerator():
""" """
Generate random user agents with specified constraints. Generate random user agents with specified constraints.
@@ -32,6 +170,7 @@ class UserAgentGenerator:
android_version: Optional[str] = None android_version: Optional[str] = None
): Generates a random user agent string based on the specified parameters. ): Generates a random user agent string based on the specified parameters.
""" """
def __init__(self): def __init__(self):
# Previous platform definitions remain the same... # Previous platform definitions remain the same...
self.desktop_platforms = { self.desktop_platforms = {
@@ -47,7 +186,7 @@ class UserAgentGenerator:
"generic": "(X11; Linux x86_64)", "generic": "(X11; Linux x86_64)",
"ubuntu": "(X11; Ubuntu; Linux x86_64)", "ubuntu": "(X11; Ubuntu; Linux x86_64)",
"chrome_os": "(X11; CrOS x86_64 14541.0.0)", "chrome_os": "(X11; CrOS x86_64 14541.0.0)",
} },
} }
self.mobile_platforms = { self.mobile_platforms = {
@@ -60,26 +199,14 @@ class UserAgentGenerator:
"ios": { "ios": {
"iphone": "(iPhone; CPU iPhone OS 16_5 like Mac OS X)", "iphone": "(iPhone; CPU iPhone OS 16_5 like Mac OS X)",
"ipad": "(iPad; CPU OS 16_5 like Mac OS X)", "ipad": "(iPad; CPU OS 16_5 like Mac OS X)",
} },
} }
# Browser Combinations # Browser Combinations
self.browser_combinations = { self.browser_combinations = {
1: [ 1: [["chrome"], ["firefox"], ["safari"], ["edge"]],
["chrome"], 2: [["gecko", "firefox"], ["chrome", "safari"], ["webkit", "safari"]],
["firefox"], 3: [["chrome", "safari", "edge"], ["webkit", "chrome", "safari"]],
["safari"],
["edge"]
],
2: [
["gecko", "firefox"],
["chrome", "safari"],
["webkit", "safari"]
],
3: [
["chrome", "safari", "edge"],
["webkit", "chrome", "safari"]
]
} }
# Rendering Engines with versions # Rendering Engines with versions
@@ -90,7 +217,7 @@ class UserAgentGenerator:
"Gecko/20100101", "Gecko/20100101",
"Gecko/20100101", # Firefox usually uses this constant version "Gecko/20100101", # Firefox usually uses this constant version
"Gecko/2010010", "Gecko/2010010",
] ],
} }
# Browser Versions # Browser Versions
@@ -170,12 +297,14 @@ class UserAgentGenerator:
return browser_stack return browser_stack
def generate(self, def generate(
device_type: Optional[Literal['desktop', 'mobile']] = None, self,
os_type: Optional[str] = None, device_type: Optional[Literal["desktop", "mobile"]] = None,
device_brand: Optional[str] = None, os_type: Optional[str] = None,
browser_type: Optional[Literal['chrome', 'edge', 'safari', 'firefox']] = None, device_brand: Optional[str] = None,
num_browsers: int = 3) -> str: browser_type: Optional[Literal["chrome", "edge", "safari", "firefox"]] = None,
num_browsers: int = 3,
) -> str:
""" """
Generate a random user agent with specified constraints. Generate a random user agent with specified constraints.
@@ -196,9 +325,15 @@ class UserAgentGenerator:
browser_stack = self.get_browser_stack(num_browsers) browser_stack = self.get_browser_stack(num_browsers)
# Add appropriate legacy token based on browser stack # Add appropriate legacy token based on browser stack
if "Firefox" in str(browser_stack): if "Firefox" in str(browser_stack) or browser_type == "firefox":
components.append(random.choice(self.rendering_engines["gecko"])) components.append(random.choice(self.rendering_engines["gecko"]))
elif "Chrome" in str(browser_stack) or "Safari" in str(browser_stack): elif "Chrome" in str(browser_stack) or "Safari" in str(browser_stack) or browser_type == "chrome":
components.append(self.rendering_engines["chrome_webkit"])
components.append("(KHTML, like Gecko)")
elif "Edge" in str(browser_stack) or browser_type == "edge":
components.append(self.rendering_engines["safari_webkit"])
components.append("(KHTML, like Gecko)")
elif "Safari" in str(browser_stack) or browser_type == "safari":
components.append(self.rendering_engines["chrome_webkit"]) components.append(self.rendering_engines["chrome_webkit"])
components.append("(KHTML, like Gecko)") components.append("(KHTML, like Gecko)")
@@ -215,9 +350,13 @@ class UserAgentGenerator:
def get_random_platform(self, device_type, os_type, device_brand): def get_random_platform(self, device_type, os_type, device_brand):
"""Helper method to get random platform based on constraints""" """Helper method to get random platform based on constraints"""
platforms = self.desktop_platforms if device_type == 'desktop' else \ platforms = (
self.mobile_platforms if device_type == 'mobile' else \ self.desktop_platforms
{**self.desktop_platforms, **self.mobile_platforms} if device_type == "desktop"
else self.mobile_platforms
if device_type == "mobile"
else {**self.desktop_platforms, **self.mobile_platforms}
)
if os_type: if os_type:
for platform_group in [self.desktop_platforms, self.mobile_platforms]: for platform_group in [self.desktop_platforms, self.mobile_platforms]:
@@ -233,10 +372,10 @@ class UserAgentGenerator:
def parse_user_agent(self, user_agent: str) -> Dict[str, str]: def parse_user_agent(self, user_agent: str) -> Dict[str, str]:
"""Parse a user agent string to extract browser and version information""" """Parse a user agent string to extract browser and version information"""
browsers = { browsers = {
'chrome': r'Chrome/(\d+)', "chrome": r"Chrome/(\d+)",
'edge': r'Edg/(\d+)', "edge": r"Edg/(\d+)",
'safari': r'Version/(\d+)', "safari": r"Version/(\d+)",
'firefox': r'Firefox/(\d+)' "firefox": r"Firefox/(\d+)",
} }
result = {} result = {}
@@ -255,51 +394,36 @@ class UserAgentGenerator:
hints = [] hints = []
# Handle different browser combinations # Handle different browser combinations
if 'chrome' in browsers: if "chrome" in browsers:
hints.append(f'"Chromium";v="{browsers["chrome"]}"') hints.append(f'"Chromium";v="{browsers["chrome"]}"')
hints.append('"Not_A Brand";v="8"') hints.append('"Not_A Brand";v="8"')
if 'edge' in browsers: if "edge" in browsers:
hints.append(f'"Microsoft Edge";v="{browsers["edge"]}"') hints.append(f'"Microsoft Edge";v="{browsers["edge"]}"')
else: else:
hints.append(f'"Google Chrome";v="{browsers["chrome"]}"') hints.append(f'"Google Chrome";v="{browsers["chrome"]}"')
elif 'firefox' in browsers: elif "firefox" in browsers:
# Firefox doesn't typically send Sec-CH-UA # Firefox doesn't typically send Sec-CH-UA
return '""' return '""'
elif 'safari' in browsers: elif "safari" in browsers:
# Safari's format for client hints # Safari's format for client hints
hints.append(f'"Safari";v="{browsers["safari"]}"') hints.append(f'"Safari";v="{browsers["safari"]}"')
hints.append('"Not_A Brand";v="8"') hints.append('"Not_A Brand";v="8"')
return ', '.join(hints) return ", ".join(hints)
# Example usage: # Example usage:
if __name__ == "__main__": if __name__ == "__main__":
generator = UserAgentGenerator()
print(generator.generate())
print("\nSingle browser (Chrome):") # Usage example:
print(generator.generate(num_browsers=1, browser_type='chrome')) generator = ValidUAGenerator()
ua = generator.generate()
print(ua)
print("\nTwo browsers (Gecko/Firefox):") generator = OnlineUAGenerator()
print(generator.generate(num_browsers=2)) ua = generator.generate()
print(ua)
print("\nThree browsers (Chrome/Safari/Edge):")
print(generator.generate(num_browsers=3))
print("\nFirefox on Linux:")
print(generator.generate(
device_type='desktop',
os_type='linux',
browser_type='firefox',
num_browsers=2
))
print("\nChrome/Safari/Edge on Windows:")
print(generator.generate(
device_type='desktop',
os_type='windows',
num_browsers=3
))

File diff suppressed because it is too large Load Diff

View File

@@ -1,9 +1,9 @@
# version_manager.py # version_manager.py
import os
from pathlib import Path from pathlib import Path
from packaging import version from packaging import version
from . import __version__ from . import __version__
class VersionManager: class VersionManager:
def __init__(self): def __init__(self):
self.home_dir = Path.home() / ".crawl4ai" self.home_dir = Path.home() / ".crawl4ai"
@@ -27,4 +27,3 @@ class VersionManager:
installed = self.get_installed_version() installed = self.get_installed_version()
current = version.parse(__version__.__version__) current = version.parse(__version__.__version__)
return installed is None or installed < current return installed is None or installed < current

View File

@@ -1,9 +1,10 @@
import os, time import os, time
os.environ["TOKENIZERS_PARALLELISM"] = "false" os.environ["TOKENIZERS_PARALLELISM"] = "false"
from pathlib import Path from pathlib import Path
from .models import UrlModel, CrawlResult from .models import UrlModel, CrawlResult
from .database import init_db, get_cached_url, cache_url, DB_PATH, flush_db from .database import init_db, get_cached_url, cache_url
from .utils import * from .utils import *
from .chunking_strategy import * from .chunking_strategy import *
from .extraction_strategy import * from .extraction_strategy import *
@@ -14,14 +15,27 @@ from .content_scraping_strategy import WebScrapingStrategy
from .config import * from .config import *
import warnings import warnings
import json import json
warnings.filterwarnings("ignore", message='Field "model_name" has conflict with protected namespace "model_".')
warnings.filterwarnings(
"ignore",
message='Field "model_name" has conflict with protected namespace "model_".',
)
class WebCrawler: class WebCrawler:
def __init__(self, crawler_strategy: CrawlerStrategy = None, always_by_pass_cache: bool = False, verbose: bool = False): def __init__(
self.crawler_strategy = crawler_strategy or LocalSeleniumCrawlerStrategy(verbose=verbose) self,
crawler_strategy: CrawlerStrategy = None,
always_by_pass_cache: bool = False,
verbose: bool = False,
):
self.crawler_strategy = crawler_strategy or LocalSeleniumCrawlerStrategy(
verbose=verbose
)
self.always_by_pass_cache = always_by_pass_cache self.always_by_pass_cache = always_by_pass_cache
self.crawl4ai_folder = os.path.join(os.getenv("CRAWL4_AI_BASE_DIRECTORY", Path.home()), ".crawl4ai") self.crawl4ai_folder = os.path.join(
os.getenv("CRAWL4_AI_BASE_DIRECTORY", Path.home()), ".crawl4ai"
)
os.makedirs(self.crawl4ai_folder, exist_ok=True) os.makedirs(self.crawl4ai_folder, exist_ok=True)
os.makedirs(f"{self.crawl4ai_folder}/cache", exist_ok=True) os.makedirs(f"{self.crawl4ai_folder}/cache", exist_ok=True)
init_db() init_db()
@@ -30,11 +44,11 @@ class WebCrawler:
def warmup(self): def warmup(self):
print("[LOG] 🌤️ Warming up the WebCrawler") print("[LOG] 🌤️ Warming up the WebCrawler")
self.run( self.run(
url='https://google.com/', url="https://google.com/",
word_count_threshold=5, word_count_threshold=5,
extraction_strategy=NoExtractionStrategy(), extraction_strategy=NoExtractionStrategy(),
bypass_cache=False, bypass_cache=False,
verbose=False verbose=False,
) )
self.ready = True self.ready = True
print("[LOG] 🌞 WebCrawler is ready to crawl") print("[LOG] 🌞 WebCrawler is ready to crawl")
@@ -80,6 +94,7 @@ class WebCrawler:
**kwargs, **kwargs,
) -> List[CrawlResult]: ) -> List[CrawlResult]:
extraction_strategy = extraction_strategy or NoExtractionStrategy() extraction_strategy = extraction_strategy or NoExtractionStrategy()
def fetch_page_wrapper(url_model, *args, **kwargs): def fetch_page_wrapper(url_model, *args, **kwargs):
return self.fetch_page(url_model, *args, **kwargs) return self.fetch_page(url_model, *args, **kwargs)
@@ -104,150 +119,176 @@ class WebCrawler:
return results return results
def run( def run(
self, self,
url: str, url: str,
word_count_threshold=MIN_WORD_THRESHOLD, word_count_threshold=MIN_WORD_THRESHOLD,
extraction_strategy: ExtractionStrategy = None, extraction_strategy: ExtractionStrategy = None,
chunking_strategy: ChunkingStrategy = RegexChunking(), chunking_strategy: ChunkingStrategy = RegexChunking(),
bypass_cache: bool = False, bypass_cache: bool = False,
css_selector: str = None, css_selector: str = None,
screenshot: bool = False, screenshot: bool = False,
user_agent: str = None, user_agent: str = None,
verbose=True, verbose=True,
**kwargs, **kwargs,
) -> CrawlResult: ) -> CrawlResult:
try: try:
extraction_strategy = extraction_strategy or NoExtractionStrategy() extraction_strategy = extraction_strategy or NoExtractionStrategy()
extraction_strategy.verbose = verbose extraction_strategy.verbose = verbose
if not isinstance(extraction_strategy, ExtractionStrategy): if not isinstance(extraction_strategy, ExtractionStrategy):
raise ValueError("Unsupported extraction strategy") raise ValueError("Unsupported extraction strategy")
if not isinstance(chunking_strategy, ChunkingStrategy): if not isinstance(chunking_strategy, ChunkingStrategy):
raise ValueError("Unsupported chunking strategy") raise ValueError("Unsupported chunking strategy")
word_count_threshold = max(word_count_threshold, MIN_WORD_THRESHOLD) word_count_threshold = max(word_count_threshold, MIN_WORD_THRESHOLD)
cached = None cached = None
screenshot_data = None screenshot_data = None
extracted_content = None extracted_content = None
if not bypass_cache and not self.always_by_pass_cache: if not bypass_cache and not self.always_by_pass_cache:
cached = get_cached_url(url) cached = get_cached_url(url)
if kwargs.get("warmup", True) and not self.ready: if kwargs.get("warmup", True) and not self.ready:
return None return None
if cached: if cached:
html = sanitize_input_encode(cached[1]) html = sanitize_input_encode(cached[1])
extracted_content = sanitize_input_encode(cached[4]) extracted_content = sanitize_input_encode(cached[4])
if screenshot: if screenshot:
screenshot_data = cached[9] screenshot_data = cached[9]
if not screenshot_data: if not screenshot_data:
cached = None cached = None
if not cached or not html: if not cached or not html:
if user_agent: if user_agent:
self.crawler_strategy.update_user_agent(user_agent) self.crawler_strategy.update_user_agent(user_agent)
t1 = time.time() t1 = time.time()
html = sanitize_input_encode(self.crawler_strategy.crawl(url, **kwargs)) html = sanitize_input_encode(self.crawler_strategy.crawl(url, **kwargs))
t2 = time.time() t2 = time.time()
if verbose: if verbose:
print(f"[LOG] 🚀 Crawling done for {url}, success: {bool(html)}, time taken: {t2 - t1:.2f} seconds") print(
if screenshot: f"[LOG] 🚀 Crawling done for {url}, success: {bool(html)}, time taken: {t2 - t1:.2f} seconds"
screenshot_data = self.crawler_strategy.take_screenshot() )
if screenshot:
screenshot_data = self.crawler_strategy.take_screenshot()
crawl_result = self.process_html(
crawl_result = self.process_html(url, html, extracted_content, word_count_threshold, extraction_strategy, chunking_strategy, css_selector, screenshot_data, verbose, bool(cached), **kwargs) url,
crawl_result.success = bool(html) html,
return crawl_result extracted_content,
except Exception as e: word_count_threshold,
if not hasattr(e, "msg"): extraction_strategy,
e.msg = str(e) chunking_strategy,
print(f"[ERROR] 🚫 Failed to crawl {url}, error: {e.msg}") css_selector,
return CrawlResult(url=url, html="", success=False, error_message=e.msg) screenshot_data,
verbose,
bool(cached),
**kwargs,
)
crawl_result.success = bool(html)
return crawl_result
except Exception as e:
if not hasattr(e, "msg"):
e.msg = str(e)
print(f"[ERROR] 🚫 Failed to crawl {url}, error: {e.msg}")
return CrawlResult(url=url, html="", success=False, error_message=e.msg)
def process_html( def process_html(
self, self,
url: str, url: str,
html: str, html: str,
extracted_content: str, extracted_content: str,
word_count_threshold: int, word_count_threshold: int,
extraction_strategy: ExtractionStrategy, extraction_strategy: ExtractionStrategy,
chunking_strategy: ChunkingStrategy, chunking_strategy: ChunkingStrategy,
css_selector: str, css_selector: str,
screenshot: bool, screenshot: bool,
verbose: bool, verbose: bool,
is_cached: bool, is_cached: bool,
**kwargs, **kwargs,
) -> CrawlResult: ) -> CrawlResult:
t = time.time() t = time.time()
# Extract content from HTML # Extract content from HTML
try: try:
t1 = time.time() t1 = time.time()
scrapping_strategy = WebScrapingStrategy() scrapping_strategy = WebScrapingStrategy()
extra_params = {k: v for k, v in kwargs.items() if k not in ["only_text", "image_description_min_word_threshold"]} extra_params = {
result = scrapping_strategy.scrap( k: v
url, for k, v in kwargs.items()
html, if k not in ["only_text", "image_description_min_word_threshold"]
word_count_threshold=word_count_threshold, }
css_selector=css_selector, result = scrapping_strategy.scrap(
only_text=kwargs.get("only_text", False), url,
image_description_min_word_threshold=kwargs.get( html,
"image_description_min_word_threshold", IMAGE_DESCRIPTION_MIN_WORD_THRESHOLD word_count_threshold=word_count_threshold,
), css_selector=css_selector,
**extra_params, only_text=kwargs.get("only_text", False),
) image_description_min_word_threshold=kwargs.get(
"image_description_min_word_threshold",
# result = get_content_of_website_optimized(url, html, word_count_threshold, css_selector=css_selector, only_text=kwargs.get("only_text", False)) IMAGE_DESCRIPTION_MIN_WORD_THRESHOLD,
if verbose: ),
print(f"[LOG] 🚀 Content extracted for {url}, success: True, time taken: {time.time() - t1:.2f} seconds") **extra_params,
if result is None:
raise ValueError(f"Failed to extract content from the website: {url}")
except InvalidCSSSelectorError as e:
raise ValueError(str(e))
cleaned_html = sanitize_input_encode(result.get("cleaned_html", ""))
markdown = sanitize_input_encode(result.get("markdown", ""))
media = result.get("media", [])
links = result.get("links", [])
metadata = result.get("metadata", {})
if extracted_content is None:
if verbose:
print(f"[LOG] 🔥 Extracting semantic blocks for {url}, Strategy: {extraction_strategy.name}")
sections = chunking_strategy.chunk(markdown)
extracted_content = extraction_strategy.run(url, sections)
extracted_content = json.dumps(extracted_content, indent=4, default=str, ensure_ascii=False)
if verbose:
print(f"[LOG] 🚀 Extraction done for {url}, time taken: {time.time() - t:.2f} seconds.")
screenshot = None if not screenshot else screenshot
if not is_cached:
cache_url(
url,
html,
cleaned_html,
markdown,
extracted_content,
True,
json.dumps(media),
json.dumps(links),
json.dumps(metadata),
screenshot=screenshot,
)
return CrawlResult(
url=url,
html=html,
cleaned_html=format_html(cleaned_html),
markdown=markdown,
media=media,
links=links,
metadata=metadata,
screenshot=screenshot,
extracted_content=extracted_content,
success=True,
error_message="",
) )
# result = get_content_of_website_optimized(url, html, word_count_threshold, css_selector=css_selector, only_text=kwargs.get("only_text", False))
if verbose:
print(
f"[LOG] 🚀 Content extracted for {url}, success: True, time taken: {time.time() - t1:.2f} seconds"
)
if result is None:
raise ValueError(f"Failed to extract content from the website: {url}")
except InvalidCSSSelectorError as e:
raise ValueError(str(e))
cleaned_html = sanitize_input_encode(result.get("cleaned_html", ""))
markdown = sanitize_input_encode(result.get("markdown", ""))
media = result.get("media", [])
links = result.get("links", [])
metadata = result.get("metadata", {})
if extracted_content is None:
if verbose:
print(
f"[LOG] 🔥 Extracting semantic blocks for {url}, Strategy: {extraction_strategy.name}"
)
sections = chunking_strategy.chunk(markdown)
extracted_content = extraction_strategy.run(url, sections)
extracted_content = json.dumps(
extracted_content, indent=4, default=str, ensure_ascii=False
)
if verbose:
print(
f"[LOG] 🚀 Extraction done for {url}, time taken: {time.time() - t:.2f} seconds."
)
screenshot = None if not screenshot else screenshot
if not is_cached:
cache_url(
url,
html,
cleaned_html,
markdown,
extracted_content,
True,
json.dumps(media),
json.dumps(links),
json.dumps(metadata),
screenshot=screenshot,
)
return CrawlResult(
url=url,
html=html,
cleaned_html=format_html(cleaned_html),
markdown=markdown,
media=media,
links=links,
metadata=metadata,
screenshot=screenshot,
extracted_content=extracted_content,
success=True,
error_message="",
)

View File

@@ -0,0 +1,244 @@
# BFS Scraper Strategy: Smart Web Traversal
The BFS (Breadth-First Search) Scraper Strategy provides an intelligent way to traverse websites systematically. It crawls websites level by level, ensuring thorough coverage while respecting web crawling etiquette.
```mermaid
flowchart TB
Start([Start]) --> Init[Initialize BFS Strategy]
Init --> InitStats[Initialize CrawlStats]
InitStats --> InitQueue[Initialize Priority Queue]
InitQueue --> AddStart[Add Start URL to Queue]
AddStart --> CheckState{Queue Empty or\nTasks Pending?}
CheckState -->|No| Cleanup[Cleanup & Stats]
Cleanup --> End([End])
CheckState -->|Yes| CheckCancel{Cancel\nRequested?}
CheckCancel -->|Yes| Cleanup
CheckCancel -->|No| CheckConcurrent{Under Max\nConcurrent?}
CheckConcurrent -->|No| WaitComplete[Wait for Task Completion]
WaitComplete --> YieldResult[Yield Result]
YieldResult --> CheckState
CheckConcurrent -->|Yes| GetNextURL[Get Next URL from Queue]
GetNextURL --> ValidateURL{Already\nVisited?}
ValidateURL -->|Yes| CheckState
ValidateURL -->|No| ProcessURL[Process URL]
subgraph URL_Processing [URL Processing]
ProcessURL --> CheckValid{URL Valid?}
CheckValid -->|No| UpdateStats[Update Skip Stats]
CheckValid -->|Yes| CheckRobots{Allowed by\nrobots.txt?}
CheckRobots -->|No| UpdateRobotStats[Update Robot Stats]
CheckRobots -->|Yes| ApplyDelay[Apply Politeness Delay]
ApplyDelay --> FetchContent[Fetch Content with Rate Limit]
FetchContent --> CheckError{Error?}
CheckError -->|Yes| Retry{Retry\nNeeded?}
Retry -->|Yes| FetchContent
Retry -->|No| UpdateFailStats[Update Fail Stats]
CheckError -->|No| ExtractLinks[Extract & Process Links]
ExtractLinks --> ScoreURLs[Score New URLs]
ScoreURLs --> AddToQueue[Add to Priority Queue]
end
ProcessURL --> CreateTask{Parallel\nProcessing?}
CreateTask -->|Yes| AddTask[Add to Pending Tasks]
CreateTask -->|No| DirectProcess[Process Directly]
AddTask --> CheckState
DirectProcess --> YieldResult
UpdateStats --> CheckState
UpdateRobotStats --> CheckState
UpdateFailStats --> CheckState
classDef process fill:#90caf9,stroke:#000,stroke-width:2px;
classDef decision fill:#fff59d,stroke:#000,stroke-width:2px;
classDef error fill:#ef9a9a,stroke:#000,stroke-width:2px;
classDef stats fill:#a5d6a7,stroke:#000,stroke-width:2px;
class Start,End stats;
class CheckState,CheckCancel,CheckConcurrent,ValidateURL,CheckValid,CheckRobots,CheckError,Retry,CreateTask decision;
class UpdateStats,UpdateRobotStats,UpdateFailStats,InitStats,Cleanup stats;
class ProcessURL,FetchContent,ExtractLinks,ScoreURLs process;
```
## How It Works
The BFS strategy crawls a website by:
1. Starting from a root URL
2. Processing all URLs at the current depth
3. Moving to URLs at the next depth level
4. Continuing until maximum depth is reached
This ensures systematic coverage of the website while maintaining control over the crawling process.
## Key Features
### 1. Smart URL Processing
```python
strategy = BFSScraperStrategy(
max_depth=2,
filter_chain=my_filters,
url_scorer=my_scorer,
max_concurrent=5
)
```
- Controls crawl depth
- Filters unwanted URLs
- Scores URLs for priority
- Manages concurrent requests
### 2. Polite Crawling
The strategy automatically implements web crawling best practices:
- Respects robots.txt
- Implements rate limiting
- Adds politeness delays
- Manages concurrent requests
### 3. Link Processing Control
```python
strategy = BFSScraperStrategy(
...,
process_external_links=False # Only process internal links
)
```
- Control whether to follow external links
- Default: internal links only
- Enable external links when needed
## Configuration Options
| Parameter | Description | Default |
|-----------|-------------|---------|
| max_depth | Maximum crawl depth | Required |
| filter_chain | URL filtering rules | Required |
| url_scorer | URL priority scoring | Required |
| max_concurrent | Max parallel requests | 5 |
| min_crawl_delay | Seconds between requests | 1 |
| process_external_links | Follow external links | False |
## Best Practices
1. **Set Appropriate Depth**
- Start with smaller depths (2-3)
- Increase based on needs
- Consider site structure
2. **Configure Filters**
- Use URL patterns
- Filter by content type
- Avoid unwanted sections
3. **Tune Performance**
- Adjust max_concurrent
- Set appropriate delays
- Monitor resource usage
4. **Handle External Links**
- Keep external_links=False for focused crawls
- Enable only when needed
- Consider additional filtering
## Example Usage
```python
from crawl4ai.scraper import BFSScraperStrategy
from crawl4ai.scraper.filters import FilterChain
from crawl4ai.scraper.scorers import BasicURLScorer
# Configure strategy
strategy = BFSScraperStrategy(
max_depth=3,
filter_chain=FilterChain([
URLPatternFilter("*.example.com/*"),
ContentTypeFilter(["text/html"])
]),
url_scorer=BasicURLScorer(),
max_concurrent=5,
min_crawl_delay=1,
process_external_links=False
)
# Use with AsyncWebScraper
scraper = AsyncWebScraper(crawler, strategy)
results = await scraper.ascrape("https://example.com")
```
## Common Use Cases
### 1. Site Mapping
```python
strategy = BFSScraperStrategy(
max_depth=5,
filter_chain=site_filter,
url_scorer=depth_scorer,
process_external_links=False
)
```
Perfect for creating complete site maps or understanding site structure.
### 2. Content Aggregation
```python
strategy = BFSScraperStrategy(
max_depth=2,
filter_chain=content_filter,
url_scorer=relevance_scorer,
max_concurrent=3
)
```
Ideal for collecting specific types of content (articles, products, etc.).
### 3. Link Analysis
```python
strategy = BFSScraperStrategy(
max_depth=1,
filter_chain=link_filter,
url_scorer=link_scorer,
process_external_links=True
)
```
Useful for analyzing both internal and external link structures.
## Advanced Features
### Progress Monitoring
```python
async for result in scraper.ascrape(url):
print(f"Current depth: {strategy.stats.current_depth}")
print(f"Processed URLs: {strategy.stats.urls_processed}")
```
### Custom URL Scoring
```python
class CustomScorer(URLScorer):
def score(self, url: str) -> float:
# Lower scores = higher priority
return score_based_on_criteria(url)
```
## Troubleshooting
1. **Slow Crawling**
- Increase max_concurrent
- Adjust min_crawl_delay
- Check network conditions
2. **Missing Content**
- Verify max_depth
- Check filter settings
- Review URL patterns
3. **High Resource Usage**
- Reduce max_concurrent
- Increase crawl delay
- Add more specific filters

View File

@@ -0,0 +1,260 @@
from crawl4ai.async_configs import CrawlerRunConfig, BrowserConfig
from crawl4ai.content_scraping_strategy import LXMLWebScrapingStrategy
from crawl4ai.deep_crawl import (
BFSDeepCrawlStrategy,
FilterChain,
URLPatternFilter,
ContentTypeFilter,
DomainFilter,
KeywordRelevanceScorer,
PathDepthScorer,
FreshnessScorer,
CompositeScorer,
)
from crawl4ai.async_webcrawler import AsyncWebCrawler
import re
import time
import logging
browser_config = BrowserConfig(headless=True, viewport_width=800, viewport_height=600)
async def basic_example():
"""
Basic example: Deep crawl a blog site for articles
- Crawls only HTML pages
- Stays within the blog section
- Collects all results at once
"""
# Create a simple filter chain
filter_chain = FilterChain(
[
# Only crawl pages within the blog section
URLPatternFilter("*/basic/*"),
# Only process HTML pages
ContentTypeFilter(["text/html"]),
]
)
# Initialize the strategy with basic configuration
bfs_strategy = BFSDeepCrawlStrategy(
max_depth=2, # Only go 2 levels deep
filter_chain=filter_chain,
url_scorer=None, # Use default scoring
process_external_links=True,
)
# Create the crawler
async with AsyncWebCrawler(
config=browser_config,
) as crawler:
# Start scraping
try:
results = await crawler.arun(
"https://crawl4ai.com/mkdocs",
CrawlerRunConfig(deep_crawl_strategy=bfs_strategy),
)
# Process results
print(f"Crawled {len(results)} pages:")
for result in results:
print(f"- {result.url}: {len(result.html)} bytes")
except Exception as e:
print(f"Error during scraping: {e}")
async def advanced_example():
"""
Advanced example: Intelligent news site crawling
- Uses all filter types
- Implements sophisticated scoring
- Streams results
- Includes monitoring and logging
"""
# Set up logging
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger("advanced_deep_crawler")
# Create sophisticated filter chain
filter_chain = FilterChain(
[
# Domain control
DomainFilter(
allowed_domains=["techcrunch.com"],
blocked_domains=["login.techcrunch.com", "legal.yahoo.com"],
),
# URL patterns
URLPatternFilter(
[
"*/article/*",
"*/news/*",
"*/blog/*",
re.compile(r"\d{4}/\d{2}/.*"), # Date-based URLs
]
),
# Content types
ContentTypeFilter(["text/html", "application/xhtml+xml"]),
]
)
# Create composite scorer
scorer = CompositeScorer(
[
# Prioritize by keywords
KeywordRelevanceScorer(
keywords=["news", "breaking", "update", "latest"], weight=1.0
),
# Prefer optimal URL structure
PathDepthScorer(optimal_depth=3, weight=0.7),
# Prioritize fresh content
FreshnessScorer(weight=0.9),
]
)
# Initialize strategy with advanced configuration
bfs_strategy = BFSDeepCrawlStrategy(
max_depth=2, filter_chain=filter_chain, url_scorer=scorer
)
# Create crawler
async with AsyncWebCrawler(
config=browser_config,
) as crawler:
# Track statistics
stats = {"processed": 0, "errors": 0, "total_size": 0}
try:
# Use streaming mode
results = []
result_generator = await crawler.arun(
"https://techcrunch.com",
config=CrawlerRunConfig(deep_crawl_strategy=bfs_strategy, stream=True),
)
async for result in result_generator:
stats["processed"] += 1
if result.success:
stats["total_size"] += len(result.html)
logger.info(
f"Processed at depth: {result.depth} with score: {result.score:.3f} : \n {result.url}"
)
results.append(result)
else:
stats["errors"] += 1
logger.error(
f"Failed to process {result.url}: {result.error_message}"
)
# Log progress regularly
if stats["processed"] % 10 == 0:
logger.info(f"Progress: {stats['processed']} URLs processed")
except Exception as e:
logger.error(f"Scraping error: {e}")
finally:
# Print final statistics
logger.info("Scraping completed:")
logger.info(f"- URLs processed: {stats['processed']}")
logger.info(f"- Errors: {stats['errors']}")
logger.info(f"- Total content size: {stats['total_size'] / 1024:.2f} KB")
# Print filter statistics
for filter_ in filter_chain.filters:
logger.info(f"{filter_.name} stats:")
logger.info(f"- Passed: {filter_.stats.passed_urls}")
logger.info(f"- Rejected: {filter_.stats.rejected_urls}")
# Print scorer statistics
logger.info("Scoring statistics:")
logger.info(f"- Average score: {scorer.stats.average_score:.2f}")
logger.info(
f"- Score range: {scorer.stats.min_score:.2f} - {scorer.stats.max_score:.2f}"
)
async def basic_example_many_urls():
filter_chain = FilterChain(
[
URLPatternFilter("*/basic/*"),
ContentTypeFilter(["text/html"]),
]
)
# Initialize the strategy with basic configuration
bfs_strategy = BFSDeepCrawlStrategy(
max_depth=2, # Only go 2 levels deep
filter_chain=filter_chain,
url_scorer=None, # Use default scoring
process_external_links=False,
)
# Create the crawler
async with AsyncWebCrawler(
config=browser_config,
) as crawler:
# Start scraping
try:
results = await crawler.arun_many(
urls=["https://crawl4ai.com/mkdocs","https://aravindkarnam.com"],
config=CrawlerRunConfig(deep_crawl_strategy=bfs_strategy),
)
# Process results
print(f"Crawled {len(results)} pages:")
for url_result in results:
for result in url_result:
print(f"- {result.url}: {len(result.html)} bytes")
except Exception as e:
print(f"Error during scraping: {e}")
async def basic_example_many_urls_stream():
filter_chain = FilterChain(
[
URLPatternFilter("*/basic/*"),
ContentTypeFilter(["text/html"]),
]
)
# Initialize the strategy with basic configuration
bfs_strategy = BFSDeepCrawlStrategy(
max_depth=2, # Only go 2 levels deep
filter_chain=filter_chain,
url_scorer=None, # Use default scoring
process_external_links=False,
)
# Create the crawler
async with AsyncWebCrawler(
config=browser_config,
) as crawler:
# Start scraping
try:
async for result in await crawler.arun_many(
urls=["https://crawl4ai.com/mkdocs","https://aravindkarnam.com"],
config=CrawlerRunConfig(deep_crawl_strategy=bfs_strategy,stream=True),
):
# Process results
print(f"- {result.url}: {len(result.html)} bytes")
except Exception as e:
print(f"Error during scraping: {e}")
if __name__ == "__main__":
import asyncio
import time
# Run basic example
start_time = time.perf_counter()
print("Running basic Deep crawl example...")
asyncio.run(basic_example())
end_time = time.perf_counter()
print(f"Basic deep crawl example completed in {end_time - start_time:.2f} seconds")
# Run advanced example
print("\nRunning advanced deep crawl example...")
asyncio.run(advanced_example())
print("\nRunning advanced deep crawl example with arun_many...")
asyncio.run(basic_example_many_urls())
print("\nRunning advanced deep crawl example with arun_many streaming enabled...")
asyncio.run(basic_example_many_urls_stream())

View File

@@ -0,0 +1,342 @@
# URL Filters and Scorers
The crawl4ai library provides powerful URL filtering and scoring capabilities that help you control and prioritize your web crawling. This guide explains how to use these features effectively.
```mermaid
flowchart TB
Start([URL Input]) --> Chain[Filter Chain]
subgraph Chain Process
Chain --> Pattern{URL Pattern\nFilter}
Pattern -->|Match| Content{Content Type\nFilter}
Pattern -->|No Match| Reject1[Reject URL]
Content -->|Allowed| Domain{Domain\nFilter}
Content -->|Not Allowed| Reject2[Reject URL]
Domain -->|Allowed| Accept[Accept URL]
Domain -->|Blocked| Reject3[Reject URL]
end
subgraph Statistics
Pattern --> UpdatePattern[Update Pattern Stats]
Content --> UpdateContent[Update Content Stats]
Domain --> UpdateDomain[Update Domain Stats]
Accept --> UpdateChain[Update Chain Stats]
Reject1 --> UpdateChain
Reject2 --> UpdateChain
Reject3 --> UpdateChain
end
Accept --> End([End])
Reject1 --> End
Reject2 --> End
Reject3 --> End
classDef process fill:#90caf9,stroke:#000,stroke-width:2px;
classDef decision fill:#fff59d,stroke:#000,stroke-width:2px;
classDef reject fill:#ef9a9a,stroke:#000,stroke-width:2px;
classDef accept fill:#a5d6a7,stroke:#000,stroke-width:2px;
class Start,End accept;
class Pattern,Content,Domain decision;
class Reject1,Reject2,Reject3 reject;
class Chain,UpdatePattern,UpdateContent,UpdateDomain,UpdateChain process;
```
## URL Filters
URL filters help you control which URLs are crawled. Multiple filters can be chained together to create sophisticated filtering rules.
### Available Filters
1. **URL Pattern Filter**
```python
pattern_filter = URLPatternFilter([
"*.example.com/*", # Glob pattern
"*/article/*", # Path pattern
re.compile(r"blog-\d+") # Regex pattern
])
```
- Supports glob patterns and regex
- Multiple patterns per filter
- Pattern pre-compilation for performance
2. **Content Type Filter**
```python
content_filter = ContentTypeFilter([
"text/html",
"application/pdf"
], check_extension=True)
```
- Filter by MIME types
- Extension checking
- Support for multiple content types
3. **Domain Filter**
```python
domain_filter = DomainFilter(
allowed_domains=["example.com", "blog.example.com"],
blocked_domains=["ads.example.com"]
)
```
- Allow/block specific domains
- Subdomain support
- Efficient domain matching
### Creating Filter Chains
```python
# Create and configure a filter chain
filter_chain = FilterChain([
URLPatternFilter(["*.example.com/*"]),
ContentTypeFilter(["text/html"]),
DomainFilter(blocked_domains=["ads.*"])
])
# Add more filters
filter_chain.add_filter(
URLPatternFilter(["*/article/*"])
)
```
```mermaid
flowchart TB
Start([URL Input]) --> Composite[Composite Scorer]
subgraph Scoring Process
Composite --> Keywords[Keyword Relevance]
Composite --> Path[Path Depth]
Composite --> Content[Content Type]
Composite --> Fresh[Freshness]
Composite --> Domain[Domain Authority]
Keywords --> KeywordScore[Calculate Score]
Path --> PathScore[Calculate Score]
Content --> ContentScore[Calculate Score]
Fresh --> FreshScore[Calculate Score]
Domain --> DomainScore[Calculate Score]
KeywordScore --> Weight1[Apply Weight]
PathScore --> Weight2[Apply Weight]
ContentScore --> Weight3[Apply Weight]
FreshScore --> Weight4[Apply Weight]
DomainScore --> Weight5[Apply Weight]
end
Weight1 --> Combine[Combine Scores]
Weight2 --> Combine
Weight3 --> Combine
Weight4 --> Combine
Weight5 --> Combine
Combine --> Normalize{Normalize?}
Normalize -->|Yes| NormalizeScore[Normalize Combined Score]
Normalize -->|No| FinalScore[Final Score]
NormalizeScore --> FinalScore
FinalScore --> Stats[Update Statistics]
Stats --> End([End])
classDef process fill:#90caf9,stroke:#000,stroke-width:2px;
classDef scorer fill:#fff59d,stroke:#000,stroke-width:2px;
classDef calc fill:#a5d6a7,stroke:#000,stroke-width:2px;
classDef decision fill:#ef9a9a,stroke:#000,stroke-width:2px;
class Start,End calc;
class Keywords,Path,Content,Fresh,Domain scorer;
class KeywordScore,PathScore,ContentScore,FreshScore,DomainScore process;
class Normalize decision;
```
## URL Scorers
URL scorers help prioritize which URLs to crawl first. Higher scores indicate higher priority.
### Available Scorers
1. **Keyword Relevance Scorer**
```python
keyword_scorer = KeywordRelevanceScorer(
keywords=["python", "programming"],
weight=1.0,
case_sensitive=False
)
```
- Score based on keyword matches
- Case sensitivity options
- Weighted scoring
2. **Path Depth Scorer**
```python
path_scorer = PathDepthScorer(
optimal_depth=3, # Preferred URL depth
weight=0.7
)
```
- Score based on URL path depth
- Configurable optimal depth
- Diminishing returns for deeper paths
3. **Content Type Scorer**
```python
content_scorer = ContentTypeScorer({
r'\.html$': 1.0,
r'\.pdf$': 0.8,
r'\.xml$': 0.6
})
```
- Score based on file types
- Configurable type weights
- Pattern matching support
4. **Freshness Scorer**
```python
freshness_scorer = FreshnessScorer(weight=0.9)
```
- Score based on date indicators in URLs
- Multiple date format support
- Recency weighting
5. **Domain Authority Scorer**
```python
authority_scorer = DomainAuthorityScorer({
"python.org": 1.0,
"github.com": 0.9,
"medium.com": 0.7
})
```
- Score based on domain importance
- Configurable domain weights
- Default weight for unknown domains
### Combining Scorers
```python
# Create a composite scorer
composite_scorer = CompositeScorer([
KeywordRelevanceScorer(["python"], weight=1.0),
PathDepthScorer(optimal_depth=2, weight=0.7),
FreshnessScorer(weight=0.8)
], normalize=True)
```
## Best Practices
### Filter Configuration
1. **Start Restrictive**
```python
# Begin with strict filters
filter_chain = FilterChain([
DomainFilter(allowed_domains=["example.com"]),
ContentTypeFilter(["text/html"])
])
```
2. **Layer Filters**
```python
# Add more specific filters
filter_chain.add_filter(
URLPatternFilter(["*/article/*", "*/blog/*"])
)
```
3. **Monitor Filter Statistics**
```python
# Check filter performance
for filter in filter_chain.filters:
print(f"{filter.name}: {filter.stats.rejected_urls} rejected")
```
### Scorer Configuration
1. **Balance Weights**
```python
# Balanced scoring configuration
scorer = create_balanced_scorer()
```
2. **Customize for Content**
```python
# News site configuration
news_scorer = CompositeScorer([
KeywordRelevanceScorer(["news", "article"], weight=1.0),
FreshnessScorer(weight=1.0),
PathDepthScorer(optimal_depth=2, weight=0.5)
])
```
3. **Monitor Scoring Statistics**
```python
# Check scoring distribution
print(f"Average score: {scorer.stats.average_score}")
print(f"Score range: {scorer.stats.min_score} - {scorer.stats.max_score}")
```
## Common Use Cases
### Blog Crawling
```python
blog_config = {
'filters': FilterChain([
URLPatternFilter(["*/blog/*", "*/post/*"]),
ContentTypeFilter(["text/html"])
]),
'scorer': CompositeScorer([
FreshnessScorer(weight=1.0),
KeywordRelevanceScorer(["blog", "article"], weight=0.8)
])
}
```
### Documentation Sites
```python
docs_config = {
'filters': FilterChain([
URLPatternFilter(["*/docs/*", "*/guide/*"]),
ContentTypeFilter(["text/html", "application/pdf"])
]),
'scorer': CompositeScorer([
PathDepthScorer(optimal_depth=3, weight=1.0),
KeywordRelevanceScorer(["guide", "tutorial"], weight=0.9)
])
}
```
### E-commerce Sites
```python
ecommerce_config = {
'filters': FilterChain([
URLPatternFilter(["*/product/*", "*/category/*"]),
DomainFilter(blocked_domains=["ads.*", "tracker.*"])
]),
'scorer': CompositeScorer([
PathDepthScorer(optimal_depth=2, weight=1.0),
ContentTypeScorer({
r'/product/': 1.0,
r'/category/': 0.8
})
])
}
```
## Advanced Topics
### Custom Filters
```python
class CustomFilter(URLFilter):
def apply(self, url: str) -> bool:
# Your custom filtering logic
return True
```
### Custom Scorers
```python
class CustomScorer(URLScorer):
def _calculate_score(self, url: str) -> float:
# Your custom scoring logic
return 1.0
```
For more examples, check our [example repository](https://github.com/example/crawl4ai/examples).

View File

@@ -0,0 +1,206 @@
# Scraper Examples Guide
This guide provides two complete examples of using the crawl4ai scraper: a basic implementation for simple use cases and an advanced implementation showcasing all features.
## Basic Example
The basic example demonstrates a simple blog scraping scenario:
```python
from crawl4ai.scraper import AsyncWebScraper, BFSScraperStrategy, FilterChain
# Create simple filter chain
filter_chain = FilterChain([
URLPatternFilter("*/blog/*"),
ContentTypeFilter(["text/html"])
])
# Initialize strategy
strategy = BFSScraperStrategy(
max_depth=2,
filter_chain=filter_chain,
url_scorer=None,
max_concurrent=3
)
# Create and run scraper
crawler = AsyncWebCrawler()
scraper = AsyncWebScraper(crawler, strategy)
result = await scraper.ascrape("https://example.com/blog/")
```
### Features Demonstrated
- Basic URL filtering
- Simple content type filtering
- Depth control
- Concurrent request limiting
- Result collection
## Advanced Example
The advanced example shows a sophisticated news site scraping setup with all features enabled:
```python
# Create comprehensive filter chain
filter_chain = FilterChain([
DomainFilter(
allowed_domains=["example.com"],
blocked_domains=["ads.example.com"]
),
URLPatternFilter([
"*/article/*",
re.compile(r"\d{4}/\d{2}/.*")
]),
ContentTypeFilter(["text/html"])
])
# Create intelligent scorer
scorer = CompositeScorer([
KeywordRelevanceScorer(
keywords=["news", "breaking"],
weight=1.0
),
PathDepthScorer(optimal_depth=3, weight=0.7),
FreshnessScorer(weight=0.9)
])
# Initialize advanced strategy
strategy = BFSScraperStrategy(
max_depth=4,
filter_chain=filter_chain,
url_scorer=scorer,
max_concurrent=5
)
```
### Features Demonstrated
1. **Advanced Filtering**
- Domain filtering
- Pattern matching
- Content type control
2. **Intelligent Scoring**
- Keyword relevance
- Path optimization
- Freshness priority
3. **Monitoring**
- Progress tracking
- Error handling
- Statistics collection
4. **Resource Management**
- Concurrent processing
- Rate limiting
- Cleanup handling
## Running the Examples
```bash
# Basic usage
python basic_scraper_example.py
# Advanced usage with logging
PYTHONPATH=. python advanced_scraper_example.py
```
## Example Output
### Basic Example
```
Crawled 15 pages:
- https://example.com/blog/post1: 24560 bytes
- https://example.com/blog/post2: 18920 bytes
...
```
### Advanced Example
```
INFO: Starting crawl of https://example.com/news/
INFO: Processed: https://example.com/news/breaking/story1
DEBUG: KeywordScorer: 0.85
DEBUG: FreshnessScorer: 0.95
INFO: Progress: 10 URLs processed
...
INFO: Scraping completed:
INFO: - URLs processed: 50
INFO: - Errors: 2
INFO: - Total content size: 1240.50 KB
```
## Customization
### Adding Custom Filters
```python
class CustomFilter(URLFilter):
def apply(self, url: str) -> bool:
# Your custom filtering logic
return True
filter_chain.add_filter(CustomFilter())
```
### Custom Scoring Logic
```python
class CustomScorer(URLScorer):
def _calculate_score(self, url: str) -> float:
# Your custom scoring logic
return 1.0
scorer = CompositeScorer([
CustomScorer(weight=1.0),
...
])
```
## Best Practices
1. **Start Simple**
- Begin with basic filtering
- Add features incrementally
- Test thoroughly at each step
2. **Monitor Performance**
- Watch memory usage
- Track processing times
- Adjust concurrency as needed
3. **Handle Errors**
- Implement proper error handling
- Log important events
- Track error statistics
4. **Optimize Resources**
- Set appropriate delays
- Limit concurrent requests
- Use streaming for large crawls
## Troubleshooting
Common issues and solutions:
1. **Too Many Requests**
```python
strategy = BFSScraperStrategy(
max_concurrent=3, # Reduce concurrent requests
min_crawl_delay=2 # Increase delay between requests
)
```
2. **Memory Issues**
```python
# Use streaming mode for large crawls
async for result in scraper.ascrape(url, stream=True):
process_result(result)
```
3. **Missing Content**
```python
# Check your filter chain
filter_chain = FilterChain([
URLPatternFilter("*"), # Broaden patterns
ContentTypeFilter(["*"]) # Accept all content
])
```
For more examples and use cases, visit our [GitHub repository](https://github.com/example/crawl4ai/examples).

View File

@@ -9,12 +9,10 @@ from crawl4ai.extraction_strategy import JsonCssExtractionStrategy
from crawl4ai.async_configs import BrowserConfig, CrawlerRunConfig from crawl4ai.async_configs import BrowserConfig, CrawlerRunConfig
import json import json
async def extract_amazon_products(): async def extract_amazon_products():
# Initialize browser config # Initialize browser config
browser_config = BrowserConfig( browser_config = BrowserConfig(browser_type="chromium", headless=True)
browser_type="chromium",
headless=True
)
# Initialize crawler config with JSON CSS extraction strategy # Initialize crawler config with JSON CSS extraction strategy
crawler_config = CrawlerRunConfig( crawler_config = CrawlerRunConfig(
@@ -27,57 +25,53 @@ async def extract_amazon_products():
"name": "asin", "name": "asin",
"selector": "", "selector": "",
"type": "attribute", "type": "attribute",
"attribute": "data-asin" "attribute": "data-asin",
},
{
"name": "title",
"selector": "h2 a span",
"type": "text"
}, },
{"name": "title", "selector": "h2 a span", "type": "text"},
{ {
"name": "url", "name": "url",
"selector": "h2 a", "selector": "h2 a",
"type": "attribute", "type": "attribute",
"attribute": "href" "attribute": "href",
}, },
{ {
"name": "image", "name": "image",
"selector": ".s-image", "selector": ".s-image",
"type": "attribute", "type": "attribute",
"attribute": "src" "attribute": "src",
}, },
{ {
"name": "rating", "name": "rating",
"selector": ".a-icon-star-small .a-icon-alt", "selector": ".a-icon-star-small .a-icon-alt",
"type": "text" "type": "text",
}, },
{ {
"name": "reviews_count", "name": "reviews_count",
"selector": "[data-csa-c-func-deps='aui-da-a-popover'] ~ span span", "selector": "[data-csa-c-func-deps='aui-da-a-popover'] ~ span span",
"type": "text" "type": "text",
}, },
{ {
"name": "price", "name": "price",
"selector": ".a-price .a-offscreen", "selector": ".a-price .a-offscreen",
"type": "text" "type": "text",
}, },
{ {
"name": "original_price", "name": "original_price",
"selector": ".a-price.a-text-price .a-offscreen", "selector": ".a-price.a-text-price .a-offscreen",
"type": "text" "type": "text",
}, },
{ {
"name": "sponsored", "name": "sponsored",
"selector": ".puis-sponsored-label-text", "selector": ".puis-sponsored-label-text",
"type": "exists" "type": "exists",
}, },
{ {
"name": "delivery_info", "name": "delivery_info",
"selector": "[data-cy='delivery-recipe'] .a-color-base", "selector": "[data-cy='delivery-recipe'] .a-color-base",
"type": "text", "type": "text",
"multiple": True "multiple": True,
} },
] ],
} }
) )
) )
@@ -105,10 +99,12 @@ async def extract_amazon_products():
print(f"Rating: {product.get('rating')}") print(f"Rating: {product.get('rating')}")
print(f"Reviews: {product.get('reviews_count')}") print(f"Reviews: {product.get('reviews_count')}")
print(f"Sponsored: {'Yes' if product.get('sponsored') else 'No'}") print(f"Sponsored: {'Yes' if product.get('sponsored') else 'No'}")
if product.get('delivery_info'): if product.get("delivery_info"):
print(f"Delivery: {' '.join(product['delivery_info'])}") print(f"Delivery: {' '.join(product['delivery_info'])}")
print("-" * 80) print("-" * 80)
if __name__ == "__main__": if __name__ == "__main__":
import asyncio import asyncio
asyncio.run(extract_amazon_products()) asyncio.run(extract_amazon_products())

View File

@@ -10,6 +10,7 @@ from crawl4ai.async_configs import BrowserConfig, CrawlerRunConfig
import json import json
from playwright.async_api import Page, BrowserContext from playwright.async_api import Page, BrowserContext
async def extract_amazon_products(): async def extract_amazon_products():
# Initialize browser config # Initialize browser config
browser_config = BrowserConfig( browser_config = BrowserConfig(
@@ -20,7 +21,6 @@ async def extract_amazon_products():
# Initialize crawler config with JSON CSS extraction strategy nav-search-submit-button # Initialize crawler config with JSON CSS extraction strategy nav-search-submit-button
crawler_config = CrawlerRunConfig( crawler_config = CrawlerRunConfig(
cache_mode=CacheMode.BYPASS, cache_mode=CacheMode.BYPASS,
extraction_strategy=JsonCssExtractionStrategy( extraction_strategy=JsonCssExtractionStrategy(
schema={ schema={
"name": "Amazon Product Search Results", "name": "Amazon Product Search Results",
@@ -30,82 +30,86 @@ async def extract_amazon_products():
"name": "asin", "name": "asin",
"selector": "", "selector": "",
"type": "attribute", "type": "attribute",
"attribute": "data-asin" "attribute": "data-asin",
},
{
"name": "title",
"selector": "h2 a span",
"type": "text"
}, },
{"name": "title", "selector": "h2 a span", "type": "text"},
{ {
"name": "url", "name": "url",
"selector": "h2 a", "selector": "h2 a",
"type": "attribute", "type": "attribute",
"attribute": "href" "attribute": "href",
}, },
{ {
"name": "image", "name": "image",
"selector": ".s-image", "selector": ".s-image",
"type": "attribute", "type": "attribute",
"attribute": "src" "attribute": "src",
}, },
{ {
"name": "rating", "name": "rating",
"selector": ".a-icon-star-small .a-icon-alt", "selector": ".a-icon-star-small .a-icon-alt",
"type": "text" "type": "text",
}, },
{ {
"name": "reviews_count", "name": "reviews_count",
"selector": "[data-csa-c-func-deps='aui-da-a-popover'] ~ span span", "selector": "[data-csa-c-func-deps='aui-da-a-popover'] ~ span span",
"type": "text" "type": "text",
}, },
{ {
"name": "price", "name": "price",
"selector": ".a-price .a-offscreen", "selector": ".a-price .a-offscreen",
"type": "text" "type": "text",
}, },
{ {
"name": "original_price", "name": "original_price",
"selector": ".a-price.a-text-price .a-offscreen", "selector": ".a-price.a-text-price .a-offscreen",
"type": "text" "type": "text",
}, },
{ {
"name": "sponsored", "name": "sponsored",
"selector": ".puis-sponsored-label-text", "selector": ".puis-sponsored-label-text",
"type": "exists" "type": "exists",
}, },
{ {
"name": "delivery_info", "name": "delivery_info",
"selector": "[data-cy='delivery-recipe'] .a-color-base", "selector": "[data-cy='delivery-recipe'] .a-color-base",
"type": "text", "type": "text",
"multiple": True "multiple": True,
} },
] ],
} }
) ),
) )
url = "https://www.amazon.com/" url = "https://www.amazon.com/"
async def after_goto(page: Page, context: BrowserContext, url: str, response: dict, **kwargs): async def after_goto(
page: Page, context: BrowserContext, url: str, response: dict, **kwargs
):
"""Hook called after navigating to each URL""" """Hook called after navigating to each URL"""
print(f"[HOOK] after_goto - Successfully loaded: {url}") print(f"[HOOK] after_goto - Successfully loaded: {url}")
try: try:
# Wait for search box to be available # Wait for search box to be available
search_box = await page.wait_for_selector('#twotabsearchtextbox', timeout=1000) search_box = await page.wait_for_selector(
"#twotabsearchtextbox", timeout=1000
)
# Type the search query # Type the search query
await search_box.fill('Samsung Galaxy Tab') await search_box.fill("Samsung Galaxy Tab")
# Get the search button and prepare for navigation # Get the search button and prepare for navigation
search_button = await page.wait_for_selector('#nav-search-submit-button', timeout=1000) search_button = await page.wait_for_selector(
"#nav-search-submit-button", timeout=1000
)
# Click with navigation waiting # Click with navigation waiting
await search_button.click() await search_button.click()
# Wait for search results to load # Wait for search results to load
await page.wait_for_selector('[data-component-type="s-search-result"]', timeout=10000) await page.wait_for_selector(
'[data-component-type="s-search-result"]', timeout=10000
)
print("[HOOK] Search completed and results loaded!") print("[HOOK] Search completed and results loaded!")
except Exception as e: except Exception as e:
@@ -115,7 +119,6 @@ async def extract_amazon_products():
# Use context manager for proper resource handling # Use context manager for proper resource handling
async with AsyncWebCrawler(config=browser_config) as crawler: async with AsyncWebCrawler(config=browser_config) as crawler:
crawler.crawler_strategy.set_hook("after_goto", after_goto) crawler.crawler_strategy.set_hook("after_goto", after_goto)
# Extract the data # Extract the data
@@ -136,10 +139,12 @@ async def extract_amazon_products():
print(f"Rating: {product.get('rating')}") print(f"Rating: {product.get('rating')}")
print(f"Reviews: {product.get('reviews_count')}") print(f"Reviews: {product.get('reviews_count')}")
print(f"Sponsored: {'Yes' if product.get('sponsored') else 'No'}") print(f"Sponsored: {'Yes' if product.get('sponsored') else 'No'}")
if product.get('delivery_info'): if product.get("delivery_info"):
print(f"Delivery: {' '.join(product['delivery_info'])}") print(f"Delivery: {' '.join(product['delivery_info'])}")
print("-" * 80) print("-" * 80)
if __name__ == "__main__": if __name__ == "__main__":
import asyncio import asyncio
asyncio.run(extract_amazon_products()) asyncio.run(extract_amazon_products())

View File

@@ -8,7 +8,7 @@ from crawl4ai import AsyncWebCrawler, CacheMode
from crawl4ai.extraction_strategy import JsonCssExtractionStrategy from crawl4ai.extraction_strategy import JsonCssExtractionStrategy
from crawl4ai.async_configs import BrowserConfig, CrawlerRunConfig from crawl4ai.async_configs import BrowserConfig, CrawlerRunConfig
import json import json
from playwright.async_api import Page, BrowserContext
async def extract_amazon_products(): async def extract_amazon_products():
# Initialize browser config # Initialize browser config
@@ -30,7 +30,7 @@ async def extract_amazon_products():
""" """
crawler_config = CrawlerRunConfig( crawler_config = CrawlerRunConfig(
cache_mode=CacheMode.BYPASS, cache_mode=CacheMode.BYPASS,
js_code = js_code_to_search, js_code=js_code_to_search,
wait_for='css:[data-component-type="s-search-result"]', wait_for='css:[data-component-type="s-search-result"]',
extraction_strategy=JsonCssExtractionStrategy( extraction_strategy=JsonCssExtractionStrategy(
schema={ schema={
@@ -41,65 +41,60 @@ async def extract_amazon_products():
"name": "asin", "name": "asin",
"selector": "", "selector": "",
"type": "attribute", "type": "attribute",
"attribute": "data-asin" "attribute": "data-asin",
},
{
"name": "title",
"selector": "h2 a span",
"type": "text"
}, },
{"name": "title", "selector": "h2 a span", "type": "text"},
{ {
"name": "url", "name": "url",
"selector": "h2 a", "selector": "h2 a",
"type": "attribute", "type": "attribute",
"attribute": "href" "attribute": "href",
}, },
{ {
"name": "image", "name": "image",
"selector": ".s-image", "selector": ".s-image",
"type": "attribute", "type": "attribute",
"attribute": "src" "attribute": "src",
}, },
{ {
"name": "rating", "name": "rating",
"selector": ".a-icon-star-small .a-icon-alt", "selector": ".a-icon-star-small .a-icon-alt",
"type": "text" "type": "text",
}, },
{ {
"name": "reviews_count", "name": "reviews_count",
"selector": "[data-csa-c-func-deps='aui-da-a-popover'] ~ span span", "selector": "[data-csa-c-func-deps='aui-da-a-popover'] ~ span span",
"type": "text" "type": "text",
}, },
{ {
"name": "price", "name": "price",
"selector": ".a-price .a-offscreen", "selector": ".a-price .a-offscreen",
"type": "text" "type": "text",
}, },
{ {
"name": "original_price", "name": "original_price",
"selector": ".a-price.a-text-price .a-offscreen", "selector": ".a-price.a-text-price .a-offscreen",
"type": "text" "type": "text",
}, },
{ {
"name": "sponsored", "name": "sponsored",
"selector": ".puis-sponsored-label-text", "selector": ".puis-sponsored-label-text",
"type": "exists" "type": "exists",
}, },
{ {
"name": "delivery_info", "name": "delivery_info",
"selector": "[data-cy='delivery-recipe'] .a-color-base", "selector": "[data-cy='delivery-recipe'] .a-color-base",
"type": "text", "type": "text",
"multiple": True "multiple": True,
} },
] ],
} }
) ),
) )
# Example search URL (you should replace with your actual Amazon URL) # Example search URL (you should replace with your actual Amazon URL)
url = "https://www.amazon.com/" url = "https://www.amazon.com/"
# Use context manager for proper resource handling # Use context manager for proper resource handling
async with AsyncWebCrawler(config=browser_config) as crawler: async with AsyncWebCrawler(config=browser_config) as crawler:
# Extract the data # Extract the data
@@ -120,10 +115,12 @@ async def extract_amazon_products():
print(f"Rating: {product.get('rating')}") print(f"Rating: {product.get('rating')}")
print(f"Reviews: {product.get('reviews_count')}") print(f"Reviews: {product.get('reviews_count')}")
print(f"Sponsored: {'Yes' if product.get('sponsored') else 'No'}") print(f"Sponsored: {'Yes' if product.get('sponsored') else 'No'}")
if product.get('delivery_info'): if product.get("delivery_info"):
print(f"Delivery: {' '.join(product['delivery_info'])}") print(f"Delivery: {' '.join(product['delivery_info'])}")
print("-" * 80) print("-" * 80)
if __name__ == "__main__": if __name__ == "__main__":
import asyncio import asyncio
asyncio.run(extract_amazon_products()) asyncio.run(extract_amazon_products())

View File

@@ -1,12 +1,16 @@
# File: async_webcrawler_multiple_urls_example.py # File: async_webcrawler_multiple_urls_example.py
import os, sys import os, sys
# append 2 parent directories to sys.path to import crawl4ai # append 2 parent directories to sys.path to import crawl4ai
parent_dir = os.path.dirname(os.path.dirname(os.path.dirname(os.path.abspath(__file__)))) parent_dir = os.path.dirname(
os.path.dirname(os.path.dirname(os.path.abspath(__file__)))
)
sys.path.append(parent_dir) sys.path.append(parent_dir)
import asyncio import asyncio
from crawl4ai import AsyncWebCrawler from crawl4ai import AsyncWebCrawler
async def main(): async def main():
# Initialize the AsyncWebCrawler # Initialize the AsyncWebCrawler
async with AsyncWebCrawler(verbose=True) as crawler: async with AsyncWebCrawler(verbose=True) as crawler:
@@ -16,7 +20,7 @@ async def main():
"https://python.org", "https://python.org",
"https://github.com", "https://github.com",
"https://stackoverflow.com", "https://stackoverflow.com",
"https://news.ycombinator.com" "https://news.ycombinator.com",
] ]
# Set up crawling parameters # Set up crawling parameters
@@ -27,7 +31,7 @@ async def main():
urls=urls, urls=urls,
word_count_threshold=word_count_threshold, word_count_threshold=word_count_threshold,
bypass_cache=True, bypass_cache=True,
verbose=True verbose=True,
) )
# Process the results # Process the results
@@ -36,7 +40,9 @@ async def main():
print(f"Successfully crawled: {result.url}") print(f"Successfully crawled: {result.url}")
print(f"Title: {result.metadata.get('title', 'N/A')}") print(f"Title: {result.metadata.get('title', 'N/A')}")
print(f"Word count: {len(result.markdown.split())}") print(f"Word count: {len(result.markdown.split())}")
print(f"Number of links: {len(result.links.get('internal', [])) + len(result.links.get('external', []))}") print(
f"Number of links: {len(result.links.get('internal', [])) + len(result.links.get('external', []))}"
)
print(f"Number of images: {len(result.media.get('images', []))}") print(f"Number of images: {len(result.media.get('images', []))}")
print("---") print("---")
else: else:
@@ -44,5 +50,6 @@ async def main():
print(f"Error: {result.error_message}") print(f"Error: {result.error_message}")
print("---") print("---")
if __name__ == "__main__": if __name__ == "__main__":
asyncio.run(main()) asyncio.run(main())

View File

@@ -6,10 +6,8 @@ This example demonstrates optimal browser usage patterns in Crawl4AI:
""" """
import asyncio import asyncio
import os
from typing import List from typing import List
from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig
from crawl4ai.content_filter_strategy import PruningContentFilter
from crawl4ai.markdown_generation_strategy import DefaultMarkdownGenerator from crawl4ai.markdown_generation_strategy import DefaultMarkdownGenerator

View File

@@ -1,31 +1,32 @@
import os, time import os, time
# append the path to the root of the project # append the path to the root of the project
import sys import sys
import asyncio import asyncio
sys.path.append(os.path.join(os.path.dirname(__file__), '..', '..'))
sys.path.append(os.path.join(os.path.dirname(__file__), "..", ".."))
from firecrawl import FirecrawlApp from firecrawl import FirecrawlApp
from crawl4ai import AsyncWebCrawler from crawl4ai import AsyncWebCrawler
__data__ = os.path.join(os.path.dirname(__file__), '..', '..') + '/.data'
__data__ = os.path.join(os.path.dirname(__file__), "..", "..") + "/.data"
async def compare(): async def compare():
app = FirecrawlApp(api_key=os.environ['FIRECRAWL_API_KEY']) app = FirecrawlApp(api_key=os.environ["FIRECRAWL_API_KEY"])
# Tet Firecrawl with a simple crawl # Tet Firecrawl with a simple crawl
start = time.time() start = time.time()
scrape_status = app.scrape_url( scrape_status = app.scrape_url(
'https://www.nbcnews.com/business', "https://www.nbcnews.com/business", params={"formats": ["markdown", "html"]}
params={'formats': ['markdown', 'html']}
) )
end = time.time() end = time.time()
print(f"Time taken: {end - start} seconds") print(f"Time taken: {end - start} seconds")
print(len(scrape_status['markdown'])) print(len(scrape_status["markdown"]))
# save the markdown content with provider name # save the markdown content with provider name
with open(f"{__data__}/firecrawl_simple.md", "w") as f: with open(f"{__data__}/firecrawl_simple.md", "w") as f:
f.write(scrape_status['markdown']) f.write(scrape_status["markdown"])
# Count how many "cldnry.s-nbcnews.com" are in the markdown # Count how many "cldnry.s-nbcnews.com" are in the markdown
print(scrape_status['markdown'].count("cldnry.s-nbcnews.com")) print(scrape_status["markdown"].count("cldnry.s-nbcnews.com"))
async with AsyncWebCrawler() as crawler: async with AsyncWebCrawler() as crawler:
start = time.time() start = time.time()
@@ -34,7 +35,7 @@ async def compare():
# js_code=["const loadMoreButton = Array.from(document.querySelectorAll('button')).find(button => button.textContent.includes('Load More')); loadMoreButton && loadMoreButton.click();"], # js_code=["const loadMoreButton = Array.from(document.querySelectorAll('button')).find(button => button.textContent.includes('Load More')); loadMoreButton && loadMoreButton.click();"],
word_count_threshold=0, word_count_threshold=0,
bypass_cache=True, bypass_cache=True,
verbose=False verbose=False,
) )
end = time.time() end = time.time()
print(f"Time taken: {end - start} seconds") print(f"Time taken: {end - start} seconds")
@@ -48,10 +49,12 @@ async def compare():
start = time.time() start = time.time()
result = await crawler.arun( result = await crawler.arun(
url="https://www.nbcnews.com/business", url="https://www.nbcnews.com/business",
js_code=["const loadMoreButton = Array.from(document.querySelectorAll('button')).find(button => button.textContent.includes('Load More')); loadMoreButton && loadMoreButton.click();"], js_code=[
"const loadMoreButton = Array.from(document.querySelectorAll('button')).find(button => button.textContent.includes('Load More')); loadMoreButton && loadMoreButton.click();"
],
word_count_threshold=0, word_count_threshold=0,
bypass_cache=True, bypass_cache=True,
verbose=False verbose=False,
) )
end = time.time() end = time.time()
print(f"Time taken: {end - start} seconds") print(f"Time taken: {end - start} seconds")
@@ -62,6 +65,6 @@ async def compare():
# count how many "cldnry.s-nbcnews.com" are in the markdown # count how many "cldnry.s-nbcnews.com" are in the markdown
print(result.markdown.count("cldnry.s-nbcnews.com")) print(result.markdown.count("cldnry.s-nbcnews.com"))
if __name__ == "__main__": if __name__ == "__main__":
asyncio.run(compare()) asyncio.run(compare())

View File

@@ -0,0 +1,136 @@
import asyncio
import time
from rich import print
from rich.table import Table
from crawl4ai import (
AsyncWebCrawler,
BrowserConfig,
CrawlerRunConfig,
MemoryAdaptiveDispatcher,
SemaphoreDispatcher,
RateLimiter,
CrawlerMonitor,
DisplayMode,
CacheMode,
LXMLWebScrapingStrategy,
)
async def memory_adaptive(urls, browser_config, run_config):
"""Memory adaptive crawler with monitoring"""
start = time.perf_counter()
async with AsyncWebCrawler(config=browser_config) as crawler:
dispatcher = MemoryAdaptiveDispatcher(
memory_threshold_percent=70.0,
max_session_permit=10,
monitor=CrawlerMonitor(
max_visible_rows=15, display_mode=DisplayMode.DETAILED
),
)
results = await crawler.arun_many(
urls, config=run_config, dispatcher=dispatcher
)
duration = time.perf_counter() - start
return len(results), duration
async def memory_adaptive_with_rate_limit(urls, browser_config, run_config):
"""Memory adaptive crawler with rate limiting"""
start = time.perf_counter()
async with AsyncWebCrawler(config=browser_config) as crawler:
dispatcher = MemoryAdaptiveDispatcher(
memory_threshold_percent=70.0,
max_session_permit=10,
rate_limiter=RateLimiter(
base_delay=(1.0, 2.0), max_delay=30.0, max_retries=2
),
monitor=CrawlerMonitor(
max_visible_rows=15, display_mode=DisplayMode.DETAILED
),
)
results = await crawler.arun_many(
urls, config=run_config, dispatcher=dispatcher
)
duration = time.perf_counter() - start
return len(results), duration
async def semaphore(urls, browser_config, run_config):
"""Basic semaphore crawler"""
start = time.perf_counter()
async with AsyncWebCrawler(config=browser_config) as crawler:
dispatcher = SemaphoreDispatcher(
semaphore_count=5,
monitor=CrawlerMonitor(
max_visible_rows=15, display_mode=DisplayMode.DETAILED
),
)
results = await crawler.arun_many(
urls, config=run_config, dispatcher=dispatcher
)
duration = time.perf_counter() - start
return len(results), duration
async def semaphore_with_rate_limit(urls, browser_config, run_config):
"""Semaphore crawler with rate limiting"""
start = time.perf_counter()
async with AsyncWebCrawler(config=browser_config) as crawler:
dispatcher = SemaphoreDispatcher(
semaphore_count=5,
rate_limiter=RateLimiter(
base_delay=(1.0, 2.0), max_delay=30.0, max_retries=2
),
monitor=CrawlerMonitor(
max_visible_rows=15, display_mode=DisplayMode.DETAILED
),
)
results = await crawler.arun_many(
urls, config=run_config, dispatcher=dispatcher
)
duration = time.perf_counter() - start
return len(results), duration
def create_performance_table(results):
"""Creates a rich table showing performance results"""
table = Table(title="Crawler Strategy Performance Comparison")
table.add_column("Strategy", style="cyan")
table.add_column("URLs Crawled", justify="right", style="green")
table.add_column("Time (seconds)", justify="right", style="yellow")
table.add_column("URLs/second", justify="right", style="magenta")
sorted_results = sorted(results.items(), key=lambda x: x[1][1])
for strategy, (urls_crawled, duration) in sorted_results:
urls_per_second = urls_crawled / duration
table.add_row(
strategy, str(urls_crawled), f"{duration:.2f}", f"{urls_per_second:.2f}"
)
return table
async def main():
urls = [f"https://example.com/page{i}" for i in range(1, 40)]
browser_config = BrowserConfig(headless=True, verbose=False)
run_config = CrawlerRunConfig(cache_mode=CacheMode.BYPASS, scraping_strategy=LXMLWebScrapingStrategy())
results = {
"Memory Adaptive": await memory_adaptive(urls, browser_config, run_config),
# "Memory Adaptive + Rate Limit": await memory_adaptive_with_rate_limit(
# urls, browser_config, run_config
# ),
# "Semaphore": await semaphore(urls, browser_config, run_config),
# "Semaphore + Rate Limit": await semaphore_with_rate_limit(
# urls, browser_config, run_config
# ),
}
table = create_performance_table(results)
print("\nPerformance Summary:")
print(table)
if __name__ == "__main__":
asyncio.run(main())

View File

@@ -6,15 +6,24 @@ import base64
import os import os
from typing import Dict, Any from typing import Dict, Any
class Crawl4AiTester: class Crawl4AiTester:
def __init__(self, base_url: str = "http://localhost:11235", api_token: str = None): def __init__(self, base_url: str = "http://localhost:11235", api_token: str = None):
self.base_url = base_url self.base_url = base_url
self.api_token = api_token or os.getenv('CRAWL4AI_API_TOKEN') or "test_api_code" # Check environment variable as fallback self.api_token = (
self.headers = {'Authorization': f'Bearer {self.api_token}'} if self.api_token else {} api_token or os.getenv("CRAWL4AI_API_TOKEN") or "test_api_code"
) # Check environment variable as fallback
self.headers = (
{"Authorization": f"Bearer {self.api_token}"} if self.api_token else {}
)
def submit_and_wait(self, request_data: Dict[str, Any], timeout: int = 300) -> Dict[str, Any]: def submit_and_wait(
self, request_data: Dict[str, Any], timeout: int = 300
) -> Dict[str, Any]:
# Submit crawl job # Submit crawl job
response = requests.post(f"{self.base_url}/crawl", json=request_data, headers=self.headers) response = requests.post(
f"{self.base_url}/crawl", json=request_data, headers=self.headers
)
if response.status_code == 403: if response.status_code == 403:
raise Exception("API token is invalid or missing") raise Exception("API token is invalid or missing")
task_id = response.json()["task_id"] task_id = response.json()["task_id"]
@@ -24,9 +33,13 @@ class Crawl4AiTester:
start_time = time.time() start_time = time.time()
while True: while True:
if time.time() - start_time > timeout: if time.time() - start_time > timeout:
raise TimeoutError(f"Task {task_id} did not complete within {timeout} seconds") raise TimeoutError(
f"Task {task_id} did not complete within {timeout} seconds"
)
result = requests.get(f"{self.base_url}/task/{task_id}", headers=self.headers) result = requests.get(
f"{self.base_url}/task/{task_id}", headers=self.headers
)
status = result.json() status = result.json()
if status["status"] == "failed": if status["status"] == "failed":
@@ -39,7 +52,12 @@ class Crawl4AiTester:
time.sleep(2) time.sleep(2)
def submit_sync(self, request_data: Dict[str, Any]) -> Dict[str, Any]: def submit_sync(self, request_data: Dict[str, Any]) -> Dict[str, Any]:
response = requests.post(f"{self.base_url}/crawl_sync", json=request_data, headers=self.headers, timeout=60) response = requests.post(
f"{self.base_url}/crawl_sync",
json=request_data,
headers=self.headers,
timeout=60,
)
if response.status_code == 408: if response.status_code == 408:
raise TimeoutError("Task did not complete within server timeout") raise TimeoutError("Task did not complete within server timeout")
response.raise_for_status() response.raise_for_status()
@@ -48,16 +66,15 @@ class Crawl4AiTester:
def crawl_direct(self, request_data: Dict[str, Any]) -> Dict[str, Any]: def crawl_direct(self, request_data: Dict[str, Any]) -> Dict[str, Any]:
"""Directly crawl without using task queue""" """Directly crawl without using task queue"""
response = requests.post( response = requests.post(
f"{self.base_url}/crawl_direct", f"{self.base_url}/crawl_direct", json=request_data, headers=self.headers
json=request_data,
headers=self.headers
) )
response.raise_for_status() response.raise_for_status()
return response.json() return response.json()
def test_docker_deployment(version="basic"): def test_docker_deployment(version="basic"):
tester = Crawl4AiTester( tester = Crawl4AiTester(
base_url="http://localhost:11235" , base_url="http://localhost:11235",
# base_url="https://api.crawl4ai.com" # just for example # base_url="https://api.crawl4ai.com" # just for example
# api_token="test" # just for example # api_token="test" # just for example
) )
@@ -70,7 +87,7 @@ def test_docker_deployment(version="basic"):
health = requests.get(f"{tester.base_url}/health", timeout=10) health = requests.get(f"{tester.base_url}/health", timeout=10)
print("Health check:", health.json()) print("Health check:", health.json())
break break
except requests.exceptions.RequestException as e: except requests.exceptions.RequestException:
if i == max_retries - 1: if i == max_retries - 1:
print(f"Failed to connect after {max_retries} attempts") print(f"Failed to connect after {max_retries} attempts")
sys.exit(1) sys.exit(1)
@@ -99,7 +116,7 @@ def test_basic_crawl(tester: Crawl4AiTester):
request = { request = {
"urls": "https://www.nbcnews.com/business", "urls": "https://www.nbcnews.com/business",
"priority": 10, "priority": 10,
"session_id": "test" "session_id": "test",
} }
result = tester.submit_and_wait(request) result = tester.submit_and_wait(request)
@@ -107,19 +124,21 @@ def test_basic_crawl(tester: Crawl4AiTester):
assert result["result"]["success"] assert result["result"]["success"]
assert len(result["result"]["markdown"]) > 0 assert len(result["result"]["markdown"]) > 0
def test_basic_crawl_sync(tester: Crawl4AiTester): def test_basic_crawl_sync(tester: Crawl4AiTester):
print("\n=== Testing Basic Crawl (Sync) ===") print("\n=== Testing Basic Crawl (Sync) ===")
request = { request = {
"urls": "https://www.nbcnews.com/business", "urls": "https://www.nbcnews.com/business",
"priority": 10, "priority": 10,
"session_id": "test" "session_id": "test",
} }
result = tester.submit_sync(request) result = tester.submit_sync(request)
print(f"Basic crawl result length: {len(result['result']['markdown'])}") print(f"Basic crawl result length: {len(result['result']['markdown'])}")
assert result['status'] == 'completed' assert result["status"] == "completed"
assert result['result']['success'] assert result["result"]["success"]
assert len(result['result']['markdown']) > 0 assert len(result["result"]["markdown"]) > 0
def test_basic_crawl_direct(tester: Crawl4AiTester): def test_basic_crawl_direct(tester: Crawl4AiTester):
print("\n=== Testing Basic Crawl (Direct) ===") print("\n=== Testing Basic Crawl (Direct) ===")
@@ -127,13 +146,14 @@ def test_basic_crawl_direct(tester: Crawl4AiTester):
"urls": "https://www.nbcnews.com/business", "urls": "https://www.nbcnews.com/business",
"priority": 10, "priority": 10,
# "session_id": "test" # "session_id": "test"
"cache_mode": "bypass" # or "enabled", "disabled", "read_only", "write_only" "cache_mode": "bypass", # or "enabled", "disabled", "read_only", "write_only"
} }
result = tester.crawl_direct(request) result = tester.crawl_direct(request)
print(f"Basic crawl result length: {len(result['result']['markdown'])}") print(f"Basic crawl result length: {len(result['result']['markdown'])}")
assert result['result']['success'] assert result["result"]["success"]
assert len(result['result']['markdown']) > 0 assert len(result["result"]["markdown"]) > 0
def test_js_execution(tester: Crawl4AiTester): def test_js_execution(tester: Crawl4AiTester):
print("\n=== Testing JS Execution ===") print("\n=== Testing JS Execution ===")
@@ -144,32 +164,29 @@ def test_js_execution(tester: Crawl4AiTester):
"const loadMoreButton = Array.from(document.querySelectorAll('button')).find(button => button.textContent.includes('Load More')); loadMoreButton && loadMoreButton.click();" "const loadMoreButton = Array.from(document.querySelectorAll('button')).find(button => button.textContent.includes('Load More')); loadMoreButton && loadMoreButton.click();"
], ],
"wait_for": "article.tease-card:nth-child(10)", "wait_for": "article.tease-card:nth-child(10)",
"crawler_params": { "crawler_params": {"headless": True},
"headless": True
}
} }
result = tester.submit_and_wait(request) result = tester.submit_and_wait(request)
print(f"JS execution result length: {len(result['result']['markdown'])}") print(f"JS execution result length: {len(result['result']['markdown'])}")
assert result["result"]["success"] assert result["result"]["success"]
def test_css_selector(tester: Crawl4AiTester): def test_css_selector(tester: Crawl4AiTester):
print("\n=== Testing CSS Selector ===") print("\n=== Testing CSS Selector ===")
request = { request = {
"urls": "https://www.nbcnews.com/business", "urls": "https://www.nbcnews.com/business",
"priority": 7, "priority": 7,
"css_selector": ".wide-tease-item__description", "css_selector": ".wide-tease-item__description",
"crawler_params": { "crawler_params": {"headless": True},
"headless": True "extra": {"word_count_threshold": 10},
},
"extra": {"word_count_threshold": 10}
} }
result = tester.submit_and_wait(request) result = tester.submit_and_wait(request)
print(f"CSS selector result length: {len(result['result']['markdown'])}") print(f"CSS selector result length: {len(result['result']['markdown'])}")
assert result["result"]["success"] assert result["result"]["success"]
def test_structured_extraction(tester: Crawl4AiTester): def test_structured_extraction(tester: Crawl4AiTester):
print("\n=== Testing Structured Extraction ===") print("\n=== Testing Structured Extraction ===")
schema = { schema = {
@@ -190,19 +207,14 @@ def test_structured_extraction(tester: Crawl4AiTester):
"name": "price", "name": "price",
"selector": "td:nth-child(2)", "selector": "td:nth-child(2)",
"type": "text", "type": "text",
} },
], ],
} }
request = { request = {
"urls": "https://www.coinbase.com/explore", "urls": "https://www.coinbase.com/explore",
"priority": 9, "priority": 9,
"extraction_config": { "extraction_config": {"type": "json_css", "params": {"schema": schema}},
"type": "json_css",
"params": {
"schema": schema
}
}
} }
result = tester.submit_and_wait(request) result = tester.submit_and_wait(request)
@@ -212,6 +224,7 @@ def test_structured_extraction(tester: Crawl4AiTester):
assert result["result"]["success"] assert result["result"]["success"]
assert len(extracted) > 0 assert len(extracted) > 0
def test_llm_extraction(tester: Crawl4AiTester): def test_llm_extraction(tester: Crawl4AiTester):
print("\n=== Testing LLM Extraction ===") print("\n=== Testing LLM Extraction ===")
schema = { schema = {
@@ -219,18 +232,18 @@ def test_llm_extraction(tester: Crawl4AiTester):
"properties": { "properties": {
"model_name": { "model_name": {
"type": "string", "type": "string",
"description": "Name of the OpenAI model." "description": "Name of the OpenAI model.",
}, },
"input_fee": { "input_fee": {
"type": "string", "type": "string",
"description": "Fee for input token for the OpenAI model." "description": "Fee for input token for the OpenAI model.",
}, },
"output_fee": { "output_fee": {
"type": "string", "type": "string",
"description": "Fee for output token for the OpenAI model." "description": "Fee for output token for the OpenAI model.",
} },
}, },
"required": ["model_name", "input_fee", "output_fee"] "required": ["model_name", "input_fee", "output_fee"],
} }
request = { request = {
@@ -243,10 +256,10 @@ def test_llm_extraction(tester: Crawl4AiTester):
"api_token": os.getenv("OPENAI_API_KEY"), "api_token": os.getenv("OPENAI_API_KEY"),
"schema": schema, "schema": schema,
"extraction_type": "schema", "extraction_type": "schema",
"instruction": """From the crawled content, extract all mentioned model names along with their fees for input and output tokens.""" "instruction": """From the crawled content, extract all mentioned model names along with their fees for input and output tokens.""",
} },
}, },
"crawler_params": {"word_count_threshold": 1} "crawler_params": {"word_count_threshold": 1},
} }
try: try:
@@ -258,6 +271,7 @@ def test_llm_extraction(tester: Crawl4AiTester):
except Exception as e: except Exception as e:
print(f"LLM extraction test failed (might be due to missing API key): {str(e)}") print(f"LLM extraction test failed (might be due to missing API key): {str(e)}")
def test_llm_with_ollama(tester: Crawl4AiTester): def test_llm_with_ollama(tester: Crawl4AiTester):
print("\n=== Testing LLM with Ollama ===") print("\n=== Testing LLM with Ollama ===")
schema = { schema = {
@@ -265,18 +279,18 @@ def test_llm_with_ollama(tester: Crawl4AiTester):
"properties": { "properties": {
"article_title": { "article_title": {
"type": "string", "type": "string",
"description": "The main title of the news article" "description": "The main title of the news article",
}, },
"summary": { "summary": {
"type": "string", "type": "string",
"description": "A brief summary of the article content" "description": "A brief summary of the article content",
}, },
"main_topics": { "main_topics": {
"type": "array", "type": "array",
"items": {"type": "string"}, "items": {"type": "string"},
"description": "Main topics or themes discussed in the article" "description": "Main topics or themes discussed in the article",
} },
} },
} }
request = { request = {
@@ -288,11 +302,11 @@ def test_llm_with_ollama(tester: Crawl4AiTester):
"provider": "ollama/llama2", "provider": "ollama/llama2",
"schema": schema, "schema": schema,
"extraction_type": "schema", "extraction_type": "schema",
"instruction": "Extract the main article information including title, summary, and main topics." "instruction": "Extract the main article information including title, summary, and main topics.",
} },
}, },
"extra": {"word_count_threshold": 1}, "extra": {"word_count_threshold": 1},
"crawler_params": {"verbose": True} "crawler_params": {"verbose": True},
} }
try: try:
@@ -303,6 +317,7 @@ def test_llm_with_ollama(tester: Crawl4AiTester):
except Exception as e: except Exception as e:
print(f"Ollama extraction test failed: {str(e)}") print(f"Ollama extraction test failed: {str(e)}")
def test_cosine_extraction(tester: Crawl4AiTester): def test_cosine_extraction(tester: Crawl4AiTester):
print("\n=== Testing Cosine Extraction ===") print("\n=== Testing Cosine Extraction ===")
request = { request = {
@@ -314,9 +329,9 @@ def test_cosine_extraction(tester: Crawl4AiTester):
"semantic_filter": "business finance economy", "semantic_filter": "business finance economy",
"word_count_threshold": 10, "word_count_threshold": 10,
"max_dist": 0.2, "max_dist": 0.2,
"top_k": 3 "top_k": 3,
} },
} },
} }
try: try:
@@ -328,15 +343,14 @@ def test_cosine_extraction(tester: Crawl4AiTester):
except Exception as e: except Exception as e:
print(f"Cosine extraction test failed: {str(e)}") print(f"Cosine extraction test failed: {str(e)}")
def test_screenshot(tester: Crawl4AiTester): def test_screenshot(tester: Crawl4AiTester):
print("\n=== Testing Screenshot ===") print("\n=== Testing Screenshot ===")
request = { request = {
"urls": "https://www.nbcnews.com/business", "urls": "https://www.nbcnews.com/business",
"priority": 5, "priority": 5,
"screenshot": True, "screenshot": True,
"crawler_params": { "crawler_params": {"headless": True},
"headless": True
}
} }
result = tester.submit_and_wait(request) result = tester.submit_and_wait(request)
@@ -351,6 +365,7 @@ def test_screenshot(tester: Crawl4AiTester):
assert result["result"]["success"] assert result["result"]["success"]
if __name__ == "__main__": if __name__ == "__main__":
version = sys.argv[1] if len(sys.argv) > 1 else "basic" version = sys.argv[1] if len(sys.argv) > 1 else "basic"
# version = "full" # version = "full"

View File

@@ -9,18 +9,17 @@ This example shows how to:
import asyncio import asyncio
import os import os
from typing import Dict, Any
from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig, CacheMode from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig, CacheMode
from crawl4ai.extraction_strategy import ( from crawl4ai.extraction_strategy import (
LLMExtractionStrategy, LLMExtractionStrategy,
JsonCssExtractionStrategy, JsonCssExtractionStrategy,
JsonXPathExtractionStrategy JsonXPathExtractionStrategy,
) )
from crawl4ai.chunking_strategy import RegexChunking, IdentityChunking
from crawl4ai.content_filter_strategy import PruningContentFilter from crawl4ai.content_filter_strategy import PruningContentFilter
from crawl4ai.markdown_generation_strategy import DefaultMarkdownGenerator from crawl4ai.markdown_generation_strategy import DefaultMarkdownGenerator
async def run_extraction(crawler: AsyncWebCrawler, url: str, strategy, name: str): async def run_extraction(crawler: AsyncWebCrawler, url: str, strategy, name: str):
"""Helper function to run extraction with proper configuration""" """Helper function to run extraction with proper configuration"""
try: try:
@@ -30,7 +29,7 @@ async def run_extraction(crawler: AsyncWebCrawler, url: str, strategy, name: str
extraction_strategy=strategy, extraction_strategy=strategy,
markdown_generator=DefaultMarkdownGenerator( markdown_generator=DefaultMarkdownGenerator(
content_filter=PruningContentFilter() # For fit_markdown support content_filter=PruningContentFilter() # For fit_markdown support
) ),
) )
# Run the crawler # Run the crawler
@@ -40,22 +39,22 @@ async def run_extraction(crawler: AsyncWebCrawler, url: str, strategy, name: str
print(f"\n=== {name} Results ===") print(f"\n=== {name} Results ===")
print(f"Extracted Content: {result.extracted_content}") print(f"Extracted Content: {result.extracted_content}")
print(f"Raw Markdown Length: {len(result.markdown_v2.raw_markdown)}") print(f"Raw Markdown Length: {len(result.markdown_v2.raw_markdown)}")
print(f"Citations Markdown Length: {len(result.markdown_v2.markdown_with_citations)}") print(
f"Citations Markdown Length: {len(result.markdown_v2.markdown_with_citations)}"
)
else: else:
print(f"Error in {name}: Crawl failed") print(f"Error in {name}: Crawl failed")
except Exception as e: except Exception as e:
print(f"Error in {name}: {str(e)}") print(f"Error in {name}: {str(e)}")
async def main(): async def main():
# Example URL (replace with actual URL) # Example URL (replace with actual URL)
url = "https://example.com/product-page" url = "https://example.com/product-page"
# Configure browser settings # Configure browser settings
browser_config = BrowserConfig( browser_config = BrowserConfig(headless=True, verbose=True)
headless=True,
verbose=True
)
# Initialize extraction strategies # Initialize extraction strategies
@@ -63,21 +62,21 @@ async def main():
markdown_strategy = LLMExtractionStrategy( markdown_strategy = LLMExtractionStrategy(
provider="openai/gpt-4o-mini", provider="openai/gpt-4o-mini",
api_token=os.getenv("OPENAI_API_KEY"), api_token=os.getenv("OPENAI_API_KEY"),
instruction="Extract product information including name, price, and description" instruction="Extract product information including name, price, and description",
) )
html_strategy = LLMExtractionStrategy( html_strategy = LLMExtractionStrategy(
input_format="html", input_format="html",
provider="openai/gpt-4o-mini", provider="openai/gpt-4o-mini",
api_token=os.getenv("OPENAI_API_KEY"), api_token=os.getenv("OPENAI_API_KEY"),
instruction="Extract product information from HTML including structured data" instruction="Extract product information from HTML including structured data",
) )
fit_markdown_strategy = LLMExtractionStrategy( fit_markdown_strategy = LLMExtractionStrategy(
input_format="fit_markdown", input_format="fit_markdown",
provider="openai/gpt-4o-mini", provider="openai/gpt-4o-mini",
api_token=os.getenv("OPENAI_API_KEY"), api_token=os.getenv("OPENAI_API_KEY"),
instruction="Extract product information from cleaned markdown" instruction="Extract product information from cleaned markdown",
) )
# 2. JSON CSS Extraction (automatically uses HTML input) # 2. JSON CSS Extraction (automatically uses HTML input)
@@ -86,8 +85,8 @@ async def main():
"fields": [ "fields": [
{"name": "title", "selector": "h1.product-title", "type": "text"}, {"name": "title", "selector": "h1.product-title", "type": "text"},
{"name": "price", "selector": ".price", "type": "text"}, {"name": "price", "selector": ".price", "type": "text"},
{"name": "description", "selector": ".description", "type": "text"} {"name": "description", "selector": ".description", "type": "text"},
] ],
} }
css_strategy = JsonCssExtractionStrategy(schema=css_schema) css_strategy = JsonCssExtractionStrategy(schema=css_schema)
@@ -95,10 +94,22 @@ async def main():
xpath_schema = { xpath_schema = {
"baseSelector": "//div[@class='product']", "baseSelector": "//div[@class='product']",
"fields": [ "fields": [
{"name": "title", "selector": ".//h1[@class='product-title']/text()", "type": "text"}, {
{"name": "price", "selector": ".//span[@class='price']/text()", "type": "text"}, "name": "title",
{"name": "description", "selector": ".//div[@class='description']/text()", "type": "text"} "selector": ".//h1[@class='product-title']/text()",
] "type": "text",
},
{
"name": "price",
"selector": ".//span[@class='price']/text()",
"type": "text",
},
{
"name": "description",
"selector": ".//div[@class='description']/text()",
"type": "text",
},
],
} }
xpath_strategy = JsonXPathExtractionStrategy(schema=xpath_schema) xpath_strategy = JsonXPathExtractionStrategy(schema=xpath_schema)
@@ -111,5 +122,6 @@ async def main():
await run_extraction(crawler, url, css_strategy, "CSS Extraction") await run_extraction(crawler, url, css_strategy, "CSS Extraction")
await run_extraction(crawler, url, xpath_strategy, "XPath Extraction") await run_extraction(crawler, url, xpath_strategy, "XPath Extraction")
if __name__ == "__main__": if __name__ == "__main__":
asyncio.run(main()) asyncio.run(main())

View File

@@ -1,20 +1,23 @@
import asyncio import asyncio
from crawl4ai import * from crawl4ai import *
async def main(): async def main():
browser_config = BrowserConfig(headless=True, verbose=True) browser_config = BrowserConfig(headless=True, verbose=True)
async with AsyncWebCrawler(config=browser_config) as crawler: async with AsyncWebCrawler(config=browser_config) as crawler:
crawler_config = CrawlerRunConfig( crawler_config = CrawlerRunConfig(
cache_mode=CacheMode.BYPASS, cache_mode=CacheMode.BYPASS,
markdown_generator=DefaultMarkdownGenerator( markdown_generator=DefaultMarkdownGenerator(
content_filter=PruningContentFilter(threshold=0.48, threshold_type="fixed", min_word_threshold=0) content_filter=PruningContentFilter(
) threshold=0.48, threshold_type="fixed", min_word_threshold=0
)
),
) )
result = await crawler.arun( result = await crawler.arun(
url="https://www.helloworld.org", url="https://www.helloworld.org", config=crawler_config
config=crawler_config
) )
print(result.markdown_v2.raw_markdown[:500]) print(result.markdown_v2.raw_markdown[:500])
if __name__ == "__main__": if __name__ == "__main__":
asyncio.run(main()) asyncio.run(main())

View File

@@ -1,19 +1,18 @@
from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig, CacheMode from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig, CacheMode
from playwright.async_api import Page, BrowserContext from playwright.async_api import Page, BrowserContext
async def main(): async def main():
print("🔗 Hooks Example: Demonstrating different hook use cases") print("🔗 Hooks Example: Demonstrating different hook use cases")
# Configure browser settings # Configure browser settings
browser_config = BrowserConfig( browser_config = BrowserConfig(headless=True)
headless=True
)
# Configure crawler settings # Configure crawler settings
crawler_run_config = CrawlerRunConfig( crawler_run_config = CrawlerRunConfig(
js_code="window.scrollTo(0, document.body.scrollHeight);", js_code="window.scrollTo(0, document.body.scrollHeight);",
wait_for="body", wait_for="body",
cache_mode=CacheMode.BYPASS cache_mode=CacheMode.BYPASS,
) )
# Create crawler instance # Create crawler instance
@@ -30,16 +29,22 @@ async def main():
"""Hook called after a new page and context are created""" """Hook called after a new page and context are created"""
print("[HOOK] on_page_context_created - New page created!") print("[HOOK] on_page_context_created - New page created!")
# Example: Set default viewport size # Example: Set default viewport size
await context.add_cookies([{ await context.add_cookies(
'name': 'session_id', [
'value': 'example_session', {
'domain': '.example.com', "name": "session_id",
'path': '/' "value": "example_session",
}]) "domain": ".example.com",
await page.set_viewport_size({"width": 1920, "height": 1080}) "path": "/",
}
]
)
await page.set_viewport_size({"width": 1080, "height": 800})
return page return page
async def on_user_agent_updated(page: Page, context: BrowserContext, user_agent: str, **kwargs): async def on_user_agent_updated(
page: Page, context: BrowserContext, user_agent: str, **kwargs
):
"""Hook called when the user agent is updated""" """Hook called when the user agent is updated"""
print(f"[HOOK] on_user_agent_updated - New user agent: {user_agent}") print(f"[HOOK] on_user_agent_updated - New user agent: {user_agent}")
return page return page
@@ -53,17 +58,17 @@ async def main():
"""Hook called before navigating to each URL""" """Hook called before navigating to each URL"""
print(f"[HOOK] before_goto - About to visit: {url}") print(f"[HOOK] before_goto - About to visit: {url}")
# Example: Add custom headers for the request # Example: Add custom headers for the request
await page.set_extra_http_headers({ await page.set_extra_http_headers({"Custom-Header": "my-value"})
"Custom-Header": "my-value"
})
return page return page
async def after_goto(page: Page, context: BrowserContext, url: str, response: dict, **kwargs): async def after_goto(
page: Page, context: BrowserContext, url: str, response: dict, **kwargs
):
"""Hook called after navigating to each URL""" """Hook called after navigating to each URL"""
print(f"[HOOK] after_goto - Successfully loaded: {url}") print(f"[HOOK] after_goto - Successfully loaded: {url}")
# Example: Wait for a specific element to be loaded # Example: Wait for a specific element to be loaded
try: try:
await page.wait_for_selector('.content', timeout=1000) await page.wait_for_selector(".content", timeout=1000)
print("Content element found!") print("Content element found!")
except: except:
print("Content element not found, continuing anyway") print("Content element not found, continuing anyway")
@@ -76,7 +81,9 @@ async def main():
await page.evaluate("window.scrollTo(0, document.body.scrollHeight);") await page.evaluate("window.scrollTo(0, document.body.scrollHeight);")
return page return page
async def before_return_html(page: Page, context: BrowserContext, html:str, **kwargs): async def before_return_html(
page: Page, context: BrowserContext, html: str, **kwargs
):
"""Hook called before returning the HTML content""" """Hook called before returning the HTML content"""
print(f"[HOOK] before_return_html - Got HTML content (length: {len(html)})") print(f"[HOOK] before_return_html - Got HTML content (length: {len(html)})")
# Example: You could modify the HTML content here if needed # Example: You could modify the HTML content here if needed
@@ -84,7 +91,9 @@ async def main():
# Set all the hooks # Set all the hooks
crawler.crawler_strategy.set_hook("on_browser_created", on_browser_created) crawler.crawler_strategy.set_hook("on_browser_created", on_browser_created)
crawler.crawler_strategy.set_hook("on_page_context_created", on_page_context_created) crawler.crawler_strategy.set_hook(
"on_page_context_created", on_page_context_created
)
crawler.crawler_strategy.set_hook("on_user_agent_updated", on_user_agent_updated) crawler.crawler_strategy.set_hook("on_user_agent_updated", on_user_agent_updated)
crawler.crawler_strategy.set_hook("on_execution_started", on_execution_started) crawler.crawler_strategy.set_hook("on_execution_started", on_execution_started)
crawler.crawler_strategy.set_hook("before_goto", before_goto) crawler.crawler_strategy.set_hook("before_goto", before_goto)
@@ -95,13 +104,15 @@ async def main():
await crawler.start() await crawler.start()
# Example usage: crawl a simple website # Example usage: crawl a simple website
url = 'https://example.com' url = "https://example.com"
result = await crawler.arun(url, config=crawler_run_config) result = await crawler.arun(url, config=crawler_run_config)
print(f"\nCrawled URL: {result.url}") print(f"\nCrawled URL: {result.url}")
print(f"HTML length: {len(result.html)}") print(f"HTML length: {len(result.html)}")
await crawler.close() await crawler.close()
if __name__ == "__main__": if __name__ == "__main__":
import asyncio import asyncio
asyncio.run(main()) asyncio.run(main())

View File

@@ -1,6 +1,7 @@
import asyncio import asyncio
from crawl4ai import AsyncWebCrawler, AsyncPlaywrightCrawlerStrategy from crawl4ai import AsyncWebCrawler, AsyncPlaywrightCrawlerStrategy
async def main(): async def main():
# Example 1: Setting language when creating the crawler # Example 1: Setting language when creating the crawler
crawler1 = AsyncWebCrawler( crawler1 = AsyncWebCrawler(
@@ -9,11 +10,15 @@ async def main():
) )
) )
result1 = await crawler1.arun("https://www.example.com") result1 = await crawler1.arun("https://www.example.com")
print("Example 1 result:", result1.extracted_content[:100]) # Print first 100 characters print(
"Example 1 result:", result1.extracted_content[:100]
) # Print first 100 characters
# Example 2: Setting language before crawling # Example 2: Setting language before crawling
crawler2 = AsyncWebCrawler() crawler2 = AsyncWebCrawler()
crawler2.crawler_strategy.headers["Accept-Language"] = "es-ES,es;q=0.9,en-US;q=0.8,en;q=0.7" crawler2.crawler_strategy.headers[
"Accept-Language"
] = "es-ES,es;q=0.9,en-US;q=0.8,en;q=0.7"
result2 = await crawler2.arun("https://www.example.com") result2 = await crawler2.arun("https://www.example.com")
print("Example 2 result:", result2.extracted_content[:100]) print("Example 2 result:", result2.extracted_content[:100])
@@ -21,7 +26,7 @@ async def main():
crawler3 = AsyncWebCrawler() crawler3 = AsyncWebCrawler()
result3 = await crawler3.arun( result3 = await crawler3.arun(
"https://www.example.com", "https://www.example.com",
headers={"Accept-Language": "de-DE,de;q=0.9,en-US;q=0.8,en;q=0.7"} headers={"Accept-Language": "de-DE,de;q=0.9,en-US;q=0.8,en;q=0.7"},
) )
print("Example 3 result:", result3.extracted_content[:100]) print("Example 3 result:", result3.extracted_content[:100])
@@ -33,13 +38,13 @@ async def main():
] ]
crawler4 = AsyncWebCrawler() crawler4 = AsyncWebCrawler()
results = await asyncio.gather(*[ results = await asyncio.gather(
crawler4.arun(url, headers={"Accept-Language": lang}) *[crawler4.arun(url, headers={"Accept-Language": lang}) for url, lang in urls]
for url, lang in urls )
])
for url, result in zip([u for u, _ in urls], results): for url, result in zip([u for u, _ in urls], results):
print(f"Result for {url}:", result.extracted_content[:100]) print(f"Result for {url}:", result.extracted_content[:100])
if __name__ == "__main__": if __name__ == "__main__":
asyncio.run(main()) asyncio.run(main())

View File

@@ -3,32 +3,37 @@ from crawl4ai.crawler_strategy import *
import asyncio import asyncio
from pydantic import BaseModel, Field from pydantic import BaseModel, Field
url = r'https://openai.com/api/pricing/' url = r"https://openai.com/api/pricing/"
class OpenAIModelFee(BaseModel): class OpenAIModelFee(BaseModel):
model_name: str = Field(..., description="Name of the OpenAI model.") model_name: str = Field(..., description="Name of the OpenAI model.")
input_fee: str = Field(..., description="Fee for input token for the OpenAI model.") input_fee: str = Field(..., description="Fee for input token for the OpenAI model.")
output_fee: str = Field(..., description="Fee for output token for the OpenAI model.") output_fee: str = Field(
..., description="Fee for output token for the OpenAI model."
)
from crawl4ai import AsyncWebCrawler from crawl4ai import AsyncWebCrawler
async def main(): async def main():
# Use AsyncWebCrawler # Use AsyncWebCrawler
async with AsyncWebCrawler() as crawler: async with AsyncWebCrawler() as crawler:
result = await crawler.arun( result = await crawler.arun(
url=url, url=url,
word_count_threshold=1, word_count_threshold=1,
extraction_strategy= LLMExtractionStrategy( extraction_strategy=LLMExtractionStrategy(
# provider= "openai/gpt-4o", api_token = os.getenv('OPENAI_API_KEY'), # provider= "openai/gpt-4o", api_token = os.getenv('OPENAI_API_KEY'),
provider= "groq/llama-3.1-70b-versatile", api_token = os.getenv('GROQ_API_KEY'), provider="groq/llama-3.1-70b-versatile",
api_token=os.getenv("GROQ_API_KEY"),
schema=OpenAIModelFee.model_json_schema(), schema=OpenAIModelFee.model_json_schema(),
extraction_type="schema", extraction_type="schema",
instruction="From the crawled content, extract all mentioned model names along with their " \ instruction="From the crawled content, extract all mentioned model names along with their "
"fees for input and output tokens. Make sure not to miss anything in the entire content. " \ "fees for input and output tokens. Make sure not to miss anything in the entire content. "
'One extracted model JSON format should look like this: ' \ "One extracted model JSON format should look like this: "
'{ "model_name": "GPT-4", "input_fee": "US$10.00 / 1M tokens", "output_fee": "US$30.00 / 1M tokens" }' '{ "model_name": "GPT-4", "input_fee": "US$10.00 / 1M tokens", "output_fee": "US$30.00 / 1M tokens" }',
), ),
) )
print("Success:", result.success) print("Success:", result.success)
model_fees = json.loads(result.extracted_content) model_fees = json.loads(result.extracted_content)
@@ -37,4 +42,5 @@ async def main():
with open(".data/data.json", "w", encoding="utf-8") as f: with open(".data/data.json", "w", encoding="utf-8") as f:
f.write(result.extracted_content) f.write(result.extracted_content)
asyncio.run(main()) asyncio.run(main())

View File

@@ -0,0 +1,87 @@
import os
import asyncio
from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig, CacheMode
from crawl4ai.content_filter_strategy import LLMContentFilter
async def test_llm_filter():
# Create an HTML source that needs intelligent filtering
url = "https://docs.python.org/3/tutorial/classes.html"
browser_config = BrowserConfig(
headless=True,
verbose=True
)
# run_config = CrawlerRunConfig(cache_mode=CacheMode.BYPASS)
run_config = CrawlerRunConfig(cache_mode=CacheMode.ENABLED)
async with AsyncWebCrawler(config=browser_config) as crawler:
# First get the raw HTML
result = await crawler.arun(url, config=run_config)
html = result.cleaned_html
# Initialize LLM filter with focused instruction
filter = LLMContentFilter(
provider="openai/gpt-4o",
api_token=os.getenv('OPENAI_API_KEY'),
instruction="""
Focus on extracting the core educational content about Python classes.
Include:
- Key concepts and their explanations
- Important code examples
- Essential technical details
Exclude:
- Navigation elements
- Sidebars
- Footer content
- Version information
- Any non-essential UI elements
Format the output as clean markdown with proper code blocks and headers.
""",
verbose=True
)
filter = LLMContentFilter(
provider="openai/gpt-4o",
api_token=os.getenv('OPENAI_API_KEY'),
chunk_token_threshold=2 ** 12 * 2, # 2048 * 2
instruction="""
Extract the main educational content while preserving its original wording and substance completely. Your task is to:
1. Maintain the exact language and terminology used in the main content
2. Keep all technical explanations, examples, and educational content intact
3. Preserve the original flow and structure of the core content
4. Remove only clearly irrelevant elements like:
- Navigation menus
- Advertisement sections
- Cookie notices
- Footers with site information
- Sidebars with external links
- Any UI elements that don't contribute to learning
The goal is to create a clean markdown version that reads exactly like the original article,
keeping all valuable content but free from distracting elements. Imagine you're creating
a perfect reading experience where nothing valuable is lost, but all noise is removed.
""",
verbose=True
)
# Apply filtering
filtered_content = filter.filter_content(html, ignore_cache = True)
# Show results
print("\nFiltered Content Length:", len(filtered_content))
print("\nFirst 500 chars of filtered content:")
if filtered_content:
print(filtered_content[0][:500])
# Save on disc the markdown version
with open("filtered_content.md", "w", encoding="utf-8") as f:
f.write("\n".join(filtered_content))
# Show token usage
filter.show_usage()
if __name__ == "__main__":
asyncio.run(test_llm_filter())

View File

@@ -8,12 +8,12 @@ import asyncio
import time import time
import json import json
import re import re
from typing import Dict, List from typing import Dict
from bs4 import BeautifulSoup from bs4 import BeautifulSoup
from pydantic import BaseModel, Field from pydantic import BaseModel, Field
from crawl4ai import AsyncWebCrawler, CacheMode, BrowserConfig, CrawlerRunConfig from crawl4ai import AsyncWebCrawler, CacheMode, BrowserConfig, CrawlerRunConfig
from crawl4ai.markdown_generation_strategy import DefaultMarkdownGenerator from crawl4ai.markdown_generation_strategy import DefaultMarkdownGenerator
from crawl4ai.content_filter_strategy import BM25ContentFilter, PruningContentFilter from crawl4ai.content_filter_strategy import PruningContentFilter
from crawl4ai.extraction_strategy import ( from crawl4ai.extraction_strategy import (
JsonCssExtractionStrategy, JsonCssExtractionStrategy,
LLMExtractionStrategy, LLMExtractionStrategy,
@@ -62,6 +62,7 @@ async def clean_content():
print(f"Full Markdown Length: {full_markdown_length}") print(f"Full Markdown Length: {full_markdown_length}")
print(f"Fit Markdown Length: {fit_markdown_length}") print(f"Fit Markdown Length: {fit_markdown_length}")
async def link_analysis(): async def link_analysis():
crawler_config = CrawlerRunConfig( crawler_config = CrawlerRunConfig(
cache_mode=CacheMode.ENABLED, cache_mode=CacheMode.ENABLED,
@@ -76,9 +77,10 @@ async def link_analysis():
print(f"Found {len(result.links['internal'])} internal links") print(f"Found {len(result.links['internal'])} internal links")
print(f"Found {len(result.links['external'])} external links") print(f"Found {len(result.links['external'])} external links")
for link in result.links['internal'][:5]: for link in result.links["internal"][:5]:
print(f"Href: {link['href']}\nText: {link['text']}\n") print(f"Href: {link['href']}\nText: {link['text']}\n")
# JavaScript Execution Example # JavaScript Execution Example
async def simple_example_with_running_js_code(): async def simple_example_with_running_js_code():
print("\n--- Executing JavaScript and Using CSS Selectors ---") print("\n--- Executing JavaScript and Using CSS Selectors ---")
@@ -112,25 +114,29 @@ async def simple_example_with_css_selector():
) )
print(result.markdown[:500]) print(result.markdown[:500])
async def media_handling(): async def media_handling():
crawler_config = CrawlerRunConfig(cache_mode=CacheMode.BYPASS, exclude_external_images=True, screenshot=True) crawler_config = CrawlerRunConfig(
cache_mode=CacheMode.BYPASS, exclude_external_images=True, screenshot=True
)
async with AsyncWebCrawler() as crawler: async with AsyncWebCrawler() as crawler:
result = await crawler.arun( result = await crawler.arun(
url="https://www.nbcnews.com/business", url="https://www.nbcnews.com/business", config=crawler_config
config=crawler_config
) )
for img in result.media['images'][:5]: for img in result.media["images"][:5]:
print(f"Image URL: {img['src']}, Alt: {img['alt']}, Score: {img['score']}") print(f"Image URL: {img['src']}, Alt: {img['alt']}, Score: {img['score']}")
async def custom_hook_workflow(verbose=True): async def custom_hook_workflow(verbose=True):
async with AsyncWebCrawler() as crawler: async with AsyncWebCrawler() as crawler:
# Set a 'before_goto' hook to run custom code just before navigation # Set a 'before_goto' hook to run custom code just before navigation
crawler.crawler_strategy.set_hook("before_goto", lambda page, context: print("[Hook] Preparing to navigate...")) crawler.crawler_strategy.set_hook(
"before_goto",
lambda page, context: print("[Hook] Preparing to navigate..."),
)
# Perform the crawl operation # Perform the crawl operation
result = await crawler.arun( result = await crawler.arun(url="https://crawl4ai.com")
url="https://crawl4ai.com"
)
print(result.markdown_v2.raw_markdown[:500].replace("\n", " -- ")) print(result.markdown_v2.raw_markdown[:500].replace("\n", " -- "))
@@ -225,7 +231,7 @@ async def extract_structured_data_using_css_extractor():
print("\n--- Using JsonCssExtractionStrategy for Fast Structured Output ---") print("\n--- Using JsonCssExtractionStrategy for Fast Structured Output ---")
schema = { schema = {
"name": "KidoCode Courses", "name": "KidoCode Courses",
"baseSelector": "section.charge-methodology .w-tab-content > div", "baseSelector": "section.charge-methodology .framework-collection-item.w-dyn-item",
"fields": [ "fields": [
{ {
"name": "section_title", "name": "section_title",
@@ -273,6 +279,7 @@ async def extract_structured_data_using_css_extractor():
cache_mode=CacheMode.BYPASS, cache_mode=CacheMode.BYPASS,
extraction_strategy=JsonCssExtractionStrategy(schema), extraction_strategy=JsonCssExtractionStrategy(schema),
js_code=[js_click_tabs], js_code=[js_click_tabs],
delay_before_return_html=1
) )
async with AsyncWebCrawler(config=browser_config) as crawler: async with AsyncWebCrawler(config=browser_config) as crawler:
@@ -412,21 +419,22 @@ async def cosine_similarity_extraction():
cache_mode=CacheMode.BYPASS, cache_mode=CacheMode.BYPASS,
extraction_strategy=CosineStrategy( extraction_strategy=CosineStrategy(
word_count_threshold=10, word_count_threshold=10,
max_dist=0.2, # Maximum distance between two words max_dist=0.2, # Maximum distance between two words
linkage_method="ward", # Linkage method for hierarchical clustering (ward, complete, average, single) linkage_method="ward", # Linkage method for hierarchical clustering (ward, complete, average, single)
top_k=3, # Number of top keywords to extract top_k=3, # Number of top keywords to extract
sim_threshold=0.3, # Similarity threshold for clustering sim_threshold=0.3, # Similarity threshold for clustering
semantic_filter="McDonald's economic impact, American consumer trends", # Keywords to filter the content semantically using embeddings semantic_filter="McDonald's economic impact, American consumer trends", # Keywords to filter the content semantically using embeddings
verbose=True verbose=True,
), ),
) )
async with AsyncWebCrawler() as crawler: async with AsyncWebCrawler() as crawler:
result = await crawler.arun( result = await crawler.arun(
url="https://www.nbcnews.com/business/consumer/how-mcdonalds-e-coli-crisis-inflation-politics-reflect-american-story-rcna177156", url="https://www.nbcnews.com/business/consumer/how-mcdonalds-e-coli-crisis-inflation-politics-reflect-american-story-rcna177156",
config=crawl_config config=crawl_config,
) )
print(json.loads(result.extracted_content)[:5]) print(json.loads(result.extracted_content)[:5])
# Browser Comparison # Browser Comparison
async def crawl_custom_browser_type(): async def crawl_custom_browser_type():
print("\n--- Browser Comparison ---") print("\n--- Browser Comparison ---")
@@ -484,18 +492,16 @@ async def crawl_with_user_simulation():
result = await crawler.arun(url="YOUR-URL-HERE", config=crawler_config) result = await crawler.arun(url="YOUR-URL-HERE", config=crawler_config)
print(result.markdown) print(result.markdown)
async def ssl_certification(): async def ssl_certification():
# Configure crawler to fetch SSL certificate # Configure crawler to fetch SSL certificate
config = CrawlerRunConfig( config = CrawlerRunConfig(
fetch_ssl_certificate=True, fetch_ssl_certificate=True,
cache_mode=CacheMode.BYPASS # Bypass cache to always get fresh certificates cache_mode=CacheMode.BYPASS, # Bypass cache to always get fresh certificates
) )
async with AsyncWebCrawler() as crawler: async with AsyncWebCrawler() as crawler:
result = await crawler.arun( result = await crawler.arun(url="https://example.com", config=config)
url='https://example.com',
config=config
)
if result.success and result.ssl_certificate: if result.success and result.ssl_certificate:
cert = result.ssl_certificate cert = result.ssl_certificate
@@ -511,12 +517,17 @@ async def ssl_certification():
print("\nCertificate exported to:") print("\nCertificate exported to:")
print(f"- JSON: {os.path.join(tmp_dir, 'certificate.json')}") print(f"- JSON: {os.path.join(tmp_dir, 'certificate.json')}")
pem_data = cert.to_pem(os.path.join(tmp_dir, "certificate.pem")) # For web servers pem_data = cert.to_pem(
os.path.join(tmp_dir, "certificate.pem")
) # For web servers
print(f"- PEM: {os.path.join(tmp_dir, 'certificate.pem')}") print(f"- PEM: {os.path.join(tmp_dir, 'certificate.pem')}")
der_data = cert.to_der(os.path.join(tmp_dir, "certificate.der")) # For Java apps der_data = cert.to_der(
os.path.join(tmp_dir, "certificate.der")
) # For Java apps
print(f"- DER: {os.path.join(tmp_dir, 'certificate.der')}") print(f"- DER: {os.path.join(tmp_dir, 'certificate.der')}")
# Speed Comparison # Speed Comparison
async def speed_comparison(): async def speed_comparison():
print("\n--- Speed Comparison ---") print("\n--- Speed Comparison ---")
@@ -581,29 +592,26 @@ async def speed_comparison():
# Main execution # Main execution
async def main(): async def main():
# Basic examples # Basic examples
# await simple_crawl() await simple_crawl()
# await simple_example_with_running_js_code() await simple_example_with_running_js_code()
# await simple_example_with_css_selector() await simple_example_with_css_selector()
# Advanced examples # Advanced examples
# await extract_structured_data_using_css_extractor() await extract_structured_data_using_css_extractor()
await extract_structured_data_using_llm( await extract_structured_data_using_llm(
"openai/gpt-4o", os.getenv("OPENAI_API_KEY") "openai/gpt-4o", os.getenv("OPENAI_API_KEY")
) )
# await crawl_dynamic_content_pages_method_1() await crawl_dynamic_content_pages_method_1()
# await crawl_dynamic_content_pages_method_2() await crawl_dynamic_content_pages_method_2()
# Browser comparisons # Browser comparisons
# await crawl_custom_browser_type() await crawl_custom_browser_type()
# Performance testing
# await speed_comparison()
# Screenshot example # Screenshot example
# await capture_and_save_screenshot( await capture_and_save_screenshot(
# "https://www.example.com", "https://www.example.com",
# os.path.join(__location__, "tmp/example_screenshot.jpg") os.path.join(__location__, "tmp/example_screenshot.jpg")
# ) )
if __name__ == "__main__": if __name__ == "__main__":

View File

@@ -1,6 +1,10 @@
import os, sys import os, sys
# append parent directory to system path # append parent directory to system path
sys.path.append(os.path.dirname(os.path.dirname(os.path.dirname(os.path.abspath(__file__))))); os.environ['FIRECRAWL_API_KEY'] = "fc-84b370ccfad44beabc686b38f1769692"; sys.path.append(
os.path.dirname(os.path.dirname(os.path.dirname(os.path.abspath(__file__))))
)
os.environ["FIRECRAWL_API_KEY"] = "fc-84b370ccfad44beabc686b38f1769692"
import asyncio import asyncio
# import nest_asyncio # import nest_asyncio
@@ -15,7 +19,7 @@ from bs4 import BeautifulSoup
from pydantic import BaseModel, Field from pydantic import BaseModel, Field
from crawl4ai import AsyncWebCrawler, CacheMode from crawl4ai import AsyncWebCrawler, CacheMode
from crawl4ai.markdown_generation_strategy import DefaultMarkdownGenerator from crawl4ai.markdown_generation_strategy import DefaultMarkdownGenerator
from crawl4ai.content_filter_strategy import BM25ContentFilter, PruningContentFilter from crawl4ai.content_filter_strategy import PruningContentFilter
from crawl4ai.extraction_strategy import ( from crawl4ai.extraction_strategy import (
JsonCssExtractionStrategy, JsonCssExtractionStrategy,
LLMExtractionStrategy, LLMExtractionStrategy,
@@ -32,9 +36,12 @@ print("Website: https://crawl4ai.com")
async def simple_crawl(): async def simple_crawl():
print("\n--- Basic Usage ---") print("\n--- Basic Usage ---")
async with AsyncWebCrawler(verbose=True) as crawler: async with AsyncWebCrawler(verbose=True) as crawler:
result = await crawler.arun(url="https://www.nbcnews.com/business", cache_mode= CacheMode.BYPASS) result = await crawler.arun(
url="https://www.nbcnews.com/business", cache_mode=CacheMode.BYPASS
)
print(result.markdown[:500]) # Print first 500 characters print(result.markdown[:500]) # Print first 500 characters
async def simple_example_with_running_js_code(): async def simple_example_with_running_js_code():
print("\n--- Executing JavaScript and Using CSS Selectors ---") print("\n--- Executing JavaScript and Using CSS Selectors ---")
# New code to handle the wait_for parameter # New code to handle the wait_for parameter
@@ -57,6 +64,7 @@ async def simple_example_with_running_js_code():
) )
print(result.markdown[:500]) # Print first 500 characters print(result.markdown[:500]) # Print first 500 characters
async def simple_example_with_css_selector(): async def simple_example_with_css_selector():
print("\n--- Using CSS Selectors ---") print("\n--- Using CSS Selectors ---")
async with AsyncWebCrawler(verbose=True) as crawler: async with AsyncWebCrawler(verbose=True) as crawler:
@@ -67,26 +75,27 @@ async def simple_example_with_css_selector():
) )
print(result.markdown[:500]) # Print first 500 characters print(result.markdown[:500]) # Print first 500 characters
async def use_proxy(): async def use_proxy():
print("\n--- Using a Proxy ---") print("\n--- Using a Proxy ---")
print( print(
"Note: Replace 'http://your-proxy-url:port' with a working proxy to run this example." "Note: Replace 'http://your-proxy-url:port' with a working proxy to run this example."
) )
# Uncomment and modify the following lines to use a proxy # Uncomment and modify the following lines to use a proxy
async with AsyncWebCrawler(verbose=True, proxy="http://your-proxy-url:port") as crawler: async with AsyncWebCrawler(
verbose=True, proxy="http://your-proxy-url:port"
) as crawler:
result = await crawler.arun( result = await crawler.arun(
url="https://www.nbcnews.com/business", url="https://www.nbcnews.com/business", cache_mode=CacheMode.BYPASS
cache_mode= CacheMode.BYPASS
) )
if result.success: if result.success:
print(result.markdown[:500]) # Print first 500 characters print(result.markdown[:500]) # Print first 500 characters
async def capture_and_save_screenshot(url: str, output_path: str): async def capture_and_save_screenshot(url: str, output_path: str):
async with AsyncWebCrawler(verbose=True) as crawler: async with AsyncWebCrawler(verbose=True) as crawler:
result = await crawler.arun( result = await crawler.arun(
url=url, url=url, screenshot=True, cache_mode=CacheMode.BYPASS
screenshot=True,
cache_mode= CacheMode.BYPASS
) )
if result.success and result.screenshot: if result.success and result.screenshot:
@@ -96,13 +105,14 @@ async def capture_and_save_screenshot(url: str, output_path: str):
screenshot_data = base64.b64decode(result.screenshot) screenshot_data = base64.b64decode(result.screenshot)
# Save the screenshot as a JPEG file # Save the screenshot as a JPEG file
with open(output_path, 'wb') as f: with open(output_path, "wb") as f:
f.write(screenshot_data) f.write(screenshot_data)
print(f"Screenshot saved successfully to {output_path}") print(f"Screenshot saved successfully to {output_path}")
else: else:
print("Failed to capture screenshot") print("Failed to capture screenshot")
class OpenAIModelFee(BaseModel): class OpenAIModelFee(BaseModel):
model_name: str = Field(..., description="Name of the OpenAI model.") model_name: str = Field(..., description="Name of the OpenAI model.")
input_fee: str = Field(..., description="Fee for input token for the OpenAI model.") input_fee: str = Field(..., description="Fee for input token for the OpenAI model.")
@@ -110,7 +120,10 @@ class OpenAIModelFee(BaseModel):
..., description="Fee for output token for the OpenAI model." ..., description="Fee for output token for the OpenAI model."
) )
async def extract_structured_data_using_llm(provider: str, api_token: str = None, extra_headers: Dict[str, str] = None):
async def extract_structured_data_using_llm(
provider: str, api_token: str = None, extra_headers: Dict[str, str] = None
):
print(f"\n--- Extracting Structured Data with {provider} ---") print(f"\n--- Extracting Structured Data with {provider} ---")
if api_token is None and provider != "ollama": if api_token is None and provider != "ollama":
@@ -118,7 +131,7 @@ async def extract_structured_data_using_llm(provider: str, api_token: str = None
return return
# extra_args = {} # extra_args = {}
extra_args={ extra_args = {
"temperature": 0, "temperature": 0,
"top_p": 0.9, "top_p": 0.9,
"max_tokens": 2000, "max_tokens": 2000,
@@ -139,52 +152,49 @@ async def extract_structured_data_using_llm(provider: str, api_token: str = None
instruction="""From the crawled content, extract all mentioned model names along with their fees for input and output tokens. instruction="""From the crawled content, extract all mentioned model names along with their fees for input and output tokens.
Do not miss any models in the entire content. One extracted model JSON format should look like this: Do not miss any models in the entire content. One extracted model JSON format should look like this:
{"model_name": "GPT-4", "input_fee": "US$10.00 / 1M tokens", "output_fee": "US$30.00 / 1M tokens"}.""", {"model_name": "GPT-4", "input_fee": "US$10.00 / 1M tokens", "output_fee": "US$30.00 / 1M tokens"}.""",
extra_args=extra_args extra_args=extra_args,
), ),
cache_mode=CacheMode.BYPASS, cache_mode=CacheMode.BYPASS,
) )
print(result.extracted_content) print(result.extracted_content)
async def extract_structured_data_using_css_extractor(): async def extract_structured_data_using_css_extractor():
print("\n--- Using JsonCssExtractionStrategy for Fast Structured Output ---") print("\n--- Using JsonCssExtractionStrategy for Fast Structured Output ---")
schema = { schema = {
"name": "KidoCode Courses", "name": "KidoCode Courses",
"baseSelector": "section.charge-methodology .w-tab-content > div", "baseSelector": "section.charge-methodology .w-tab-content > div",
"fields": [ "fields": [
{ {
"name": "section_title", "name": "section_title",
"selector": "h3.heading-50", "selector": "h3.heading-50",
"type": "text", "type": "text",
}, },
{ {
"name": "section_description", "name": "section_description",
"selector": ".charge-content", "selector": ".charge-content",
"type": "text", "type": "text",
}, },
{ {
"name": "course_name", "name": "course_name",
"selector": ".text-block-93", "selector": ".text-block-93",
"type": "text", "type": "text",
}, },
{ {
"name": "course_description", "name": "course_description",
"selector": ".course-content-text", "selector": ".course-content-text",
"type": "text", "type": "text",
}, },
{ {
"name": "course_icon", "name": "course_icon",
"selector": ".image-92", "selector": ".image-92",
"type": "attribute", "type": "attribute",
"attribute": "src" "attribute": "src",
} },
] ],
} }
async with AsyncWebCrawler(
headless=True,
verbose=True
) as crawler:
async with AsyncWebCrawler(headless=True, verbose=True) as crawler:
# Create the JavaScript that handles clicking multiple times # Create the JavaScript that handles clicking multiple times
js_click_tabs = """ js_click_tabs = """
(async () => { (async () => {
@@ -204,13 +214,14 @@ async def extract_structured_data_using_css_extractor():
url="https://www.kidocode.com/degrees/technology", url="https://www.kidocode.com/degrees/technology",
extraction_strategy=JsonCssExtractionStrategy(schema, verbose=True), extraction_strategy=JsonCssExtractionStrategy(schema, verbose=True),
js_code=[js_click_tabs], js_code=[js_click_tabs],
cache_mode=CacheMode.BYPASS cache_mode=CacheMode.BYPASS,
) )
companies = json.loads(result.extracted_content) companies = json.loads(result.extracted_content)
print(f"Successfully extracted {len(companies)} companies") print(f"Successfully extracted {len(companies)} companies")
print(json.dumps(companies[0], indent=2)) print(json.dumps(companies[0], indent=2))
# Advanced Session-Based Crawling with Dynamic Content 🔄 # Advanced Session-Based Crawling with Dynamic Content 🔄
async def crawl_dynamic_content_pages_method_1(): async def crawl_dynamic_content_pages_method_1():
print("\n--- Advanced Multi-Page Crawling with JavaScript Execution ---") print("\n--- Advanced Multi-Page Crawling with JavaScript Execution ---")
@@ -267,6 +278,7 @@ async def crawl_dynamic_content_pages_method_1():
await crawler.crawler_strategy.kill_session(session_id) await crawler.crawler_strategy.kill_session(session_id)
print(f"Successfully crawled {len(all_commits)} commits across 3 pages") print(f"Successfully crawled {len(all_commits)} commits across 3 pages")
async def crawl_dynamic_content_pages_method_2(): async def crawl_dynamic_content_pages_method_2():
print("\n--- Advanced Multi-Page Crawling with JavaScript Execution ---") print("\n--- Advanced Multi-Page Crawling with JavaScript Execution ---")
@@ -334,8 +346,11 @@ async def crawl_dynamic_content_pages_method_2():
await crawler.crawler_strategy.kill_session(session_id) await crawler.crawler_strategy.kill_session(session_id)
print(f"Successfully crawled {len(all_commits)} commits across 3 pages") print(f"Successfully crawled {len(all_commits)} commits across 3 pages")
async def crawl_dynamic_content_pages_method_3(): async def crawl_dynamic_content_pages_method_3():
print("\n--- Advanced Multi-Page Crawling with JavaScript Execution using `wait_for` ---") print(
"\n--- Advanced Multi-Page Crawling with JavaScript Execution using `wait_for` ---"
)
async with AsyncWebCrawler(verbose=True) as crawler: async with AsyncWebCrawler(verbose=True) as crawler:
url = "https://github.com/microsoft/TypeScript/commits/main" url = "https://github.com/microsoft/TypeScript/commits/main"
@@ -395,41 +410,54 @@ async def crawl_dynamic_content_pages_method_3():
await crawler.crawler_strategy.kill_session(session_id) await crawler.crawler_strategy.kill_session(session_id)
print(f"Successfully crawled {len(all_commits)} commits across 3 pages") print(f"Successfully crawled {len(all_commits)} commits across 3 pages")
async def crawl_custom_browser_type(): async def crawl_custom_browser_type():
# Use Firefox # Use Firefox
start = time.time() start = time.time()
async with AsyncWebCrawler(browser_type="firefox", verbose=True, headless = True) as crawler: async with AsyncWebCrawler(
result = await crawler.arun(url="https://www.example.com", cache_mode= CacheMode.BYPASS) browser_type="firefox", verbose=True, headless=True
) as crawler:
result = await crawler.arun(
url="https://www.example.com", cache_mode=CacheMode.BYPASS
)
print(result.markdown[:500]) print(result.markdown[:500])
print("Time taken: ", time.time() - start) print("Time taken: ", time.time() - start)
# Use WebKit # Use WebKit
start = time.time() start = time.time()
async with AsyncWebCrawler(browser_type="webkit", verbose=True, headless = True) as crawler: async with AsyncWebCrawler(
result = await crawler.arun(url="https://www.example.com", cache_mode= CacheMode.BYPASS) browser_type="webkit", verbose=True, headless=True
) as crawler:
result = await crawler.arun(
url="https://www.example.com", cache_mode=CacheMode.BYPASS
)
print(result.markdown[:500]) print(result.markdown[:500])
print("Time taken: ", time.time() - start) print("Time taken: ", time.time() - start)
# Use Chromium (default) # Use Chromium (default)
start = time.time() start = time.time()
async with AsyncWebCrawler(verbose=True, headless = True) as crawler: async with AsyncWebCrawler(verbose=True, headless=True) as crawler:
result = await crawler.arun(url="https://www.example.com", cache_mode= CacheMode.BYPASS) result = await crawler.arun(
url="https://www.example.com", cache_mode=CacheMode.BYPASS
)
print(result.markdown[:500]) print(result.markdown[:500])
print("Time taken: ", time.time() - start) print("Time taken: ", time.time() - start)
async def crawl_with_user_simultion(): async def crawl_with_user_simultion():
async with AsyncWebCrawler(verbose=True, headless=True) as crawler: async with AsyncWebCrawler(verbose=True, headless=True) as crawler:
url = "YOUR-URL-HERE" url = "YOUR-URL-HERE"
result = await crawler.arun( result = await crawler.arun(
url=url, url=url,
cache_mode=CacheMode.BYPASS, cache_mode=CacheMode.BYPASS,
magic = True, # Automatically detects and removes overlays, popups, and other elements that block content magic=True, # Automatically detects and removes overlays, popups, and other elements that block content
# simulate_user = True,# Causes a series of random mouse movements and clicks to simulate user interaction # simulate_user = True,# Causes a series of random mouse movements and clicks to simulate user interaction
# override_navigator = True # Overrides the navigator object to make it look like a real user # override_navigator = True # Overrides the navigator object to make it look like a real user
) )
print(result.markdown) print(result.markdown)
async def speed_comparison(): async def speed_comparison():
# print("\n--- Speed Comparison ---") # print("\n--- Speed Comparison ---")
# print("Firecrawl (simulated):") # print("Firecrawl (simulated):")
@@ -439,11 +467,11 @@ async def speed_comparison():
# print() # print()
# Simulated Firecrawl performance # Simulated Firecrawl performance
from firecrawl import FirecrawlApp from firecrawl import FirecrawlApp
app = FirecrawlApp(api_key=os.environ['FIRECRAWL_API_KEY'])
app = FirecrawlApp(api_key=os.environ["FIRECRAWL_API_KEY"])
start = time.time() start = time.time()
scrape_status = app.scrape_url( scrape_status = app.scrape_url(
'https://www.nbcnews.com/business', "https://www.nbcnews.com/business", params={"formats": ["markdown", "html"]}
params={'formats': ['markdown', 'html']}
) )
end = time.time() end = time.time()
print("Firecrawl:") print("Firecrawl:")
@@ -474,7 +502,9 @@ async def speed_comparison():
url="https://www.nbcnews.com/business", url="https://www.nbcnews.com/business",
word_count_threshold=0, word_count_threshold=0,
markdown_generator=DefaultMarkdownGenerator( markdown_generator=DefaultMarkdownGenerator(
content_filter = PruningContentFilter(threshold=0.48, threshold_type="fixed", min_word_threshold=0) content_filter=PruningContentFilter(
threshold=0.48, threshold_type="fixed", min_word_threshold=0
)
# content_filter=BM25ContentFilter(user_query=None, bm25_threshold=1.0) # content_filter=BM25ContentFilter(user_query=None, bm25_threshold=1.0)
), ),
cache_mode=CacheMode.BYPASS, cache_mode=CacheMode.BYPASS,
@@ -498,7 +528,9 @@ async def speed_comparison():
word_count_threshold=0, word_count_threshold=0,
cache_mode=CacheMode.BYPASS, cache_mode=CacheMode.BYPASS,
markdown_generator=DefaultMarkdownGenerator( markdown_generator=DefaultMarkdownGenerator(
content_filter = PruningContentFilter(threshold=0.48, threshold_type="fixed", min_word_threshold=0) content_filter=PruningContentFilter(
threshold=0.48, threshold_type="fixed", min_word_threshold=0
)
# content_filter=BM25ContentFilter(user_query=None, bm25_threshold=1.0) # content_filter=BM25ContentFilter(user_query=None, bm25_threshold=1.0)
), ),
verbose=False, verbose=False,
@@ -520,6 +552,7 @@ async def speed_comparison():
print("If you run these tests in an environment with better network conditions,") print("If you run these tests in an environment with better network conditions,")
print("you may observe an even more significant speed advantage for Crawl4AI.") print("you may observe an even more significant speed advantage for Crawl4AI.")
async def generate_knowledge_graph(): async def generate_knowledge_graph():
class Entity(BaseModel): class Entity(BaseModel):
name: str name: str
@@ -536,11 +569,11 @@ async def generate_knowledge_graph():
relationships: List[Relationship] relationships: List[Relationship]
extraction_strategy = LLMExtractionStrategy( extraction_strategy = LLMExtractionStrategy(
provider='openai/gpt-4o-mini', # Or any other provider, including Ollama and open source models provider="openai/gpt-4o-mini", # Or any other provider, including Ollama and open source models
api_token=os.getenv('OPENAI_API_KEY'), # In case of Ollama just pass "no-token" api_token=os.getenv("OPENAI_API_KEY"), # In case of Ollama just pass "no-token"
schema=KnowledgeGraph.model_json_schema(), schema=KnowledgeGraph.model_json_schema(),
extraction_type="schema", extraction_type="schema",
instruction="""Extract entities and relationships from the given text.""" instruction="""Extract entities and relationships from the given text.""",
) )
async with AsyncWebCrawler() as crawler: async with AsyncWebCrawler() as crawler:
url = "https://paulgraham.com/love.html" url = "https://paulgraham.com/love.html"
@@ -554,27 +587,22 @@ async def generate_knowledge_graph():
with open(os.path.join(__location__, "kb.json"), "w") as f: with open(os.path.join(__location__, "kb.json"), "w") as f:
f.write(result.extracted_content) f.write(result.extracted_content)
async def fit_markdown_remove_overlay():
async def fit_markdown_remove_overlay():
async with AsyncWebCrawler( async with AsyncWebCrawler(
headless=True, # Set to False to see what is happening headless=True, # Set to False to see what is happening
verbose=True, verbose=True,
user_agent_mode="random", user_agent_mode="random",
user_agent_generator_config={ user_agent_generator_config={"device_type": "mobile", "os_type": "android"},
"device_type": "mobile",
"os_type": "android"
},
) as crawler: ) as crawler:
result = await crawler.arun( result = await crawler.arun(
url='https://www.kidocode.com/degrees/technology', url="https://www.kidocode.com/degrees/technology",
cache_mode=CacheMode.BYPASS, cache_mode=CacheMode.BYPASS,
markdown_generator=DefaultMarkdownGenerator( markdown_generator=DefaultMarkdownGenerator(
content_filter=PruningContentFilter( content_filter=PruningContentFilter(
threshold=0.48, threshold_type="fixed", min_word_threshold=0 threshold=0.48, threshold_type="fixed", min_word_threshold=0
), ),
options={ options={"ignore_links": True},
"ignore_links": True
}
), ),
# markdown_generator=DefaultMarkdownGenerator( # markdown_generator=DefaultMarkdownGenerator(
# content_filter=BM25ContentFilter(user_query="", bm25_threshold=1.0), # content_filter=BM25ContentFilter(user_query="", bm25_threshold=1.0),
@@ -593,13 +621,20 @@ async def fit_markdown_remove_overlay():
with open(os.path.join(__location__, "output/cleaned_html.html"), "w") as f: with open(os.path.join(__location__, "output/cleaned_html.html"), "w") as f:
f.write(result.cleaned_html) f.write(result.cleaned_html)
with open(os.path.join(__location__, "output/output_raw_markdown.md"), "w") as f: with open(
os.path.join(__location__, "output/output_raw_markdown.md"), "w"
) as f:
f.write(result.markdown_v2.raw_markdown) f.write(result.markdown_v2.raw_markdown)
with open(os.path.join(__location__, "output/output_markdown_with_citations.md"), "w") as f: with open(
os.path.join(__location__, "output/output_markdown_with_citations.md"),
"w",
) as f:
f.write(result.markdown_v2.markdown_with_citations) f.write(result.markdown_v2.markdown_with_citations)
with open(os.path.join(__location__, "output/output_fit_markdown.md"), "w") as f: with open(
os.path.join(__location__, "output/output_fit_markdown.md"), "w"
) as f:
f.write(result.markdown_v2.fit_markdown) f.write(result.markdown_v2.fit_markdown)
print("Done") print("Done")

View File

@@ -10,15 +10,17 @@ from functools import lru_cache
console = Console() console = Console()
@lru_cache() @lru_cache()
def create_crawler(): def create_crawler():
crawler = WebCrawler(verbose=True) crawler = WebCrawler(verbose=True)
crawler.warmup() crawler.warmup()
return crawler return crawler
def print_result(result): def print_result(result):
# Print each key in one line and just the first 10 characters of each one's value and three dots # Print each key in one line and just the first 10 characters of each one's value and three dots
console.print(f"\t[bold]Result:[/bold]") console.print("\t[bold]Result:[/bold]")
for key, value in result.model_dump().items(): for key, value in result.model_dump().items():
if isinstance(value, str) and value: if isinstance(value, str) and value:
console.print(f"\t{key}: [green]{value[:20]}...[/green]") console.print(f"\t{key}: [green]{value[:20]}...[/green]")
@@ -33,18 +35,27 @@ def cprint(message, press_any_key=False):
console.print("Press any key to continue...", style="") console.print("Press any key to continue...", style="")
input() input()
def basic_usage(crawler): def basic_usage(crawler):
cprint("🛠️ [bold cyan]Basic Usage: Simply provide a URL and let Crawl4ai do the magic![/bold cyan]") cprint(
result = crawler.run(url="https://www.nbcnews.com/business", only_text = True) "🛠️ [bold cyan]Basic Usage: Simply provide a URL and let Crawl4ai do the magic![/bold cyan]"
)
result = crawler.run(url="https://www.nbcnews.com/business", only_text=True)
cprint("[LOG] 📦 [bold yellow]Basic crawl result:[/bold yellow]") cprint("[LOG] 📦 [bold yellow]Basic crawl result:[/bold yellow]")
print_result(result) print_result(result)
def basic_usage_some_params(crawler): def basic_usage_some_params(crawler):
cprint("🛠️ [bold cyan]Basic Usage: Simply provide a URL and let Crawl4ai do the magic![/bold cyan]") cprint(
result = crawler.run(url="https://www.nbcnews.com/business", word_count_threshold=1, only_text = True) "🛠️ [bold cyan]Basic Usage: Simply provide a URL and let Crawl4ai do the magic![/bold cyan]"
)
result = crawler.run(
url="https://www.nbcnews.com/business", word_count_threshold=1, only_text=True
)
cprint("[LOG] 📦 [bold yellow]Basic crawl result:[/bold yellow]") cprint("[LOG] 📦 [bold yellow]Basic crawl result:[/bold yellow]")
print_result(result) print_result(result)
def screenshot_usage(crawler): def screenshot_usage(crawler):
cprint("\n📸 [bold cyan]Let's take a screenshot of the page![/bold cyan]") cprint("\n📸 [bold cyan]Let's take a screenshot of the page![/bold cyan]")
result = crawler.run(url="https://www.nbcnews.com/business", screenshot=True) result = crawler.run(url="https://www.nbcnews.com/business", screenshot=True)
@@ -55,16 +66,23 @@ def screenshot_usage(crawler):
cprint("Screenshot saved to 'screenshot.png'!") cprint("Screenshot saved to 'screenshot.png'!")
print_result(result) print_result(result)
def understanding_parameters(crawler): def understanding_parameters(crawler):
cprint("\n🧠 [bold cyan]Understanding 'bypass_cache' and 'include_raw_html' parameters:[/bold cyan]") cprint(
cprint("By default, Crawl4ai caches the results of your crawls. This means that subsequent crawls of the same URL will be much faster! Let's see this in action.") "\n🧠 [bold cyan]Understanding 'bypass_cache' and 'include_raw_html' parameters:[/bold cyan]"
)
cprint(
"By default, Crawl4ai caches the results of your crawls. This means that subsequent crawls of the same URL will be much faster! Let's see this in action."
)
# First crawl (reads from cache) # First crawl (reads from cache)
cprint("1⃣ First crawl (caches the result):", True) cprint("1⃣ First crawl (caches the result):", True)
start_time = time.time() start_time = time.time()
result = crawler.run(url="https://www.nbcnews.com/business") result = crawler.run(url="https://www.nbcnews.com/business")
end_time = time.time() end_time = time.time()
cprint(f"[LOG] 📦 [bold yellow]First crawl took {end_time - start_time} seconds and result (from cache):[/bold yellow]") cprint(
f"[LOG] 📦 [bold yellow]First crawl took {end_time - start_time} seconds and result (from cache):[/bold yellow]"
)
print_result(result) print_result(result)
# Force to crawl again # Force to crawl again
@@ -72,132 +90,194 @@ def understanding_parameters(crawler):
start_time = time.time() start_time = time.time()
result = crawler.run(url="https://www.nbcnews.com/business", bypass_cache=True) result = crawler.run(url="https://www.nbcnews.com/business", bypass_cache=True)
end_time = time.time() end_time = time.time()
cprint(f"[LOG] 📦 [bold yellow]Second crawl took {end_time - start_time} seconds and result (forced to crawl):[/bold yellow]") cprint(
f"[LOG] 📦 [bold yellow]Second crawl took {end_time - start_time} seconds and result (forced to crawl):[/bold yellow]"
)
print_result(result) print_result(result)
def add_chunking_strategy(crawler): def add_chunking_strategy(crawler):
# Adding a chunking strategy: RegexChunking # Adding a chunking strategy: RegexChunking
cprint("\n🧩 [bold cyan]Let's add a chunking strategy: RegexChunking![/bold cyan]", True) cprint(
cprint("RegexChunking is a simple chunking strategy that splits the text based on a given regex pattern. Let's see it in action!") "\n🧩 [bold cyan]Let's add a chunking strategy: RegexChunking![/bold cyan]",
True,
)
cprint(
"RegexChunking is a simple chunking strategy that splits the text based on a given regex pattern. Let's see it in action!"
)
result = crawler.run( result = crawler.run(
url="https://www.nbcnews.com/business", url="https://www.nbcnews.com/business",
chunking_strategy=RegexChunking(patterns=["\n\n"]) chunking_strategy=RegexChunking(patterns=["\n\n"]),
) )
cprint("[LOG] 📦 [bold yellow]RegexChunking result:[/bold yellow]") cprint("[LOG] 📦 [bold yellow]RegexChunking result:[/bold yellow]")
print_result(result) print_result(result)
# Adding another chunking strategy: NlpSentenceChunking # Adding another chunking strategy: NlpSentenceChunking
cprint("\n🔍 [bold cyan]Time to explore another chunking strategy: NlpSentenceChunking![/bold cyan]", True) cprint(
cprint("NlpSentenceChunking uses NLP techniques to split the text into sentences. Let's see how it performs!") "\n🔍 [bold cyan]Time to explore another chunking strategy: NlpSentenceChunking![/bold cyan]",
True,
)
cprint(
"NlpSentenceChunking uses NLP techniques to split the text into sentences. Let's see how it performs!"
)
result = crawler.run( result = crawler.run(
url="https://www.nbcnews.com/business", url="https://www.nbcnews.com/business", chunking_strategy=NlpSentenceChunking()
chunking_strategy=NlpSentenceChunking()
) )
cprint("[LOG] 📦 [bold yellow]NlpSentenceChunking result:[/bold yellow]") cprint("[LOG] 📦 [bold yellow]NlpSentenceChunking result:[/bold yellow]")
print_result(result) print_result(result)
def add_extraction_strategy(crawler): def add_extraction_strategy(crawler):
# Adding an extraction strategy: CosineStrategy # Adding an extraction strategy: CosineStrategy
cprint("\n🧠 [bold cyan]Let's get smarter with an extraction strategy: CosineStrategy![/bold cyan]", True) cprint(
cprint("CosineStrategy uses cosine similarity to extract semantically similar blocks of text. Let's see it in action!") "\n🧠 [bold cyan]Let's get smarter with an extraction strategy: CosineStrategy![/bold cyan]",
True,
)
cprint(
"CosineStrategy uses cosine similarity to extract semantically similar blocks of text. Let's see it in action!"
)
result = crawler.run( result = crawler.run(
url="https://www.nbcnews.com/business", url="https://www.nbcnews.com/business",
extraction_strategy=CosineStrategy(word_count_threshold=10, max_dist=0.2, linkage_method="ward", top_k=3, sim_threshold = 0.3, verbose=True) extraction_strategy=CosineStrategy(
word_count_threshold=10,
max_dist=0.2,
linkage_method="ward",
top_k=3,
sim_threshold=0.3,
verbose=True,
),
) )
cprint("[LOG] 📦 [bold yellow]CosineStrategy result:[/bold yellow]") cprint("[LOG] 📦 [bold yellow]CosineStrategy result:[/bold yellow]")
print_result(result) print_result(result)
# Using semantic_filter with CosineStrategy # Using semantic_filter with CosineStrategy
cprint("You can pass other parameters like 'semantic_filter' to the CosineStrategy to extract semantically similar blocks of text. Let's see it in action!") cprint(
"You can pass other parameters like 'semantic_filter' to the CosineStrategy to extract semantically similar blocks of text. Let's see it in action!"
)
result = crawler.run( result = crawler.run(
url="https://www.nbcnews.com/business", url="https://www.nbcnews.com/business",
extraction_strategy=CosineStrategy( extraction_strategy=CosineStrategy(
semantic_filter="inflation rent prices", semantic_filter="inflation rent prices",
) ),
)
cprint(
"[LOG] 📦 [bold yellow]CosineStrategy result with semantic filter:[/bold yellow]"
) )
cprint("[LOG] 📦 [bold yellow]CosineStrategy result with semantic filter:[/bold yellow]")
print_result(result) print_result(result)
def add_llm_extraction_strategy(crawler): def add_llm_extraction_strategy(crawler):
# Adding an LLM extraction strategy without instructions # Adding an LLM extraction strategy without instructions
cprint("\n🤖 [bold cyan]Time to bring in the big guns: LLMExtractionStrategy without instructions![/bold cyan]", True) cprint(
cprint("LLMExtractionStrategy uses a large language model to extract relevant information from the web page. Let's see it in action!") "\n🤖 [bold cyan]Time to bring in the big guns: LLMExtractionStrategy without instructions![/bold cyan]",
True,
)
cprint(
"LLMExtractionStrategy uses a large language model to extract relevant information from the web page. Let's see it in action!"
)
result = crawler.run( result = crawler.run(
url="https://www.nbcnews.com/business", url="https://www.nbcnews.com/business",
extraction_strategy=LLMExtractionStrategy(provider="openai/gpt-4o", api_token=os.getenv('OPENAI_API_KEY')) extraction_strategy=LLMExtractionStrategy(
provider="openai/gpt-4o", api_token=os.getenv("OPENAI_API_KEY")
),
)
cprint(
"[LOG] 📦 [bold yellow]LLMExtractionStrategy (no instructions) result:[/bold yellow]"
) )
cprint("[LOG] 📦 [bold yellow]LLMExtractionStrategy (no instructions) result:[/bold yellow]")
print_result(result) print_result(result)
# Adding an LLM extraction strategy with instructions # Adding an LLM extraction strategy with instructions
cprint("\n📜 [bold cyan]Let's make it even more interesting: LLMExtractionStrategy with instructions![/bold cyan]", True) cprint(
cprint("Let's say we are only interested in financial news. Let's see how LLMExtractionStrategy performs with instructions!") "\n📜 [bold cyan]Let's make it even more interesting: LLMExtractionStrategy with instructions![/bold cyan]",
True,
)
cprint(
"Let's say we are only interested in financial news. Let's see how LLMExtractionStrategy performs with instructions!"
)
result = crawler.run( result = crawler.run(
url="https://www.nbcnews.com/business", url="https://www.nbcnews.com/business",
extraction_strategy=LLMExtractionStrategy( extraction_strategy=LLMExtractionStrategy(
provider="openai/gpt-4o", provider="openai/gpt-4o",
api_token=os.getenv('OPENAI_API_KEY'), api_token=os.getenv("OPENAI_API_KEY"),
instruction="I am interested in only financial news" instruction="I am interested in only financial news",
) ),
)
cprint(
"[LOG] 📦 [bold yellow]LLMExtractionStrategy (with instructions) result:[/bold yellow]"
) )
cprint("[LOG] 📦 [bold yellow]LLMExtractionStrategy (with instructions) result:[/bold yellow]")
print_result(result) print_result(result)
result = crawler.run( result = crawler.run(
url="https://www.nbcnews.com/business", url="https://www.nbcnews.com/business",
extraction_strategy=LLMExtractionStrategy( extraction_strategy=LLMExtractionStrategy(
provider="openai/gpt-4o", provider="openai/gpt-4o",
api_token=os.getenv('OPENAI_API_KEY'), api_token=os.getenv("OPENAI_API_KEY"),
instruction="Extract only content related to technology" instruction="Extract only content related to technology",
) ),
)
cprint(
"[LOG] 📦 [bold yellow]LLMExtractionStrategy (with technology instruction) result:[/bold yellow]"
) )
cprint("[LOG] 📦 [bold yellow]LLMExtractionStrategy (with technology instruction) result:[/bold yellow]")
print_result(result) print_result(result)
def targeted_extraction(crawler): def targeted_extraction(crawler):
# Using a CSS selector to extract only H2 tags # Using a CSS selector to extract only H2 tags
cprint("\n🎯 [bold cyan]Targeted extraction: Let's use a CSS selector to extract only H2 tags![/bold cyan]", True) cprint(
result = crawler.run( "\n🎯 [bold cyan]Targeted extraction: Let's use a CSS selector to extract only H2 tags![/bold cyan]",
url="https://www.nbcnews.com/business", True,
css_selector="h2"
) )
result = crawler.run(url="https://www.nbcnews.com/business", css_selector="h2")
cprint("[LOG] 📦 [bold yellow]CSS Selector (H2 tags) result:[/bold yellow]") cprint("[LOG] 📦 [bold yellow]CSS Selector (H2 tags) result:[/bold yellow]")
print_result(result) print_result(result)
def interactive_extraction(crawler): def interactive_extraction(crawler):
# Passing JavaScript code to interact with the page # Passing JavaScript code to interact with the page
cprint("\n🖱️ [bold cyan]Let's get interactive: Passing JavaScript code to click 'Load More' button![/bold cyan]", True) cprint(
cprint("In this example we try to click the 'Load More' button on the page using JavaScript code.") "\n🖱️ [bold cyan]Let's get interactive: Passing JavaScript code to click 'Load More' button![/bold cyan]",
True,
)
cprint(
"In this example we try to click the 'Load More' button on the page using JavaScript code."
)
js_code = """ js_code = """
const loadMoreButton = Array.from(document.querySelectorAll('button')).find(button => button.textContent.includes('Load More')); const loadMoreButton = Array.from(document.querySelectorAll('button')).find(button => button.textContent.includes('Load More'));
loadMoreButton && loadMoreButton.click(); loadMoreButton && loadMoreButton.click();
""" """
# crawler_strategy = LocalSeleniumCrawlerStrategy(js_code=js_code) # crawler_strategy = LocalSeleniumCrawlerStrategy(js_code=js_code)
# crawler = WebCrawler(crawler_strategy=crawler_strategy, always_by_pass_cache=True) # crawler = WebCrawler(crawler_strategy=crawler_strategy, always_by_pass_cache=True)
result = crawler.run( result = crawler.run(url="https://www.nbcnews.com/business", js=js_code)
url="https://www.nbcnews.com/business", cprint(
js = js_code "[LOG] 📦 [bold yellow]JavaScript Code (Load More button) result:[/bold yellow]"
) )
cprint("[LOG] 📦 [bold yellow]JavaScript Code (Load More button) result:[/bold yellow]")
print_result(result) print_result(result)
def multiple_scrip(crawler): def multiple_scrip(crawler):
# Passing JavaScript code to interact with the page # Passing JavaScript code to interact with the page
cprint("\n🖱️ [bold cyan]Let's get interactive: Passing JavaScript code to click 'Load More' button![/bold cyan]", True) cprint(
cprint("In this example we try to click the 'Load More' button on the page using JavaScript code.") "\n🖱️ [bold cyan]Let's get interactive: Passing JavaScript code to click 'Load More' button![/bold cyan]",
js_code = [""" True,
)
cprint(
"In this example we try to click the 'Load More' button on the page using JavaScript code."
)
js_code = [
"""
const loadMoreButton = Array.from(document.querySelectorAll('button')).find(button => button.textContent.includes('Load More')); const loadMoreButton = Array.from(document.querySelectorAll('button')).find(button => button.textContent.includes('Load More'));
loadMoreButton && loadMoreButton.click(); loadMoreButton && loadMoreButton.click();
"""] * 2 """
] * 2
# crawler_strategy = LocalSeleniumCrawlerStrategy(js_code=js_code) # crawler_strategy = LocalSeleniumCrawlerStrategy(js_code=js_code)
# crawler = WebCrawler(crawler_strategy=crawler_strategy, always_by_pass_cache=True) # crawler = WebCrawler(crawler_strategy=crawler_strategy, always_by_pass_cache=True)
result = crawler.run( result = crawler.run(url="https://www.nbcnews.com/business", js=js_code)
url="https://www.nbcnews.com/business", cprint(
js = js_code "[LOG] 📦 [bold yellow]JavaScript Code (Load More button) result:[/bold yellow]"
) )
cprint("[LOG] 📦 [bold yellow]JavaScript Code (Load More button) result:[/bold yellow]")
print_result(result) print_result(result)
def using_crawler_hooks(crawler): def using_crawler_hooks(crawler):
# Example usage of the hooks for authentication and setting a cookie # Example usage of the hooks for authentication and setting a cookie
def on_driver_created(driver): def on_driver_created(driver):
@@ -206,33 +286,34 @@ def using_crawler_hooks(crawler):
driver.maximize_window() driver.maximize_window()
# Example customization: logging in to a hypothetical website # Example customization: logging in to a hypothetical website
driver.get('https://example.com/login') driver.get("https://example.com/login")
from selenium.webdriver.support.ui import WebDriverWait from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC from selenium.webdriver.support import expected_conditions as EC
WebDriverWait(driver, 10).until( WebDriverWait(driver, 10).until(
EC.presence_of_element_located((By.NAME, 'username')) EC.presence_of_element_located((By.NAME, "username"))
) )
driver.find_element(By.NAME, 'username').send_keys('testuser') driver.find_element(By.NAME, "username").send_keys("testuser")
driver.find_element(By.NAME, 'password').send_keys('password123') driver.find_element(By.NAME, "password").send_keys("password123")
driver.find_element(By.NAME, 'login').click() driver.find_element(By.NAME, "login").click()
WebDriverWait(driver, 10).until( WebDriverWait(driver, 10).until(
EC.presence_of_element_located((By.ID, 'welcome')) EC.presence_of_element_located((By.ID, "welcome"))
) )
# Add a custom cookie # Add a custom cookie
driver.add_cookie({'name': 'test_cookie', 'value': 'cookie_value'}) driver.add_cookie({"name": "test_cookie", "value": "cookie_value"})
return driver return driver
def before_get_url(driver): def before_get_url(driver):
print("[HOOK] before_get_url") print("[HOOK] before_get_url")
# Example customization: add a custom header # Example customization: add a custom header
# Enable Network domain for sending headers # Enable Network domain for sending headers
driver.execute_cdp_cmd('Network.enable', {}) driver.execute_cdp_cmd("Network.enable", {})
# Add a custom header # Add a custom header
driver.execute_cdp_cmd('Network.setExtraHTTPHeaders', {'headers': {'X-Test-Header': 'test'}}) driver.execute_cdp_cmd(
"Network.setExtraHTTPHeaders", {"headers": {"X-Test-Header": "test"}}
)
return driver return driver
def after_get_url(driver): def after_get_url(driver):
@@ -247,20 +328,24 @@ def using_crawler_hooks(crawler):
print(len(html)) print(len(html))
return driver return driver
cprint("\n🔗 [bold cyan]Using Crawler Hooks: Let's see how we can customize the crawler using hooks![/bold cyan]", True) cprint(
"\n🔗 [bold cyan]Using Crawler Hooks: Let's see how we can customize the crawler using hooks![/bold cyan]",
True,
)
crawler_strategy = LocalSeleniumCrawlerStrategy(verbose=True) crawler_strategy = LocalSeleniumCrawlerStrategy(verbose=True)
crawler_strategy.set_hook('on_driver_created', on_driver_created) crawler_strategy.set_hook("on_driver_created", on_driver_created)
crawler_strategy.set_hook('before_get_url', before_get_url) crawler_strategy.set_hook("before_get_url", before_get_url)
crawler_strategy.set_hook('after_get_url', after_get_url) crawler_strategy.set_hook("after_get_url", after_get_url)
crawler_strategy.set_hook('before_return_html', before_return_html) crawler_strategy.set_hook("before_return_html", before_return_html)
crawler = WebCrawler(verbose=True, crawler_strategy=crawler_strategy) crawler = WebCrawler(verbose=True, crawler_strategy=crawler_strategy)
crawler.warmup() crawler.warmup()
result = crawler.run(url="https://example.com") result = crawler.run(url="https://example.com")
cprint("[LOG] 📦 [bold yellow]Crawler Hooks result:[/bold yellow]") cprint("[LOG] 📦 [bold yellow]Crawler Hooks result:[/bold yellow]")
print_result(result= result) print_result(result=result)
def using_crawler_hooks_dleay_example(crawler): def using_crawler_hooks_dleay_example(crawler):
def delay(driver): def delay(driver):
@@ -270,12 +355,14 @@ def using_crawler_hooks_dleay_example(crawler):
def create_crawler(): def create_crawler():
crawler_strategy = LocalSeleniumCrawlerStrategy(verbose=True) crawler_strategy = LocalSeleniumCrawlerStrategy(verbose=True)
crawler_strategy.set_hook('after_get_url', delay) crawler_strategy.set_hook("after_get_url", delay)
crawler = WebCrawler(verbose=True, crawler_strategy=crawler_strategy) crawler = WebCrawler(verbose=True, crawler_strategy=crawler_strategy)
crawler.warmup() crawler.warmup()
return crawler return crawler
cprint("\n🔗 [bold cyan]Using Crawler Hooks: Let's add a delay after fetching the url to make sure entire page is fetched.[/bold cyan]") cprint(
"\n🔗 [bold cyan]Using Crawler Hooks: Let's add a delay after fetching the url to make sure entire page is fetched.[/bold cyan]"
)
crawler = create_crawler() crawler = create_crawler()
result = crawler.run(url="https://google.com", bypass_cache=True) result = crawler.run(url="https://google.com", bypass_cache=True)
@@ -283,11 +370,16 @@ def using_crawler_hooks_dleay_example(crawler):
print_result(result) print_result(result)
def main(): def main():
cprint("🌟 [bold green]Welcome to the Crawl4ai Quickstart Guide! Let's dive into some web crawling fun! 🌐[/bold green]") cprint(
cprint("⛳️ [bold cyan]First Step: Create an instance of WebCrawler and call the `warmup()` function.[/bold cyan]") "🌟 [bold green]Welcome to the Crawl4ai Quickstart Guide! Let's dive into some web crawling fun! 🌐[/bold green]"
cprint("If this is the first time you're running Crawl4ai, this might take a few seconds to load required model files.") )
cprint(
"⛳️ [bold cyan]First Step: Create an instance of WebCrawler and call the `warmup()` function.[/bold cyan]"
)
cprint(
"If this is the first time you're running Crawl4ai, this might take a few seconds to load required model files."
)
crawler = create_crawler() crawler = create_crawler()
@@ -305,8 +397,10 @@ def main():
interactive_extraction(crawler) interactive_extraction(crawler)
multiple_scrip(crawler) multiple_scrip(crawler)
cprint("\n🎉 [bold green]Congratulations! You've made it through the Crawl4ai Quickstart Guide! Now go forth and crawl the web like a pro! 🕸️[/bold green]") cprint(
"\n🎉 [bold green]Congratulations! You've made it through the Crawl4ai Quickstart Guide! Now go forth and crawl the web like a pro! 🕸️[/bold green]"
)
if __name__ == "__main__": if __name__ == "__main__":
main() main()

View File

@@ -702,7 +702,7 @@
"\n", "\n",
"Crawl4AI offers a fast, flexible, and powerful solution for web crawling and data extraction tasks. Its asynchronous architecture and advanced features make it suitable for a wide range of applications, from simple web scraping to complex, multi-page data extraction scenarios.\n", "Crawl4AI offers a fast, flexible, and powerful solution for web crawling and data extraction tasks. Its asynchronous architecture and advanced features make it suitable for a wide range of applications, from simple web scraping to complex, multi-page data extraction scenarios.\n",
"\n", "\n",
"For more information and advanced usage, please visit the [Crawl4AI documentation](https://crawl4ai.com/mkdocs/).\n", "For more information and advanced usage, please visit the [Crawl4AI documentation](https://docs.crawl4ai.com/).\n",
"\n", "\n",
"Happy crawling!" "Happy crawling!"
] ]

View File

@@ -11,7 +11,9 @@ from groq import Groq
# Import threadpools to run the crawl_url function in a separate thread # Import threadpools to run the crawl_url function in a separate thread
from concurrent.futures import ThreadPoolExecutor from concurrent.futures import ThreadPoolExecutor
client = AsyncOpenAI(base_url="https://api.groq.com/openai/v1", api_key=os.getenv("GROQ_API_KEY")) client = AsyncOpenAI(
base_url="https://api.groq.com/openai/v1", api_key=os.getenv("GROQ_API_KEY")
)
# Instrument the OpenAI client # Instrument the OpenAI client
cl.instrument_openai() cl.instrument_openai()
@@ -25,32 +27,31 @@ settings = {
"presence_penalty": 0, "presence_penalty": 0,
} }
def extract_urls(text): def extract_urls(text):
url_pattern = re.compile(r'(https?://\S+)') url_pattern = re.compile(r"(https?://\S+)")
return url_pattern.findall(text) return url_pattern.findall(text)
def crawl_url(url): def crawl_url(url):
data = { data = {
"urls": [url], "urls": [url],
"include_raw_html": True, "include_raw_html": True,
"word_count_threshold": 10, "word_count_threshold": 10,
"extraction_strategy": "NoExtractionStrategy", "extraction_strategy": "NoExtractionStrategy",
"chunking_strategy": "RegexChunking" "chunking_strategy": "RegexChunking",
} }
response = requests.post("https://crawl4ai.com/crawl", json=data) response = requests.post("https://crawl4ai.com/crawl", json=data)
response_data = response.json() response_data = response.json()
response_data = response_data['results'][0] response_data = response_data["results"][0]
return response_data['markdown'] return response_data["markdown"]
@cl.on_chat_start @cl.on_chat_start
async def on_chat_start(): async def on_chat_start():
cl.user_session.set("session", { cl.user_session.set("session", {"history": [], "context": {}})
"history": [], await cl.Message(content="Welcome to the chat! How can I assist you today?").send()
"context": {}
})
await cl.Message(
content="Welcome to the chat! How can I assist you today?"
).send()
@cl.on_message @cl.on_message
async def on_message(message: cl.Message): async def on_message(message: cl.Message):
@@ -59,7 +60,6 @@ async def on_message(message: cl.Message):
# Extract URLs from the user's message # Extract URLs from the user's message
urls = extract_urls(message.content) urls = extract_urls(message.content)
futures = [] futures = []
with ThreadPoolExecutor() as executor: with ThreadPoolExecutor() as executor:
for url in urls: for url in urls:
@@ -69,16 +69,9 @@ async def on_message(message: cl.Message):
for url, result in zip(urls, results): for url, result in zip(urls, results):
ref_number = f"REF_{len(user_session['context']) + 1}" ref_number = f"REF_{len(user_session['context']) + 1}"
user_session["context"][ref_number] = { user_session["context"][ref_number] = {"url": url, "content": result}
"url": url,
"content": result
}
user_session["history"].append({"role": "user", "content": message.content})
user_session["history"].append({
"role": "user",
"content": message.content
})
# Create a system message that includes the context # Create a system message that includes the context
context_messages = [ context_messages = [
@@ -95,26 +88,17 @@ async def on_message(message: cl.Message):
"If not, there is no need to add a references section. " "If not, there is no need to add a references section. "
"At the end of your response, provide a reference section listing the URLs and their REF numbers only if sources from the appendices were used.\n\n" "At the end of your response, provide a reference section listing the URLs and their REF numbers only if sources from the appendices were used.\n\n"
"\n\n".join(context_messages) "\n\n".join(context_messages)
) ),
} }
else: else:
system_message = { system_message = {"role": "system", "content": "You are a helpful assistant."}
"role": "system",
"content": "You are a helpful assistant."
}
msg = cl.Message(content="") msg = cl.Message(content="")
await msg.send() await msg.send()
# Get response from the LLM # Get response from the LLM
stream = await client.chat.completions.create( stream = await client.chat.completions.create(
messages=[ messages=[system_message, *user_session["history"]], stream=True, **settings
system_message,
*user_session["history"]
],
stream=True,
**settings
) )
assistant_response = "" assistant_response = ""
@@ -124,10 +108,7 @@ async def on_message(message: cl.Message):
await msg.stream_token(token) await msg.stream_token(token)
# Add assistant message to the history # Add assistant message to the history
user_session["history"].append({ user_session["history"].append({"role": "assistant", "content": assistant_response})
"role": "assistant",
"content": assistant_response
})
await msg.update() await msg.update()
# Append the reference section to the assistant's response # Append the reference section to the assistant's response
@@ -154,6 +135,7 @@ async def on_audio_chunk(chunk: cl.AudioChunk):
pass pass
@cl.step(type="tool") @cl.step(type="tool")
async def speech_to_text(audio_file): async def speech_to_text(audio_file):
cli = Groq() cli = Groq()
@@ -179,17 +161,12 @@ async def on_audio_end(elements: list[ElementBased]):
end_time = time.time() end_time = time.time()
print(f"Transcription took {end_time - start_time} seconds") print(f"Transcription took {end_time - start_time} seconds")
user_msg = cl.Message( user_msg = cl.Message(author="You", type="user_message", content=transcription)
author="You",
type="user_message",
content=transcription
)
await user_msg.send() await user_msg.send()
await on_message(user_msg) await on_message(user_msg)
if __name__ == "__main__": if __name__ == "__main__":
from chainlit.cli import run_chainlit from chainlit.cli import run_chainlit
run_chainlit(__file__) run_chainlit(__file__)

View File

@@ -1,4 +1,3 @@
import requests, base64, os import requests, base64, os
data = { data = {
@@ -7,58 +6,49 @@ data = {
} }
response = requests.post("https://crawl4ai.com/crawl", json=data) response = requests.post("https://crawl4ai.com/crawl", json=data)
result = response.json()['results'][0] result = response.json()["results"][0]
print(result.keys()) print(result.keys())
# dict_keys(['url', 'html', 'success', 'cleaned_html', 'media', # dict_keys(['url', 'html', 'success', 'cleaned_html', 'media',
# 'links', 'screenshot', 'markdown', 'extracted_content', # 'links', 'screenshot', 'markdown', 'extracted_content',
# 'metadata', 'error_message']) # 'metadata', 'error_message'])
with open("screenshot.png", "wb") as f: with open("screenshot.png", "wb") as f:
f.write(base64.b64decode(result['screenshot'])) f.write(base64.b64decode(result["screenshot"]))
# Example of filtering the content using CSS selectors # Example of filtering the content using CSS selectors
data = { data = {
"urls": [ "urls": ["https://www.nbcnews.com/business"],
"https://www.nbcnews.com/business"
],
"css_selector": "article", "css_selector": "article",
"screenshot": True, "screenshot": True,
} }
# Example of executing a JS script on the page before extracting the content # Example of executing a JS script on the page before extracting the content
data = { data = {
"urls": [ "urls": ["https://www.nbcnews.com/business"],
"https://www.nbcnews.com/business"
],
"screenshot": True, "screenshot": True,
'js' : [""" "js": [
"""
const loadMoreButton = Array.from(document.querySelectorAll('button')). const loadMoreButton = Array.from(document.querySelectorAll('button')).
find(button => button.textContent.includes('Load More')); find(button => button.textContent.includes('Load More'));
loadMoreButton && loadMoreButton.click(); loadMoreButton && loadMoreButton.click();
"""] """
],
} }
# Example of using a custom extraction strategy # Example of using a custom extraction strategy
data = { data = {
"urls": [ "urls": ["https://www.nbcnews.com/business"],
"https://www.nbcnews.com/business"
],
"extraction_strategy": "CosineStrategy", "extraction_strategy": "CosineStrategy",
"extraction_strategy_args": { "extraction_strategy_args": {"semantic_filter": "inflation rent prices"},
"semantic_filter": "inflation rent prices"
},
} }
# Example of using LLM to extract content # Example of using LLM to extract content
data = { data = {
"urls": [ "urls": ["https://www.nbcnews.com/business"],
"https://www.nbcnews.com/business"
],
"extraction_strategy": "LLMExtractionStrategy", "extraction_strategy": "LLMExtractionStrategy",
"extraction_strategy_args": { "extraction_strategy_args": {
"provider": "groq/llama3-8b-8192", "provider": "groq/llama3-8b-8192",
"api_token": os.environ.get("GROQ_API_KEY"), "api_token": os.environ.get("GROQ_API_KEY"),
"instruction": """I am interested in only financial news, "instruction": """I am interested in only financial news,
and translate them in French.""" and translate them in French.""",
}, },
} }

View File

@@ -0,0 +1,135 @@
import time, re
from crawl4ai.content_scraping_strategy import WebScrapingStrategy, LXMLWebScrapingStrategy
import time
import functools
from collections import defaultdict
class TimingStats:
def __init__(self):
self.stats = defaultdict(lambda: defaultdict(lambda: {"calls": 0, "total_time": 0}))
def add(self, strategy_name, func_name, elapsed):
self.stats[strategy_name][func_name]["calls"] += 1
self.stats[strategy_name][func_name]["total_time"] += elapsed
def report(self):
for strategy_name, funcs in self.stats.items():
print(f"\n{strategy_name} Timing Breakdown:")
print("-" * 60)
print(f"{'Function':<30} {'Calls':<10} {'Total(s)':<10} {'Avg(ms)':<10}")
print("-" * 60)
for func, data in sorted(funcs.items(), key=lambda x: x[1]["total_time"], reverse=True):
avg_ms = (data["total_time"] / data["calls"]) * 1000
print(f"{func:<30} {data['calls']:<10} {data['total_time']:<10.3f} {avg_ms:<10.2f}")
timing_stats = TimingStats()
# Modify timing decorator
def timing_decorator(strategy_name):
def decorator(func):
@functools.wraps(func)
def wrapper(*args, **kwargs):
start = time.time()
result = func(*args, **kwargs)
elapsed = time.time() - start
timing_stats.add(strategy_name, func.__name__, elapsed)
return result
return wrapper
return decorator
# Modified decorator application
def apply_decorators(cls, method_name, strategy_name):
try:
original_method = getattr(cls, method_name)
decorated_method = timing_decorator(strategy_name)(original_method)
setattr(cls, method_name, decorated_method)
except AttributeError:
print(f"Method {method_name} not found in class {cls.__name__}.")
# Apply to key methods
methods_to_profile = [
'_scrap',
# 'process_element',
'_process_element',
'process_image',
]
# Apply decorators to both strategies
for strategy, name in [(WebScrapingStrategy, "Original"), (LXMLWebScrapingStrategy, "LXML")]:
for method in methods_to_profile:
apply_decorators(strategy, method, name)
def generate_large_html(n_elements=1000):
html = ['<!DOCTYPE html><html><head></head><body>']
for i in range(n_elements):
html.append(f'''
<div class="article">
<h2>Heading {i}</h2>
<div>
<div>
<p>This is paragraph {i} with some content and a <a href="http://example.com/{i}">link</a></p>
</div>
</div>
<img src="image{i}.jpg" alt="Image {i}">
<ul>
<li>List item {i}.1</li>
<li>List item {i}.2</li>
</ul>
</div>
''')
html.append('</body></html>')
return ''.join(html)
def test_scraping():
# Initialize both scrapers
original_scraper = WebScrapingStrategy()
selected_scraper = LXMLWebScrapingStrategy()
# Generate test HTML
print("Generating HTML...")
html = generate_large_html(5000)
print(f"HTML Size: {len(html)/1024:.2f} KB")
# Time the scraping
print("\nStarting scrape...")
start_time = time.time()
kwargs = {
"url": "http://example.com",
"html": html,
"word_count_threshold": 5,
"keep_data_attributes": True
}
t1 = time.perf_counter()
result_selected = selected_scraper.scrap(**kwargs)
t2 = time.perf_counter()
result_original = original_scraper.scrap(**kwargs)
t3 = time.perf_counter()
elapsed = t3 - start_time
print(f"\nScraping completed in {elapsed:.2f} seconds")
timing_stats.report()
# Print stats of LXML output
print("\Turbo Output:")
print(f"\nExtracted links: {len(result_selected.links.internal) + len(result_selected.links.external)}")
print(f"Extracted images: {len(result_selected.media.images)}")
print(f"Clean HTML size: {len(result_selected.cleaned_html)/1024:.2f} KB")
print(f"Scraping time: {t2 - t1:.2f} seconds")
# Print stats of original output
print("\nOriginal Output:")
print(f"\nExtracted links: {len(result_original.links.internal) + len(result_original.links.external)}")
print(f"Extracted images: {len(result_original.media.images)}")
print(f"Clean HTML size: {len(result_original.cleaned_html)/1024:.2f} KB")
print(f"Scraping time: {t3 - t1:.2f} seconds")
if __name__ == "__main__":
test_scraping()

View File

@@ -5,22 +5,22 @@ import os
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig, CacheMode from crawl4ai import AsyncWebCrawler, CrawlerRunConfig, CacheMode
# Create tmp directory if it doesn't exist # Create tmp directory if it doesn't exist
parent_dir = os.path.dirname(os.path.dirname(os.path.dirname(os.path.abspath(__file__)))) parent_dir = os.path.dirname(
os.path.dirname(os.path.dirname(os.path.abspath(__file__)))
)
tmp_dir = os.path.join(parent_dir, "tmp") tmp_dir = os.path.join(parent_dir, "tmp")
os.makedirs(tmp_dir, exist_ok=True) os.makedirs(tmp_dir, exist_ok=True)
async def main(): async def main():
# Configure crawler to fetch SSL certificate # Configure crawler to fetch SSL certificate
config = CrawlerRunConfig( config = CrawlerRunConfig(
fetch_ssl_certificate=True, fetch_ssl_certificate=True,
cache_mode=CacheMode.BYPASS # Bypass cache to always get fresh certificates cache_mode=CacheMode.BYPASS, # Bypass cache to always get fresh certificates
) )
async with AsyncWebCrawler() as crawler: async with AsyncWebCrawler() as crawler:
result = await crawler.arun( result = await crawler.arun(url="https://example.com", config=config)
url='https://example.com',
config=config
)
if result.success and result.ssl_certificate: if result.success and result.ssl_certificate:
cert = result.ssl_certificate cert = result.ssl_certificate
@@ -36,11 +36,16 @@ async def main():
print("\nCertificate exported to:") print("\nCertificate exported to:")
print(f"- JSON: {os.path.join(tmp_dir, 'certificate.json')}") print(f"- JSON: {os.path.join(tmp_dir, 'certificate.json')}")
pem_data = cert.to_pem(os.path.join(tmp_dir, "certificate.pem")) # For web servers pem_data = cert.to_pem(
os.path.join(tmp_dir, "certificate.pem")
) # For web servers
print(f"- PEM: {os.path.join(tmp_dir, 'certificate.pem')}") print(f"- PEM: {os.path.join(tmp_dir, 'certificate.pem')}")
der_data = cert.to_der(os.path.join(tmp_dir, "certificate.der")) # For Java apps der_data = cert.to_der(
os.path.join(tmp_dir, "certificate.der")
) # For Java apps
print(f"- DER: {os.path.join(tmp_dir, 'certificate.der')}") print(f"- DER: {os.path.join(tmp_dir, 'certificate.der')}")
if __name__ == "__main__": if __name__ == "__main__":
asyncio.run(main()) asyncio.run(main())

View File

@@ -1,39 +1,41 @@
import os import os
import time
import json import json
from crawl4ai.web_crawler import WebCrawler from crawl4ai.web_crawler import WebCrawler
from crawl4ai.chunking_strategy import * from crawl4ai.chunking_strategy import *
from crawl4ai.extraction_strategy import * from crawl4ai.extraction_strategy import *
from crawl4ai.crawler_strategy import * from crawl4ai.crawler_strategy import *
url = r'https://marketplace.visualstudio.com/items?itemName=Unclecode.groqopilot' url = r"https://marketplace.visualstudio.com/items?itemName=Unclecode.groqopilot"
crawler = WebCrawler() crawler = WebCrawler()
crawler.warmup() crawler.warmup()
from pydantic import BaseModel, Field from pydantic import BaseModel, Field
class PageSummary(BaseModel): class PageSummary(BaseModel):
title: str = Field(..., description="Title of the page.") title: str = Field(..., description="Title of the page.")
summary: str = Field(..., description="Summary of the page.") summary: str = Field(..., description="Summary of the page.")
brief_summary: str = Field(..., description="Brief summary of the page.") brief_summary: str = Field(..., description="Brief summary of the page.")
keywords: list = Field(..., description="Keywords assigned to the page.") keywords: list = Field(..., description="Keywords assigned to the page.")
result = crawler.run( result = crawler.run(
url=url, url=url,
word_count_threshold=1, word_count_threshold=1,
extraction_strategy= LLMExtractionStrategy( extraction_strategy=LLMExtractionStrategy(
provider= "openai/gpt-4o", api_token = os.getenv('OPENAI_API_KEY'), provider="openai/gpt-4o",
api_token=os.getenv("OPENAI_API_KEY"),
schema=PageSummary.model_json_schema(), schema=PageSummary.model_json_schema(),
extraction_type="schema", extraction_type="schema",
apply_chunking =False, apply_chunking=False,
instruction="From the crawled content, extract the following details: "\ instruction="From the crawled content, extract the following details: "
"1. Title of the page "\ "1. Title of the page "
"2. Summary of the page, which is a detailed summary "\ "2. Summary of the page, which is a detailed summary "
"3. Brief summary of the page, which is a paragraph text "\ "3. Brief summary of the page, which is a paragraph text "
"4. Keywords assigned to the page, which is a list of keywords. "\ "4. Keywords assigned to the page, which is a list of keywords. "
'The extracted JSON format should look like this: '\ "The extracted JSON format should look like this: "
'{ "title": "Page Title", "summary": "Detailed summary of the page.", "brief_summary": "Brief summary in a paragraph.", "keywords": ["keyword1", "keyword2", "keyword3"] }' '{ "title": "Page Title", "summary": "Detailed summary of the page.", "brief_summary": "Brief summary in a paragraph.", "keywords": ["keyword1", "keyword2", "keyword3"] }',
), ),
bypass_cache=True, bypass_cache=True,
) )

View File

@@ -1,4 +1,5 @@
import os, sys import os, sys
# append the parent directory to the sys.path # append the parent directory to the sys.path
parent_dir = os.path.dirname(os.path.dirname(os.path.abspath(__file__))) parent_dir = os.path.dirname(os.path.dirname(os.path.abspath(__file__)))
sys.path.append(parent_dir) sys.path.append(parent_dir)
@@ -13,6 +14,7 @@ import json
from crawl4ai import AsyncWebCrawler, CacheMode from crawl4ai import AsyncWebCrawler, CacheMode
from crawl4ai.content_filter_strategy import BM25ContentFilter from crawl4ai.content_filter_strategy import BM25ContentFilter
# 1. File Download Processing Example # 1. File Download Processing Example
async def download_example(): async def download_example():
"""Example of downloading files from Python.org""" """Example of downloading files from Python.org"""
@@ -23,9 +25,7 @@ async def download_example():
print(f"Downloads will be saved to: {downloads_path}") print(f"Downloads will be saved to: {downloads_path}")
async with AsyncWebCrawler( async with AsyncWebCrawler(
accept_downloads=True, accept_downloads=True, downloads_path=downloads_path, verbose=True
downloads_path=downloads_path,
verbose=True
) as crawler: ) as crawler:
result = await crawler.arun( result = await crawler.arun(
url="https://www.python.org/downloads/", url="https://www.python.org/downloads/",
@@ -40,7 +40,7 @@ async def download_example():
} }
""", """,
delay_before_return_html=1, # Wait 5 seconds to ensure download starts delay_before_return_html=1, # Wait 5 seconds to ensure download starts
cache_mode=CacheMode.BYPASS cache_mode=CacheMode.BYPASS,
) )
if result.downloaded_files: if result.downloaded_files:
@@ -52,24 +52,25 @@ async def download_example():
else: else:
print("\nNo files were downloaded") print("\nNo files were downloaded")
# 2. Local File and Raw HTML Processing Example # 2. Local File and Raw HTML Processing Example
async def local_and_raw_html_example(): async def local_and_raw_html_example():
"""Example of processing local files and raw HTML""" """Example of processing local files and raw HTML"""
# Create a sample HTML file # Create a sample HTML file
sample_file = os.path.join(__data__, "sample.html") sample_file = os.path.join(__data__, "sample.html")
with open(sample_file, "w") as f: with open(sample_file, "w") as f:
f.write(""" f.write(
"""
<html><body> <html><body>
<h1>Test Content</h1> <h1>Test Content</h1>
<p>This is a test paragraph.</p> <p>This is a test paragraph.</p>
</body></html> </body></html>
""") """
)
async with AsyncWebCrawler(verbose=True) as crawler: async with AsyncWebCrawler(verbose=True) as crawler:
# Process local file # Process local file
local_result = await crawler.arun( local_result = await crawler.arun(url=f"file://{os.path.abspath(sample_file)}")
url=f"file://{os.path.abspath(sample_file)}"
)
# Process raw HTML # Process raw HTML
raw_html = """ raw_html = """
@@ -78,9 +79,7 @@ async def local_and_raw_html_example():
<p>This is a test of raw HTML processing.</p> <p>This is a test of raw HTML processing.</p>
</body></html> </body></html>
""" """
raw_result = await crawler.arun( raw_result = await crawler.arun(url=f"raw:{raw_html}")
url=f"raw:{raw_html}"
)
# Clean up # Clean up
os.remove(sample_file) os.remove(sample_file)
@@ -88,6 +87,7 @@ async def local_and_raw_html_example():
print("Local file content:", local_result.markdown) print("Local file content:", local_result.markdown)
print("\nRaw HTML content:", raw_result.markdown) print("\nRaw HTML content:", raw_result.markdown)
# 3. Enhanced Markdown Generation Example # 3. Enhanced Markdown Generation Example
async def markdown_generation_example(): async def markdown_generation_example():
"""Example of enhanced markdown generation with citations and LLM-friendly features""" """Example of enhanced markdown generation with citations and LLM-friendly features"""
@@ -102,27 +102,32 @@ async def markdown_generation_example():
url="https://en.wikipedia.org/wiki/Apple", url="https://en.wikipedia.org/wiki/Apple",
css_selector="main div#bodyContent", css_selector="main div#bodyContent",
content_filter=content_filter, content_filter=content_filter,
cache_mode=CacheMode.BYPASS cache_mode=CacheMode.BYPASS,
) )
from crawl4ai import AsyncWebCrawler
from crawl4ai.content_filter_strategy import BM25ContentFilter from crawl4ai.content_filter_strategy import BM25ContentFilter
result = await crawler.arun( result = await crawler.arun(
url="https://en.wikipedia.org/wiki/Apple", url="https://en.wikipedia.org/wiki/Apple",
css_selector="main div#bodyContent", css_selector="main div#bodyContent",
content_filter=BM25ContentFilter() content_filter=BM25ContentFilter(),
) )
print(result.markdown_v2.fit_markdown) print(result.markdown_v2.fit_markdown)
print("\nMarkdown Generation Results:") print("\nMarkdown Generation Results:")
print(f"1. Original markdown length: {len(result.markdown)}") print(f"1. Original markdown length: {len(result.markdown)}")
print(f"2. New markdown versions (markdown_v2):") print("2. New markdown versions (markdown_v2):")
print(f" - Raw markdown length: {len(result.markdown_v2.raw_markdown)}") print(f" - Raw markdown length: {len(result.markdown_v2.raw_markdown)}")
print(f" - Citations markdown length: {len(result.markdown_v2.markdown_with_citations)}") print(
print(f" - References section length: {len(result.markdown_v2.references_markdown)}") f" - Citations markdown length: {len(result.markdown_v2.markdown_with_citations)}"
)
print(
f" - References section length: {len(result.markdown_v2.references_markdown)}"
)
if result.markdown_v2.fit_markdown: if result.markdown_v2.fit_markdown:
print(f" - Filtered markdown length: {len(result.markdown_v2.fit_markdown)}") print(
f" - Filtered markdown length: {len(result.markdown_v2.fit_markdown)}"
)
# Save examples to files # Save examples to files
output_dir = os.path.join(__data__, "markdown_examples") output_dir = os.path.join(__data__, "markdown_examples")
@@ -148,7 +153,10 @@ async def markdown_generation_example():
print("\nSample of markdown with citations:") print("\nSample of markdown with citations:")
print(result.markdown_v2.markdown_with_citations[:500] + "...\n") print(result.markdown_v2.markdown_with_citations[:500] + "...\n")
print("Sample of references:") print("Sample of references:")
print('\n'.join(result.markdown_v2.references_markdown.split('\n')[:10]) + "...") print(
"\n".join(result.markdown_v2.references_markdown.split("\n")[:10]) + "..."
)
# 4. Browser Management Example # 4. Browser Management Example
async def browser_management_example(): async def browser_management_example():
@@ -163,31 +171,31 @@ async def browser_management_example():
use_managed_browser=True, use_managed_browser=True,
user_data_dir=user_data_dir, user_data_dir=user_data_dir,
headless=False, headless=False,
verbose=True verbose=True,
) as crawler: ) as crawler:
result = await crawler.arun( result = await crawler.arun(
url="https://crawl4ai.com", url="https://crawl4ai.com",
# session_id="persistent_session_1", # session_id="persistent_session_1",
cache_mode=CacheMode.BYPASS cache_mode=CacheMode.BYPASS,
) )
# Use GitHub as an example - it's a good test for browser management # Use GitHub as an example - it's a good test for browser management
# because it requires proper browser handling # because it requires proper browser handling
result = await crawler.arun( result = await crawler.arun(
url="https://github.com/trending", url="https://github.com/trending",
# session_id="persistent_session_1", # session_id="persistent_session_1",
cache_mode=CacheMode.BYPASS cache_mode=CacheMode.BYPASS,
) )
print("\nBrowser session result:", result.success) print("\nBrowser session result:", result.success)
if result.success: if result.success:
print("Page title:", result.metadata.get('title', 'No title found')) print("Page title:", result.metadata.get("title", "No title found"))
# 5. API Usage Example # 5. API Usage Example
async def api_example(): async def api_example():
"""Example of using the new API endpoints""" """Example of using the new API endpoints"""
api_token = os.getenv('CRAWL4AI_API_TOKEN') or "test_api_code" api_token = os.getenv("CRAWL4AI_API_TOKEN") or "test_api_code"
headers = {'Authorization': f'Bearer {api_token}'} headers = {"Authorization": f"Bearer {api_token}"}
async with aiohttp.ClientSession() as session: async with aiohttp.ClientSession() as session:
# Submit crawl job # Submit crawl job
crawl_request = { crawl_request = {
@@ -199,25 +207,17 @@ async def api_example():
"name": "Hacker News Articles", "name": "Hacker News Articles",
"baseSelector": ".athing", "baseSelector": ".athing",
"fields": [ "fields": [
{ {"name": "title", "selector": ".title a", "type": "text"},
"name": "title", {"name": "score", "selector": ".score", "type": "text"},
"selector": ".title a",
"type": "text"
},
{
"name": "score",
"selector": ".score",
"type": "text"
},
{ {
"name": "url", "name": "url",
"selector": ".title a", "selector": ".title a",
"type": "attribute", "type": "attribute",
"attribute": "href" "attribute": "href",
} },
] ],
} }
} },
}, },
"crawler_params": { "crawler_params": {
"headless": True, "headless": True,
@@ -229,9 +229,7 @@ async def api_example():
} }
async with session.post( async with session.post(
"http://localhost:11235/crawl", "http://localhost:11235/crawl", json=crawl_request, headers=headers
json=crawl_request,
headers=headers
) as response: ) as response:
task_data = await response.json() task_data = await response.json()
task_id = task_data["task_id"] task_id = task_data["task_id"]
@@ -239,8 +237,7 @@ async def api_example():
# Check task status # Check task status
while True: while True:
async with session.get( async with session.get(
f"http://localhost:11235/task/{task_id}", f"http://localhost:11235/task/{task_id}", headers=headers
headers=headers
) as status_response: ) as status_response:
result = await status_response.json() result = await status_response.json()
print(f"Task status: {result['status']}") print(f"Task status: {result['status']}")
@@ -248,12 +245,13 @@ async def api_example():
if result["status"] == "completed": if result["status"] == "completed":
print("Task completed!") print("Task completed!")
print("Results:") print("Results:")
news = json.loads(result["results"][0]['extracted_content']) news = json.loads(result["results"][0]["extracted_content"])
print(json.dumps(news[:4], indent=2)) print(json.dumps(news[:4], indent=2))
break break
else: else:
await asyncio.sleep(1) await asyncio.sleep(1)
# Main execution # Main execution
async def main(): async def main():
# print("Running Crawl4AI feature examples...") # print("Running Crawl4AI feature examples...")
@@ -273,5 +271,6 @@ async def main():
# print("\n5. Running API Example:") # print("\n5. Running API Example:")
await api_example() await api_example()
if __name__ == "__main__": if __name__ == "__main__":
asyncio.run(main()) asyncio.run(main())

View File

@@ -10,15 +10,14 @@ import asyncio
import os import os
import json import json
import re import re
from typing import List, Optional, Dict, Any from typing import List
from pydantic import BaseModel, Field
from crawl4ai import ( from crawl4ai import (
AsyncWebCrawler, AsyncWebCrawler,
BrowserConfig, BrowserConfig,
CrawlerRunConfig, CrawlerRunConfig,
CacheMode, CacheMode,
LLMExtractionStrategy, LLMExtractionStrategy,
JsonCssExtractionStrategy JsonCssExtractionStrategy,
) )
from crawl4ai.content_filter_strategy import RelevantContentFilter from crawl4ai.content_filter_strategy import RelevantContentFilter
from crawl4ai.markdown_generation_strategy import DefaultMarkdownGenerator from crawl4ai.markdown_generation_strategy import DefaultMarkdownGenerator
@@ -52,6 +51,7 @@ SAMPLE_HTML = """
</div> </div>
""" """
async def demo_ssl_features(): async def demo_ssl_features():
""" """
Enhanced SSL & Security Features Demo Enhanced SSL & Security Features Demo
@@ -76,14 +76,11 @@ async def demo_ssl_features():
run_config = CrawlerRunConfig( run_config = CrawlerRunConfig(
cache_mode=CacheMode.BYPASS, cache_mode=CacheMode.BYPASS,
fetch_ssl_certificate=True # Enable SSL certificate fetching fetch_ssl_certificate=True, # Enable SSL certificate fetching
) )
async with AsyncWebCrawler(config=browser_config) as crawler: async with AsyncWebCrawler(config=browser_config) as crawler:
result = await crawler.arun( result = await crawler.arun(url="https://example.com", config=run_config)
url="https://example.com",
config=run_config
)
print(f"SSL Crawl Success: {result.success}") print(f"SSL Crawl Success: {result.success}")
result.ssl_certificate.to_json( result.ssl_certificate.to_json(
os.path.join(os.getcwd(), "ssl_certificate.json") os.path.join(os.getcwd(), "ssl_certificate.json")
@@ -91,6 +88,7 @@ async def demo_ssl_features():
if not result.success: if not result.success:
print(f"SSL Error: {result.error_message}") print(f"SSL Error: {result.error_message}")
async def demo_content_filtering(): async def demo_content_filtering():
""" """
Smart Content Filtering Demo Smart Content Filtering Demo
@@ -110,12 +108,14 @@ async def demo_content_filtering():
super().__init__() super().__init__()
# Add news-specific patterns # Add news-specific patterns
self.negative_patterns = re.compile( self.negative_patterns = re.compile(
r'nav|footer|header|sidebar|ads|comment|share|related|recommended|popular|trending', r"nav|footer|header|sidebar|ads|comment|share|related|recommended|popular|trending",
re.I re.I,
) )
self.min_word_count = 30 # Higher threshold for news content self.min_word_count = 30 # Higher threshold for news content
def filter_content(self, html: str, min_word_threshold: int = None) -> List[str]: def filter_content(
self, html: str, min_word_threshold: int = None
) -> List[str]:
""" """
Implements news-specific content filtering logic. Implements news-specific content filtering logic.
@@ -129,14 +129,16 @@ async def demo_content_filtering():
if not html or not isinstance(html, str): if not html or not isinstance(html, str):
return [] return []
soup = BeautifulSoup(html, 'lxml') soup = BeautifulSoup(html, "lxml")
if not soup.body: if not soup.body:
soup = BeautifulSoup(f'<body>{html}</body>', 'lxml') soup = BeautifulSoup(f"<body>{html}</body>", "lxml")
body = soup.find('body') body = soup.find("body")
# Extract chunks with metadata # Extract chunks with metadata
chunks = self.extract_text_chunks(body, min_word_threshold or self.min_word_count) chunks = self.extract_text_chunks(
body, min_word_threshold or self.min_word_count
)
# Filter chunks based on news-specific criteria # Filter chunks based on news-specific criteria
filtered_chunks = [] filtered_chunks = []
@@ -146,7 +148,7 @@ async def demo_content_filtering():
continue continue
# Headers are important in news articles # Headers are important in news articles
if tag_type == 'header': if tag_type == "header":
filtered_chunks.append(self.clean_element(element)) filtered_chunks.append(self.clean_element(element))
continue continue
@@ -154,7 +156,9 @@ async def demo_content_filtering():
text = element.get_text(strip=True) text = element.get_text(strip=True)
if len(text.split()) >= (min_word_threshold or self.min_word_count): if len(text.split()) >= (min_word_threshold or self.min_word_count):
# Calculate link density # Calculate link density
links_text = ' '.join(a.get_text(strip=True) for a in element.find_all('a')) links_text = " ".join(
a.get_text(strip=True) for a in element.find_all("a")
)
link_density = len(links_text) / len(text) if text else 1 link_density = len(links_text) / len(text) if text else 1
# Accept if link density is reasonable # Accept if link density is reasonable
@@ -164,23 +168,20 @@ async def demo_content_filtering():
return filtered_chunks return filtered_chunks
# Create markdown generator with custom filter # Create markdown generator with custom filter
markdown_gen = DefaultMarkdownGenerator( markdown_gen = DefaultMarkdownGenerator(content_filter=CustomNewsFilter())
content_filter=CustomNewsFilter()
)
run_config = CrawlerRunConfig( run_config = CrawlerRunConfig(
markdown_generator=markdown_gen, markdown_generator=markdown_gen, cache_mode=CacheMode.BYPASS
cache_mode=CacheMode.BYPASS
) )
async with AsyncWebCrawler() as crawler: async with AsyncWebCrawler() as crawler:
result = await crawler.arun( result = await crawler.arun(
url="https://news.ycombinator.com", url="https://news.ycombinator.com", config=run_config
config=run_config
) )
print("Filtered Content Sample:") print("Filtered Content Sample:")
print(result.markdown[:500]) # Show first 500 chars print(result.markdown[:500]) # Show first 500 chars
async def demo_json_extraction(): async def demo_json_extraction():
""" """
Improved JSON Extraction Demo Improved JSON Extraction Demo
@@ -206,7 +207,7 @@ async def demo_json_extraction():
"baseSelector": "div.article-list", "baseSelector": "div.article-list",
"baseFields": [ "baseFields": [
{"name": "list_id", "type": "attribute", "attribute": "data-list-id"}, {"name": "list_id", "type": "attribute", "attribute": "data-list-id"},
{"name": "category", "type": "attribute", "attribute": "data-category"} {"name": "category", "type": "attribute", "attribute": "data-category"},
], ],
"fields": [ "fields": [
{ {
@@ -214,8 +215,16 @@ async def demo_json_extraction():
"selector": "article.post", "selector": "article.post",
"type": "nested_list", "type": "nested_list",
"baseFields": [ "baseFields": [
{"name": "post_id", "type": "attribute", "attribute": "data-post-id"}, {
{"name": "author_id", "type": "attribute", "attribute": "data-author"} "name": "post_id",
"type": "attribute",
"attribute": "data-post-id",
},
{
"name": "author_id",
"type": "attribute",
"attribute": "data-author",
},
], ],
"fields": [ "fields": [
{ {
@@ -223,51 +232,59 @@ async def demo_json_extraction():
"selector": "h2.title a", "selector": "h2.title a",
"type": "text", "type": "text",
"baseFields": [ "baseFields": [
{"name": "url", "type": "attribute", "attribute": "href"} {
] "name": "url",
"type": "attribute",
"attribute": "href",
}
],
}, },
{ {
"name": "author", "name": "author",
"selector": "div.meta a.author", "selector": "div.meta a.author",
"type": "text", "type": "text",
"baseFields": [ "baseFields": [
{"name": "profile_url", "type": "attribute", "attribute": "href"} {
] "name": "profile_url",
}, "type": "attribute",
{ "attribute": "href",
"name": "date", }
"selector": "span.date", ],
"type": "text"
}, },
{"name": "date", "selector": "span.date", "type": "text"},
{ {
"name": "read_more", "name": "read_more",
"selector": "a.read-more", "selector": "a.read-more",
"type": "nested", "type": "nested",
"fields": [ "fields": [
{"name": "text", "type": "text"}, {"name": "text", "type": "text"},
{"name": "url", "type": "attribute", "attribute": "href"} {
] "name": "url",
} "type": "attribute",
] "attribute": "href",
},
],
},
],
} }
] ],
} }
) )
# Demonstrate extraction from raw HTML # Demonstrate extraction from raw HTML
run_config = CrawlerRunConfig( run_config = CrawlerRunConfig(
extraction_strategy=json_strategy, extraction_strategy=json_strategy, cache_mode=CacheMode.BYPASS
cache_mode=CacheMode.BYPASS
) )
async with AsyncWebCrawler() as crawler: async with AsyncWebCrawler() as crawler:
result = await crawler.arun( result = await crawler.arun(
url="raw:" + SAMPLE_HTML, # Use raw: prefix for raw HTML url="raw:" + SAMPLE_HTML, # Use raw: prefix for raw HTML
config=run_config config=run_config,
) )
print("Extracted Content:") print("Extracted Content:")
print(result.extracted_content) print(result.extracted_content)
async def demo_input_formats(): async def demo_input_formats():
""" """
Input Format Handling Demo Input Format Handling Demo
@@ -359,18 +376,30 @@ async def demo_input_formats():
# Define our schema using Pydantic # Define our schema using Pydantic
class JobRequirement(BaseModel): class JobRequirement(BaseModel):
category: str = Field(description="Category of the requirement (e.g., Technical, Soft Skills)") category: str = Field(
items: List[str] = Field(description="List of specific requirements in this category") description="Category of the requirement (e.g., Technical, Soft Skills)"
priority: str = Field(description="Priority level (Required/Preferred) based on the HTML class or context") )
items: List[str] = Field(
description="List of specific requirements in this category"
)
priority: str = Field(
description="Priority level (Required/Preferred) based on the HTML class or context"
)
class JobPosting(BaseModel): class JobPosting(BaseModel):
title: str = Field(description="Job title") title: str = Field(description="Job title")
department: str = Field(description="Department or team") department: str = Field(description="Department or team")
location: str = Field(description="Job location, including remote options") location: str = Field(description="Job location, including remote options")
salary_range: Optional[str] = Field(description="Salary range if specified") salary_range: Optional[str] = Field(description="Salary range if specified")
requirements: List[JobRequirement] = Field(description="Categorized job requirements") requirements: List[JobRequirement] = Field(
application_deadline: Optional[str] = Field(description="Application deadline if specified") description="Categorized job requirements"
contact_info: Optional[dict] = Field(description="Contact information from footer or contact section") )
application_deadline: Optional[str] = Field(
description="Application deadline if specified"
)
contact_info: Optional[dict] = Field(
description="Contact information from footer or contact section"
)
# First try with markdown (default) # First try with markdown (default)
markdown_strategy = LLMExtractionStrategy( markdown_strategy = LLMExtractionStrategy(
@@ -382,7 +411,7 @@ async def demo_input_formats():
Extract job posting details into structured data. Focus on the visible text content Extract job posting details into structured data. Focus on the visible text content
and organize requirements into categories. and organize requirements into categories.
""", """,
input_format="markdown" # default input_format="markdown", # default
) )
# Then with HTML for better structure understanding # Then with HTML for better structure understanding
@@ -400,34 +429,25 @@ async def demo_input_formats():
Use HTML attributes and classes to enhance extraction accuracy. Use HTML attributes and classes to enhance extraction accuracy.
""", """,
input_format="html" # explicitly use HTML input_format="html", # explicitly use HTML
) )
async with AsyncWebCrawler() as crawler: async with AsyncWebCrawler() as crawler:
# Try with markdown first # Try with markdown first
markdown_config = CrawlerRunConfig( markdown_config = CrawlerRunConfig(extraction_strategy=markdown_strategy)
extraction_strategy=markdown_strategy markdown_result = await crawler.arun(url=url, config=markdown_config)
)
markdown_result = await crawler.arun(
url=url,
config=markdown_config
)
print("\nMarkdown-based Extraction Result:") print("\nMarkdown-based Extraction Result:")
items = json.loads(markdown_result.extracted_content) items = json.loads(markdown_result.extracted_content)
print(json.dumps(items, indent=2)) print(json.dumps(items, indent=2))
# Then with HTML for better structure understanding # Then with HTML for better structure understanding
html_config = CrawlerRunConfig( html_config = CrawlerRunConfig(extraction_strategy=html_strategy)
extraction_strategy=html_strategy html_result = await crawler.arun(url=url, config=html_config)
)
html_result = await crawler.arun(
url=url,
config=html_config
)
print("\nHTML-based Extraction Result:") print("\nHTML-based Extraction Result:")
items = json.loads(html_result.extracted_content) items = json.loads(html_result.extracted_content)
print(json.dumps(items, indent=2)) print(json.dumps(items, indent=2))
# Main execution # Main execution
async def main(): async def main():
print("Crawl4AI v0.4.24 Feature Walkthrough") print("Crawl4AI v0.4.24 Feature Walkthrough")
@@ -439,5 +459,6 @@ async def main():
await demo_json_extraction() await demo_json_extraction()
# await demo_input_formats() # await demo_input_formats()
if __name__ == "__main__": if __name__ == "__main__":
asyncio.run(main()) asyncio.run(main())

View File

@@ -0,0 +1,351 @@
"""
Crawl4ai v0.4.3b2 Features Demo
============================
This demonstration showcases three major categories of new features in Crawl4ai v0.4.3:
1. Efficiency & Speed:
- Memory-efficient dispatcher strategies
- New scraping algorithm
- Streaming support for batch crawling
2. LLM Integration:
- Automatic schema generation
- LLM-powered content filtering
- Smart markdown generation
3. Core Improvements:
- Robots.txt compliance
- Proxy rotation
- Enhanced URL handling
- Shared data among hooks
- add page routes
Each demo function can be run independently or as part of the full suite.
"""
import asyncio
import os
import json
import re
import random
from typing import Optional, Dict
from dotenv import load_dotenv
load_dotenv()
from crawl4ai import (
AsyncWebCrawler,
BrowserConfig,
CrawlerRunConfig,
CacheMode,
DisplayMode,
MemoryAdaptiveDispatcher,
CrawlerMonitor,
DefaultMarkdownGenerator,
LXMLWebScrapingStrategy,
JsonCssExtractionStrategy,
LLMContentFilter
)
async def demo_memory_dispatcher():
"""Demonstrates the new memory-efficient dispatcher system.
Key Features:
- Adaptive memory management
- Real-time performance monitoring
- Concurrent session control
"""
print("\n=== Memory Dispatcher Demo ===")
try:
# Configuration
browser_config = BrowserConfig(headless=True, verbose=False)
crawler_config = CrawlerRunConfig(
cache_mode=CacheMode.BYPASS,
markdown_generator=DefaultMarkdownGenerator()
)
# Test URLs
urls = ["http://example.com", "http://example.org", "http://example.net"] * 3
print("\n📈 Initializing crawler with memory monitoring...")
async with AsyncWebCrawler(config=browser_config) as crawler:
monitor = CrawlerMonitor(
max_visible_rows=10,
display_mode=DisplayMode.DETAILED
)
dispatcher = MemoryAdaptiveDispatcher(
memory_threshold_percent=80.0,
check_interval=0.5,
max_session_permit=5,
monitor=monitor
)
print("\n🚀 Starting batch crawl...")
results = await crawler.arun_many(
urls=urls,
config=crawler_config,
dispatcher=dispatcher
)
print(f"\n✅ Completed {len(results)} URLs successfully")
except Exception as e:
print(f"\n❌ Error in memory dispatcher demo: {str(e)}")
async def demo_streaming_support():
"""
2. Streaming Support Demo
======================
Shows how to process URLs as they complete using streaming
"""
print("\n=== 2. Streaming Support Demo ===")
browser_config = BrowserConfig(headless=True, verbose=False)
crawler_config = CrawlerRunConfig(cache_mode=CacheMode.BYPASS, stream=True)
# Test URLs
urls = ["http://example.com", "http://example.org", "http://example.net"] * 2
async with AsyncWebCrawler(config=browser_config) as crawler:
# Initialize dispatcher for streaming
dispatcher = MemoryAdaptiveDispatcher(max_session_permit=3, check_interval=0.5)
print("Starting streaming crawl...")
async for result in await crawler.arun_many(
urls=urls,
config=crawler_config,
dispatcher=dispatcher
):
# Process each result as it arrives
print(
f"Received result for {result.url} - Success: {result.success}"
)
if result.success:
print(f"Content length: {len(result.markdown)}")
async def demo_content_scraping():
"""
3. Content Scraping Strategy Demo
==============================
Demonstrates the new LXMLWebScrapingStrategy for faster content scraping.
"""
print("\n=== 3. Content Scraping Strategy Demo ===")
crawler = AsyncWebCrawler()
url = "https://example.com/article"
# Configure with the new LXML strategy
config = CrawlerRunConfig(
scraping_strategy=LXMLWebScrapingStrategy(),
verbose=True
)
print("Scraping content with LXML strategy...")
async with crawler:
result = await crawler.arun(url, config=config)
if result.success:
print("Successfully scraped content using LXML strategy")
async def demo_llm_markdown():
"""
4. LLM-Powered Markdown Generation Demo
===================================
Shows how to use the new LLM-powered content filtering and markdown generation.
"""
print("\n=== 4. LLM-Powered Markdown Generation Demo ===")
crawler = AsyncWebCrawler()
url = "https://docs.python.org/3/tutorial/classes.html"
content_filter = LLMContentFilter(
provider="openai/gpt-4o",
api_token=os.getenv("OPENAI_API_KEY"),
instruction="""
Focus on extracting the core educational content about Python classes.
Include:
- Key concepts and their explanations
- Important code examples
- Essential technical details
Exclude:
- Navigation elements
- Sidebars
- Footer content
- Version information
- Any non-essential UI elements
Format the output as clean markdown with proper code blocks and headers.
""",
verbose=True,
)
# Configure LLM-powered markdown generation
config = CrawlerRunConfig(
markdown_generator=DefaultMarkdownGenerator(
content_filter=content_filter
),
cache_mode = CacheMode.BYPASS,
verbose=True
)
print("Generating focused markdown with LLM...")
async with crawler:
result = await crawler.arun(url, config=config)
if result.success and result.markdown_v2:
print("Successfully generated LLM-filtered markdown")
print("First 500 chars of filtered content:")
print(result.markdown_v2.fit_markdown[:500])
print("Successfully generated LLM-filtered markdown")
async def demo_robots_compliance():
"""
5. Robots.txt Compliance Demo
==========================
Demonstrates the new robots.txt compliance feature with SQLite caching.
"""
print("\n=== 5. Robots.txt Compliance Demo ===")
crawler = AsyncWebCrawler()
urls = ["https://example.com", "https://facebook.com", "https://twitter.com"]
# Enable robots.txt checking
config = CrawlerRunConfig(check_robots_txt=True, verbose=True)
print("Crawling with robots.txt compliance...")
async with crawler:
results = await crawler.arun_many(urls, config=config)
for result in results:
if result.status_code == 403:
print(f"Access blocked by robots.txt: {result.url}")
elif result.success:
print(f"Successfully crawled: {result.url}")
async def demo_json_schema_generation():
"""
7. LLM-Powered Schema Generation Demo
=================================
Demonstrates automatic CSS and XPath schema generation using LLM models.
"""
print("\n=== 7. LLM-Powered Schema Generation Demo ===")
# Example HTML content for a job listing
html_content = """
<div class="job-listing">
<h1 class="job-title">Senior Software Engineer</h1>
<div class="job-details">
<span class="location">San Francisco, CA</span>
<span class="salary">$150,000 - $200,000</span>
<div class="requirements">
<h2>Requirements</h2>
<ul>
<li>5+ years Python experience</li>
<li>Strong background in web crawling</li>
</ul>
</div>
</div>
</div>
"""
print("Generating CSS selectors schema...")
# Generate CSS selectors with a specific query
css_schema = JsonCssExtractionStrategy.generate_schema(
html_content,
schema_type="CSS",
query="Extract job title, location, and salary information",
provider="openai/gpt-4o", # or use other providers like "ollama"
)
print("\nGenerated CSS Schema:")
print(css_schema)
# Example of using the generated schema with crawler
crawler = AsyncWebCrawler()
url = "https://example.com/job-listing"
# Create an extraction strategy with the generated schema
extraction_strategy = JsonCssExtractionStrategy(schema=css_schema)
config = CrawlerRunConfig(extraction_strategy=extraction_strategy, verbose=True)
print("\nTesting generated schema with crawler...")
async with crawler:
result = await crawler.arun(url, config=config)
if result.success:
print(json.dumps(result.extracted_content, indent=2) if result.extracted_content else None)
print("Successfully used generated schema for crawling")
async def demo_proxy_rotation():
"""
8. Proxy Rotation Demo
===================
Demonstrates how to rotate proxies for each request using Crawl4ai.
"""
print("\n=== 8. Proxy Rotation Demo ===")
async def get_next_proxy(proxy_file: str = f"proxies.txt") -> Optional[Dict]:
"""Get next proxy from local file"""
try:
proxies = os.getenv("PROXIES", "").split(",")
ip, port, username, password = random.choice(proxies).split(":")
return {
"server": f"http://{ip}:{port}",
"username": username,
"password": password,
"ip": ip # Store original IP for verification
}
except Exception as e:
print(f"Error loading proxy: {e}")
return None
# Create 10 test requests to httpbin
urls = ["https://httpbin.org/ip"] * 2
browser_config = BrowserConfig(headless=True, verbose=False)
run_config = CrawlerRunConfig(cache_mode=CacheMode.BYPASS)
async with AsyncWebCrawler(config=browser_config) as crawler:
for url in urls:
proxy = await get_next_proxy()
if not proxy:
print("No proxy available, skipping...")
continue
# Create new config with proxy
current_config = run_config.clone(proxy_config=proxy, user_agent="")
result = await crawler.arun(url=url, config=current_config)
if result.success:
ip_match = re.search(r'(?:[0-9]{1,3}\.){3}[0-9]{1,3}', result.html)
print(f"Proxy {proxy['ip']} -> Response IP: {ip_match.group(0) if ip_match else 'Not found'}")
verified = ip_match.group(0) == proxy['ip']
if verified:
print(f"✅ Proxy working! IP matches: {proxy['ip']}")
else:
print(f"❌ Proxy failed or IP mismatch!")
else:
print(f"Failed with proxy {proxy['ip']}")
async def main():
"""Run all feature demonstrations."""
print("\n📊 Running Crawl4ai v0.4.3 Feature Demos\n")
# Efficiency & Speed Demos
print("\n🚀 EFFICIENCY & SPEED DEMOS")
await demo_memory_dispatcher()
await demo_streaming_support()
await demo_content_scraping()
# # LLM Integration Demos
print("\n🤖 LLM INTEGRATION DEMOS")
await demo_json_schema_generation()
await demo_llm_markdown()
# # Core Improvements
print("\n🔧 CORE IMPROVEMENT DEMOS")
await demo_robots_compliance()
await demo_proxy_rotation()
if __name__ == "__main__":
asyncio.run(main())

View File

@@ -1,15 +1,17 @@
# Advanced Features (Proxy, PDF, Screenshot, SSL, Headers, & Storage State) # Overview of Some Important Advanced Features
(Proxy, PDF, Screenshot, SSL, Headers, & Storage State)
Crawl4AI offers multiple power-user features that go beyond simple crawling. This tutorial covers: Crawl4AI offers multiple power-user features that go beyond simple crawling. This tutorial covers:
1. **Proxy Usage** 1. **Proxy Usage**
2. **Capturing PDFs & Screenshots** 2. **Capturing PDFs & Screenshots**
3. **Handling SSL Certificates** 3. **Handling SSL Certificates**
4. **Custom Headers** 4. **Custom Headers**
5. **Session Persistence & Local Storage** 5. **Session Persistence & Local Storage**
6. **Robots.txt Compliance**
> **Prerequisites** > **Prerequisites**
> - You have a basic grasp of [AsyncWebCrawler Basics](./async-webcrawler-basics.md) > - You have a basic grasp of [AsyncWebCrawler Basics](../core/simple-crawling.md)
> - You know how to run or configure your Python environment with Playwright installed > - You know how to run or configure your Python environment with Playwright installed
--- ---
@@ -84,7 +86,7 @@ async def main():
# Save PDF # Save PDF
if result.pdf: if result.pdf:
with open("wikipedia_page.pdf", "wb") as f: with open("wikipedia_page.pdf", "wb") as f:
f.write(b64decode(result.pdf)) f.write(result.pdf)
print("[OK] PDF & screenshot captured.") print("[OK] PDF & screenshot captured.")
else: else:
@@ -186,7 +188,7 @@ if __name__ == "__main__":
**Notes** **Notes**
- Some sites may react differently to certain headers (e.g., `Accept-Language`). - Some sites may react differently to certain headers (e.g., `Accept-Language`).
- If you need advanced user-agent randomization or client hints, see [Identity-Based Crawling (Anti-Bot)](./identity-anti-bot.md) or use `UserAgentGenerator`. - If you need advanced user-agent randomization or client hints, see [Identity-Based Crawling (Anti-Bot)](./identity-based-crawling.md) or use `UserAgentGenerator`.
--- ---
@@ -246,7 +248,43 @@ You can sign in once, export the browser context, and reuse it later—without r
- **`await context.storage_state(path="my_storage.json")`**: Exports cookies, localStorage, etc. to a file. - **`await context.storage_state(path="my_storage.json")`**: Exports cookies, localStorage, etc. to a file.
- Provide `storage_state="my_storage.json"` on subsequent runs to skip the login step. - Provide `storage_state="my_storage.json"` on subsequent runs to skip the login step.
**See**: [Detailed session management tutorial](./hooks-custom.md#using-storage_state) or [Explanations → Browser Context & Managed Browser](../../explanations/browser-management.md) for more advanced scenarios (like multi-step logins, or capturing after interactive pages). **See**: [Detailed session management tutorial](./session-management.md) or [Explanations → Browser Context & Managed Browser](./identity-based-crawling.md) for more advanced scenarios (like multi-step logins, or capturing after interactive pages).
---
## 6. Robots.txt Compliance
Crawl4AI supports respecting robots.txt rules with efficient caching:
```python
import asyncio
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig
async def main():
# Enable robots.txt checking in config
config = CrawlerRunConfig(
check_robots_txt=True # Will check and respect robots.txt rules
)
async with AsyncWebCrawler() as crawler:
result = await crawler.arun(
"https://example.com",
config=config
)
if not result.success and result.status_code == 403:
print("Access denied by robots.txt")
if __name__ == "__main__":
asyncio.run(main())
```
**Key Points**
- Robots.txt files are cached locally for efficiency
- Cache is stored in `~/.crawl4ai/robots/robots_cache.db`
- Cache has a default TTL of 7 days
- If robots.txt can't be fetched, crawling is allowed
- Returns 403 status code if URL is disallowed
--- ---
@@ -283,7 +321,10 @@ async def main():
# 3. Crawl # 3. Crawl
async with AsyncWebCrawler(config=browser_cfg) as crawler: async with AsyncWebCrawler(config=browser_cfg) as crawler:
result = await crawler.arun("https://secure.example.com/protected", config=crawler_cfg) result = await crawler.arun(
url = "https://secure.example.com/protected",
config=crawler_cfg
)
if result.success: if result.success:
print("[OK] Crawled the secure page. Links found:", len(result.links.get("internal", []))) print("[OK] Crawled the secure page. Links found:", len(result.links.get("internal", [])))
@@ -317,13 +358,8 @@ Youve now explored several **advanced** features:
- **SSL Certificate** retrieval & exporting - **SSL Certificate** retrieval & exporting
- **Custom Headers** for language or specialized requests - **Custom Headers** for language or specialized requests
- **Session Persistence** via storage state - **Session Persistence** via storage state
- **Robots.txt Compliance**
**Where to go next**:
- **[Hooks & Custom Code](./hooks-custom.md)**: For multi-step interactions (clicking “Load More,” performing logins, etc.)
- **[Identity-Based Crawling & Anti-Bot](./identity-anti-bot.md)**: If you need more sophisticated user simulation or stealth.
- **[Reference → BrowserConfig & CrawlerRunConfig](../../reference/configuration.md)**: Detailed param descriptions for everything youve seen here and more.
With these power tools, you can build robust scraping workflows that mimic real user behavior, handle secure sites, capture detailed snapshots, and manage sessions across multiple runs—streamlining your entire data collection pipeline. With these power tools, you can build robust scraping workflows that mimic real user behavior, handle secure sites, capture detailed snapshots, and manage sessions across multiple runs—streamlining your entire data collection pipeline.
**Last Updated**: 2024-XX-XX **Last Updated**: 2025-01-01

View File

@@ -1,136 +0,0 @@
# Content Processing
Crawl4AI provides powerful content processing capabilities that help you extract clean, relevant content from web pages. This guide covers content cleaning, media handling, link analysis, and metadata extraction.
## Media Processing
Crawl4AI provides comprehensive media extraction and analysis capabilities. It automatically detects and processes various types of media elements while maintaining their context and relevance.
### Image Processing
The library handles various image scenarios, including:
- Regular images
- Lazy-loaded images
- Background images
- Responsive images
- Image metadata and context
```python
from crawl4ai.async_configs import CrawlerRunConfig
config = CrawlerRunConfig()
result = await crawler.arun(url="https://example.com", config=config)
for image in result.media["images"]:
# Each image includes rich metadata
print(f"Source: {image['src']}")
print(f"Alt text: {image['alt']}")
print(f"Description: {image['desc']}")
print(f"Context: {image['context']}") # Surrounding text
print(f"Relevance score: {image['score']}") # 0-10 score
```
### Handling Lazy-Loaded Content
Crawl4AI already handles lazy loading for media elements. You can customize the wait time for lazy-loaded content with `CrawlerRunConfig`:
```python
config = CrawlerRunConfig(
wait_for="css:img[data-src]", # Wait for lazy images
delay_before_return_html=2.0 # Additional wait time
)
result = await crawler.arun(url="https://example.com", config=config)
```
### Video and Audio Content
The library extracts video and audio elements with their metadata:
```python
from crawl4ai.async_configs import CrawlerRunConfig
config = CrawlerRunConfig()
result = await crawler.arun(url="https://example.com", config=config)
# Process videos
for video in result.media["videos"]:
print(f"Video source: {video['src']}")
print(f"Type: {video['type']}")
print(f"Duration: {video.get('duration')}")
print(f"Thumbnail: {video.get('poster')}")
# Process audio
for audio in result.media["audios"]:
print(f"Audio source: {audio['src']}")
print(f"Type: {audio['type']}")
print(f"Duration: {audio.get('duration')}")
```
## Link Analysis
Crawl4AI provides sophisticated link analysis capabilities, helping you understand the relationship between pages and identify important navigation patterns.
### Link Classification
The library automatically categorizes links into:
- Internal links (same domain)
- External links (different domains)
- Social media links
- Navigation links
- Content links
```python
from crawl4ai.async_configs import CrawlerRunConfig
config = CrawlerRunConfig()
result = await crawler.arun(url="https://example.com", config=config)
# Analyze internal links
for link in result.links["internal"]:
print(f"Internal: {link['href']}")
print(f"Link text: {link['text']}")
print(f"Context: {link['context']}") # Surrounding text
print(f"Type: {link['type']}") # nav, content, etc.
# Analyze external links
for link in result.links["external"]:
print(f"External: {link['href']}")
print(f"Domain: {link['domain']}")
print(f"Type: {link['type']}")
```
### Smart Link Filtering
Control which links are included in the results with `CrawlerRunConfig`:
```python
config = CrawlerRunConfig(
exclude_external_links=True, # Remove external links
exclude_social_media_links=True, # Remove social media links
exclude_social_media_domains=[ # Custom social media domains
"facebook.com", "twitter.com", "instagram.com"
],
exclude_domains=["ads.example.com"] # Exclude specific domains
)
result = await crawler.arun(url="https://example.com", config=config)
```
## Metadata Extraction
Crawl4AI automatically extracts and processes page metadata, providing valuable information about the content:
```python
from crawl4ai.async_configs import CrawlerRunConfig
config = CrawlerRunConfig()
result = await crawler.arun(url="https://example.com", config=config)
metadata = result.metadata
print(f"Title: {metadata['title']}")
print(f"Description: {metadata['description']}")
print(f"Keywords: {metadata['keywords']}")
print(f"Author: {metadata['author']}")
print(f"Published Date: {metadata['published_date']}")
print(f"Modified Date: {metadata['modified_date']}")
print(f"Language: {metadata['language']}")
```

View File

@@ -0,0 +1,12 @@
# Crawl Dispatcher
Were excited to announce a **Crawl Dispatcher** module that can handle **thousands** of crawling tasks simultaneously. By efficiently managing system resources (memory, CPU, network), this dispatcher ensures high-performance data extraction at scale. It also provides **real-time monitoring** of each crawlers status, memory usage, and overall progress.
Stay tuned—this feature is **coming soon** in an upcoming release of Crawl4AI! For the latest news, keep an eye on our changelogs and follow [@unclecode](https://twitter.com/unclecode) on X.
Below is a **sample** of how the dispatchers performance monitor might look in action:
![Crawl Dispatcher Performance Monitor](../assets/images/dispatcher.png)
We cant wait to bring you this streamlined, **scalable** approach to multi-URL crawling—**watch this space** for updates!

View File

@@ -17,18 +17,6 @@ async def main():
asyncio.run(main()) asyncio.run(main())
``` ```
Or, enable it for a specific crawl by using `CrawlerRunConfig`:
```python
from crawl4ai.async_configs import CrawlerRunConfig
async def main():
async with AsyncWebCrawler() as crawler:
config = CrawlerRunConfig(accept_downloads=True)
result = await crawler.arun(url="https://example.com", config=config)
# ...
```
## Specifying Download Location ## Specifying Download Location
Specify the download directory using the `downloads_path` attribute in the `BrowserConfig` object. If not provided, Crawl4AI defaults to creating a "downloads" directory inside the `.crawl4ai` folder in your home directory. Specify the download directory using the `downloads_path` attribute in the `BrowserConfig` object. If not provided, Crawl4AI defaults to creating a "downloads" directory inside the `.crawl4ai` folder in your home directory.
@@ -98,7 +86,8 @@ async def download_multiple_files(url: str, download_path: str):
const downloadLinks = document.querySelectorAll('a[download]'); const downloadLinks = document.querySelectorAll('a[download]');
for (const link of downloadLinks) { for (const link of downloadLinks) {
link.click(); link.click();
await new Promise(r => setTimeout(r, 2000)); // Delay between clicks // Delay between clicks
await new Promise(r => setTimeout(r, 2000));
} }
""", """,
wait_for=10 # Wait for all downloads to start wait_for=10 # Wait for all downloads to start

View File

@@ -1,121 +1,254 @@
# Hooks & Auth for AsyncWebCrawler # Hooks & Auth in AsyncWebCrawler
Crawl4AI's `AsyncWebCrawler` allows you to customize the behavior of the web crawler using hooks. Hooks are asynchronous functions called at specific points in the crawling process, allowing you to modify the crawler's behavior or perform additional actions. This updated documentation demonstrates how to use hooks, including the new `on_page_context_created` hook, and ensures compatibility with `BrowserConfig` and `CrawlerRunConfig`. Crawl4AIs **hooks** let you customize the crawler at specific points in the pipeline:
## Example: Using Crawler Hooks with AsyncWebCrawler 1. **`on_browser_created`** After browser creation.
2. **`on_page_context_created`** After a new context & page are created.
3. **`before_goto`** Just before navigating to a page.
4. **`after_goto`** Right after navigation completes.
5. **`on_user_agent_updated`** Whenever the user agent changes.
6. **`on_execution_started`** Once custom JavaScript execution begins.
7. **`before_retrieve_html`** Just before the crawler retrieves final HTML.
8. **`before_return_html`** Right before returning the HTML content.
In this example, we'll: **Important**: Avoid heavy tasks in `on_browser_created` since you dont yet have a page context. If you need to *log in*, do so in **`on_page_context_created`**.
1. Configure the browser and set up authentication when it's created. > note "Important Hook Usage Warning"
2. Apply custom routing and initial actions when the page context is created. **Avoid Misusing Hooks**: Do not manipulate page objects in the wrong hook or at the wrong time, as it can crash the pipeline or produce incorrect results. A common mistake is attempting to handle authentication prematurely—such as creating or closing pages in `on_browser_created`.
3. Add custom headers before navigating to the URL.
4. Log the current URL after navigation.
5. Perform actions after JavaScript execution.
6. Log the length of the HTML before returning it.
### Hook Definitions > **Use the Right Hook for Auth**: If you need to log in or set tokens, use `on_page_context_created`. This ensures you have a valid page/context to work with, without disrupting the main crawling flow.
> **Identity-Based Crawling**: For robust auth, consider identity-based crawling (or passing a session ID) to preserve state. Run your initial login steps in a separate, well-defined process, then feed that session to your main crawl—rather than shoehorning complex authentication into early hooks. Check out [Identity-Based Crawling](../advanced/identity-based-crawling.md) for more details.
> **Be Cautious**: Overwriting or removing elements in the wrong hook can compromise the final crawl. Keep hooks focused on smaller tasks (like route filters, custom headers), and let your main logic (crawling, data extraction) proceed normally.
Below is an example demonstration.
---
## Example: Using Hooks in AsyncWebCrawler
```python ```python
import asyncio import asyncio
from crawl4ai import AsyncWebCrawler import json
from crawl4ai.async_configs import BrowserConfig, CrawlerRunConfig from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig, CacheMode
from playwright.async_api import Page, Browser, BrowserContext from playwright.async_api import Page, BrowserContext
def log_routing(route):
# Example: block loading images
if route.request.resource_type == "image":
print(f"[HOOK] Blocking image request: {route.request.url}")
asyncio.create_task(route.abort())
else:
asyncio.create_task(route.continue_())
async def on_browser_created(browser: Browser, **kwargs):
print("[HOOK] on_browser_created")
# Example: Set browser viewport size and log in
context = await browser.new_context(viewport={"width": 1920, "height": 1080})
page = await context.new_page()
await page.goto("https://example.com/login")
await page.fill("input[name='username']", "testuser")
await page.fill("input[name='password']", "password123")
await page.click("button[type='submit']")
await page.wait_for_selector("#welcome")
await context.add_cookies([{"name": "auth_token", "value": "abc123", "url": "https://example.com"}])
await page.close()
await context.close()
async def on_page_context_created(context: BrowserContext, page: Page, **kwargs):
print("[HOOK] on_page_context_created")
await context.route("**", log_routing)
async def before_goto(page: Page, context: BrowserContext, **kwargs):
print("[HOOK] before_goto")
await page.set_extra_http_headers({"X-Test-Header": "test"})
async def after_goto(page: Page, context: BrowserContext, **kwargs):
print("[HOOK] after_goto")
print(f"Current URL: {page.url}")
async def on_execution_started(page: Page, context: BrowserContext, **kwargs):
print("[HOOK] on_execution_started")
await page.evaluate("console.log('Custom JS executed')")
async def before_return_html(page: Page, context: BrowserContext, html: str, **kwargs):
print("[HOOK] before_return_html")
print(f"HTML length: {len(html)}")
return page
```
### Using the Hooks with AsyncWebCrawler
```python
async def main(): async def main():
print("\n🔗 Using Crawler Hooks: Customize AsyncWebCrawler with hooks!") print("🔗 Hooks Example: Demonstrating recommended usage")
# Configure browser and crawler settings # 1) Configure the browser
browser_config = BrowserConfig( browser_config = BrowserConfig(
headless=True, headless=True,
viewport_width=1920, verbose=True
viewport_height=1080
) )
# 2) Configure the crawler run
crawler_run_config = CrawlerRunConfig( crawler_run_config = CrawlerRunConfig(
js_code="window.scrollTo(0, document.body.scrollHeight);", js_code="window.scrollTo(0, document.body.scrollHeight);",
wait_for="footer" wait_for="body",
cache_mode=CacheMode.BYPASS
) )
# Initialize crawler # 3) Create the crawler instance
async with AsyncWebCrawler(config=browser_config) as crawler: crawler = AsyncWebCrawler(config=browser_config)
crawler.crawler_strategy.set_hook("on_browser_created", on_browser_created)
crawler.crawler_strategy.set_hook("on_page_context_created", on_page_context_created)
crawler.crawler_strategy.set_hook("before_goto", before_goto)
crawler.crawler_strategy.set_hook("after_goto", after_goto)
crawler.crawler_strategy.set_hook("on_execution_started", on_execution_started)
crawler.crawler_strategy.set_hook("before_return_html", before_return_html)
# Run the crawler #
result = await crawler.arun(url="https://example.com", config=crawler_run_config) # Define Hook Functions
#
print("\n📦 Crawler Hooks Result:") async def on_browser_created(browser, **kwargs):
print(result) # Called once the browser instance is created (but no pages or contexts yet)
print("[HOOK] on_browser_created - Browser created successfully!")
# Typically, do minimal setup here if needed
return browser
asyncio.run(main()) async def on_page_context_created(page: Page, context: BrowserContext, **kwargs):
# Called right after a new page + context are created (ideal for auth or route config).
print("[HOOK] on_page_context_created - Setting up page & context.")
# Example 1: Route filtering (e.g., block images)
async def route_filter(route):
if route.request.resource_type == "image":
print(f"[HOOK] Blocking image request: {route.request.url}")
await route.abort()
else:
await route.continue_()
await context.route("**", route_filter)
# Example 2: (Optional) Simulate a login scenario
# (We do NOT create or close pages here, just do quick steps if needed)
# e.g., await page.goto("https://example.com/login")
# e.g., await page.fill("input[name='username']", "testuser")
# e.g., await page.fill("input[name='password']", "password123")
# e.g., await page.click("button[type='submit']")
# e.g., await page.wait_for_selector("#welcome")
# e.g., await context.add_cookies([...])
# Then continue
# Example 3: Adjust the viewport
await page.set_viewport_size({"width": 1080, "height": 600})
return page
async def before_goto(
page: Page, context: BrowserContext, url: str, **kwargs
):
# Called before navigating to each URL.
print(f"[HOOK] before_goto - About to navigate: {url}")
# e.g., inject custom headers
await page.set_extra_http_headers({
"Custom-Header": "my-value"
})
return page
async def after_goto(
page: Page, context: BrowserContext,
url: str, response, **kwargs
):
# Called after navigation completes.
print(f"[HOOK] after_goto - Successfully loaded: {url}")
# e.g., wait for a certain element if we want to verify
try:
await page.wait_for_selector('.content', timeout=1000)
print("[HOOK] Found .content element!")
except:
print("[HOOK] .content not found, continuing anyway.")
return page
async def on_user_agent_updated(
page: Page, context: BrowserContext,
user_agent: str, **kwargs
):
# Called whenever the user agent updates.
print(f"[HOOK] on_user_agent_updated - New user agent: {user_agent}")
return page
async def on_execution_started(page: Page, context: BrowserContext, **kwargs):
# Called after custom JavaScript execution begins.
print("[HOOK] on_execution_started - JS code is running!")
return page
async def before_retrieve_html(page: Page, context: BrowserContext, **kwargs):
# Called before final HTML retrieval.
print("[HOOK] before_retrieve_html - We can do final actions")
# Example: Scroll again
await page.evaluate("window.scrollTo(0, document.body.scrollHeight);")
return page
async def before_return_html(
page: Page, context: BrowserContext, html: str, **kwargs
):
# Called just before returning the HTML in the result.
print(f"[HOOK] before_return_html - HTML length: {len(html)}")
return page
#
# Attach Hooks
#
crawler.crawler_strategy.set_hook("on_browser_created", on_browser_created)
crawler.crawler_strategy.set_hook(
"on_page_context_created", on_page_context_created
)
crawler.crawler_strategy.set_hook("before_goto", before_goto)
crawler.crawler_strategy.set_hook("after_goto", after_goto)
crawler.crawler_strategy.set_hook(
"on_user_agent_updated", on_user_agent_updated
)
crawler.crawler_strategy.set_hook(
"on_execution_started", on_execution_started
)
crawler.crawler_strategy.set_hook(
"before_retrieve_html", before_retrieve_html
)
crawler.crawler_strategy.set_hook(
"before_return_html", before_return_html
)
await crawler.start()
# 4) Run the crawler on an example page
url = "https://example.com"
result = await crawler.arun(url, config=crawler_run_config)
if result.success:
print("\nCrawled URL:", result.url)
print("HTML length:", len(result.html))
else:
print("Error:", result.error_message)
await crawler.close()
if __name__ == "__main__":
asyncio.run(main())
``` ```
### Explanation of Hooks ---
- **`on_browser_created`**: Called when the browser is created. Use this to configure the browser or handle authentication (e.g., logging in and setting cookies). ## Hook Lifecycle Summary
- **`on_page_context_created`**: Called when a new page context is created. Use this to apply routing, block resources, or inject custom logic before navigating to the URL.
- **`before_goto`**: Called before navigating to the URL. Use this to add custom headers or perform other pre-navigation actions.
- **`after_goto`**: Called after navigation. Use this to verify content or log the URL.
- **`on_execution_started`**: Called after executing custom JavaScript. Use this to perform additional actions.
- **`before_return_html`**: Called before returning the HTML content. Use this to log details or preprocess the content.
### Additional Customizations 1. **`on_browser_created`**:
- Browser is up, but **no** pages or contexts yet.
- Light setup only—dont try to open or close pages here (that belongs in `on_page_context_created`).
- **Resource Management**: Use `on_page_context_created` to block or modify requests (e.g., block images, fonts, or third-party scripts). 2. **`on_page_context_created`**:
- **Dynamic Headers**: Use `before_goto` to add or modify headers dynamically based on the URL. - Perfect for advanced **auth** or route blocking.
- **Authentication**: Use `on_browser_created` to handle login processes and set authentication cookies or tokens. - You have a **page** + **context** ready but havent navigated to the target URL yet.
- **Content Analysis**: Use `before_return_html` to analyze or modify the extracted HTML content.
These hooks provide powerful customization options for tailoring the crawling process to your needs. 3. **`before_goto`**:
- Right before navigation. Typically used for setting **custom headers** or logging the target URL.
4. **`after_goto`**:
- After page navigation is done. Good place for verifying content or waiting on essential elements.
5. **`on_user_agent_updated`**:
- Whenever the user agent changes (for stealth or different UA modes).
6. **`on_execution_started`**:
- If you set `js_code` or run custom scripts, this runs once your JS is about to start.
7. **`before_retrieve_html`**:
- Just before the final HTML snapshot is taken. Often you do a final scroll or lazy-load triggers here.
8. **`before_return_html`**:
- The last hook before returning HTML to the `CrawlResult`. Good for logging HTML length or minor modifications.
---
## When to Handle Authentication
**Recommended**: Use **`on_page_context_created`** if you need to:
- Navigate to a login page or fill forms
- Set cookies or localStorage tokens
- Block resource routes to avoid ads
This ensures the newly created context is under your control **before** `arun()` navigates to the main URL.
---
## Additional Considerations
- **Session Management**: If you want multiple `arun()` calls to reuse a single session, pass `session_id=` in your `CrawlerRunConfig`. Hooks remain the same.
- **Performance**: Hooks can slow down crawling if they do heavy tasks. Keep them concise.
- **Error Handling**: If a hook fails, the overall crawl might fail. Catch exceptions or handle them gracefully.
- **Concurrency**: If you run `arun_many()`, each URL triggers these hooks in parallel. Ensure your hooks are thread/async-safe.
---
## Conclusion
Hooks provide **fine-grained** control over:
- **Browser** creation (light tasks only)
- **Page** and **context** creation (auth, route blocking)
- **Navigation** phases
- **Final HTML** retrieval
Follow the recommended usage:
- **Login** or advanced tasks in `on_page_context_created`
- **Custom headers** or logs in `before_goto` / `after_goto`
- **Scrolling** or final checks in `before_retrieve_html` / `before_return_html`

View File

@@ -0,0 +1,180 @@
# Preserve Your Identity with Crawl4AI
Crawl4AI empowers you to navigate and interact with the web using your **authentic digital identity**, ensuring youre recognized as a human and not mistaken for a bot. This tutorial covers:
1. **Managed Browsers** The recommended approach for persistent profiles and identity-based crawling.
2. **Magic Mode** A simplified fallback solution for quick automation without persistent identity.
---
## 1. Managed Browsers: Your Digital Identity Solution
**Managed Browsers** let developers create and use **persistent browser profiles**. These profiles store local storage, cookies, and other session data, letting you browse as your **real self**—complete with logins, preferences, and cookies.
### Key Benefits
- **Authentic Browsing Experience**: Retain session data and browser fingerprints as though youre a normal user.
- **Effortless Configuration**: Once you log in or solve CAPTCHAs in your chosen data directory, you can re-run crawls without repeating those steps.
- **Empowered Data Access**: If you can see the data in your own browser, you can automate its retrieval with your genuine identity.
---
Below is a **partial update** to your **Managed Browsers** tutorial, specifically the section about **creating a user-data directory** using **Playwrights Chromium** binary rather than a system-wide Chrome/Edge. Well show how to **locate** that binary and launch it with a `--user-data-dir` argument to set up your profile. You can then point `BrowserConfig.user_data_dir` to that folder for subsequent crawls.
---
### Creating a User Data Directory (Command-Line Approach via Playwright)
If you installed Crawl4AI (which installs Playwright under the hood), you already have a Playwright-managed Chromium on your system. Follow these steps to launch that **Chromium** from your command line, specifying a **custom** data directory:
1. **Find** the Playwright Chromium binary:
- On most systems, installed browsers go under a `~/.cache/ms-playwright/` folder or similar path.
- To see an overview of installed browsers, run:
```bash
python -m playwright install --dry-run
```
or
```bash
playwright install --dry-run
```
(depending on your environment). This shows where Playwright keeps Chromium.
- For instance, you might see a path like:
```
~/.cache/ms-playwright/chromium-1234/chrome-linux/chrome
```
on Linux, or a corresponding folder on macOS/Windows.
2. **Launch** the Playwright Chromium binary with a **custom** user-data directory:
```bash
# Linux example
~/.cache/ms-playwright/chromium-1234/chrome-linux/chrome \
--user-data-dir=/home/<you>/my_chrome_profile
```
```bash
# macOS example (Playwrights internal binary)
~/Library/Caches/ms-playwright/chromium-1234/chrome-mac/Chromium.app/Contents/MacOS/Chromium \
--user-data-dir=/Users/<you>/my_chrome_profile
```
```powershell
# Windows example (PowerShell/cmd)
"C:\Users\<you>\AppData\Local\ms-playwright\chromium-1234\chrome-win\chrome.exe" ^
--user-data-dir="C:\Users\<you>\my_chrome_profile"
```
**Replace** the path with the actual subfolder indicated in your `ms-playwright` cache structure.
- This **opens** a fresh Chromium with your new or existing data folder.
- **Log into** any sites or configure your browser the way you want.
- **Close** when done—your profile data is saved in that folder.
3. **Use** that folder in **`BrowserConfig.user_data_dir`**:
```python
from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig
browser_config = BrowserConfig(
headless=True,
use_managed_browser=True,
user_data_dir="/home/<you>/my_chrome_profile",
browser_type="chromium"
)
```
- Next time you run your code, it reuses that folder—**preserving** your session data, cookies, local storage, etc.
---
## 3. Using Managed Browsers in Crawl4AI
Once you have a data directory with your session data, pass it to **`BrowserConfig`**:
```python
import asyncio
from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig
async def main():
# 1) Reference your persistent data directory
browser_config = BrowserConfig(
headless=True, # 'True' for automated runs
verbose=True,
use_managed_browser=True, # Enables persistent browser strategy
browser_type="chromium",
user_data_dir="/path/to/my-chrome-profile"
)
# 2) Standard crawl config
crawl_config = CrawlerRunConfig(
wait_for="css:.logged-in-content"
)
async with AsyncWebCrawler(config=browser_config) as crawler:
result = await crawler.arun(url="https://example.com/private", config=crawl_config)
if result.success:
print("Successfully accessed private data with your identity!")
else:
print("Error:", result.error_message)
if __name__ == "__main__":
asyncio.run(main())
```
### Workflow
1. **Login** externally (via CLI or your normal Chrome with `--user-data-dir=...`).
2. **Close** that browser.
3. **Use** the same folder in `user_data_dir=` in Crawl4AI.
4. **Crawl** The site sees your identity as if youre the same user who just logged in.
---
## 4. Magic Mode: Simplified Automation
If you **dont** need a persistent profile or identity-based approach, **Magic Mode** offers a quick way to simulate human-like browsing without storing long-term data.
```python
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig
async with AsyncWebCrawler() as crawler:
result = await crawler.arun(
url="https://example.com",
config=CrawlerRunConfig(
magic=True, # Simplifies a lot of interaction
remove_overlay_elements=True,
page_timeout=60000
)
)
```
**Magic Mode**:
- Simulates a user-like experience
- Randomizes user agent & navigator
- Randomizes interactions & timings
- Masks automation signals
- Attempts pop-up handling
**But** its no substitute for **true** user-based sessions if you want a fully legitimate identity-based solution.
---
## 5. Comparing Managed Browsers vs. Magic Mode
| Feature | **Managed Browsers** | **Magic Mode** |
|----------------------------|---------------------------------------------------------------|-----------------------------------------------------|
| **Session Persistence** | Full localStorage/cookies retained in user_data_dir | No persistent data (fresh each run) |
| **Genuine Identity** | Real user profile with full rights & preferences | Emulated user-like patterns, but no actual identity |
| **Complex Sites** | Best for login-gated sites or heavy config | Simple tasks, minimal login or config needed |
| **Setup** | External creation of user_data_dir, then use in Crawl4AI | Single-line approach (`magic=True`) |
| **Reliability** | Extremely consistent (same data across runs) | Good for smaller tasks, can be less stable |
---
## 6. Summary
- **Create** your user-data directory by launching Chrome/Chromium externally with `--user-data-dir=/some/path`.
- **Log in** or configure sites as needed, then close the browser.
- **Reference** that folder in `BrowserConfig(user_data_dir="...")` + `use_managed_browser=True`.
- Enjoy **persistent** sessions that reflect your real identity.
- If you only need quick, ephemeral automation, **Magic Mode** might suffice.
**Recommended**: Always prefer a **Managed Browser** for robust, identity-based crawling and simpler interactions with complex sites. Use **Magic Mode** for quick tasks or prototypes where persistent data is unnecessary.
With these approaches, you preserve your **authentic** browsing environment, ensuring the site sees you exactly as a normal user—no repeated logins or wasted time.

View File

@@ -1,156 +0,0 @@
### Preserve Your Identity with Crawl4AI
Crawl4AI empowers you to navigate and interact with the web using your authentic digital identity, ensuring that you are recognized as a human and not mistaken for a bot. This document introduces Managed Browsers, the recommended approach for preserving your rights to access the web, and Magic Mode, a simplified solution for specific scenarios.
---
### Managed Browsers: Your Digital Identity Solution
**Managed Browsers** enable developers to create and use persistent browser profiles. These profiles store local storage, cookies, and other session-related data, allowing you to interact with websites as a recognized user. By leveraging your unique identity, Managed Browsers ensure that your experience reflects your rights as a human browsing the web.
#### Why Use Managed Browsers?
1. **Authentic Browsing Experience**: Managed Browsers retain session data and browser fingerprints, mirroring genuine user behavior.
2. **Effortless Configuration**: Once you interact with the site using the browser (e.g., solving a CAPTCHA), the session data is saved and reused, providing seamless access.
3. **Empowered Data Access**: By using your identity, Managed Browsers empower users to access data they can view on their own screens without artificial restrictions.
#### Steps to Use Managed Browsers
1. **Setup the Browser Configuration**:
```python
from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig
from crawl4ai.extraction_strategy import JsonCssExtractionStrategy
browser_config = BrowserConfig(
headless=False, # Set to False for initial setup to view browser actions
verbose=True,
user_agent_mode="random",
use_managed_browser=True, # Enables persistent browser sessions
browser_type="chromium",
user_data_dir="/path/to/user_profile_data" # Path to save session data
)
```
2. **Perform an Initial Run**:
- Run the crawler with `headless=False`.
- Manually interact with the site (e.g., solve CAPTCHA or log in).
- The browser session saves cookies, local storage, and other required data.
3. **Subsequent Runs**:
- Switch to `headless=True` for automation.
- The session data is reused, allowing seamless crawling.
#### Example: Extracting Data Using Managed Browsers
```python
import asyncio
from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig
from crawl4ai.extraction_strategy import JsonCssExtractionStrategy
async def main():
# Define schema for structured data extraction
schema = {
"name": "Example Data",
"baseSelector": "div.example",
"fields": [
{"name": "title", "selector": "h1", "type": "text"},
{"name": "link", "selector": "a", "type": "attribute", "attribute": "href"}
]
}
# Configure crawler
browser_config = BrowserConfig(
headless=True, # Automate subsequent runs
verbose=True,
use_managed_browser=True,
user_data_dir="/path/to/user_profile_data"
)
crawl_config = CrawlerRunConfig(
extraction_strategy=JsonCssExtractionStrategy(schema),
wait_for="css:div.example" # Wait for the targeted element to load
)
async with AsyncWebCrawler(config=browser_config) as crawler:
result = await crawler.arun(
url="https://example.com",
config=crawl_config
)
if result.success:
print("Extracted Data:", result.extracted_content)
if __name__ == "__main__":
asyncio.run(main())
```
### Benefits of Managed Browsers Over Other Methods
Managed Browsers eliminate the need for manual detection workarounds by enabling developers to work directly with their identity and user profile data. This approach ensures maximum compatibility with websites and simplifies the crawling process while preserving your right to access data freely.
---
### Magic Mode: Simplified Automation
While Managed Browsers are the preferred approach, **Magic Mode** provides an alternative for scenarios where persistent user profiles are unnecessary or infeasible. Magic Mode automates user-like behavior and simplifies configuration.
#### What Magic Mode Does:
- Simulates human browsing by randomizing interaction patterns and timing.
- Masks browser automation signals.
- Handles cookie popups and modals.
- Modifies navigator properties for enhanced compatibility.
#### Using Magic Mode
```python
async with AsyncWebCrawler() as crawler:
result = await crawler.arun(
url="https://example.com",
magic=True # Enables all automation features
)
```
Magic Mode is particularly useful for:
- Quick prototyping when a Managed Browser setup is not available.
- Basic sites requiring minimal interaction or configuration.
#### Example: Combining Magic Mode with Additional Options
```python
async def crawl_with_magic_mode(url: str):
async with AsyncWebCrawler(headless=True) as crawler:
result = await crawler.arun(
url=url,
magic=True,
remove_overlay_elements=True, # Remove popups/modals
page_timeout=60000 # Increased timeout for complex pages
)
return result.markdown if result.success else None
```
### Magic Mode vs. Managed Browsers
While Magic Mode simplifies many tasks, it cannot match the reliability and authenticity of Managed Browsers. By using your identity and persistent profiles, Managed Browsers render Magic Mode largely unnecessary. However, Magic Mode remains a viable fallback for specific situations where user identity is not a factor.
---
### Key Comparison: Managed Browsers vs. Magic Mode
| Feature | **Managed Browsers** | **Magic Mode** |
|-------------------------|------------------------------------------|-------------------------------------|
| **Session Persistence** | Retains cookies and local storage. | No session retention. |
| **Human Interaction** | Uses real user profiles and data. | Simulates human-like patterns. |
| **Complex Sites** | Best suited for heavily configured sites.| Works well with simpler challenges.|
| **Setup Complexity** | Requires initial manual interaction. | Fully automated, one-line setup. |
#### Recommendation:
- Use **Managed Browsers** for reliable, session-based crawling and data extraction.
- Use **Magic Mode** for quick prototyping or when persistent profiles are not required.
---
### Conclusion
- **Use Managed Browsers** to preserve your digital identity and ensure reliable, identity-based crawling with persistent sessions. This approach works seamlessly for even the most complex websites.
- **Leverage Magic Mode** for quick automation or in scenarios where persistent user profiles are not needed.
By combining these approaches, Crawl4AI provides unparalleled flexibility and capability for your crawling needs.

View File

@@ -0,0 +1,104 @@
## Handling Lazy-Loaded Images
Many websites now load images **lazily** as you scroll. If you need to ensure they appear in your final crawl (and in `result.media`), consider:
1. **`wait_for_images=True`** Wait for images to fully load.
2. **`scan_full_page`** Force the crawler to scroll the entire page, triggering lazy loads.
3. **`scroll_delay`** Add small delays between scroll steps.
**Note**: If the site requires multiple “Load More” triggers or complex interactions, see the [Page Interaction docs](../core/page-interaction.md).
### Example: Ensuring Lazy Images Appear
```python
import asyncio
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig, BrowserConfig
from crawl4ai.async_configs import CacheMode
async def main():
config = CrawlerRunConfig(
# Force the crawler to wait until images are fully loaded
wait_for_images=True,
# Option 1: If you want to automatically scroll the page to load images
scan_full_page=True, # Tells the crawler to try scrolling the entire page
scroll_delay=0.5, # Delay (seconds) between scroll steps
# Option 2: If the site uses a 'Load More' or JS triggers for images,
# you can also specify js_code or wait_for logic here.
cache_mode=CacheMode.BYPASS,
verbose=True
)
async with AsyncWebCrawler(config=BrowserConfig(headless=True)) as crawler:
result = await crawler.arun("https://www.example.com/gallery", config=config)
if result.success:
images = result.media.get("images", [])
print("Images found:", len(images))
for i, img in enumerate(images[:5]):
print(f"[Image {i}] URL: {img['src']}, Score: {img.get('score','N/A')}")
else:
print("Error:", result.error_message)
if __name__ == "__main__":
asyncio.run(main())
```
**Explanation**:
- **`wait_for_images=True`**
The crawler tries to ensure images have finished loading before finalizing the HTML.
- **`scan_full_page=True`**
Tells the crawler to attempt scrolling from top to bottom. Each scroll step helps trigger lazy loading.
- **`scroll_delay=0.5`**
Pause half a second between each scroll step. Helps the site load images before continuing.
**When to Use**:
- **Lazy-Loading**: If images appear only when the user scrolls into view, `scan_full_page` + `scroll_delay` helps the crawler see them.
- **Heavier Pages**: If a page is extremely long, be mindful that scanning the entire page can be slow. Adjust `scroll_delay` or the max scroll steps as needed.
---
## Combining with Other Link & Media Filters
You can still combine **lazy-load** logic with the usual **exclude_external_images**, **exclude_domains**, or link filtration:
```python
config = CrawlerRunConfig(
wait_for_images=True,
scan_full_page=True,
scroll_delay=0.5,
# Filter out external images if you only want local ones
exclude_external_images=True,
# Exclude certain domains for links
exclude_domains=["spammycdn.com"],
)
```
This approach ensures you see **all** images from the main domain while ignoring external ones, and the crawler physically scrolls the entire page so that lazy-loading triggers.
---
## Tips & Troubleshooting
1. **Long Pages**
- Setting `scan_full_page=True` on extremely long or infinite-scroll pages can be resource-intensive.
- Consider using [hooks](../core/page-interaction.md) or specialized logic to load specific sections or “Load More” triggers repeatedly.
2. **Mixed Image Behavior**
- Some sites load images in batches as you scroll. If youre missing images, increase your `scroll_delay` or call multiple partial scrolls in a loop with JS code or hooks.
3. **Combining with Dynamic Wait**
- If the site has a placeholder that only changes to a real image after a certain event, you might do `wait_for="css:img.loaded"` or a custom JS `wait_for`.
4. **Caching**
- If `cache_mode` is enabled, repeated crawls might skip some network fetches. If you suspect caching is missing new images, set `cache_mode=CacheMode.BYPASS` for fresh fetches.
---
With **lazy-loading** support, **wait_for_images**, and **scan_full_page** settings, you can capture the entire gallery or feed of images you expect—even if the site only loads them as the user scrolls. Combine these with the standard media filtering and domain exclusion for a complete link & media handling strategy.

View File

@@ -1,52 +0,0 @@
# Magic Mode & Anti-Bot Protection
Crawl4AI provides powerful anti-detection capabilities, with Magic Mode being the simplest and most comprehensive solution.
## Magic Mode
The easiest way to bypass anti-bot protections:
```python
async with AsyncWebCrawler() as crawler:
result = await crawler.arun(
url="https://example.com",
magic=True # Enables all anti-detection features
)
```
Magic Mode automatically:
- Masks browser automation signals
- Simulates human-like behavior
- Overrides navigator properties
- Handles cookie consent popups
- Manages browser fingerprinting
- Randomizes timing patterns
## Manual Anti-Bot Options
While Magic Mode is recommended, you can also configure individual anti-detection features:
```python
result = await crawler.arun(
url="https://example.com",
simulate_user=True, # Simulate human behavior
override_navigator=True # Mask automation signals
)
```
Note: When `magic=True` is used, you don't need to set these individual options.
## Example: Handling Protected Sites
```python
async def crawl_protected_site(url: str):
async with AsyncWebCrawler(headless=True) as crawler:
result = await crawler.arun(
url=url,
magic=True,
remove_overlay_elements=True, # Remove popups/modals
page_timeout=60000 # Increased timeout for protection checks
)
return result.markdown if result.success else None
```

View File

@@ -1,188 +0,0 @@
# Creating Browser Instances, Contexts, and Pages
## 1 Introduction
### Overview of Browser Management in Crawl4AI
Crawl4AI's browser management system is designed to provide developers with advanced tools for handling complex web crawling tasks. By managing browser instances, contexts, and pages, Crawl4AI ensures optimal performance, anti-bot measures, and session persistence for high-volume, dynamic web crawling.
### Key Objectives
- **Anti-Bot Handling**:
- Implements stealth techniques to evade detection mechanisms used by modern websites.
- Simulates human-like behavior, such as mouse movements, scrolling, and key presses.
- Supports integration with third-party services to bypass CAPTCHA challenges.
- **Persistent Sessions**:
- Retains session data (cookies, local storage) for workflows requiring user authentication.
- Allows seamless continuation of tasks across multiple runs without re-authentication.
- **Scalable Crawling**:
- Optimized resource utilization for handling thousands of URLs concurrently.
- Flexible configuration options to tailor crawling behavior to specific requirements.
---
## 2 Browser Creation Methods
### Standard Browser Creation
Standard browser creation initializes a browser instance with default or minimal configurations. It is suitable for tasks that do not require session persistence or heavy customization.
#### Features and Limitations
- **Features**:
- Quick and straightforward setup for small-scale tasks.
- Supports headless and headful modes.
- **Limitations**:
- Lacks advanced customization options like session reuse.
- May struggle with sites employing strict anti-bot measures.
#### Example Usage
```python
from crawl4ai import AsyncWebCrawler, BrowserConfig
browser_config = BrowserConfig(browser_type="chromium", headless=True)
async with AsyncWebCrawler(config=browser_config) as crawler:
result = await crawler.arun("https://crawl4ai.com")
print(result.markdown)
```
### Persistent Contexts
Persistent contexts create browser sessions with stored data, enabling workflows that require maintaining login states or other session-specific information.
#### Benefits of Using `user_data_dir`
- **Session Persistence**:
- Stores cookies, local storage, and cache between crawling sessions.
- Reduces overhead for repetitive logins or multi-step workflows.
- **Enhanced Performance**:
- Leverages pre-loaded resources for faster page loading.
- **Flexibility**:
- Adapts to complex workflows requiring user-specific configurations.
#### Example: Setting Up Persistent Contexts
```python
config = BrowserConfig(user_data_dir="/path/to/user/data")
async with AsyncWebCrawler(config=config) as crawler:
result = await crawler.arun("https://crawl4ai.com")
print(result.markdown)
```
### Managed Browser
The `ManagedBrowser` class offers a high-level abstraction for managing browser instances, emphasizing resource management, debugging capabilities, and anti-bot measures.
#### How It Works
- **Browser Process Management**:
- Automates initialization and cleanup of browser processes.
- Optimizes resource usage by pooling and reusing browser instances.
- **Debugging Support**:
- Integrates with debugging tools like Chrome Developer Tools for real-time inspection.
- **Anti-Bot Measures**:
- Implements stealth plugins to mimic real user behavior and bypass bot detection.
#### Features
- **Customizable Configurations**:
- Supports advanced options such as viewport resizing, proxy settings, and header manipulation.
- **Debugging and Logging**:
- Logs detailed browser interactions for debugging and performance analysis.
- **Scalability**:
- Handles multiple browser instances concurrently, scaling dynamically based on workload.
#### Example: Using `ManagedBrowser`
```python
from crawl4ai import AsyncWebCrawler, BrowserConfig
config = BrowserConfig(headless=False, debug_port=9222)
async with AsyncWebCrawler(config=config) as crawler:
result = await crawler.arun("https://crawl4ai.com")
print(result.markdown)
```
---
## 3 Context and Page Management
### Creating and Configuring Browser Contexts
Browser contexts act as isolated environments within a single browser instance, enabling independent browsing sessions with their own cookies, cache, and storage.
#### Customizations
- **Headers and Cookies**:
- Define custom headers to mimic specific devices or browsers.
- Set cookies for authenticated sessions.
- **Session Reuse**:
- Retain and reuse session data across multiple requests.
- Example: Preserve login states for authenticated crawls.
#### Example: Context Initialization
```python
from crawl4ai import CrawlerRunConfig
config = CrawlerRunConfig(headers={"User-Agent": "Crawl4AI/1.0"})
async with AsyncWebCrawler() as crawler:
result = await crawler.arun("https://crawl4ai.com", config=config)
print(result.markdown)
```
### Creating Pages
Pages represent individual tabs or views within a browser context. They are responsible for rendering content, executing JavaScript, and handling user interactions.
#### Key Features
- **IFrame Handling**:
- Extract content from embedded iframes.
- Navigate and interact with nested content.
- **Viewport Customization**:
- Adjust viewport size to match target device dimensions.
- **Lazy Loading**:
- Ensure dynamic elements are fully loaded before extraction.
#### Example: Page Initialization
```python
config = CrawlerRunConfig(viewport_width=1920, viewport_height=1080)
async with AsyncWebCrawler() as crawler:
result = await crawler.arun("https://crawl4ai.com", config=config)
print(result.markdown)
```
---
## 4 Advanced Features and Best Practices
### Debugging and Logging
Remote debugging provides a powerful way to troubleshoot complex crawling workflows.
#### Example: Enabling Remote Debugging
```python
config = BrowserConfig(debug_port=9222)
async with AsyncWebCrawler(config=config) as crawler:
result = await crawler.arun("https://crawl4ai.com")
```
### Anti-Bot Techniques
- **Human Behavior Simulation**:
- Mimic real user actions, such as scrolling, clicking, and typing.
- Example: Use JavaScript to simulate interactions.
- **Captcha Handling**:
- Integrate with third-party services like 2Captcha or AntiCaptcha for automated solving.
#### Example: Simulating User Actions
```python
js_code = """
(async () => {
document.querySelector('input[name="search"]').value = 'test';
document.querySelector('button[type="submit"]').click();
})();
"""
config = CrawlerRunConfig(js_code=[js_code])
async with AsyncWebCrawler() as crawler:
result = await crawler.arun("https://crawl4ai.com", config=config)
```
### Optimizations for Performance and Scalability
- **Persistent Contexts**:
- Reuse browser contexts to minimize resource consumption.
- **Concurrent Crawls**:
- Use `arun_many` with a controlled semaphore count for efficient batch processing.
#### Example: Scaling Crawls
```python
urls = ["https://example1.com", "https://example2.com"]
config = CrawlerRunConfig(semaphore_count=10)
async with AsyncWebCrawler() as crawler:
results = await crawler.arun_many(urls, config=config)
for result in results:
print(result.url, result.markdown)
```

View File

@@ -0,0 +1,429 @@
# Advanced Multi-URL Crawling with Dispatchers
> **Heads Up**: Crawl4AI supports advanced dispatchers for **parallel** or **throttled** crawling, providing dynamic rate limiting and memory usage checks. The built-in `arun_many()` function uses these dispatchers to handle concurrency efficiently.
## 1. Introduction
When crawling many URLs:
- **Basic**: Use `arun()` in a loop (simple but less efficient)
- **Better**: Use `arun_many()`, which efficiently handles multiple URLs with proper concurrency control
- **Best**: Customize dispatcher behavior for your specific needs (memory management, rate limits, etc.)
**Why Dispatchers?**
- **Adaptive**: Memory-based dispatchers can pause or slow down based on system resources
- **Rate-limiting**: Built-in rate limiting with exponential backoff for 429/503 responses
- **Real-time Monitoring**: Live dashboard of ongoing tasks, memory usage, and performance
- **Flexibility**: Choose between memory-adaptive or semaphore-based concurrency
---
## 2. Core Components
### 2.1 Rate Limiter
```python
class RateLimiter:
def __init__(
# Random delay range between requests
base_delay: Tuple[float, float] = (1.0, 3.0),
# Maximum backoff delay
max_delay: float = 60.0,
# Retries before giving up
max_retries: int = 3,
# Status codes triggering backoff
rate_limit_codes: List[int] = [429, 503]
)
```
Heres the revised and simplified explanation of the **RateLimiter**, focusing on constructor parameters and adhering to your markdown style and mkDocs guidelines.
#### RateLimiter Constructor Parameters
The **RateLimiter** is a utility that helps manage the pace of requests to avoid overloading servers or getting blocked due to rate limits. It operates internally to delay requests and handle retries but can be configured using its constructor parameters.
**Parameters of the `RateLimiter` constructor:**
1.**`base_delay`** (`Tuple[float, float]`, default: `(1.0, 3.0)`)
The range for a random delay (in seconds) between consecutive requests to the same domain.
- A random delay is chosen between `base_delay[0]` and `base_delay[1]` for each request.
- This prevents sending requests at a predictable frequency, reducing the chances of triggering rate limits.
**Example:**
If `base_delay = (2.0, 5.0)`, delays could be randomly chosen as `2.3s`, `4.1s`, etc.
---
2.**`max_delay`** (`float`, default: `60.0`)
The maximum allowable delay when rate-limiting errors occur.
- When servers return rate-limit responses (e.g., 429 or 503), the delay increases exponentially with jitter.
- The `max_delay` ensures the delay doesnt grow unreasonably high, capping it at this value.
**Example:**
For a `max_delay = 30.0`, even if backoff calculations suggest a delay of `45s`, it will cap at `30s`.
---
3.**`max_retries`** (`int`, default: `3`)
The maximum number of retries for a request if rate-limiting errors occur.
- After encountering a rate-limit response, the `RateLimiter` retries the request up to this number of times.
- If all retries fail, the request is marked as failed, and the process continues.
**Example:**
If `max_retries = 3`, the system retries a failed request three times before giving up.
---
4.**`rate_limit_codes`** (`List[int]`, default: `[429, 503]`)
A list of HTTP status codes that trigger the rate-limiting logic.
- These status codes indicate the server is overwhelmed or actively limiting requests.
- You can customize this list to include other codes based on specific server behavior.
**Example:**
If `rate_limit_codes = [429, 503, 504]`, the crawler will back off on these three error codes.
---
**How to Use the `RateLimiter`:**
Heres an example of initializing and using a `RateLimiter` in your project:
```python
from crawl4ai import RateLimiter
# Create a RateLimiter with custom settings
rate_limiter = RateLimiter(
base_delay=(2.0, 4.0), # Random delay between 2-4 seconds
max_delay=30.0, # Cap delay at 30 seconds
max_retries=5, # Retry up to 5 times on rate-limiting errors
rate_limit_codes=[429, 503] # Handle these HTTP status codes
)
# RateLimiter will handle delays and retries internally
# No additional setup is required for its operation
```
The `RateLimiter` integrates seamlessly with dispatchers like `MemoryAdaptiveDispatcher` and `SemaphoreDispatcher`, ensuring requests are paced correctly without user intervention. Its internal mechanisms manage delays and retries to avoid overwhelming servers while maximizing efficiency.
### 2.2 Crawler Monitor
The CrawlerMonitor provides real-time visibility into crawling operations:
```python
from crawl4ai import CrawlerMonitor, DisplayMode
monitor = CrawlerMonitor(
# Maximum rows in live display
max_visible_rows=15,
# DETAILED or AGGREGATED view
display_mode=DisplayMode.DETAILED
)
```
**Display Modes**:
1. **DETAILED**: Shows individual task status, memory usage, and timing
2. **AGGREGATED**: Displays summary statistics and overall progress
---
## 3. Available Dispatchers
### 3.1 MemoryAdaptiveDispatcher (Default)
Automatically manages concurrency based on system memory usage:
```python
from crawl4ai.async_dispatcher import MemoryAdaptiveDispatcher
dispatcher = MemoryAdaptiveDispatcher(
memory_threshold_percent=90.0, # Pause if memory exceeds this
check_interval=1.0, # How often to check memory
max_session_permit=10, # Maximum concurrent tasks
rate_limiter=RateLimiter( # Optional rate limiting
base_delay=(1.0, 2.0),
max_delay=30.0,
max_retries=2
),
monitor=CrawlerMonitor( # Optional monitoring
max_visible_rows=15,
display_mode=DisplayMode.DETAILED
)
)
```
**Constructor Parameters:**
1.**`memory_threshold_percent`** (`float`, default: `90.0`)
Specifies the memory usage threshold (as a percentage). If system memory usage exceeds this value, the dispatcher pauses crawling to prevent system overload.
2.**`check_interval`** (`float`, default: `1.0`)
The interval (in seconds) at which the dispatcher checks system memory usage.
3.**`max_session_permit`** (`int`, default: `10`)
The maximum number of concurrent crawling tasks allowed. This ensures resource limits are respected while maintaining concurrency.
4.**`memory_wait_timeout`** (`float`, default: `300.0`)
Optional timeout (in seconds). If memory usage exceeds `memory_threshold_percent` for longer than this duration, a `MemoryError` is raised.
5.**`rate_limiter`** (`RateLimiter`, default: `None`)
Optional rate-limiting logic to avoid server-side blocking (e.g., for handling 429 or 503 errors). See **RateLimiter** for details.
6.**`monitor`** (`CrawlerMonitor`, default: `None`)
Optional monitoring for real-time task tracking and performance insights. See **CrawlerMonitor** for details.
---
### 3.2 SemaphoreDispatcher
Provides simple concurrency control with a fixed limit:
```python
from crawl4ai.async_dispatcher import SemaphoreDispatcher
dispatcher = SemaphoreDispatcher(
max_session_permit=20, # Maximum concurrent tasks
rate_limiter=RateLimiter( # Optional rate limiting
base_delay=(0.5, 1.0),
max_delay=10.0
),
monitor=CrawlerMonitor( # Optional monitoring
max_visible_rows=15,
display_mode=DisplayMode.DETAILED
)
)
```
**Constructor Parameters:**
1.**`max_session_permit`** (`int`, default: `20`)
The maximum number of concurrent crawling tasks allowed, irrespective of semaphore slots.
2.**`rate_limiter`** (`RateLimiter`, default: `None`)
Optional rate-limiting logic to avoid overwhelming servers. See **RateLimiter** for details.
3.**`monitor`** (`CrawlerMonitor`, default: `None`)
Optional monitoring for tracking task progress and resource usage. See **CrawlerMonitor** for details.
---
## 4. Usage Examples
### 4.1 Batch Processing (Default)
```python
async def crawl_batch():
browser_config = BrowserConfig(headless=True, verbose=False)
run_config = CrawlerRunConfig(
cache_mode=CacheMode.BYPASS,
stream=False # Default: get all results at once
)
dispatcher = MemoryAdaptiveDispatcher(
memory_threshold_percent=70.0,
check_interval=1.0,
max_session_permit=10,
monitor=CrawlerMonitor(
display_mode=DisplayMode.DETAILED
)
)
async with AsyncWebCrawler(config=browser_config) as crawler:
# Get all results at once
results = await crawler.arun_many(
urls=urls,
config=run_config,
dispatcher=dispatcher
)
# Process all results after completion
for result in results:
if result.success:
await process_result(result)
else:
print(f"Failed to crawl {result.url}: {result.error_message}")
```
**Review:**
- **Purpose:** Executes a batch crawl with all URLs processed together after crawling is complete.
- **Dispatcher:** Uses `MemoryAdaptiveDispatcher` to manage concurrency and system memory.
- **Stream:** Disabled (`stream=False`), so all results are collected at once for post-processing.
- **Best Use Case:** When you need to analyze results in bulk rather than individually during the crawl.
---
### 4.2 Streaming Mode
```python
async def crawl_streaming():
browser_config = BrowserConfig(headless=True, verbose=False)
run_config = CrawlerRunConfig(
cache_mode=CacheMode.BYPASS,
stream=True # Enable streaming mode
)
dispatcher = MemoryAdaptiveDispatcher(
memory_threshold_percent=70.0,
check_interval=1.0,
max_session_permit=10,
monitor=CrawlerMonitor(
display_mode=DisplayMode.DETAILED
)
)
async with AsyncWebCrawler(config=browser_config) as crawler:
# Process results as they become available
async for result in await crawler.arun_many(
urls=urls,
config=run_config,
dispatcher=dispatcher
):
if result.success:
# Process each result immediately
await process_result(result)
else:
print(f"Failed to crawl {result.url}: {result.error_message}")
```
**Review:**
- **Purpose:** Enables streaming to process results as soon as theyre available.
- **Dispatcher:** Uses `MemoryAdaptiveDispatcher` for concurrency and memory management.
- **Stream:** Enabled (`stream=True`), allowing real-time processing during crawling.
- **Best Use Case:** When you need to act on results immediately, such as for real-time analytics or progressive data storage.
---
### 4.3 Semaphore-based Crawling
```python
async def crawl_with_semaphore(urls):
browser_config = BrowserConfig(headless=True, verbose=False)
run_config = CrawlerRunConfig(cache_mode=CacheMode.BYPASS)
dispatcher = SemaphoreDispatcher(
semaphore_count=5,
rate_limiter=RateLimiter(
base_delay=(0.5, 1.0),
max_delay=10.0
),
monitor=CrawlerMonitor(
max_visible_rows=15,
display_mode=DisplayMode.DETAILED
)
)
async with AsyncWebCrawler(config=browser_config) as crawler:
results = await crawler.arun_many(
urls,
config=run_config,
dispatcher=dispatcher
)
return results
```
**Review:**
- **Purpose:** Uses `SemaphoreDispatcher` to limit concurrency with a fixed number of slots.
- **Dispatcher:** Configured with a semaphore to control parallel crawling tasks.
- **Rate Limiter:** Prevents servers from being overwhelmed by pacing requests.
- **Best Use Case:** When you want precise control over the number of concurrent requests, independent of system memory.
---
### 4.4 Robots.txt Consideration
```python
import asyncio
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig, CacheMode
async def main():
urls = [
"https://example1.com",
"https://example2.com",
"https://example3.com"
]
config = CrawlerRunConfig(
cache_mode=CacheMode.ENABLED,
check_robots_txt=True, # Will respect robots.txt for each URL
semaphore_count=3 # Max concurrent requests
)
async with AsyncWebCrawler() as crawler:
async for result in crawler.arun_many(urls, config=config):
if result.success:
print(f"Successfully crawled {result.url}")
elif result.status_code == 403 and "robots.txt" in result.error_message:
print(f"Skipped {result.url} - blocked by robots.txt")
else:
print(f"Failed to crawl {result.url}: {result.error_message}")
if __name__ == "__main__":
asyncio.run(main())
```
**Review:**
- **Purpose:** Ensures compliance with `robots.txt` rules for ethical and legal web crawling.
- **Configuration:** Set `check_robots_txt=True` to validate each URL against `robots.txt` before crawling.
- **Dispatcher:** Handles requests with concurrency limits (`semaphore_count=3`).
- **Best Use Case:** When crawling websites that strictly enforce robots.txt policies or for responsible crawling practices.
---
## 5. Dispatch Results
Each crawl result includes dispatch information:
```python
@dataclass
class DispatchResult:
task_id: str
memory_usage: float
peak_memory: float
start_time: datetime
end_time: datetime
error_message: str = ""
```
Access via `result.dispatch_result`:
```python
for result in results:
if result.success:
dr = result.dispatch_result
print(f"URL: {result.url}")
print(f"Memory: {dr.memory_usage:.1f}MB")
print(f"Duration: {dr.end_time - dr.start_time}")
```
## 6. Summary
1.**Two Dispatcher Types**:
- MemoryAdaptiveDispatcher (default): Dynamic concurrency based on memory
- SemaphoreDispatcher: Fixed concurrency limit
2.**Optional Components**:
- RateLimiter: Smart request pacing and backoff
- CrawlerMonitor: Real-time progress visualization
3.**Key Benefits**:
- Automatic memory management
- Built-in rate limiting
- Live progress monitoring
- Flexible concurrency control
Choose the dispatcher that best fits your needs:
- **MemoryAdaptiveDispatcher**: For large crawls or limited resources
- **SemaphoreDispatcher**: For simple, fixed-concurrency scenarios

View File

@@ -1,6 +1,4 @@
# Proxy & Security # Proxy
Configure proxy settings and enhance security features in Crawl4AI for reliable data extraction.
## Basic Proxy Setup ## Basic Proxy Setup
@@ -38,58 +36,33 @@ async with AsyncWebCrawler(config=browser_config) as crawler:
result = await crawler.arun(url="https://example.com") result = await crawler.arun(url="https://example.com")
``` ```
Here's the corrected documentation:
## Rotating Proxies ## Rotating Proxies
Example using a proxy rotation service and updating `BrowserConfig` dynamically: Example using a proxy rotation service dynamically:
```python ```python
from crawl4ai.async_configs import BrowserConfig from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig
async def get_next_proxy(): async def get_next_proxy():
# Your proxy rotation logic here # Your proxy rotation logic here
return {"server": "http://next.proxy.com:8080"} return {"server": "http://next.proxy.com:8080"}
browser_config = BrowserConfig() async def main():
async with AsyncWebCrawler(config=browser_config) as crawler: browser_config = BrowserConfig()
# Update proxy for each request run_config = CrawlerRunConfig()
for url in urls:
proxy = await get_next_proxy() async with AsyncWebCrawler(config=browser_config) as crawler:
browser_config.proxy_config = proxy # For each URL, create a new run config with different proxy
result = await crawler.arun(url=url, config=browser_config) for url in urls:
proxy = await get_next_proxy()
# Clone the config and update proxy - this creates a new browser context
current_config = run_config.clone(proxy_config=proxy)
result = await crawler.arun(url=url, config=current_config)
if __name__ == "__main__":
import asyncio
asyncio.run(main())
``` ```
## Custom Headers
Add security-related headers via `BrowserConfig`:
```python
from crawl4ai.async_configs import BrowserConfig
headers = {
"X-Forwarded-For": "203.0.113.195",
"Accept-Language": "en-US,en;q=0.9",
"Cache-Control": "no-cache",
"Pragma": "no-cache"
}
browser_config = BrowserConfig(headers=headers)
async with AsyncWebCrawler(config=browser_config) as crawler:
result = await crawler.arun(url="https://example.com")
```
## Combining with Magic Mode
For maximum protection, combine proxy with Magic Mode via `CrawlerRunConfig` and `BrowserConfig`:
```python
from crawl4ai.async_configs import BrowserConfig, CrawlerRunConfig
browser_config = BrowserConfig(
proxy="http://proxy.example.com:8080",
headers={"Accept-Language": "en-US"}
)
crawler_config = CrawlerRunConfig(magic=True) # Enable all anti-detection features
async with AsyncWebCrawler(config=browser_config) as crawler:
result = await crawler.arun(url="https://example.com", config=crawler_config)
```

View File

@@ -1,179 +0,0 @@
### Session-Based Crawling for Dynamic Content
In modern web applications, content is often loaded dynamically without changing the URL. Examples include "Load More" buttons, infinite scrolling, or paginated content that updates via JavaScript. Crawl4AI provides session-based crawling capabilities to handle such scenarios effectively.
This guide explores advanced techniques for crawling dynamic content using Crawl4AI's session management features.
---
## Understanding Session-Based Crawling
Session-based crawling allows you to reuse a persistent browser session across multiple actions. This means the same browser tab (or page object) is used throughout, enabling:
1. **Efficient handling of dynamic content** without reloading the page.
2. **JavaScript actions before and after crawling** (e.g., clicking buttons or scrolling).
3. **State maintenance** for authenticated sessions or multi-step workflows.
4. **Faster sequential crawling**, as it avoids reopening tabs or reallocating resources.
**Note:** Session-based crawling is ideal for sequential operations, not parallel tasks.
---
## Basic Concepts
Before diving into examples, here are some key concepts:
- **Session ID**: A unique identifier for a browsing session. Use the same `session_id` across multiple requests to maintain state.
- **BrowserConfig & CrawlerRunConfig**: These configuration objects control browser settings and crawling behavior.
- **JavaScript Execution**: Use `js_code` to perform actions like clicking buttons.
- **CSS Selectors**: Target specific elements for interaction or data extraction.
- **Extraction Strategy**: Define rules to extract structured data.
- **Wait Conditions**: Specify conditions to wait for before proceeding.
---
## Example 1: Basic Session-Based Crawling
A simple example using session-based crawling:
```python
import asyncio
from crawl4ai.async_configs import BrowserConfig, CrawlerRunConfig
from crawl4ai.cache_context import CacheMode
async def basic_session_crawl():
async with AsyncWebCrawler() as crawler:
session_id = "dynamic_content_session"
url = "https://example.com/dynamic-content"
for page in range(3):
config = CrawlerRunConfig(
url=url,
session_id=session_id,
js_code="document.querySelector('.load-more-button').click();" if page > 0 else None,
css_selector=".content-item",
cache_mode=CacheMode.BYPASS
)
result = await crawler.arun(config=config)
print(f"Page {page + 1}: Found {result.extracted_content.count('.content-item')} items")
await crawler.crawler_strategy.kill_session(session_id)
asyncio.run(basic_session_crawl())
```
This example shows:
1. Reusing the same `session_id` across multiple requests.
2. Executing JavaScript to load more content dynamically.
3. Properly closing the session to free resources.
---
## Advanced Technique 1: Custom Execution Hooks
Use custom hooks to handle complex scenarios, such as waiting for content to load dynamically:
```python
async def advanced_session_crawl_with_hooks():
first_commit = ""
async def on_execution_started(page):
nonlocal first_commit
try:
while True:
await page.wait_for_selector("li.commit-item h4")
commit = await page.query_selector("li.commit-item h4")
commit = await commit.evaluate("(element) => element.textContent").strip()
if commit and commit != first_commit:
first_commit = commit
break
await asyncio.sleep(0.5)
except Exception as e:
print(f"Warning: New content didn't appear: {e}")
async with AsyncWebCrawler() as crawler:
session_id = "commit_session"
url = "https://github.com/example/repo/commits/main"
crawler.crawler_strategy.set_hook("on_execution_started", on_execution_started)
js_next_page = """document.querySelector('a.pagination-next').click();"""
for page in range(3):
config = CrawlerRunConfig(
url=url,
session_id=session_id,
js_code=js_next_page if page > 0 else None,
css_selector="li.commit-item",
js_only=page > 0,
cache_mode=CacheMode.BYPASS
)
result = await crawler.arun(config=config)
print(f"Page {page + 1}: Found {len(result.extracted_content)} commits")
await crawler.crawler_strategy.kill_session(session_id)
asyncio.run(advanced_session_crawl_with_hooks())
```
This technique ensures new content loads before the next action.
---
## Advanced Technique 2: Integrated JavaScript Execution and Waiting
Combine JavaScript execution and waiting logic for concise handling of dynamic content:
```python
async def integrated_js_and_wait_crawl():
async with AsyncWebCrawler() as crawler:
session_id = "integrated_session"
url = "https://github.com/example/repo/commits/main"
js_next_page_and_wait = """
(async () => {
const getCurrentCommit = () => document.querySelector('li.commit-item h4').textContent.trim();
const initialCommit = getCurrentCommit();
document.querySelector('a.pagination-next').click();
while (getCurrentCommit() === initialCommit) {
await new Promise(resolve => setTimeout(resolve, 100));
}
})();
"""
for page in range(3):
config = CrawlerRunConfig(
url=url,
session_id=session_id,
js_code=js_next_page_and_wait if page > 0 else None,
css_selector="li.commit-item",
js_only=page > 0,
cache_mode=CacheMode.BYPASS
)
result = await crawler.arun(config=config)
print(f"Page {page + 1}: Found {len(result.extracted_content)} commits")
await crawler.crawler_strategy.kill_session(session_id)
asyncio.run(integrated_js_and_wait_crawl())
```
---
## Best Practices for Session-Based Crawling
1. **Unique Session IDs**: Assign descriptive and unique `session_id` values.
2. **Close Sessions**: Always clean up sessions with `kill_session` after use.
3. **Error Handling**: Anticipate and handle errors gracefully.
4. **Respect Websites**: Follow terms of service and robots.txt.
5. **Delays**: Add delays to avoid overwhelming servers.
6. **Optimize JavaScript**: Keep scripts concise for better performance.
7. **Monitor Resources**: Track memory and CPU usage for long sessions.
---
## Conclusion
Session-based crawling in Crawl4AI is a robust solution for handling dynamic content and multi-step workflows. By combining session management, JavaScript execution, and structured extraction strategies, you can effectively navigate and extract data from modern web applications. Always adhere to ethical web scraping practices and respect website policies.

View File

@@ -1,4 +1,4 @@
### Session Management # Session Management
Session management in Crawl4AI is a powerful feature that allows you to maintain state across multiple requests, making it particularly suitable for handling complex multi-step crawling tasks. It enables you to reuse the same browser tab (or page object) across sequential actions and crawls, which is beneficial for: Session management in Crawl4AI is a powerful feature that allows you to maintain state across multiple requests, making it particularly suitable for handling complex multi-step crawling tasks. It enables you to reuse the same browser tab (or page object) across sequential actions and crawls, which is beneficial for:
@@ -20,8 +20,12 @@ async with AsyncWebCrawler() as crawler:
session_id = "my_session" session_id = "my_session"
# Define configurations # Define configurations
config1 = CrawlerRunConfig(url="https://example.com/page1", session_id=session_id) config1 = CrawlerRunConfig(
config2 = CrawlerRunConfig(url="https://example.com/page2", session_id=session_id) url="https://example.com/page1", session_id=session_id
)
config2 = CrawlerRunConfig(
url="https://example.com/page2", session_id=session_id
)
# First request # First request
result1 = await crawler.arun(config=config1) result1 = await crawler.arun(config=config1)
@@ -54,7 +58,9 @@ async def crawl_dynamic_content():
schema = { schema = {
"name": "Commit Extractor", "name": "Commit Extractor",
"baseSelector": "li.Box-sc-g0xbh4-0", "baseSelector": "li.Box-sc-g0xbh4-0",
"fields": [{"name": "title", "selector": "h4.markdown-title", "type": "text"}], "fields": [{
"name": "title", "selector": "h4.markdown-title", "type": "text"
}],
} }
extraction_strategy = JsonCssExtractionStrategy(schema) extraction_strategy = JsonCssExtractionStrategy(schema)
@@ -87,51 +93,146 @@ async def crawl_dynamic_content():
--- ---
#### Session Best Practices ## Example 1: Basic Session-Based Crawling
1. **Descriptive Session IDs**: A simple example using session-based crawling:
Use meaningful names for session IDs to organize workflows:
```python
session_id = "login_flow_session"
session_id = "product_catalog_session"
```
2. **Resource Management**: ```python
Always ensure sessions are cleaned up to free resources: import asyncio
```python from crawl4ai.async_configs import BrowserConfig, CrawlerRunConfig
try: from crawl4ai.cache_context import CacheMode
# Your crawling code here
pass
finally:
await crawler.crawler_strategy.kill_session(session_id)
```
3. **State Maintenance**: async def basic_session_crawl():
Reuse the session for subsequent actions within the same workflow: async with AsyncWebCrawler() as crawler:
```python session_id = "dynamic_content_session"
# Step 1: Login url = "https://example.com/dynamic-content"
login_config = CrawlerRunConfig(
url="https://example.com/login",
session_id=session_id,
js_code="document.querySelector('form').submit();"
)
await crawler.arun(config=login_config)
# Step 2: Verify login success for page in range(3):
dashboard_config = CrawlerRunConfig( config = CrawlerRunConfig(
url="https://example.com/dashboard", url=url,
session_id=session_id, session_id=session_id,
wait_for="css:.user-profile" # Wait for authenticated content js_code="document.querySelector('.load-more-button').click();" if page > 0 else None,
) css_selector=".content-item",
result = await crawler.arun(config=dashboard_config) cache_mode=CacheMode.BYPASS
``` )
result = await crawler.arun(config=config)
print(f"Page {page + 1}: Found {result.extracted_content.count('.content-item')} items")
await crawler.crawler_strategy.kill_session(session_id)
asyncio.run(basic_session_crawl())
```
This example shows:
1. Reusing the same `session_id` across multiple requests.
2. Executing JavaScript to load more content dynamically.
3. Properly closing the session to free resources.
---
## Advanced Technique 1: Custom Execution Hooks
> Warning: You might feel confused by the end of the next few examples 😅, so make sure you are comfortable with the order of the parts before you start this.
Use custom hooks to handle complex scenarios, such as waiting for content to load dynamically:
```python
async def advanced_session_crawl_with_hooks():
first_commit = ""
async def on_execution_started(page):
nonlocal first_commit
try:
while True:
await page.wait_for_selector("li.commit-item h4")
commit = await page.query_selector("li.commit-item h4")
commit = await commit.evaluate("(element) => element.textContent").strip()
if commit and commit != first_commit:
first_commit = commit
break
await asyncio.sleep(0.5)
except Exception as e:
print(f"Warning: New content didn't appear: {e}")
async with AsyncWebCrawler() as crawler:
session_id = "commit_session"
url = "https://github.com/example/repo/commits/main"
crawler.crawler_strategy.set_hook("on_execution_started", on_execution_started)
js_next_page = """document.querySelector('a.pagination-next').click();"""
for page in range(3):
config = CrawlerRunConfig(
url=url,
session_id=session_id,
js_code=js_next_page if page > 0 else None,
css_selector="li.commit-item",
js_only=page > 0,
cache_mode=CacheMode.BYPASS
)
result = await crawler.arun(config=config)
print(f"Page {page + 1}: Found {len(result.extracted_content)} commits")
await crawler.crawler_strategy.kill_session(session_id)
asyncio.run(advanced_session_crawl_with_hooks())
```
This technique ensures new content loads before the next action.
---
## Advanced Technique 2: Integrated JavaScript Execution and Waiting
Combine JavaScript execution and waiting logic for concise handling of dynamic content:
```python
async def integrated_js_and_wait_crawl():
async with AsyncWebCrawler() as crawler:
session_id = "integrated_session"
url = "https://github.com/example/repo/commits/main"
js_next_page_and_wait = """
(async () => {
const getCurrentCommit = () => document.querySelector('li.commit-item h4').textContent.trim();
const initialCommit = getCurrentCommit();
document.querySelector('a.pagination-next').click();
while (getCurrentCommit() === initialCommit) {
await new Promise(resolve => setTimeout(resolve, 100));
}
})();
"""
for page in range(3):
config = CrawlerRunConfig(
url=url,
session_id=session_id,
js_code=js_next_page_and_wait if page > 0 else None,
css_selector="li.commit-item",
js_only=page > 0,
cache_mode=CacheMode.BYPASS
)
result = await crawler.arun(config=config)
print(f"Page {page + 1}: Found {len(result.extracted_content)} commits")
await crawler.crawler_strategy.kill_session(session_id)
asyncio.run(integrated_js_and_wait_crawl())
```
--- ---
#### Common Use Cases for Sessions #### Common Use Cases for Sessions
1. **Authentication Flows**: Login and interact with secured pages. 1. **Authentication Flows**: Login and interact with secured pages.
2. **Pagination Handling**: Navigate through multiple pages.
3. **Form Submissions**: Fill forms, submit, and process results. 2. **Pagination Handling**: Navigate through multiple pages.
4. **Multi-step Processes**: Complete workflows that span multiple actions.
5. **Dynamic Content Navigation**: Handle JavaScript-rendered or event-triggered content. 3. **Form Submissions**: Fill forms, submit, and process results.
4. **Multi-step Processes**: Complete workflows that span multiple actions.
5. **Dynamic Content Navigation**: Handle JavaScript-rendered or event-triggered content.

View File

@@ -0,0 +1,179 @@
# `SSLCertificate` Reference
The **`SSLCertificate`** class encapsulates an SSL certificates data and allows exporting it in various formats (PEM, DER, JSON, or text). Its used within **Crawl4AI** whenever you set **`fetch_ssl_certificate=True`** in your **`CrawlerRunConfig`**.
## 1. Overview
**Location**: `crawl4ai/ssl_certificate.py`
```python
class SSLCertificate:
"""
Represents an SSL certificate with methods to export in various formats.
Main Methods:
- from_url(url, timeout=10)
- from_file(file_path)
- from_binary(binary_data)
- to_json(filepath=None)
- to_pem(filepath=None)
- to_der(filepath=None)
...
Common Properties:
- issuer
- subject
- valid_from
- valid_until
- fingerprint
"""
```
### Typical Use Case
1. You **enable** certificate fetching in your crawl by:
```python
CrawlerRunConfig(fetch_ssl_certificate=True, ...)
```
2. After `arun()`, if `result.ssl_certificate` is present, its an instance of **`SSLCertificate`**.
3. You can **read** basic properties (issuer, subject, validity) or **export** them in multiple formats.
---
## 2. Construction & Fetching
### 2.1 **`from_url(url, timeout=10)`**
Manually load an SSL certificate from a given URL (port 443). Typically used internally, but you can call it directly if you want:
```python
cert = SSLCertificate.from_url("https://example.com")
if cert:
print("Fingerprint:", cert.fingerprint)
```
### 2.2 **`from_file(file_path)`**
Load from a file containing certificate data in ASN.1 or DER. Rarely needed unless you have local cert files:
```python
cert = SSLCertificate.from_file("/path/to/cert.der")
```
### 2.3 **`from_binary(binary_data)`**
Initialize from raw binary. E.g., if you captured it from a socket or another source:
```python
cert = SSLCertificate.from_binary(raw_bytes)
```
---
## 3. Common Properties
After obtaining a **`SSLCertificate`** instance (e.g. `result.ssl_certificate` from a crawl), you can read:
1. **`issuer`** *(dict)*
- E.g. `{"CN": "My Root CA", "O": "..."}`
2. **`subject`** *(dict)*
- E.g. `{"CN": "example.com", "O": "ExampleOrg"}`
3. **`valid_from`** *(str)*
- NotBefore date/time. Often in ASN.1/UTC format.
4. **`valid_until`** *(str)*
- NotAfter date/time.
5. **`fingerprint`** *(str)*
- The SHA-256 digest (lowercase hex).
- E.g. `"d14d2e..."`
---
## 4. Export Methods
Once you have a **`SSLCertificate`** object, you can **export** or **inspect** it:
### 4.1 **`to_json(filepath=None)` → `Optional[str]`**
- Returns a JSON string containing the parsed certificate fields.
- If `filepath` is provided, saves it to disk instead, returning `None`.
**Usage**:
```python
json_data = cert.to_json() # returns JSON string
cert.to_json("certificate.json") # writes file, returns None
```
### 4.2 **`to_pem(filepath=None)` → `Optional[str]`**
- Returns a PEM-encoded string (common for web servers).
- If `filepath` is provided, saves it to disk instead.
```python
pem_str = cert.to_pem() # in-memory PEM string
cert.to_pem("/path/to/cert.pem") # saved to file
```
### 4.3 **`to_der(filepath=None)` → `Optional[bytes]`**
- Returns the original DER (binary ASN.1) bytes.
- If `filepath` is specified, writes the bytes there instead.
```python
der_bytes = cert.to_der()
cert.to_der("certificate.der")
```
### 4.4 (Optional) **`export_as_text()`**
- If you see a method like `export_as_text()`, it typically returns an OpenSSL-style textual representation.
- Not always needed, but can help for debugging or manual inspection.
---
## 5. Example Usage in Crawl4AI
Below is a minimal sample showing how the crawler obtains an SSL cert from a site, then reads or exports it. The code snippet:
```python
import asyncio
import os
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig, CacheMode
async def main():
tmp_dir = "tmp"
os.makedirs(tmp_dir, exist_ok=True)
config = CrawlerRunConfig(
fetch_ssl_certificate=True,
cache_mode=CacheMode.BYPASS
)
async with AsyncWebCrawler() as crawler:
result = await crawler.arun("https://example.com", config=config)
if result.success and result.ssl_certificate:
cert = result.ssl_certificate
# 1. Basic Info
print("Issuer CN:", cert.issuer.get("CN", ""))
print("Valid until:", cert.valid_until)
print("Fingerprint:", cert.fingerprint)
# 2. Export
cert.to_json(os.path.join(tmp_dir, "certificate.json"))
cert.to_pem(os.path.join(tmp_dir, "certificate.pem"))
cert.to_der(os.path.join(tmp_dir, "certificate.der"))
if __name__ == "__main__":
asyncio.run(main())
```
---
## 6. Notes & Best Practices
1. **Timeout**: `SSLCertificate.from_url` internally uses a default **10s** socket connect and wraps SSL.
2. **Binary Form**: The certificate is loaded in ASN.1 (DER) form, then re-parsed by `OpenSSL.crypto`.
3. **Validation**: This does **not** validate the certificate chain or trust store. It only fetches and parses.
4. **Integration**: Within Crawl4AI, you typically just set `fetch_ssl_certificate=True` in `CrawlerRunConfig`; the final results `ssl_certificate` is automatically built.
5. **Export**: If you need to store or analyze a cert, the `to_json` and `to_pem` are quite universal.
---
### Summary
- **`SSLCertificate`** is a convenience class for capturing and exporting the **TLS certificate** from your crawled site(s).
- Common usage is in the **`CrawlResult.ssl_certificate`** field, accessible after setting `fetch_ssl_certificate=True`.
- Offers quick access to essential certificate details (`issuer`, `subject`, `fingerprint`) and is easy to export (PEM, DER, JSON) for further analysis or server usage.
Use it whenever you need **insight** into a sites certificate or require some form of cryptographic or compliance check.

View File

@@ -1,244 +1,305 @@
# Complete Parameter Guide for arun() # `arun()` Parameter Guide (New Approach)
The following parameters can be passed to the `arun()` method. They are organized by their primary usage context and functionality. In Crawl4AIs **latest** configuration model, nearly all parameters that once went directly to `arun()` are now part of **`CrawlerRunConfig`**. When calling `arun()`, you provide:
## Core Parameters
```python ```python
await crawler.arun( await crawler.arun(
url="https://example.com", # Required: URL to crawl url="https://example.com",
verbose=True, # Enable detailed logging config=my_run_config
cache_mode=CacheMode.ENABLED, # Control cache behavior
warmup=True # Whether to run warmup check
) )
``` ```
## Cache Control Below is an organized look at the parameters that can go inside `CrawlerRunConfig`, divided by their functional areas. For **Browser** settings (e.g., `headless`, `browser_type`), see [BrowserConfig](./parameters.md).
---
## 1. Core Usage
```python ```python
from crawl4ai import CacheMode from crawl4ai import AsyncWebCrawler, CrawlerRunConfig, CacheMode
await crawler.arun( async def main():
cache_mode=CacheMode.ENABLED, # Normal caching (read/write) run_config = CrawlerRunConfig(
# Other cache modes: verbose=True, # Detailed logging
# cache_mode=CacheMode.DISABLED # No caching at all cache_mode=CacheMode.ENABLED, # Use normal read/write cache
# cache_mode=CacheMode.READ_ONLY # Only read from cache check_robots_txt=True, # Respect robots.txt rules
# cache_mode=CacheMode.WRITE_ONLY # Only write to cache # ... other parameters
# cache_mode=CacheMode.BYPASS # Skip cache for this operation )
async with AsyncWebCrawler() as crawler:
result = await crawler.arun(
url="https://example.com",
config=run_config
)
# Check if blocked by robots.txt
if not result.success and result.status_code == 403:
print(f"Error: {result.error_message}")
```
**Key Fields**:
- `verbose=True` logs each crawl step.
- `cache_mode` decides how to read/write the local crawl cache.
---
## 2. Cache Control
**`cache_mode`** (default: `CacheMode.ENABLED`)
Use a built-in enum from `CacheMode`:
- `ENABLED`: Normal caching—reads if available, writes if missing.
- `DISABLED`: No caching—always refetch pages.
- `READ_ONLY`: Reads from cache only; no new writes.
- `WRITE_ONLY`: Writes to cache but doesnt read existing data.
- `BYPASS`: Skips reading cache for this crawl (though it might still write if set up that way).
```python
run_config = CrawlerRunConfig(
cache_mode=CacheMode.BYPASS
) )
``` ```
## Content Processing Parameters **Additional flags**:
- `bypass_cache=True` acts like `CacheMode.BYPASS`.
- `disable_cache=True` acts like `CacheMode.DISABLED`.
- `no_cache_read=True` acts like `CacheMode.WRITE_ONLY`.
- `no_cache_write=True` acts like `CacheMode.READ_ONLY`.
---
## 3. Content Processing & Selection
### 3.1 Text Processing
### Text Processing
```python ```python
await crawler.arun( run_config = CrawlerRunConfig(
word_count_threshold=10, # Minimum words per content block word_count_threshold=10, # Ignore text blocks <10 words
image_description_min_word_threshold=5, # Minimum words for image descriptions only_text=False, # If True, tries to remove non-text elements
only_text=False, # Extract only text content keep_data_attributes=False # Keep or discard data-* attributes
excluded_tags=['form', 'nav'], # HTML tags to exclude
keep_data_attributes=False, # Preserve data-* attributes
) )
``` ```
### Content Selection ### 3.2 Content Selection
```python ```python
await crawler.arun( run_config = CrawlerRunConfig(
css_selector=".main-content", # CSS selector for content extraction css_selector=".main-content", # Focus on .main-content region only
remove_forms=True, # Remove all form elements excluded_tags=["form", "nav"], # Remove entire tag blocks
remove_overlay_elements=True, # Remove popups/modals/overlays remove_forms=True, # Specifically strip <form> elements
remove_overlay_elements=True, # Attempt to remove modals/popups
) )
``` ```
### Link Handling ### 3.3 Link Handling
```python ```python
await crawler.arun( run_config = CrawlerRunConfig(
exclude_external_links=True, # Remove external links exclude_external_links=True, # Remove external links from final content
exclude_social_media_links=True, # Remove social media links exclude_social_media_links=True, # Remove links to known social sites
exclude_external_images=True, # Remove external images exclude_domains=["ads.example.com"], # Exclude links to these domains
exclude_domains=["ads.example.com"], # Specific domains to exclude exclude_social_media_domains=["facebook.com","twitter.com"], # Extend the default list
social_media_domains=[ # Additional social media domains
"facebook.com",
"twitter.com",
"instagram.com"
]
) )
``` ```
## Browser Control Parameters ### 3.4 Media Filtering
### Basic Browser Settings
```python ```python
await crawler.arun( run_config = CrawlerRunConfig(
headless=True, # Run browser in headless mode exclude_external_images=True # Strip images from other domains
browser_type="chromium", # Browser engine: "chromium", "firefox", "webkit"
page_timeout=60000, # Page load timeout in milliseconds
user_agent="custom-agent", # Custom user agent
) )
``` ```
### Navigation and Waiting ---
## 4. Page Navigation & Timing
### 4.1 Basic Browser Flow
```python ```python
await crawler.arun( run_config = CrawlerRunConfig(
wait_for="css:.dynamic-content", # Wait for element/condition wait_for="css:.dynamic-content", # Wait for .dynamic-content
delay_before_return_html=2.0, # Wait before returning HTML (seconds) delay_before_return_html=2.0, # Wait 2s before capturing final HTML
page_timeout=60000, # Navigation & script timeout (ms)
) )
``` ```
### JavaScript Execution **Key Fields**:
- `wait_for`:
- `"css:selector"` or
- `"js:() => boolean"`
e.g. `js:() => document.querySelectorAll('.item').length > 10`.
- `mean_delay` & `max_range`: define random delays for `arun_many()` calls.
- `semaphore_count`: concurrency limit when crawling multiple URLs.
### 4.2 JavaScript Execution
```python ```python
await crawler.arun( run_config = CrawlerRunConfig(
js_code=[ # JavaScript to execute (string or list) js_code=[
"window.scrollTo(0, document.body.scrollHeight);", "window.scrollTo(0, document.body.scrollHeight);",
"document.querySelector('.load-more').click();" "document.querySelector('.load-more')?.click();"
], ],
js_only=False, # Only execute JavaScript without reloading page js_only=False
) )
``` ```
### Anti-Bot Features - `js_code` can be a single string or a list of strings.
- `js_only=True` means “Im continuing in the same session with new JS steps, no new full navigation.”
### 4.3 Anti-Bot
```python ```python
await crawler.arun( run_config = CrawlerRunConfig(
magic=True, # Enable all anti-detection features magic=True,
simulate_user=True, # Simulate human behavior simulate_user=True,
override_navigator=True # Override navigator properties override_navigator=True
)
```
- `magic=True` tries multiple stealth features.
- `simulate_user=True` mimics mouse movements or random delays.
- `override_navigator=True` fakes some navigator properties (like user agent checks).
---
## 5. Session Management
**`session_id`**:
```python
run_config = CrawlerRunConfig(
session_id="my_session123"
)
```
If re-used in subsequent `arun()` calls, the same tab/page context is continued (helpful for multi-step tasks or stateful browsing).
---
## 6. Screenshot, PDF & Media Options
```python
run_config = CrawlerRunConfig(
screenshot=True, # Grab a screenshot as base64
screenshot_wait_for=1.0, # Wait 1s before capturing
pdf=True, # Also produce a PDF
image_description_min_word_threshold=5, # If analyzing alt text
image_score_threshold=3, # Filter out low-score images
)
```
**Where they appear**:
- `result.screenshot` → Base64 screenshot string.
- `result.pdf` → Byte array with PDF data.
---
## 7. Extraction Strategy
**For advanced data extraction** (CSS/LLM-based), set `extraction_strategy`:
```python
run_config = CrawlerRunConfig(
extraction_strategy=my_css_or_llm_strategy
) )
``` ```
### Session Management The extracted data will appear in `result.extracted_content`.
```python
await crawler.arun(
session_id="my_session", # Session identifier for persistent browsing
)
```
### Screenshot Options ---
```python
await crawler.arun( ## 8. Comprehensive Example
screenshot=True, # Take page screenshot
screenshot_wait_for=2.0, # Wait before screenshot (seconds) Below is a snippet combining many parameters:
)
```
### Proxy Configuration
```python ```python
await crawler.arun( import asyncio
proxy="http://proxy.example.com:8080", # Simple proxy URL from crawl4ai import AsyncWebCrawler, CrawlerRunConfig, CacheMode
proxy_config={ # Advanced proxy settings from crawl4ai.extraction_strategy import JsonCssExtractionStrategy
"server": "http://proxy.example.com:8080",
"username": "user", async def main():
"password": "pass" # Example schema
schema = {
"name": "Articles",
"baseSelector": "article.post",
"fields": [
{"name": "title", "selector": "h2", "type": "text"},
{"name": "link", "selector": "a", "type": "attribute", "attribute": "href"}
]
} }
)
```
## Content Extraction Parameters run_config = CrawlerRunConfig(
# Core
verbose=True,
cache_mode=CacheMode.ENABLED,
check_robots_txt=True, # Respect robots.txt rules
### Extraction Strategy # Content
```python word_count_threshold=10,
await crawler.arun( css_selector="main.content",
extraction_strategy=LLMExtractionStrategy( excluded_tags=["nav", "footer"],
provider="ollama/llama2", exclude_external_links=True,
schema=MySchema.schema(),
instruction="Extract specific data" # Page & JS
js_code="document.querySelector('.show-more')?.click();",
wait_for="css:.loaded-block",
page_timeout=30000,
# Extraction
extraction_strategy=JsonCssExtractionStrategy(schema),
# Session
session_id="persistent_session",
# Media
screenshot=True,
pdf=True,
# Anti-bot
simulate_user=True,
magic=True,
) )
)
async with AsyncWebCrawler() as crawler:
result = await crawler.arun("https://example.com/posts", config=run_config)
if result.success:
print("HTML length:", len(result.cleaned_html))
print("Extraction JSON:", result.extracted_content)
if result.screenshot:
print("Screenshot length:", len(result.screenshot))
if result.pdf:
print("PDF bytes length:", len(result.pdf))
else:
print("Error:", result.error_message)
if __name__ == "__main__":
asyncio.run(main())
``` ```
### Chunking Strategy **What we covered**:
```python 1. **Crawling** the main content region, ignoring external links.
await crawler.arun( 2. Running **JavaScript** to click “.show-more”.
chunking_strategy=RegexChunking( 3. **Waiting** for “.loaded-block” to appear.
patterns=[r'\n\n', r'\.\s+'] 4. Generating a **screenshot** & **PDF** of the final page.
) 5. Extracting repeated “article.post” elements with a **CSS-based** extraction strategy.
)
```
### HTML to Text Options ---
```python
await crawler.arun(
html2text={
"ignore_links": False,
"ignore_images": False,
"escape_dot": False,
"body_width": 0,
"protect_links": True,
"unicode_snob": True
}
)
```
## Debug Options ## 9. Best Practices
```python
await crawler.arun(
log_console=True, # Log browser console messages
)
```
## Parameter Interactions and Notes 1. **Use `BrowserConfig` for global browser** settings (headless, user agent).
2. **Use `CrawlerRunConfig`** to handle the **specific** crawl needs: content filtering, caching, JS, screenshot, extraction, etc.
3. Keep your **parameters consistent** in run configs—especially if youre part of a large codebase with multiple crawls.
4. **Limit** large concurrency (`semaphore_count`) if the site or your system cant handle it.
5. For dynamic pages, set `js_code` or `scan_full_page` so you load all content.
1. **Cache and Performance Setup** ---
```python
# Optimal caching for repeated crawls
await crawler.arun(
cache_mode=CacheMode.ENABLED,
word_count_threshold=10,
process_iframes=False
)
```
2. **Dynamic Content Handling** ## 10. Conclusion
```python
# Handle lazy-loaded content
await crawler.arun(
js_code="window.scrollTo(0, document.body.scrollHeight);",
wait_for="css:.lazy-content",
delay_before_return_html=2.0,
cache_mode=CacheMode.WRITE_ONLY # Cache results after dynamic load
)
```
3. **Content Extraction Pipeline** All parameters that used to be direct arguments to `arun()` now belong in **`CrawlerRunConfig`**. This approach:
```python
# Complete extraction setup
await crawler.arun(
css_selector=".main-content",
word_count_threshold=20,
extraction_strategy=my_strategy,
chunking_strategy=my_chunking,
process_iframes=True,
remove_overlay_elements=True,
cache_mode=CacheMode.ENABLED
)
```
## Best Practices - Makes code **clearer** and **more maintainable**.
- Minimizes confusion about which arguments affect global vs. per-crawl behavior.
- Allows you to create **reusable** config objects for different pages or tasks.
1. **Performance Optimization** For a **full** reference, check out the [CrawlerRunConfig Docs](./parameters.md).
```python
await crawler.arun(
cache_mode=CacheMode.ENABLED, # Use full caching
word_count_threshold=10, # Filter out noise
process_iframes=False # Skip iframes if not needed
)
```
2. **Reliable Scraping** Happy crawling with your **structured, flexible** config approach!
```python
await crawler.arun(
magic=True, # Enable anti-detection
delay_before_return_html=1.0, # Wait for dynamic content
page_timeout=60000, # Longer timeout for slow pages
cache_mode=CacheMode.WRITE_ONLY # Cache results after successful crawl
)
```
3. **Clean Content**
```python
await crawler.arun(
remove_overlay_elements=True, # Remove popups
excluded_tags=['nav', 'aside'],# Remove unnecessary elements
keep_data_attributes=False, # Remove data attributes
cache_mode=CacheMode.ENABLED # Use cache for faster processing
)
```

124
docs/md_v2/api/arun_many.md Normal file
View File

@@ -0,0 +1,124 @@
# `arun_many(...)` Reference
> **Note**: This function is very similar to [`arun()`](./arun.md) but focused on **concurrent** or **batch** crawling. If youre unfamiliar with `arun()` usage, please read that doc first, then review this for differences.
## Function Signature
```python
async def arun_many(
urls: Union[List[str], List[Any]],
config: Optional[CrawlerRunConfig] = None,
dispatcher: Optional[BaseDispatcher] = None,
...
) -> Union[List[CrawlResult], AsyncGenerator[CrawlResult, None]]:
"""
Crawl multiple URLs concurrently or in batches.
:param urls: A list of URLs (or tasks) to crawl.
:param config: (Optional) A default `CrawlerRunConfig` applying to each crawl.
:param dispatcher: (Optional) A concurrency controller (e.g. MemoryAdaptiveDispatcher).
...
:return: Either a list of `CrawlResult` objects, or an async generator if streaming is enabled.
"""
```
## Differences from `arun()`
1. **Multiple URLs**:
- Instead of crawling a single URL, you pass a list of them (strings or tasks).
- The function returns either a **list** of `CrawlResult` or an **async generator** if streaming is enabled.
2. **Concurrency & Dispatchers**:
- **`dispatcher`** param allows advanced concurrency control.
- If omitted, a default dispatcher (like `MemoryAdaptiveDispatcher`) is used internally.
- Dispatchers handle concurrency, rate limiting, and memory-based adaptive throttling (see [Multi-URL Crawling](../advanced/multi-url-crawling.md)).
3. **Streaming Support**:
- Enable streaming by setting `stream=True` in your `CrawlerRunConfig`.
- When streaming, use `async for` to process results as they become available.
- Ideal for processing large numbers of URLs without waiting for all to complete.
4. **Parallel** Execution**:
- `arun_many()` can run multiple requests concurrently under the hood.
- Each `CrawlResult` might also include a **`dispatch_result`** with concurrency details (like memory usage, start/end times).
### Basic Example (Batch Mode)
```python
# Minimal usage: The default dispatcher will be used
results = await crawler.arun_many(
urls=["https://site1.com", "https://site2.com"],
config=CrawlerRunConfig(stream=False) # Default behavior
)
for res in results:
if res.success:
print(res.url, "crawled OK!")
else:
print("Failed:", res.url, "-", res.error_message)
```
### Streaming Example
```python
config = CrawlerRunConfig(
stream=True, # Enable streaming mode
cache_mode=CacheMode.BYPASS
)
# Process results as they complete
async for result in await crawler.arun_many(
urls=["https://site1.com", "https://site2.com", "https://site3.com"],
config=config
):
if result.success:
print(f"Just completed: {result.url}")
# Process each result immediately
process_result(result)
```
### With a Custom Dispatcher
```python
dispatcher = MemoryAdaptiveDispatcher(
memory_threshold_percent=70.0,
max_session_permit=10
)
results = await crawler.arun_many(
urls=["https://site1.com", "https://site2.com", "https://site3.com"],
config=my_run_config,
dispatcher=dispatcher
)
```
**Key Points**:
- Each URL is processed by the same or separate sessions, depending on the dispatchers strategy.
- `dispatch_result` in each `CrawlResult` (if using concurrency) can hold memory and timing info.
- If you need to handle authentication or session IDs, pass them in each individual task or within your run config.
### Return Value
Either a **list** of [`CrawlResult`](./crawl-result.md) objects, or an **async generator** if streaming is enabled. You can iterate to check `result.success` or read each items `extracted_content`, `markdown`, or `dispatch_result`.
---
## Dispatcher Reference
- **`MemoryAdaptiveDispatcher`**: Dynamically manages concurrency based on system memory usage.
- **`SemaphoreDispatcher`**: Fixed concurrency limit, simpler but less adaptive.
For advanced usage or custom settings, see [Multi-URL Crawling with Dispatchers](../advanced/multi-url-crawling.md).
---
## Common Pitfalls
1. **Large Lists**: If you pass thousands of URLs, be mindful of memory or rate-limits. A dispatcher can help.
2. **Session Reuse**: If you need specialized logins or persistent contexts, ensure your dispatcher or tasks handle sessions accordingly.
3. **Error Handling**: Each `CrawlResult` might fail for different reasons—always check `result.success` or the `error_message` before proceeding.
---
## Conclusion
Use `arun_many()` when you want to **crawl multiple URLs** simultaneously or in controlled parallel tasks. If you need advanced concurrency features (like memory-based adaptive throttling or complex rate-limiting), provide a **dispatcher**. Each result is a standard `CrawlResult`, possibly augmented with concurrency stats (`dispatch_result`) for deeper inspection. For more details on concurrency logic and dispatchers, see the [Advanced Multi-URL Crawling](../advanced/multi-url-crawling.md) docs.

View File

@@ -1,320 +1,331 @@
# AsyncWebCrawler # AsyncWebCrawler
The `AsyncWebCrawler` class is the main interface for web crawling operations. It provides asynchronous web crawling capabilities with extensive configuration options. The **`AsyncWebCrawler`** is the core class for asynchronous web crawling in Crawl4AI. You typically create it **once**, optionally customize it with a **`BrowserConfig`** (e.g., headless, user agent), then **run** multiple **`arun()`** calls with different **`CrawlerRunConfig`** objects.
## Constructor **Recommended usage**:
1. **Create** a `BrowserConfig` for global browser settings.
2. **Instantiate** `AsyncWebCrawler(config=browser_config)`.
3. **Use** the crawler in an async context manager (`async with`) or manage start/close manually.
4. **Call** `arun(url, config=crawler_run_config)` for each page you want.
---
## 1. Constructor Overview
```python ```python
AsyncWebCrawler( class AsyncWebCrawler:
# Browser Settings def __init__(
browser_type: str = "chromium", # Options: "chromium", "firefox", "webkit" self,
headless: bool = True, # Run browser in headless mode crawler_strategy: Optional[AsyncCrawlerStrategy] = None,
verbose: bool = False, # Enable verbose logging config: Optional[BrowserConfig] = None,
always_bypass_cache: bool = False, # deprecated
always_by_pass_cache: Optional[bool] = None, # also deprecated
base_directory: str = ...,
thread_safe: bool = False,
**kwargs,
):
"""
Create an AsyncWebCrawler instance.
# Cache Settings Args:
always_by_pass_cache: bool = False, # Always bypass cache crawler_strategy:
base_directory: str = str(os.getenv("CRAWL4_AI_BASE_DIRECTORY", Path.home())), # Base directory for cache (Advanced) Provide a custom crawler strategy if needed.
config:
A BrowserConfig object specifying how the browser is set up.
always_bypass_cache:
(Deprecated) Use CrawlerRunConfig.cache_mode instead.
base_directory:
Folder for storing caches/logs (if relevant).
thread_safe:
If True, attempts some concurrency safeguards. Usually False.
**kwargs:
Additional legacy or debugging parameters.
"""
)
# Network Settings ### Typical Initialization
proxy: str = None, # Simple proxy URL
proxy_config: Dict = None, # Advanced proxy configuration
# Browser Behavior ```python
sleep_on_close: bool = False, # Wait before closing browser from crawl4ai import AsyncWebCrawler, BrowserConfig
# Custom Settings browser_cfg = BrowserConfig(
user_agent: str = None, # Custom user agent browser_type="chromium",
headers: Dict[str, str] = {}, # Custom HTTP headers headless=True,
js_code: Union[str, List[str]] = None, # Default JavaScript to execute verbose=True
) )
crawler = AsyncWebCrawler(config=browser_cfg)
``` ```
### Parameters in Detail **Notes**:
- **Legacy** parameters like `always_bypass_cache` remain for backward compatibility, but prefer to set **caching** in `CrawlerRunConfig`.
#### Browser Settings ---
- **browser_type** (str, optional) ## 2. Lifecycle: Start/Close or Context Manager
- Default: `"chromium"`
- Options: `"chromium"`, `"firefox"`, `"webkit"`
- Controls which browser engine to use
```python
# Example: Using Firefox
crawler = AsyncWebCrawler(browser_type="firefox")
```
- **headless** (bool, optional) ### 2.1 Context Manager (Recommended)
- Default: `True`
- When `True`, browser runs without GUI
- Set to `False` for debugging
```python
# Visible browser for debugging
crawler = AsyncWebCrawler(headless=False)
```
- **verbose** (bool, optional) ```python
- Default: `False` async with AsyncWebCrawler(config=browser_cfg) as crawler:
- Enables detailed logging result = await crawler.arun("https://example.com")
```python # The crawler automatically starts/closes resources
# Enable detailed logging ```
crawler = AsyncWebCrawler(verbose=True)
```
#### Cache Settings When the `async with` block ends, the crawler cleans up (closes the browser, etc.).
- **always_by_pass_cache** (bool, optional) ### 2.2 Manual Start & Close
- Default: `False`
- When `True`, always fetches fresh content
```python
# Always fetch fresh content
crawler = AsyncWebCrawler(always_by_pass_cache=True)
```
- **base_directory** (str, optional) ```python
- Default: User's home directory crawler = AsyncWebCrawler(config=browser_cfg)
- Base path for cache storage await crawler.start()
```python
# Custom cache directory
crawler = AsyncWebCrawler(base_directory="/path/to/cache")
```
#### Network Settings result1 = await crawler.arun("https://example.com")
result2 = await crawler.arun("https://another.com")
- **proxy** (str, optional) await crawler.close()
- Simple proxy URL ```
```python
# Using simple proxy
crawler = AsyncWebCrawler(proxy="http://proxy.example.com:8080")
```
- **proxy_config** (Dict, optional) Use this style if you have a **long-running** application or need full control of the crawlers lifecycle.
- Advanced proxy configuration with authentication
```python
# Advanced proxy with auth
crawler = AsyncWebCrawler(proxy_config={
"server": "http://proxy.example.com:8080",
"username": "user",
"password": "pass"
})
```
#### Browser Behavior ---
- **sleep_on_close** (bool, optional) ## 3. Primary Method: `arun()`
- Default: `False`
- Adds delay before closing browser
```python
# Wait before closing
crawler = AsyncWebCrawler(sleep_on_close=True)
```
#### Custom Settings
- **user_agent** (str, optional)
- Custom user agent string
```python
# Custom user agent
crawler = AsyncWebCrawler(
user_agent="Mozilla/5.0 (Custom Agent) Chrome/90.0"
)
```
- **headers** (Dict[str, str], optional)
- Custom HTTP headers
```python
# Custom headers
crawler = AsyncWebCrawler(
headers={
"Accept-Language": "en-US",
"Custom-Header": "Value"
}
)
```
- **js_code** (Union[str, List[str]], optional)
- Default JavaScript to execute on each page
```python
# Default JavaScript
crawler = AsyncWebCrawler(
js_code=[
"window.scrollTo(0, document.body.scrollHeight);",
"document.querySelector('.load-more').click();"
]
)
```
## Methods
### arun()
The primary method for crawling web pages.
```python ```python
async def arun( async def arun(
# Required self,
url: str, # URL to crawl url: str,
config: Optional[CrawlerRunConfig] = None,
# Content Selection # Legacy parameters for backward compatibility...
css_selector: str = None, # CSS selector for content
word_count_threshold: int = 10, # Minimum words per block
# Cache Control
bypass_cache: bool = False, # Bypass cache for this request
# Session Management
session_id: str = None, # Session identifier
# Screenshot Options
screenshot: bool = False, # Take screenshot
screenshot_wait_for: float = None, # Wait before screenshot
# Content Processing
process_iframes: bool = False, # Process iframe content
remove_overlay_elements: bool = False, # Remove popups/modals
# Anti-Bot Settings
simulate_user: bool = False, # Simulate human behavior
override_navigator: bool = False, # Override navigator properties
magic: bool = False, # Enable all anti-detection
# Content Filtering
excluded_tags: List[str] = None, # HTML tags to exclude
exclude_external_links: bool = False, # Remove external links
exclude_social_media_links: bool = False, # Remove social media links
# JavaScript Handling
js_code: Union[str, List[str]] = None, # JavaScript to execute
wait_for: str = None, # Wait condition
# Page Loading
page_timeout: int = 60000, # Page load timeout (ms)
delay_before_return_html: float = None, # Wait before return
# Extraction
extraction_strategy: ExtractionStrategy = None # Extraction strategy
) -> CrawlResult: ) -> CrawlResult:
...
``` ```
### Usage Examples ### 3.1 New Approach
You pass a `CrawlerRunConfig` object that sets up everything about a crawl—content filtering, caching, session reuse, JS code, screenshots, etc.
#### Basic Crawling
```python ```python
async with AsyncWebCrawler() as crawler: import asyncio
result = await crawler.arun(url="https://example.com") from crawl4ai import CrawlerRunConfig, CacheMode
run_cfg = CrawlerRunConfig(
cache_mode=CacheMode.BYPASS,
css_selector="main.article",
word_count_threshold=10,
screenshot=True
)
async with AsyncWebCrawler(config=browser_cfg) as crawler:
result = await crawler.arun("https://example.com/news", config=run_cfg)
print("Crawled HTML length:", len(result.cleaned_html))
if result.screenshot:
print("Screenshot base64 length:", len(result.screenshot))
``` ```
#### Advanced Crawling ### 3.2 Legacy Parameters Still Accepted
For **backward** compatibility, `arun()` can still accept direct arguments like `css_selector=...`, `word_count_threshold=...`, etc., but we strongly advise migrating them into a **`CrawlerRunConfig`**.
---
## 4. Batch Processing: `arun_many()`
```python ```python
async with AsyncWebCrawler( async def arun_many(
browser_type="firefox", self,
verbose=True, urls: List[str],
headers={"Custom-Header": "Value"} config: Optional[CrawlerRunConfig] = None,
) as crawler: # Legacy parameters maintained for backwards compatibility...
result = await crawler.arun( ) -> List[CrawlResult]:
url="https://example.com", """
css_selector=".main-content", Process multiple URLs with intelligent rate limiting and resource monitoring.
word_count_threshold=20, """
process_iframes=True,
magic=True,
wait_for="css:.dynamic-content",
screenshot=True
)
``` ```
#### Session Management ### 4.1 Resource-Aware Crawling
The `arun_many()` method now uses an intelligent dispatcher that:
- Monitors system memory usage
- Implements adaptive rate limiting
- Provides detailed progress monitoring
- Manages concurrent crawls efficiently
### 4.2 Example Usage
```python ```python
async with AsyncWebCrawler() as crawler: from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig, RateLimitConfig
# First request from crawl4ai.dispatcher import DisplayMode
result1 = await crawler.arun(
url="https://example.com/login", # Configure browser
session_id="my_session" browser_cfg = BrowserConfig(headless=True)
# Configure crawler with rate limiting
run_cfg = CrawlerRunConfig(
# Enable rate limiting
enable_rate_limiting=True,
rate_limit_config=RateLimitConfig(
base_delay=(1.0, 2.0), # Random delay between 1-2 seconds
max_delay=30.0, # Maximum delay after rate limit hits
max_retries=2, # Number of retries before giving up
rate_limit_codes=[429, 503] # Status codes that trigger rate limiting
),
# Resource monitoring
memory_threshold_percent=70.0, # Pause if memory exceeds this
check_interval=0.5, # How often to check resources
max_session_permit=3, # Maximum concurrent crawls
display_mode=DisplayMode.DETAILED.value # Show detailed progress
)
urls = [
"https://example.com/page1",
"https://example.com/page2",
"https://example.com/page3"
]
async with AsyncWebCrawler(config=browser_cfg) as crawler:
results = await crawler.arun_many(urls, config=run_cfg)
for result in results:
print(f"URL: {result.url}, Success: {result.success}")
```
### 4.3 Key Features
1. **Rate Limiting**
- Automatic delay between requests
- Exponential backoff on rate limit detection
- Domain-specific rate limiting
- Configurable retry strategy
2. **Resource Monitoring**
- Memory usage tracking
- Adaptive concurrency based on system load
- Automatic pausing when resources are constrained
3. **Progress Monitoring**
- Detailed or aggregated progress display
- Real-time status updates
- Memory usage statistics
4. **Error Handling**
- Graceful handling of rate limits
- Automatic retries with backoff
- Detailed error reporting
---
## 5. `CrawlResult` Output
Each `arun()` returns a **`CrawlResult`** containing:
- `url`: Final URL (if redirected).
- `html`: Original HTML.
- `cleaned_html`: Sanitized HTML.
- `markdown_v2` (or future `markdown`): Markdown outputs (raw, fit, etc.).
- `extracted_content`: If an extraction strategy was used (JSON for CSS/LLM strategies).
- `screenshot`, `pdf`: If screenshots/PDF requested.
- `media`, `links`: Information about discovered images/links.
- `success`, `error_message`: Status info.
For details, see [CrawlResult doc](./crawl-result.md).
---
## 6. Quick Example
Below is an example hooking it all together:
```python
import asyncio
from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig, CacheMode
from crawl4ai.extraction_strategy import JsonCssExtractionStrategy
import json
async def main():
# 1. Browser config
browser_cfg = BrowserConfig(
browser_type="firefox",
headless=False,
verbose=True
) )
# Subsequent request using same session # 2. Run config
result2 = await crawler.arun( schema = {
url="https://example.com/protected", "name": "Articles",
session_id="my_session" "baseSelector": "article.post",
"fields": [
{
"name": "title",
"selector": "h2",
"type": "text"
},
{
"name": "url",
"selector": "a",
"type": "attribute",
"attribute": "href"
}
]
}
run_cfg = CrawlerRunConfig(
cache_mode=CacheMode.BYPASS,
extraction_strategy=JsonCssExtractionStrategy(schema),
word_count_threshold=15,
remove_overlay_elements=True,
wait_for="css:.post" # Wait for posts to appear
) )
async with AsyncWebCrawler(config=browser_cfg) as crawler:
result = await crawler.arun(
url="https://example.com/blog",
config=run_cfg
)
if result.success:
print("Cleaned HTML length:", len(result.cleaned_html))
if result.extracted_content:
articles = json.loads(result.extracted_content)
print("Extracted articles:", articles[:2])
else:
print("Error:", result.error_message)
asyncio.run(main())
``` ```
## Context Manager **Explanation**:
- We define a **`BrowserConfig`** with Firefox, no headless, and `verbose=True`.
- We define a **`CrawlerRunConfig`** that **bypasses cache**, uses a **CSS** extraction schema, has a `word_count_threshold=15`, etc.
- We pass them to `AsyncWebCrawler(config=...)` and `arun(url=..., config=...)`.
AsyncWebCrawler implements the async context manager protocol: ---
```python ## 7. Best Practices & Migration Notes
async def __aenter__(self) -> 'AsyncWebCrawler':
# Initialize browser and resources
return self
async def __aexit__(self, *args): 1. **Use** `BrowserConfig` for **global** settings about the browsers environment.
# Cleanup resources 2. **Use** `CrawlerRunConfig` for **per-crawl** logic (caching, content filtering, extraction strategies, wait conditions).
pass 3. **Avoid** legacy parameters like `css_selector` or `word_count_threshold` directly in `arun()`. Instead:
```
Always use AsyncWebCrawler with async context manager: ```python
```python run_cfg = CrawlerRunConfig(css_selector=".main-content", word_count_threshold=20)
async with AsyncWebCrawler() as crawler: result = await crawler.arun(url="...", config=run_cfg)
# Your crawling code here ```
pass
```
## Best Practices 4. **Context Manager** usage is simplest unless you want a persistent crawler across many calls.
1. **Resource Management** ---
```python
# Always use context manager
async with AsyncWebCrawler() as crawler:
# Crawler will be properly cleaned up
pass
```
2. **Error Handling** ## 8. Summary
```python
try:
async with AsyncWebCrawler() as crawler:
result = await crawler.arun(url="https://example.com")
if not result.success:
print(f"Crawl failed: {result.error_message}")
except Exception as e:
print(f"Error: {str(e)}")
```
3. **Performance Optimization** **AsyncWebCrawler** is your entry point to asynchronous crawling:
```python
# Enable caching for better performance
crawler = AsyncWebCrawler(
always_by_pass_cache=False,
verbose=True
)
```
4. **Anti-Detection** - **Constructor** accepts **`BrowserConfig`** (or defaults).
```python - **`arun(url, config=CrawlerRunConfig)`** is the main method for single-page crawls.
# Maximum stealth - **`arun_many(urls, config=CrawlerRunConfig)`** handles concurrency across multiple URLs.
crawler = AsyncWebCrawler( - For advanced lifecycle control, use `start()` and `close()` explicitly.
headless=True,
user_agent="Mozilla/5.0...",
headers={"Accept-Language": "en-US"}
)
result = await crawler.arun(
url="https://example.com",
magic=True,
simulate_user=True
)
```
## Note on Browser Types **Migration**:
- If you used `AsyncWebCrawler(browser_type="chromium", css_selector="...")`, move browser settings to `BrowserConfig(...)` and content/crawl logic to `CrawlerRunConfig(...)`.
Each browser type has its characteristics: This modular approach ensures your code is **clean**, **scalable**, and **easy to maintain**. For any advanced or rarely used parameters, see the [BrowserConfig docs](../api/parameters.md).
- **chromium**: Best overall compatibility
- **firefox**: Good for specific use cases
- **webkit**: Lighter weight, good for basic crawling
Choose based on your specific needs:
```python
# High compatibility
crawler = AsyncWebCrawler(browser_type="chromium")
# Memory efficient
crawler = AsyncWebCrawler(browser_type="webkit")
```

View File

@@ -1,85 +0,0 @@
# CrawlerRunConfig Parameters Documentation
## Content Processing Parameters
| Parameter | Type | Default | Description |
|-----------|------|---------|-------------|
| `word_count_threshold` | int | 200 | Minimum word count threshold before processing content |
| `extraction_strategy` | ExtractionStrategy | None | Strategy to extract structured data from crawled pages. When None, uses NoExtractionStrategy |
| `chunking_strategy` | ChunkingStrategy | RegexChunking() | Strategy to chunk content before extraction |
| `markdown_generator` | MarkdownGenerationStrategy | None | Strategy for generating markdown from extracted content |
| `content_filter` | RelevantContentFilter | None | Optional filter to prune irrelevant content |
| `only_text` | bool | False | If True, attempt to extract text-only content where applicable |
| `css_selector` | str | None | CSS selector to extract a specific portion of the page |
| `excluded_tags` | list[str] | [] | List of HTML tags to exclude from processing |
| `keep_data_attributes` | bool | False | If True, retain `data-*` attributes while removing unwanted attributes |
| `remove_forms` | bool | False | If True, remove all `<form>` elements from the HTML |
| `prettiify` | bool | False | If True, apply `fast_format_html` to produce prettified HTML output |
## Caching Parameters
| Parameter | Type | Default | Description |
|-----------|------|---------|-------------|
| `cache_mode` | CacheMode | None | Defines how caching is handled. Defaults to CacheMode.ENABLED internally |
| `session_id` | str | None | Optional session ID to persist browser context and page instance |
| `bypass_cache` | bool | False | Legacy parameter, if True acts like CacheMode.BYPASS |
| `disable_cache` | bool | False | Legacy parameter, if True acts like CacheMode.DISABLED |
| `no_cache_read` | bool | False | Legacy parameter, if True acts like CacheMode.WRITE_ONLY |
| `no_cache_write` | bool | False | Legacy parameter, if True acts like CacheMode.READ_ONLY |
## Page Navigation and Timing Parameters
| Parameter | Type | Default | Description |
|-----------|------|---------|-------------|
| `wait_until` | str | "domcontentloaded" | The condition to wait for when navigating |
| `page_timeout` | int | 60000 | Timeout in milliseconds for page operations like navigation |
| `wait_for` | str | None | CSS selector or JS condition to wait for before extracting content |
| `wait_for_images` | bool | True | If True, wait for images to load before extracting content |
| `delay_before_return_html` | float | 0.1 | Delay in seconds before retrieving final HTML |
| `mean_delay` | float | 0.1 | Mean base delay between requests when calling arun_many |
| `max_range` | float | 0.3 | Max random additional delay range for requests in arun_many |
| `semaphore_count` | int | 5 | Number of concurrent operations allowed |
## Page Interaction Parameters
| Parameter | Type | Default | Description |
|-----------|------|---------|-------------|
| `js_code` | str or list[str] | None | JavaScript code/snippets to run on the page |
| `js_only` | bool | False | If True, indicates subsequent calls are JS-driven updates |
| `ignore_body_visibility` | bool | True | If True, ignore whether the body is visible before proceeding |
| `scan_full_page` | bool | False | If True, scroll through the entire page to load all content |
| `scroll_delay` | float | 0.2 | Delay in seconds between scroll steps if scan_full_page is True |
| `process_iframes` | bool | False | If True, attempts to process and inline iframe content |
| `remove_overlay_elements` | bool | False | If True, remove overlays/popups before extracting HTML |
| `simulate_user` | bool | False | If True, simulate user interactions for anti-bot measures |
| `override_navigator` | bool | False | If True, overrides navigator properties for more human-like behavior |
| `magic` | bool | False | If True, attempts automatic handling of overlays/popups |
| `adjust_viewport_to_content` | bool | False | If True, adjust viewport according to page content dimensions |
## Media Handling Parameters
| Parameter | Type | Default | Description |
|-----------|------|---------|-------------|
| `screenshot` | bool | False | Whether to take a screenshot after crawling |
| `screenshot_wait_for` | float | None | Additional wait time before taking a screenshot |
| `screenshot_height_threshold` | int | 20000 | Threshold for page height to decide screenshot strategy |
| `pdf` | bool | False | Whether to generate a PDF of the page |
| `image_description_min_word_threshold` | int | 50 | Minimum words for image description extraction |
| `image_score_threshold` | int | 3 | Minimum score threshold for processing an image |
| `exclude_external_images` | bool | False | If True, exclude all external images from processing |
## Link and Domain Handling Parameters
| Parameter | Type | Default | Description |
|-----------|------|---------|-------------|
| `exclude_social_media_domains` | list[str] | SOCIAL_MEDIA_DOMAINS | List of domains to exclude for social media links |
| `exclude_external_links` | bool | False | If True, exclude all external links from the results |
| `exclude_social_media_links` | bool | False | If True, exclude links pointing to social media domains |
| `exclude_domains` | list[str] | [] | List of specific domains to exclude from results |
## Debugging and Logging Parameters
| Parameter | Type | Default | Description |
|-----------|------|---------|-------------|
| `verbose` | bool | True | Enable verbose logging |
| `log_console` | bool | False | If True, log console messages from the page |

Some files were not shown because too many files have changed in this diff Show More