Compare commits

..

138 Commits

Author SHA1 Message Date
UncleCode
d97a075082 Delete a.md 2024-12-25 19:43:39 +08:00
Haopeng138
bacbeb3ed4 Fix #340 example llm_extraction (#358)
@Haopeng138 Thank you so much. They are still part of the library. I forgot to update them since I moved the asynchronous versions years ago. I really appreciate it. I have to say that I feel weak in the documentation. That's why I spent a lot of time on it last week. Now, when you mention some of the things in the example folder, I realize I forgot about the example folder. I'll try to update it more. If you find anything else, please help and support. Thank you. I will add your name to contributor name as well.
2024-12-24 19:56:07 +08:00
UncleCode
ed7bc1909c Bump version to 0.4.22 2024-12-15 19:49:38 +08:00
UncleCode
e9e5b5642d Fix js_snipprt issue 0.4.21
bump to 0.4.22
2024-12-15 19:49:30 +08:00
UncleCode
7524aa7b5e Feature: Add Markdown generation to CrawlerRunConfig
- Added markdown generator parameter to CrawlerRunConfig in `async_configs.py`.
  - Implemented logic for Markdown generation in content scraping in `async_webcrawler.py`.
  - Updated version number to 0.4.21 in `__version__.py`.
2024-12-13 21:51:38 +08:00
UncleCode
7af1d32ef6 Update README for version 0.4.2: Reflect new features and enhancements 2024-12-12 20:18:44 +08:00
UncleCode
399af801a1 Merge branch 'next' 2024-12-12 20:17:27 +08:00
UncleCode
4a72c5ea6e Add release notes and documentation for version 0.4.2: Configurable Crawlers, Session Management, and Enhanced Screenshot/PDF features 2024-12-12 20:15:50 +08:00
UncleCode
20d6f5fdf4 Merge branch 'main' of https://github.com/unclecode/crawl4ai 2024-12-12 19:58:01 +08:00
UncleCode
3d69715dba chore: Update .gitignore to include new files and directories 2024-12-12 19:57:59 +08:00
UncleCode
de1766d565 Bump version to 0.4.2 2024-12-12 19:35:30 +08:00
UncleCode
0982c639ae Enhance AsyncWebCrawler and related configurations
- Introduced new configuration classes: BrowserConfig and CrawlerRunConfig.
  - Refactored AsyncWebCrawler to leverage the new configuration system for cleaner parameter management.
  - Updated AsyncPlaywrightCrawlerStrategy for better flexibility and reduced legacy parameters.
  - Improved error handling with detailed context extraction during exceptions.
  - Enhanced overall maintainability and usability of the web crawler.
2024-12-12 19:35:09 +08:00
UncleCode
5188b7a6a0 Add full-page screenshot and PDF export features
- Introduced a new approach for capturing full-page screenshots by exporting them as PDFs first, enhancing reliability and performance.
  - Added documentation for the feature in `docs/examples/full_page_screenshot_and_pdf_export.md`.
  - Refactored `perform_completion_with_backoff` in `crawl4ai/utils.py` to include necessary extra parameters.
  - Updated `quickstart_async.py` to utilize LLM extraction with refined arguments.
2024-12-10 20:59:31 +08:00
lvzhengri
759164831d Update async_webcrawler.py (#337)
add @asynccontextmanager
2024-12-10 20:56:52 +08:00
UncleCode
5431fa2d0c Add PDF & screenshot functionality, new tutorial
- Added support for exporting pages as PDFs
  - Enhanced screenshot functionality for long pages
  - Created a tutorial on dynamic content loading with 'Load More' buttons.
  - Updated web crawler to handle PDF data in responses.
2024-12-10 20:10:39 +08:00
UncleCode
e130fd8db9 Implement new async crawler features and stability updates
- Introduced new async crawl strategy with session management.
  - Added BrowserManager for improved browser management.
  - Enhanced documentation, focusing on storage state and usage examples.
  - Improved error handling and logging for sessions.
  - Added JavaScript snippets for customizing navigator properties.
2024-12-10 17:55:29 +08:00
Mohammed
ded554d334 Fixed typo (#324) 2024-12-09 20:17:43 +08:00
UncleCode
2d31915f0a Commit Message:
Enhance Async Crawler with storage state handling
  - Updated Async Crawler to support storage state management.
  - Added error handling for URL validation in Async Web Crawler.
  - Modified README logo and improved .gitignore entries.
  - Fixed issues in multiple files for better code robustness.
2024-12-09 20:04:59 +08:00
lu4nx
ba3e808802 fix: The extract method logs output only when self.verbose is set to True. (#314)
Co-authored-by: lu4nx <lu4nx@lx-pc>
2024-12-09 17:19:26 +08:00
Olavo Henrique Marques Peixoto
e3488da194 fixing Readmen tap (#313) 2024-12-09 14:34:52 +08:00
UncleCode
740214e021 Merge branch 'next' 2024-12-08 20:06:36 +08:00
UncleCode
c51e901f68 feat: Enhance AsyncPlaywrightCrawlerStrategy with text-only and light modes, dynamic viewport adjustment, and session management
### New Features:
- **Text-Only Mode**: Added support for text-only crawling by disabling images, JavaScript, GPU, and other non-essential features.
- **Light Mode**: Optimized browser settings to reduce resource usage and improve efficiency during crawling.
- **Dynamic Viewport Adjustment**: Automatically adjusts viewport dimensions based on content size, ensuring accurate rendering and scaling.
- **Full Page Scanning**: Introduced a feature to scroll and capture dynamic content for pages with infinite scroll or lazy-loading elements.
- **Session Management**: Added `create_session` method for creating and managing browser sessions with unique IDs.

### Improvements:
- Unified viewport handling across contexts by dynamically setting dimensions using `self.viewport_width` and `self.viewport_height`.
- Enhanced logging and error handling for viewport adjustments, page scanning, and content evaluation.
- Reduced resource usage with additional browser flags for both `light_mode` and `text_only` configurations.
- Improved handling of cookies, headers, and proxies in session creation.

### Refactoring:
- Removed hardcoded viewport dimensions and replaced them with dynamic configurations.
- Cleaned up unused and commented-out code for better readability and maintainability.
- Introduced defaults for frequently used parameters like `delay_before_return_html`.

### Fixes:
- Resolved potential inconsistencies in viewport handling.
- Improved robustness of content loading and dynamic adjustments to avoid failures and timeouts.

### Docs Update:
- Updated schema usage in `quickstart_async.py` example:
  - Changed `OpenAIModelFee.schema()` to `OpenAIModelFee.model_json_schema()` for compatibility.
- Enhanced LLM extraction instruction documentation.

This commit introduces significant enhancements to improve efficiency, flexibility, and reliability of the crawler strategy.
2024-12-08 20:04:44 +08:00
UncleCode
8c611dcb4b Refactored web scraping components
- Enhanced the web scraping strategy with new methods for optimized media handling.
  - Added new utility functions for better content processing.
  - Refined existing features for improved accuracy and efficiency in scraping tasks.
  - Introduced more robust filtering criteria for media elements.
2024-12-05 22:33:47 +08:00
UncleCode
a45b8b1eb1 Merge issues with 0.4.0 is over 2024-12-04 20:29:25 +08:00
UncleCode
56f82f3e7f Merge branch 'next' 2024-12-04 20:27:35 +08:00
UncleCode
486db3a771 Updated to version 0.4.0 with new features
- Enhanced error handling in async crawler.
  - Added flexible options in Markdown generation.
  - Updated user agent settings for improved reliability.
  - Reflected changes in documentation and examples.
2024-12-04 20:26:39 +08:00
UncleCode
b02544bc0b docs: update README and blog for version 0.4.0 release, highlighting new features and improvements 2024-12-03 21:28:52 +08:00
UncleCode
e9639ad189 refactor: improve error handling in DataProcessor and optimize data parsing logic 2024-12-03 19:44:38 +08:00
UncleCode
95a4f74d2a fix: pass logger to WebScrapingStrategy and update score computation in PruningContentFilter 2024-12-02 20:37:28 +08:00
unclecode
293f299c08 Add PruningContentFilter with unit tests and update documentation
- Introduced the PruningContentFilter for better content relevance.
  - Implemented comprehensive unit tests for verification of functionality.
  - Enhanced existing BM25ContentFilter tests for edge case coverage.
  - Updated documentation to include usage examples for new filter.
2024-12-01 19:17:33 +08:00
UncleCode
80d58ad24c bump version to 0.3.747 2024-11-30 22:00:15 +08:00
UncleCode
3e83893b3f Enhance User-Agent Handling
- Added a new UserAgentGenerator class for generating random User-Agents.
  - Integrated User-Agent generation in AsyncPlaywrightCrawlerStrategy for randomization.
  - Enhanced HTTP headers with generated Client Hints.
2024-11-30 18:13:12 +08:00
UncleCode
8c76a8c7dc docs: add contributor entry for dvschuyl regarding AsyncPlaywrightCrawlerStrategy issue 2024-11-29 21:14:49 +08:00
UncleCode
0780db55e1 fix: handle errors during image dimension updates in AsyncPlaywrightCrawlerStrategy 2024-11-29 21:12:19 +08:00
dvschuyl
1ed7c15118 🩹 Page-evaluate navigation destroyed error (#304)
Thanks for your contribution and such a nice approach. Now that I think of it, I guess I can make good use of this for some other part of the code. By the way, thank you so much; I will add your name to the new list of contributors.
2024-11-29 21:06:04 +08:00
UncleCode
569bdb6073 Merge branch 'next' 2024-11-29 20:54:28 +08:00
UncleCode
1def53b7fe docs: update Raspberry Pi section to indicate upcoming support 2024-11-29 20:53:43 +08:00
UncleCode
f9c98a377d Enhance Docker support and improve installation process
- Added new Docker commands for platform-specific builds.
  - Updated README with comprehensive installation and setup instructions.
  - Introduced `post_install` method in setup script for automation.
  - Refined migration processes with enhanced error logging.
  - Bump version to 0.3.746 and updated dependencies.
2024-11-29 20:52:51 +08:00
UncleCode
93bf3e8a1f Refactor Dockerfile and clean up main.py
- Enhanced Dockerfile for platform-specific installations
    - Added ARG for TARGETPLATFORM and BUILDPLATFORM
    - Improved GPU support conditional on TARGETPLATFORM
  - Removed static pages mounting in main.py
  - Streamlined code structure to improve maintainability
2024-11-29 20:08:09 +08:00
UncleCode
d202f3539b Enhance installation and migration processes
- Added a post-installation setup script for initialization.
  - Updated README with installation notes for Playwright setup.
  - Enhanced migration logging for better error visibility.
  - Added 'pydantic' to requirements.
  - Bumped version to 0.3.746.
2024-11-29 18:48:44 +08:00
UncleCode
12e73d4898 refactor: remove legacy build hooks and setup files, migrate to setup.cfg and pyproject.toml 2024-11-29 16:01:19 +08:00
unclecode
449dd7cc0b Migrating from the classic setup.py to a using PyProject approach. 2024-11-29 14:45:04 +08:00
UncleCode
b0419edda6 Update README.md (#300) 2024-11-29 02:31:17 +08:00
UncleCode
c0e87abaee fix: update package versions in requirements.txt for compatibility 2024-11-28 21:43:08 +08:00
UncleCode
c8485776fe docs: update README to reflect latest version v0.3.745 2024-11-28 20:04:16 +08:00
UncleCode
aa3e2d0fe6 Merge branch 'main' of https://github.com/unclecode/crawl4ai 2024-11-28 20:03:43 +08:00
UncleCode
98c64f9d5f Merge branch 'next' 2024-11-28 20:03:11 +08:00
UncleCode
7d81c17cca fix: improve handling of CRAWL4_AI_BASE_DIRECTORY environment variable in setup.py 2024-11-28 20:02:39 +08:00
UncleCode
652d396a81 chore: update version to 0.3.745 2024-11-28 20:00:29 +08:00
UncleCode
1d83c493af Enhance setup process and update contributors list
- Acknowledge contributor paulokuong for fixing RAWL4_AI_BASE_DIRECTORY issue
  - Refine base directory handling in `setup.py`
  - Clarify Playwright installation instructions and improve error handling
2024-11-28 19:58:40 +08:00
Paulo Kuong
cf35cbe59e CRAWL4_AI_BASE_DIRECTORY should be Path object instead of string (#298)
Thank you so much for your point. Yes, that's correct. I accept your pull request, and I add your name to a contribution list. Thank you again.
2024-11-28 19:46:36 +08:00
UncleCode
9221c08418 docs: fix link formatting for recent updates section in README 2024-11-28 19:33:36 +08:00
UncleCode
48d43c14b1 docs: fix link formatting for recent updates section in README 2024-11-28 19:33:02 +08:00
UncleCode
776efa74a4 docs: fix link formatting for recent updates section in README 2024-11-28 19:32:32 +08:00
UncleCode
b14e83f499 docs: fix link formatting for recent updates section in README 2024-11-28 19:31:09 +08:00
UncleCode
a9b6b65238 chore: update version to 0.3.744 and add publish.sh to .gitignore 2024-11-28 19:26:50 +08:00
UncleCode
a036b7f122 feat: implement create_box_message utility for formatted error messages and enhance error logging in AsyncWebCrawler 2024-11-28 19:24:07 +08:00
UncleCode
0bccf23db3 docs: update quickstart_async.py to enable example function calls for better demonstration 2024-11-28 18:19:42 +08:00
UncleCode
0cbd594512 Merge branch 'next' - Update README, and quickstart examples 2024-11-28 16:43:16 +08:00
UncleCode
efe93a5f57 docs: enhance README with development TODOs and refine mission statement for clarity 2024-11-28 16:41:11 +08:00
UncleCode
3fda66b85b docs: refine README content for clarity and conciseness, improving descriptions and formatting 2024-11-28 16:36:24 +08:00
UncleCode
ddfb6707b4 docs: update README to reflect new branding and improve section headings for clarity 2024-11-28 16:34:08 +08:00
UncleCode
a69f7a9531 fix: correct typo in function documentation for clarity and accuracy 2024-11-28 16:31:41 +08:00
UncleCode
d583aa43ca refactor: update cache handling in quickstart_async example to use CacheMode enum 2024-11-28 15:53:25 +08:00
UncleCode
3abb573142 docs: update README for version 0.3.743 with improved formatting and contributor acknowledgments 2024-11-28 13:07:59 +08:00
UncleCode
d556dada9f docs: update README to keep details open for extraction capabilities, browser integration, input/output flexibility, utility & debugging, security & accessibility, community & documentation, and cutting-edge features 2024-11-28 13:07:33 +08:00
UncleCode
ce7d49484f docs: update README for version 0.3.743 with new features, enhancements, and contributor acknowledgments 2024-11-28 13:06:46 +08:00
UncleCode
e4acd18429 docs: update README for version 0.3.743 with new features, enhancements, and contributor acknowledgments 2024-11-28 13:06:30 +08:00
UncleCode
c2d4784810 fix: resolve merge conflict in DefaultMarkdownGenerator affecting fit_markdown generation 2024-11-28 12:56:31 +08:00
UncleCode
76bea6c577 Merge branch 'main' into 0.3.743 2024-11-28 12:53:30 +08:00
UncleCode
3ff0b0b2c4 feat: update changelog for version 0.3.743 with new features, improvements, and contributor acknowledgments 2024-11-28 12:48:07 +08:00
UncleCode
a1c7dc17ce Merge branch 'next' of https://github.com/unclecode/crawl4ai into next 2024-11-28 12:45:57 +08:00
UncleCode
24723b2f10 Enhance features and documentation
- Updated version to 0.3.743
  - Improved ManagedBrowser configuration with dynamic host/port
  - Implemented fast HTML formatting in web crawler
  - Enhanced markdown generation with a new generator class
  - Improved sanitization and utility functions
  - Added contributor details and pull request acknowledgments
  - Updated documentation for clearer usage scenarios
  - Adjusted tests to reflect class name changes
2024-11-28 12:45:05 +08:00
Hamza Farhan
f998e9e949 Fix: handled the cases where markdown_with_citations, references_markdown, and filtered_html might not be defined. (#293)
Thanks, dear Farhan, for the changes you made in the code. I accepted and merged them into the main branch. Also, I will add your name to our contributor list. Thank you so much.
2024-11-27 19:20:54 +08:00
zhounan
73661f7d1f docs: enhance development installation instructions (#286)
Thanks for your contribution. I'm merging your changes and I'll add your name to our contributor list. Thank you so much.
2024-11-27 15:04:20 +08:00
UncleCode
b5d4db07d1 Merge branch 'main' of https://github.com/unclecode/crawl4ai 2024-11-27 14:55:58 +08:00
UncleCode
c6a022132b docs: update CONTRIBUTORS.md to acknowledge aadityakanjolia4 for fixing 'CustomHTML2Text' bug 2024-11-27 14:55:56 +08:00
unclecode
195c0ccf8a chore: remove deprecated Docker Compose configurations for crawl4ai service 2024-11-24 19:40:27 +08:00
unclecode
b09a86c0c1 chore: remove deprecated Docker Compose configurations for crawl4ai service 2024-11-24 19:40:10 +08:00
unclecode
de43505ae4 feat: update version to 0.3.742 2024-11-24 19:36:30 +08:00
unclecode
d7c5b900b8 feat: add support for arm64 platform in Docker commands and update INSTALL_TYPE variable in docker-compose 2024-11-24 19:35:53 +08:00
unclecode
edad7b6a74 chore: remove Railway deployment configuration and related documentation 2024-11-24 18:48:39 +08:00
UncleCode
829a1f7992 feat: update version to 0.3.741 and enhance content filtering with heuristic strategy. Fixing the issue that when the past HTML to BM25 content filter does not have any HTML elements. 2024-11-23 19:45:41 +08:00
UncleCode
d729aa7d5e refactor: Add group ID to for images extracted from srcset. 2024-11-23 18:00:32 +08:00
UncleCode
0d0cef3438 feat: add enhanced markdown generation example with citations and file output 2024-11-22 20:14:58 +08:00
UncleCode
d7a112fefe Merge branch 'main' of https://github.com/unclecode/crawl4ai 2024-11-22 19:56:56 +08:00
UncleCode
a5decaa7cf Merge branch '0.3.74' 2024-11-22 19:55:52 +08:00
UncleCode
8dea3f470f chore: update README to include new features and improvements for version 0.3.74 2024-11-22 18:50:12 +08:00
UncleCode
e02935dc5b chore: update README to reflect new features and improvements in version 0.3.74 2024-11-22 18:49:22 +08:00
UncleCode
24ad2fe2dd feat: enhance Markdown generation to include fit_html attribute 2024-11-22 18:47:17 +08:00
UncleCode
571dda6549 Update Redme 2024-11-22 18:27:43 +08:00
UncleCode
006bee4a5a feat: enhance image processing capabilities
- Enhanced image processing with srcset support and validation checks for better image selection.
2024-11-22 16:00:17 +08:00
UncleCode
dbb751c8f0 In this commit, we introduce the new concept of MakrdownGenerationStrategy, which allows us to expand our future strategies to generate better markdown. Right now, we generate raw markdown as we were doing before. We have a new algorithm for fitting markdown based on BM25, and now we add the ability to refine markdown into a citation form. Our links will be extracted and replaced by a citation reference number, and then we will have reference sections at the very end; we add all the links with the descriptions. This format is more suitable for large language models. In case we don't need to pass links, we can reduce the size of the markdown significantly and also attach the list of references as a separate file to a large language model. This commit contains changes for this direction. 2024-11-21 18:21:43 +08:00
程序员阿江(Relakkes)
3439f7886d fix: crawler strategy exception handling and fixes (#271) 2024-11-20 20:30:25 +08:00
Darwing Medina
d418a04602 Fix #260 prevent pass duplicated kwargs to scrapping_strategy (#269)
Thank you for the suggestions. It totally makes sense now. Change to pop operator.
2024-11-20 18:52:11 +08:00
UncleCode
7047422e48 Merge branch '0.3.74' of https://github.com/unclecode/crawl4ai into 0.3.74 2024-11-19 19:33:08 +08:00
UncleCode
2bdec1fa5a chore: add manage-collab.sh to .gitignore 2024-11-19 19:33:04 +08:00
UncleCode
b654c49e55 Update .gitignore to exclude additional scripts and files 2024-11-19 19:32:06 +08:00
UncleCode
f2cb7d506d Delete test3.txt 2024-11-19 19:12:14 +08:00
ntohidikplay
a6dad3fc6d test: trying to push to 0.3.74 2024-11-19 12:09:33 +01:00
UncleCode
fbcff85ecb Remove test files 2024-11-19 19:03:23 +08:00
UncleCode
788c67c29a Merge branch 'main' of https://github.com/unclecode/crawl4ai 2024-11-19 19:02:44 +08:00
UncleCode
2f19d38693 Update .gitignore to include .gitboss/ and todo_executor.md 2024-11-19 19:02:41 +08:00
ntohidikplay
3aae30ed2a test1: trying to push to main 2024-11-19 11:57:07 +01:00
UncleCode
73658c758a chore: update .gitignore to include manage-collab.sh 2024-11-19 16:10:43 +08:00
UncleCode
b6af94cbbb Merge remote-tracking branch 'origin/main' into 0.3.74 2024-11-18 21:15:04 +08:00
UncleCode
852729ff38 feat(docker): add Docker Compose configurations for local and hub deployment; enhance GPU support checks in Dockerfile
feat(requirements): update requirements.txt to include snowballstemmer
fix(version_manager): correct version parsing to use __version__.__version__
feat(main): introduce chunking strategy and content filter in CrawlRequest model
feat(content_filter): enhance BM25 algorithm with priority tag scoring for improved content relevance
feat(logger): implement new async logger engine replacing print statements throughout library
fix(database): resolve version-related deadlock and circular lock issues in database operations
docs(docker): expand Docker deployment documentation with usage instructions for Docker Compose
2024-11-18 21:00:06 +08:00
UncleCode
152ac35bc2 feat(docs): update README for version 0.3.74 with new features and improvements
fix(version): update version number to 0.3.74
refactor(async_webcrawler): enhance logging and add domain-based request delay
2024-11-17 21:09:26 +08:00
UncleCode
df63a40606 feat(docs): update examples and documentation to replace bypass_cache with cache_mode for improved clarity 2024-11-17 19:44:45 +08:00
UncleCode
a59c107b23 Update changelog for 0.3.74 2024-11-17 18:42:43 +08:00
UncleCode
f9fe6f89fe feat(database): implement version management and migration checks during initialization 2024-11-17 18:09:33 +08:00
UncleCode
2a82455b3d feat(crawl): implement direct crawl functionality and introduce CacheMode for improved caching control 2024-11-17 17:17:34 +08:00
UncleCode
3a524a3bdd fix(docs): remove unnecessary blank line in README for improved readability 2024-11-17 16:00:39 +08:00
UncleCode
3a66aa8a60 feat(cache): introduce CacheMode and CacheContext for enhanced caching behavior
chore(requirements): add colorama dependency
refactor(config): add SHOW_DEPRECATION_WARNINGS flag and clean up code
fix(docs): update example scripts for clarity and consistency
2024-11-17 15:30:56 +08:00
UncleCode
4b45b28f25 feat(docs): enhance deployment documentation with one-click setup, API security details, and Docker Compose examples 2024-11-16 18:44:47 +08:00
UncleCode
9139ef3125 feat(docker): update Dockerfile for improved installation process and enhance deployment documentation with Docker Compose setup and API token security 2024-11-16 18:19:44 +08:00
UncleCode
6360d0545a feat(api): add API token authentication and update Dockerfile description 2024-11-16 18:08:56 +08:00
UncleCode
1961adb530 refactor(docker): remove shared memory size configuration to streamline Dockerfile 2024-11-16 17:35:27 +08:00
UncleCode
79feab89c4 refactor(deploy): remove memory utilization alert configuration from deployment template 2024-11-16 17:28:42 +08:00
UncleCode
5d0b13294c feat(deploy): change instance size to professional-xs and update memory utilization alert window to 300 seconds 2024-11-16 17:25:07 +08:00
UncleCode
67edc2d641 feat(deploy): update instance size to professional-xs and add memory utilization alert parameters 2024-11-16 17:23:32 +08:00
UncleCode
6b569cceb5 feat(deploy): update branch to 0.3.74 and change instance size to basic-xs 2024-11-16 17:21:45 +08:00
UncleCode
6f2fe5954f feat(deploy): update instance size to professional-xs and add memory utilization alert 2024-11-16 17:12:41 +08:00
UncleCode
fca1319b7d feat(docker): add MkDocs installation and build step for documentation 2024-11-16 17:10:30 +08:00
UncleCode
f77f06a3bd feat(deploy): add deployment configuration and templates for crawl4ai 2024-11-16 16:43:31 +08:00
UncleCode
e62c807295 feat(deploy): add Railway deployment configuration and setup instructions 2024-11-16 16:38:13 +08:00
UncleCode
90df6921b7 feat(crawl_sync): add synchronous crawl endpoint and corresponding test 2024-11-16 15:34:30 +08:00
UncleCode
5098442086 refactor: migrate versioning to __version__.py and remove deprecated _version.py 2024-11-16 15:30:24 +08:00
UncleCode
d0014c6793 New async database manager and migration support
- Introduced AsyncDatabaseManager for async DB management.
  - Added migration feature to transition to file-based storage.
  - Enhanced web crawler with improved caching logic.
  - Updated requirements and setup for async processing.
2024-11-16 14:54:41 +08:00
UncleCode
ae7ebc0bd8 chore: update .gitignore and enhance changelog with major feature additions and examples 2024-11-15 20:16:13 +08:00
UncleCode
1f269f9834 test(content_filter): add comprehensive tests for BM25ContentFilter functionality 2024-11-15 18:11:11 +08:00
UncleCode
7f1ae5adcf Update changelog 2024-11-14 22:51:51 +08:00
UncleCode
3d00fee6c2 - In this commit, the library is updated to process file downloads. Users can now specify a download folder and trigger the download process via JavaScript or other means, with all files being saved. The list of downloaded files will also be added to the crowd result object.
- Another thing this commit introduces is the concept of the Relevance Content Filter. This is an improvement over Fit Markdown. This class of strategies aims to extract the main content from a given page - the part that really matters and is useful to be processed. One strategy has been created using the BM25 algorithm, which finds chunks of text from the web page relevant to its title, descriptions, and keywords, or supports a given user query and matches them. The result is then returned to the main engine to be converted to Markdown. Plans include adding approaches using language models as well.
- The cache database was updated to hold information about response headers and downloaded files.
2024-11-14 22:50:59 +08:00
UncleCode
17913f5acf feat(crawler): support local files and raw HTML input in AsyncWebCrawler 2024-11-13 20:00:29 +08:00
UncleCode
c38ac29edb perf(crawler): major performance improvements & raw HTML support
- Switch to lxml parser (~4x speedup)
- Add raw HTML & local file crawling support
- Fix cache headers & async cleanup
- Add browser process monitoring
- Optimize BeautifulSoup operations
- Pre-compile regex patterns

Breaking: Raw HTML handling requires new URL prefixes
Fixes: #256, #253
2024-11-13 19:40:40 +08:00
UncleCode
61b93ebf36 Update change log 2024-11-13 15:38:30 +08:00
UncleCode
bf91adf3f8 fix: Resolve unexpected BrowserContext closure during crawl in Docker
- Removed __del__ method in AsyncPlaywrightCrawlerStrategy to ensure reliable browser lifecycle management by using explicit context managers.
- Added process monitoring in ManagedBrowser to detect and log unexpected terminations of the browser subprocess.
- Updated Docker configuration to expose port 9222 for remote debugging and allocate extra shared memory to prevent browser crashes.
- Improved error handling and resource cleanup for browser instances, particularly in Docker environments.

Resolves Issue #256
2024-11-13 15:37:16 +08:00
UncleCode
b6d6631b12 Enhance Async Crawler with Playwright support
- Implemented new async crawler strategy using Playwright.
- Introduced ManagedBrowser for better browser management.
- Added support for persistent browser sessions and improved error handling.
- Updated version from 0.3.73 to 0.3.731.
- Enhanced logic in main.py for conditional mounting of static files.
- Updated requirements to replace playwright_stealth with tf-playwright-stealth.
2024-11-12 12:10:58 +08:00
94 changed files with 13328 additions and 5328 deletions

19
.do/app.yaml Normal file
View File

@@ -0,0 +1,19 @@
alerts:
- rule: DEPLOYMENT_FAILED
- rule: DOMAIN_FAILED
name: crawl4ai
region: nyc
services:
- dockerfile_path: Dockerfile
github:
branch: 0.3.74
deploy_on_push: true
repo: unclecode/crawl4ai
health_check:
http_path: /health
http_port: 11235
instance_count: 1
instance_size_slug: professional-xs
name: web
routes:
- path: /

22
.do/deploy.template.yaml Normal file
View File

@@ -0,0 +1,22 @@
spec:
name: crawl4ai
services:
- name: crawl4ai
git:
branch: 0.3.74
repo_clone_url: https://github.com/unclecode/crawl4ai.git
dockerfile_path: Dockerfile
http_port: 11235
instance_count: 1
instance_size_slug: professional-xs
health_check:
http_path: /health
envs:
- key: INSTALL_TYPE
value: "basic"
- key: PYTHON_VERSION
value: "3.10"
- key: ENABLE_GPU
value: "false"
routes:
- path: /

11
.gitignore vendored
View File

@@ -199,13 +199,22 @@ test_env/
**/.DS_Store
todo.md
todo_executor.md
git_changes.py
git_changes.md
pypi_build.sh
git_issues.py
git_issues.md
.next/
.tests/
.issues/
.docs/
.issues/
.issues/
.gitboss/
todo_executor.md
protect-all-except-feature.sh
manage-collab.sh
publish.sh
combine.sh
combined_output.txt

View File

@@ -1,6 +1,498 @@
# Changelog
# CHANGELOG
## [0.4.1] December 8, 2024
### **File: `crawl4ai/async_crawler_strategy.py`**
#### **New Parameters and Attributes Added**
- **`text_only` (boolean)**: Enables text-only mode, disables images, JavaScript, and GPU-related features for faster, minimal rendering.
- **`light_mode` (boolean)**: Optimizes the browser by disabling unnecessary background processes and features for efficiency.
- **`viewport_width` and `viewport_height`**: Dynamically adjusts based on `text_only` mode (default values: 800x600 for `text_only`, 1920x1080 otherwise).
- **`extra_args`**: Adds browser-specific flags for `text_only` mode.
- **`adjust_viewport_to_content`**: Dynamically adjusts the viewport to the content size for accurate rendering.
#### **Browser Context Adjustments**
- Added **`viewport` adjustments**: Dynamically computed based on `text_only` or custom configuration.
- Enhanced support for `light_mode` and `text_only` by adding specific browser arguments to reduce resource consumption.
#### **Dynamic Content Handling**
- **Full Page Scan Feature**:
- Scrolls through the entire page while dynamically detecting content changes.
- Ensures scrolling stops when no new dynamic content is loaded.
#### **Session Management**
- Added **`create_session`** method:
- Creates a new browser session and assigns a unique ID.
- Supports persistent and non-persistent contexts with full compatibility for cookies, headers, and proxies.
#### **Improved Content Loading and Adjustment**
- **`adjust_viewport_to_content`**:
- Automatically adjusts viewport to match content dimensions.
- Includes scaling via Chrome DevTools Protocol (CDP).
- Enhanced content loading:
- Waits for images to load and ensures network activity is idle before proceeding.
#### **Error Handling and Logging**
- Improved error handling and detailed logging for:
- Viewport adjustment (`adjust_viewport_to_content`).
- Full page scanning (`scan_full_page`).
- Dynamic content loading.
#### **Refactoring and Cleanup**
- Removed hardcoded viewport dimensions in multiple places, replaced with dynamic values (`self.viewport_width`, `self.viewport_height`).
- Removed commented-out and unused code for better readability.
- Added default value for `delay_before_return_html` parameter.
#### **Optimizations**
- Reduced resource usage in `light_mode` by disabling unnecessary browser features such as extensions, background timers, and sync.
- Improved compatibility for different browser types (`chrome`, `firefox`, `webkit`).
---
### **File: `docs/examples/quickstart_async.py`**
#### **Schema Adjustment**
- Changed schema reference for `LLMExtractionStrategy`:
- **Old**: `OpenAIModelFee.schema()`
- **New**: `OpenAIModelFee.model_json_schema()`
- This likely ensures better compatibility with the `OpenAIModelFee` class and its JSON schema.
#### **Documentation Comments Updated**
- Improved extraction instruction for schema-based LLM strategies.
---
### **New Features Added**
1. **Text-Only Mode**:
- Focuses on minimal resource usage by disabling non-essential browser features.
2. **Light Mode**:
- Optimizes browser for performance by disabling background tasks and unnecessary services.
3. **Full Page Scanning**:
- Ensures the entire content of a page is crawled, including dynamic elements loaded during scrolling.
4. **Dynamic Viewport Adjustment**:
- Automatically resizes the viewport to match content dimensions, improving compatibility and rendering accuracy.
5. **Session Management**:
- Simplifies session handling with better support for persistent and non-persistent contexts.
---
### **Bug Fixes**
- Fixed potential viewport mismatches by ensuring consistent use of `self.viewport_width` and `self.viewport_height` throughout the code.
- Improved robustness of dynamic content loading to avoid timeouts and failed evaluations.
## [0.3.75] December 1, 2024
### PruningContentFilter
#### 1. Introduced PruningContentFilter (Dec 01, 2024) (Dec 01, 2024)
A new content filtering strategy that removes less relevant nodes based on metrics like text and link density.
**Affected Files:**
- `crawl4ai/content_filter_strategy.py`: Enhancement of content filtering capabilities.
```diff
Implemented effective pruning algorithm with comprehensive scoring.
```
- `README.md`: Improved documentation regarding new features.
```diff
Updated to include usage and explanation for the PruningContentFilter.
```
- `docs/md_v2/basic/content_filtering.md`: Expanded documentation for users.
```diff
Added detailed section explaining the PruningContentFilter.
```
#### 2. Added Unit Tests for PruningContentFilter (Dec 01, 2024) (Dec 01, 2024)
Comprehensive tests added to ensure correct functionality of PruningContentFilter
**Affected Files:**
- `tests/async/test_content_filter_prune.py`: Increased test coverage for content filtering strategies.
```diff
Created test cases for various scenarios using the PruningContentFilter.
```
### Development Updates
#### 3. Enhanced BM25ContentFilter tests (Dec 01, 2024) (Dec 01, 2024)
Extended testing to cover additional edge cases and performance metrics.
**Affected Files:**
- `tests/async/test_content_filter_bm25.py`: Improved reliability and performance assurance.
```diff
Added tests for new extraction scenarios including malformed HTML.
```
### Infrastructure & Documentation
#### 4. Updated Examples (Dec 01, 2024) (Dec 01, 2024)
Altered examples in documentation to promote the use of PruningContentFilter alongside existing strategies.
**Affected Files:**
- `docs/examples/quickstart_async.py`: Enhanced usability and clarity for new users.
- Revised example to illustrate usage of PruningContentFilter.
## [0.3.746] November 29, 2024
### Major Features
1. Enhanced Docker Support (Nov 29, 2024)
- Improved GPU support in Docker images.
- Dockerfile refactored for better platform-specific installations.
- Introduced new Docker commands for different platforms:
- `basic-amd64`, `all-amd64`, `gpu-amd64` for AMD64.
- `basic-arm64`, `all-arm64`, `gpu-arm64` for ARM64.
### Infrastructure & Documentation
- Enhanced README.md to improve user guidance and installation instructions.
- Added installation instructions for Playwright setup in README.
- Created and updated examples in `docs/examples/quickstart_async.py` to be more useful and user-friendly.
- Updated `requirements.txt` with a new `pydantic` dependency.
- Bumped version number in `crawl4ai/__version__.py` to 0.3.746.
### Breaking Changes
- Streamlined application structure:
- Removed static pages and related code from `main.py` which might affect existing deployments relying on static content.
### Development Updates
- Developed `post_install` method in `crawl4ai/install.py` to streamline post-installation setup tasks.
- Refined migration processes in `crawl4ai/migrations.py` with enhanced logging for better error visibility.
- Updated `docker-compose.yml` to support local and hub services for different architectures, enhancing build and deploy capabilities.
- Refactored example test cases in `docs/examples/docker_example.py` to facilitate comprehensive testing.
### README.md
Updated README with new docker commands and setup instructions.
Enhanced installation instructions and guidance.
### crawl4ai/install.py
Added post-install script functionality.
Introduced `post_install` method for automation of post-installation tasks.
### crawl4ai/migrations.py
Improved migration logging.
Refined migration processes and added better logging.
### docker-compose.yml
Refactored docker-compose for better service management.
Updated to define services for different platforms and versions.
### requirements.txt
Updated dependencies.
Added `pydantic` to requirements file.
### crawler/__version__.py
Updated version number.
Bumped version number to 0.3.746.
### docs/examples/quickstart_async.py
Enhanced example scripts.
Uncommented example usage in async guide for user functionality.
### main.py
Refactored code to improve maintainability.
Streamlined app structure by removing static pages code.
## [0.3.743] November 27, 2024
Enhance features and documentation
- Updated version to 0.3.743
- Improved ManagedBrowser configuration with dynamic host/port
- Implemented fast HTML formatting in web crawler
- Enhanced markdown generation with a new generator class
- Improved sanitization and utility functions
- Added contributor details and pull request acknowledgments
- Updated documentation for clearer usage scenarios
- Adjusted tests to reflect class name changes
### CONTRIBUTORS.md
Added new contributors and pull request details.
Updated community contributions and acknowledged pull requests.
### crawl4ai/__version__.py
Version update.
Bumped version to 0.3.743.
### crawl4ai/async_crawler_strategy.py
Improved ManagedBrowser configuration.
Enhanced browser initialization with configurable host and debugging port; improved hook execution.
### crawl4ai/async_webcrawler.py
Optimized HTML processing.
Implemented 'fast_format_html' for optimized HTML formatting; applied it when 'prettiify' is enabled.
### crawl4ai/content_scraping_strategy.py
Enhanced markdown generation strategy.
Updated to use DefaultMarkdownGenerator and improved markdown generation with filters option.
### crawl4ai/markdown_generation_strategy.py
Refactored markdown generation class.
Renamed DefaultMarkdownGenerationStrategy to DefaultMarkdownGenerator; added content filter handling.
### crawl4ai/utils.py
Enhanced utility functions.
Improved input sanitization and enhanced HTML formatting method.
### docs/md_v2/advanced/hooks-auth.md
Improved documentation for hooks.
Updated code examples to include cookies in crawler strategy initialization.
### tests/async/test_markdown_genertor.py
Refactored tests to match class renaming.
Updated tests to use renamed DefaultMarkdownGenerator class.
## [0.3.74] November 17, 2024
This changelog details the updates and changes introduced in Crawl4AI version 0.3.74. It's designed to inform developers about new features, modifications to existing components, removals, and other important information.
### 1. File Download Processing
- Users can now specify download folders using the `downloads_path` parameter in the `AsyncWebCrawler` constructor or the `arun` method. If not specified, downloads are saved to a "downloads" folder within the `.crawl4ai` directory.
- File download tracking is integrated into the `CrawlResult` object. Successfully downloaded files are listed in the `downloaded_files` attribute, providing their paths.
- Added `accept_downloads` parameter to the crawler strategies (defaults to `False`). If set to True you can add JS code and `wait_for` parameter for file download.
**Example:**
```python
import asyncio
import os
from pathlib import Path
from crawl4ai import AsyncWebCrawler
async def download_example():
downloads_path = os.path.join(Path.home(), ".crawl4ai", "downloads")
os.makedirs(downloads_path, exist_ok=True)
async with AsyncWebCrawler(
accept_downloads=True,
downloads_path=downloads_path,
verbose=True
) as crawler:
result = await crawler.arun(
url="https://www.python.org/downloads/",
js_code="""
const downloadLink = document.querySelector('a[href$=".exe"]');
if (downloadLink) { downloadLink.click(); }
""",
wait_for=5 # To ensure download has started
)
if result.downloaded_files:
print("Downloaded files:")
for file in result.downloaded_files:
print(f"- {file}")
asyncio.run(download_example())
```
### 2. Refined Content Filtering
- Introduced the `RelevanceContentFilter` strategy (and its implementation `BM25ContentFilter`) for extracting relevant content from web pages, replacing Fit Markdown and other content cleaning strategy. This new strategy leverages the BM25 algorithm to identify chunks of text relevant to the page's title, description, keywords, or a user-provided query.
- The `fit_markdown` flag in the content scraper is used to filter content based on title, meta description, and keywords.
**Example:**
```python
from crawl4ai import AsyncWebCrawler
from crawl4ai.content_filter_strategy import BM25ContentFilter
async def filter_content(url, query):
async with AsyncWebCrawler() as crawler:
content_filter = BM25ContentFilter(user_query=query)
result = await crawler.arun(url=url, extraction_strategy=content_filter, fit_markdown=True)
print(result.extracted_content) # Or result.fit_markdown for the markdown version
print(result.fit_html) # Or result.fit_html to show HTML with only the filtered content
asyncio.run(filter_content("https://en.wikipedia.org/wiki/Apple", "fruit nutrition health"))
```
### 3. Raw HTML and Local File Support
- Added support for crawling local files and raw HTML content directly.
- Use the `file://` prefix for local file paths.
- Use the `raw:` prefix for raw HTML strings.
**Example:**
```python
async def crawl_local_or_raw(crawler, content, content_type):
prefix = "file://" if content_type == "local" else "raw:"
url = f"{prefix}{content}"
result = await crawler.arun(url=url)
if result.success:
print(f"Markdown Content from {content_type.title()} Source:")
print(result.markdown)
# Example usage with local file and raw HTML
async def main():
async with AsyncWebCrawler() as crawler:
# Local File
await crawl_local_or_raw(
crawler, os.path.abspath('tests/async/sample_wikipedia.html'), "local"
)
# Raw HTML
await crawl_raw_html(crawler, "<h1>Raw Test</h1><p>This is raw HTML.</p>")
asyncio.run(main())
```
### 4. Browser Management
- New asynchronous crawler strategy implemented using Playwright.
- `ManagedBrowser` class introduced for improved browser session handling, offering features like persistent browser sessions between requests (using `session_id` parameter) and browser process monitoring.
- Updated to tf-playwright-stealth for enhanced stealth capabilities.
- Added `use_managed_browser`, `use_persistent_context`, and `chrome_channel` parameters to AsyncPlaywrightCrawlerStrategy.
**Example:**
```python
async def browser_management_demo():
user_data_dir = os.path.join(Path.home(), ".crawl4ai", "user-data-dir")
os.makedirs(user_data_dir, exist_ok=True) # Ensure directory exists
async with AsyncWebCrawler(
use_managed_browser=True,
user_data_dir=user_data_dir,
use_persistent_context=True,
verbose=True
) as crawler:
result1 = await crawler.arun(
url="https://example.com", session_id="my_session"
)
result2 = await crawler.arun(
url="https://example.com/anotherpage", session_id="my_session"
)
asyncio.run(browser_management_demo())
```
### 5. API Server & Cache Improvements
- Added CORS support to API server.
- Implemented static file serving.
- Enhanced root redirect functionality.
- Cache database updated to store response headers and downloaded files information. It utilizes a file system approach to manage large content efficiently.
- New, more efficient caching database built using xxhash and file system approach.
- Introduced `CacheMode` enum (`ENABLED`, `DISABLED`, `READ_ONLY`, `WRITE_ONLY`, `BYPASS`) and `always_bypass_cache` parameter in AsyncWebCrawler for fine-grained cache control. This replaces `bypass_cache`, `no_cache_read`, `no_cache_write`, and `always_by_pass_cache`.
### 🗑️ Removals
- Removed deprecated: `crawl4ai/content_cleaning_strategy.py`.
- Removed internal class ContentCleaningStrategy
- Removed legacy cache control flags: `bypass_cache`, `disable_cache`, `no_cache_read`, `no_cache_write`, and `always_by_pass_cache`. These have been superseded by `cache_mode`.
### ⚙️ Other Changes
- Moved version file to `crawl4ai/__version__.py`.
- Added `crawl4ai/cache_context.py`.
- Added `crawl4ai/version_manager.py`.
- Added `crawl4ai/migrations.py`.
- Added `crawl4ai-migrate` entry point.
- Added config `NEED_MIGRATION` and `SHOW_DEPRECATION_WARNINGS`.
- API server now requires an API token for authentication, configurable with the `CRAWL4AI_API_TOKEN` environment variable. This enhances API security.
- Added synchronous crawl endpoint `/crawl_sync` for immediate result retrieval, and direct crawl endpoint `/crawl_direct` bypassing the task queue.
### ⚠️ Deprecation Notices
- The synchronous version of `WebCrawler` is being phased out. While still available via `crawl4ai[sync]`, it will eventually be removed. Transition to `AsyncWebCrawler` is strongly recommended. Boolean cache control flags in `arun` are also deprecated, migrate to using the `cache_mode` parameter. See examples in the "New Features" section above for correct usage.
### 🐛 Bug Fixes
- Resolved issue with browser context closing unexpectedly in Docker. This significantly improves stability, particularly within containerized environments.
- Fixed memory leaks associated with incorrect asynchronous cleanup by removing the `__del__` method and ensuring the browser context is closed explicitly using context managers.
- Improved error handling in `WebScrapingStrategy`. More detailed error messages and suggestions for debugging will minimize frustration when running into unexpected issues.
- Fixed issue with incorrect text parsing in specific HTML structures.
### Example of migrating to the new CacheMode:
**Old way:**
```python
crawler = AsyncWebCrawler(always_by_pass_cache=True)
result = await crawler.arun(url="https://example.com", bypass_cache=True)
```
**New way:**
```python
from crawl4ai import CacheMode
crawler = AsyncWebCrawler(always_bypass_cache=True)
result = await crawler.arun(url="https://example.com", cache_mode=CacheMode.BYPASS)
```
## [0.3.74] - November 13, 2024
1. **File Download Processing** (Nov 14, 2024)
- Added capability for users to specify download folders
- Implemented file download tracking in crowd result object
- Created new file: `tests/async/test_async_doanloader.py`
2. **Content Filtering Improvements** (Nov 14, 2024)
- Introduced Relevance Content Filter as an improvement over Fit Markdown
- Implemented BM25 algorithm for content relevance matching
- Added new file: `crawl4ai/content_filter_strategy.py`
- Removed deprecated: `crawl4ai/content_cleaning_strategy.py`
3. **Local File and Raw HTML Support** (Nov 13, 2024)
- Added support for processing local files
- Implemented raw HTML input handling in AsyncWebCrawler
- Enhanced `crawl4ai/async_webcrawler.py` with significant performance improvements
4. **Browser Management Enhancements** (Nov 12, 2024)
- Implemented new async crawler strategy using Playwright
- Introduced ManagedBrowser for better browser session handling
- Added support for persistent browser sessions
- Updated from playwright_stealth to tf-playwright-stealth
5. **API Server Component**
- Added CORS support
- Implemented static file serving
- Enhanced root redirect functionality
## [0.3.731] - November 13, 2024
### Added
- Support for raw HTML and local file crawling via URL prefixes ('raw:', 'file://')
- Browser process monitoring for managed browser instances
- Screenshot capability for raw HTML and local file content
- Response headers storage in cache database
- New `fit_markdown` flag for optional markdown generation
### Changed
- Switched HTML parser from 'html.parser' to 'lxml' for ~4x performance improvement
- Optimized BeautifulSoup text conversion and element selection
- Pre-compiled regular expressions for better performance
- Improved metadata extraction efficiency
- Response headers now stored alongside HTML in cache
### Removed
- `__del__` method from AsyncPlaywrightCrawlerStrategy to prevent async cleanup issues
### Fixed
- Issue #256: Added support for crawling raw HTML content
- Issue #253: Implemented file:// protocol handling
- Missing response headers in cached results
- Memory leaks from improper async cleanup
## [v0.3.731] - 2024-11-13 Changelog for Issue 256 Fix
- Fixed: Browser context unexpectedly closing in Docker environment during crawl operations.
- Removed: __del__ method from AsyncPlaywrightCrawlerStrategy to prevent unreliable asynchronous cleanup, ensuring - browser context is closed explicitly within context managers.
- Added: Monitoring for ManagedBrowser subprocess to detect and log unexpected terminations.
- Updated: Dockerfile configurations to expose debugging port (9222) and allocate additional shared memory for improved browser stability.
- Improved: Error handling and resource cleanup processes for browser lifecycle management within the Docker environment.
## [v0.3.73] - 2024-11-05
@@ -70,7 +562,7 @@
- Modified database connection management approach
- Updated API response structure for better consistency
## Migration Guide
### Migration Guide
When upgrading to v0.3.73, be aware of the following changes:
1. Docker Deployment:
@@ -92,7 +584,7 @@ When upgrading to v0.3.73, be aware of the following changes:
- Follow recommended fixes for any identified problems
## [2024-11-04 - 13:21:42] Comprehensive Update of Crawl4AI Features and Dependencies
## [v0.3.73] - 2024-11-04
This commit introduces several key enhancements, including improved error handling and robust database operations in `async_database.py`, which now features a connection pool and retry logic for better reliability. Updates to the README.md provide clearer instructions and a better user experience with links to documentation sections. The `.gitignore` file has been refined to include additional directories, while the async web crawler now utilizes a managed browser for more efficient crawling. Furthermore, multiple dependency updates and introduction of the `CustomHTML2Text` class enhance text extraction capabilities.
## [v0.3.73] - 2024-10-24
@@ -180,7 +672,7 @@ This commit introduces several key enhancements, including improved error handli
## [v0.3.72] - 2024-10-20
### Fixed
- Added support for parsing Base64 encoded images in WebScrappingStrategy
- Added support for parsing Base64 encoded images in WebScrapingStrategy
### Added
- Forked and integrated a customized version of the html2text library for more control over Markdown generation
@@ -203,7 +695,7 @@ This commit introduces several key enhancements, including improved error handli
### Developer Notes
- The customized html2text library is now located within the crawl4ai package
- New configuration options are available in the `config.py` file for external content handling
- The `WebScrappingStrategy` class has been updated to accommodate new external content exclusion options
- The `WebScrapingStrategy` class has been updated to accommodate new external content exclusion options
## [v0.3.71] - 2024-10-19
@@ -280,7 +772,7 @@ These updates aim to provide more flexibility in text processing, improve perfor
### Improvements
1. **Better Error Handling**:
- Enhanced error reporting in WebScrappingStrategy with detailed error messages and suggestions.
- Enhanced error reporting in WebScrapingStrategy with detailed error messages and suggestions.
- Added console message and error logging for better debugging.
2. **Image Processing Enhancements**:
@@ -338,43 +830,43 @@ These updates aim to provide more flexibility in text processing, improve perfor
- Allows retrieval of content after a specified delay, useful for dynamically loaded content.
- **How to use**: Access `result.get_delayed_content(delay_in_seconds)` after crawling.
## Improvements and Optimizations
### Improvements and Optimizations
### 1. AsyncWebCrawler Enhancements
#### 1. AsyncWebCrawler Enhancements
- **Flexible Initialization**: Now accepts arbitrary keyword arguments, passed directly to the crawler strategy.
- Allows for more customized setups.
### 2. Image Processing Optimization
- Enhanced image handling in WebScrappingStrategy.
#### 2. Image Processing Optimization
- Enhanced image handling in WebScrapingStrategy.
- Added filtering for small, invisible, or irrelevant images.
- Improved image scoring system for better content relevance.
- Implemented JavaScript-based image dimension updating for more accurate representation.
### 3. Database Schema Auto-updates
#### 3. Database Schema Auto-updates
- Automatic database schema updates ensure compatibility with the latest version.
### 4. Enhanced Error Handling and Logging
#### 4. Enhanced Error Handling and Logging
- Improved error messages and logging for easier debugging.
### 5. Content Extraction Refinements
#### 5. Content Extraction Refinements
- Refined HTML sanitization process.
- Improved handling of base64 encoded images.
- Enhanced Markdown conversion process.
- Optimized content extraction algorithms.
### 6. Utility Function Enhancements
#### 6. Utility Function Enhancements
- `perform_completion_with_backoff` function now supports additional arguments for more customized API calls to LLM providers.
## Bug Fixes
### Bug Fixes
- Fixed an issue where image tags were being prematurely removed during content extraction.
## Examples and Documentation
### Examples and Documentation
- Updated `quickstart_async.py` with examples of:
- Using custom headers in LLM extraction.
- Different LLM provider usage (OpenAI, Hugging Face, Ollama).
- Custom browser type usage.
## Developer Notes
### Developer Notes
- Refactored code for better maintainability, flexibility, and performance.
- Enhanced type hinting throughout the codebase for improved development experience.
- Expanded error handling for more robust operation.

View File

@@ -10,11 +10,21 @@ We would like to thank the following people for their contributions to Crawl4AI:
## Community Contributors
- [aadityakanjolia4](https://github.com/aadityakanjolia4) - Fix for `CustomHTML2Text` is not defined.
- [FractalMind](https://github.com/FractalMind) - Created the first official Docker Hub image and fixed Dockerfile errors
- [ketonkss4](https://github.com/ketonkss4) - Identified Selenium's new capabilities, helping reduce dependencies
- [jonymusky](https://github.com/jonymusky) - Javascript execution documentation, and wait_for
- [datehoer](https://github.com/datehoer) - Add browser prxy support
## Pull Requests
- [dvschuyl](https://github.com/dvschuyl) - AsyncPlaywrightCrawlerStrategy page-evaluate context destroyed by navigation [#304](https://github.com/unclecode/crawl4ai/pull/304)
- [nelzomal](https://github.com/nelzomal) - Enhance development installation instructions [#286](https://github.com/unclecode/crawl4ai/pull/286)
- [HamzaFarhan](https://github.com/HamzaFarhan) - Handled the cases where markdown_with_citations, references_markdown, and filtered_html might not be defined [#293](https://github.com/unclecode/crawl4ai/pull/293)
- [NanmiCoder](https://github.com/NanmiCoder) - fix: crawler strategy exception handling and fixes [#271](https://github.com/unclecode/crawl4ai/pull/271)
- [paulokuong](https://github.com/paulokuong) - fix: RAWL4_AI_BASE_DIRECTORY should be Path object instead of string [#298](https://github.com/unclecode/crawl4ai/pull/298)
## Other Contributors
- [Gokhan](https://github.com/gkhngyk)

View File

@@ -1,6 +1,9 @@
# syntax=docker/dockerfile:1.4
# Build arguments
ARG TARGETPLATFORM
ARG BUILDPLATFORM
# Other build arguments
ARG PYTHON_VERSION=3.10
# Base stage with system dependencies
@@ -12,7 +15,7 @@ ARG ENABLE_GPU=false
# Platform-specific labels
LABEL maintainer="unclecode"
LABEL description="Crawl4AI - Advanced Web Crawler with AI capabilities"
LABEL description="🔥🕷️ Crawl4AI: Open-source LLM Friendly Web Crawler & scraper"
LABEL version="1.0"
# Environment setup
@@ -62,12 +65,14 @@ RUN apt-get update && apt-get install -y --no-install-recommends \
libatspi2.0-0 \
&& rm -rf /var/lib/apt/lists/*
# GPU support if enabled
RUN if [ "$ENABLE_GPU" = "true" ] ; then \
# GPU support if enabled and architecture is supported
RUN if [ "$ENABLE_GPU" = "true" ] && [ "$TARGETPLATFORM" = "linux/amd64" ] ; then \
apt-get update && apt-get install -y --no-install-recommends \
nvidia-cuda-toolkit \
&& rm -rf /var/lib/apt/lists/* ; \
fi
else \
echo "Skipping NVIDIA CUDA Toolkit installation (unsupported platform or GPU disabled)"; \
fi
# Create and set working directory
WORKDIR /app
@@ -96,26 +101,36 @@ RUN if [ "$INSTALL_TYPE" = "all" ] ; then \
# Install the package
RUN if [ "$INSTALL_TYPE" = "all" ] ; then \
pip install -e ".[all]" && \
pip install ".[all]" && \
python -m crawl4ai.model_loader ; \
elif [ "$INSTALL_TYPE" = "torch" ] ; then \
pip install -e ".[torch]" ; \
pip install ".[torch]" ; \
elif [ "$INSTALL_TYPE" = "transformer" ] ; then \
pip install -e ".[transformer]" && \
pip install ".[transformer]" && \
python -m crawl4ai.model_loader ; \
else \
pip install -e "." ; \
pip install "." ; \
fi
# Install Playwright and browsers
RUN playwright install
# Install MkDocs and required plugins
RUN pip install --no-cache-dir \
mkdocs \
mkdocs-material \
mkdocs-terminal \
pymdown-extensions
# Health check
HEALTHCHECK --interval=30s --timeout=30s --start-period=5s --retries=3 \
CMD curl -f http://localhost:8000/health || exit 1
# Build MkDocs documentation
RUN mkdocs build
# Install Playwright and browsers
RUN if [ "$TARGETPLATFORM" = "linux/amd64" ]; then \
playwright install chromium; \
elif [ "$TARGETPLATFORM" = "linux/arm64" ]; then \
playwright install chromium; \
fi
# Expose port
EXPOSE 8000
EXPOSE 8000 11235 9222 8080
# Start the FastAPI server
CMD ["uvicorn", "main:app", "--host", "0.0.0.0", "--port", "11235"]

View File

@@ -1 +1,2 @@
include requirements.txt
include requirements.txt
recursive-include crawl4ai/js_snippet *.js

776
README.md
View File

@@ -1,4 +1,4 @@
# 🔥🕷️ Crawl4AI: LLM Friendly Web Crawler & Scraper
# 🚀🤖 Crawl4AI: Crawl Smarter, Faster, Freely. For AI.
<a href="https://trendshift.io/repositories/11716" target="_blank"><img src="https://trendshift.io/api/badge/repositories/11716" alt="unclecode%2Fcrawl4ai | Trendshift" style="width: 250px; height: 55px;" width="250" height="55"/></a>
@@ -9,23 +9,119 @@
[![GitHub Pull Requests](https://img.shields.io/github/issues-pr/unclecode/crawl4ai)](https://github.com/unclecode/crawl4ai/pulls)
[![License](https://img.shields.io/github/license/unclecode/crawl4ai)](https://github.com/unclecode/crawl4ai/blob/main/LICENSE)
Crawl4AI simplifies asynchronous web crawling and data extraction, making it accessible for large language models (LLMs) and AI applications. 🆓🌐
Crawl4AI is the #1 trending GitHub repository, actively maintained by a vibrant community. It delivers blazing-fast, AI-ready web crawling tailored for LLMs, AI agents, and data pipelines. Open source, flexible, and built for real-time performance, Crawl4AI empowers developers with unmatched speed, precision, and deployment ease.
## 🌟 Meet the Crawl4AI Assistant: Your Copilot for Crawling
[✨ Check out latest update v0.4.2](#-recent-updates)
Use the [Crawl4AI GPT Assistant](https://tinyurl.com/crawl4ai-gpt) as your AI-powered copilot! With this assistant, you can:
🎉 **Version 0.4.2 is out!** Introducing our experimental PruningContentFilter - a powerful new algorithm for smarter Markdown generation. Test it out and [share your feedback](https://github.com/unclecode/crawl4ai/issues)! [Read the release notes →](https://crawl4ai.com/mkdocs/blog)
- 🧑‍💻 Generate code for complex crawling and extraction tasks
- 💡 Get tailored support and examples
- 📘 Learn Crawl4AI faster with step-by-step guidance
## 🧐 Why Crawl4AI?
## New in 0.3.73 ✨
1. **Built for LLMs**: Creates smart, concise Markdown optimized for RAG and fine-tuning applications.
2. **Lightning Fast**: Delivers results 6x faster with real-time, cost-efficient performance.
3. **Flexible Browser Control**: Offers session management, proxies, and custom hooks for seamless data access.
4. **Heuristic Intelligence**: Uses advanced algorithms for efficient extraction, reducing reliance on costly models.
5. **Open Source & Deployable**: Fully open-source with no API keys—ready for Docker and cloud integration.
6. **Thriving Community**: Actively maintained by a vibrant community and the #1 trending GitHub repository.
- 🐳 Docker Ready: Full API server with seamless deployment & scaling
- 🎯 Browser Takeover: Use your own browser with cookies & history intact (CDP support)
- 📝 Mockdown+: Enhanced tag preservation & content extraction
- ⚡️ Parallel Power: Supercharged multi-URL crawling performance
- 🌟 And many more exciting updates...
## 🚀 Quick Start
1. Install Crawl4AI:
```bash
pip install crawl4ai
crawl4ai-setup # Setup the browser
```
2. Run a simple web crawl:
```python
import asyncio
from crawl4ai import AsyncWebCrawler, CacheMode
async def main():
async with AsyncWebCrawler(verbose=True) as crawler:
result = await crawler.arun(url="https://www.nbcnews.com/business")
# Soone will be change to result.markdown
print(result.markdown_v2.raw_markdown)
if __name__ == "__main__":
asyncio.run(main())
```
## ✨ Features
<details>
<summary>📝 <strong>Markdown Generation</strong></summary>
- 🧹 **Clean Markdown**: Generates clean, structured Markdown with accurate formatting.
- 🎯 **Fit Markdown**: Heuristic-based filtering to remove noise and irrelevant parts for AI-friendly processing.
- 🔗 **Citations and References**: Converts page links into a numbered reference list with clean citations.
- 🛠️ **Custom Strategies**: Users can create their own Markdown generation strategies tailored to specific needs.
- 📚 **BM25 Algorithm**: Employs BM25-based filtering for extracting core information and removing irrelevant content.
</details>
<details>
<summary>📊 <strong>Structured Data Extraction</strong></summary>
- 🤖 **LLM-Driven Extraction**: Supports all LLMs (open-source and proprietary) for structured data extraction.
- 🧱 **Chunking Strategies**: Implements chunking (topic-based, regex, sentence-level) for targeted content processing.
- 🌌 **Cosine Similarity**: Find relevant content chunks based on user queries for semantic extraction.
- 🔎 **CSS-Based Extraction**: Fast schema-based data extraction using XPath and CSS selectors.
- 🔧 **Schema Definition**: Define custom schemas for extracting structured JSON from repetitive patterns.
</details>
<details>
<summary>🌐 <strong>Browser Integration</strong></summary>
- 🖥️ **Managed Browser**: Use user-owned browsers with full control, avoiding bot detection.
- 🔄 **Remote Browser Control**: Connect to Chrome Developer Tools Protocol for remote, large-scale data extraction.
- 🔒 **Session Management**: Preserve browser states and reuse them for multi-step crawling.
- 🧩 **Proxy Support**: Seamlessly connect to proxies with authentication for secure access.
- ⚙️ **Full Browser Control**: Modify headers, cookies, user agents, and more for tailored crawling setups.
- 🌍 **Multi-Browser Support**: Compatible with Chromium, Firefox, and WebKit.
- 📐 **Dynamic Viewport Adjustment**: Automatically adjusts the browser viewport to match page content, ensuring complete rendering and capturing of all elements.
</details>
<details>
<summary>🔎 <strong>Crawling & Scraping</strong></summary>
- 🖼️ **Media Support**: Extract images, audio, videos, and responsive image formats like `srcset` and `picture`.
- 🚀 **Dynamic Crawling**: Execute JS and wait for async or sync for dynamic content extraction.
- 📸 **Screenshots**: Capture page screenshots during crawling for debugging or analysis.
- 📂 **Raw Data Crawling**: Directly process raw HTML (`raw:`) or local files (`file://`).
- 🔗 **Comprehensive Link Extraction**: Extracts internal, external links, and embedded iframe content.
- 🛠️ **Customizable Hooks**: Define hooks at every step to customize crawling behavior.
- 💾 **Caching**: Cache data for improved speed and to avoid redundant fetches.
- 📄 **Metadata Extraction**: Retrieve structured metadata from web pages.
- 📡 **IFrame Content Extraction**: Seamless extraction from embedded iframe content.
- 🕵️ **Lazy Load Handling**: Waits for images to fully load, ensuring no content is missed due to lazy loading.
- 🔄 **Full-Page Scanning**: Simulates scrolling to load and capture all dynamic content, perfect for infinite scroll pages.
</details>
<details>
<summary>🚀 <strong>Deployment</strong></summary>
- 🐳 **Dockerized Setup**: Optimized Docker image with API server for easy deployment.
- 🔄 **API Gateway**: One-click deployment with secure token authentication for API-based workflows.
- 🌐 **Scalable Architecture**: Designed for mass-scale production and optimized server performance.
- ⚙️ **DigitalOcean Deployment**: Ready-to-deploy configurations for DigitalOcean and similar platforms.
</details>
<details>
<summary>🎯 <strong>Additional Features</strong></summary>
- 🕶️ **Stealth Mode**: Avoid bot detection by mimicking real users.
- 🏷️ **Tag-Based Content Extraction**: Refine crawling based on custom tags, headers, or metadata.
- 🔗 **Link Analysis**: Extract and analyze all links for detailed data exploration.
- 🛡️ **Error Handling**: Robust error management for seamless execution.
- 🔐 **CORS & Static Serving**: Supports filesystem-based caching and cross-origin requests.
- 📖 **Clear Documentation**: Simplified and updated guides for onboarding and advanced usage.
- 🙌 **Community Recognition**: Acknowledges contributors and pull requests for transparency.
</details>
## Try it Now!
@@ -33,53 +129,27 @@ Use the [Crawl4AI GPT Assistant](https://tinyurl.com/crawl4ai-gpt) as your AI-po
✨ Visit our [Documentation Website](https://crawl4ai.com/mkdocs/)
## Features ✨
- 🆓 Completely free and open-source
- 🚀 Blazing fast performance, outperforming many paid services
- 🤖 LLM-friendly output formats (JSON, cleaned HTML, markdown)
- 🌐 Multi-browser support (Chromium, Firefox, WebKit)
- 🌍 Supports crawling multiple URLs simultaneously
- 🎨 Extracts and returns all media tags (Images, Audio, and Video)
- 🔗 Extracts all external and internal links
- 📚 Extracts metadata from the page
- 🔄 Custom hooks for authentication, headers, and page modifications
- 🕵️ User-agent customization
- 🖼️ Takes screenshots of pages with enhanced error handling
- 📜 Executes multiple custom JavaScripts before crawling
- 📊 Generates structured output without LLM using JsonCssExtractionStrategy
- 📚 Various chunking strategies: topic-based, regex, sentence, and more
- 🧠 Advanced extraction strategies: cosine clustering, LLM, and more
- 🎯 CSS selector support for precise data extraction
- 📝 Passes instructions/keywords to refine extraction
- 🔒 Proxy support with authentication for enhanced access
- 🔄 Session management for complex multi-page crawling
- 🌐 Asynchronous architecture for improved performance
- 🖼️ Improved image processing with lazy-loading detection
- 🕰️ Enhanced handling of delayed content loading
- 🔑 Custom headers support for LLM interactions
- 🖼️ iframe content extraction for comprehensive analysis
- ⏱️ Flexible timeout and delayed content retrieval options
## Installation 🛠️
Crawl4AI offers flexible installation options to suit various use cases. You can install it as a Python package or use Docker.
### Using pip 🐍
<details>
<summary>🐍 <strong>Using pip</strong></summary>
Choose the installation option that best fits your needs:
#### Basic Installation
### Basic Installation
For basic web crawling and scraping tasks:
```bash
pip install crawl4ai
crawl4ai-setup # Setup the browser
```
By default, this will install the asynchronous version of Crawl4AI, using Playwright for web crawling.
👉 Note: When you install Crawl4AI, the setup script should automatically install and set up Playwright. However, if you encounter any Playwright-related errors, you can manually install it using one of these methods:
👉 **Note**: When you install Crawl4AI, the `crawl4ai-setup` should automatically install and set up Playwright. However, if you encounter any Playwright-related errors, you can manually install it using one of these methods:
1. Through the command line:
@@ -95,57 +165,237 @@ By default, this will install the asynchronous version of Crawl4AI, using Playwr
This second method has proven to be more reliable in some cases.
#### Installation with Synchronous Version
---
If you need the synchronous version using Selenium:
### Installation with Synchronous Version
The sync version is deprecated and will be removed in future versions. If you need the synchronous version using Selenium:
```bash
pip install crawl4ai[sync]
```
#### Development Installation
---
### Development Installation
For contributors who plan to modify the source code:
```bash
git clone https://github.com/unclecode/crawl4ai.git
cd crawl4ai
pip install -e .
pip install -e . # Basic installation in editable mode
```
### Using Docker 🐳
Install optional features:
```bash
pip install -e ".[torch]" # With PyTorch features
pip install -e ".[transformer]" # With Transformer features
pip install -e ".[cosine]" # With cosine similarity features
pip install -e ".[sync]" # With synchronous crawling (Selenium)
pip install -e ".[all]" # Install all optional features
```
</details>
<details>
<summary>🚀 <strong>One-Click Deployment</strong></summary>
Deploy your own instance of Crawl4AI with one click:
[![DigitalOcean Referral Badge](https://web-platforms.sfo2.cdn.digitaloceanspaces.com/WWW/Badge%203.svg)](https://www.digitalocean.com/?repo=https://github.com/unclecode/crawl4ai/tree/0.3.74&refcode=a0780f1bdb3d&utm_campaign=Referral_Invite&utm_medium=Referral_Program&utm_source=badge)
> 💡 **Recommended specs**: 4GB RAM minimum. Select "professional-xs" or higher when deploying for stable operation.
The deploy will:
- Set up a Docker container with Crawl4AI
- Configure Playwright and all dependencies
- Start the FastAPI server on port `11235`
- Set up health checks and auto-deployment
</details>
<details>
<summary>🐳 <strong>Using Docker</strong></summary>
Crawl4AI is available as Docker images for easy deployment. You can either pull directly from Docker Hub (recommended) or build from the repository.
#### Option 1: Docker Hub (Recommended)
---
<details>
<summary>🐳 <strong>Option 1: Docker Hub (Recommended)</strong></summary>
Choose the appropriate image based on your platform and needs:
### For AMD64 (Regular Linux/Windows):
```bash
# Pull and run from Docker Hub (choose one):
docker pull unclecode/crawl4ai:basic # Basic crawling features
docker pull unclecode/crawl4ai:all # Full installation (ML, LLM support)
docker pull unclecode/crawl4ai:gpu # GPU-enabled version
# Basic version (recommended)
docker pull unclecode/crawl4ai:basic-amd64
docker run -p 11235:11235 unclecode/crawl4ai:basic-amd64
# Run the container
docker run -p 11235:11235 unclecode/crawl4ai:basic # Replace 'basic' with your chosen version
# Full ML/LLM support
docker pull unclecode/crawl4ai:all-amd64
docker run -p 11235:11235 unclecode/crawl4ai:all-amd64
# With GPU support
docker pull unclecode/crawl4ai:gpu-amd64
docker run -p 11235:11235 unclecode/crawl4ai:gpu-amd64
```
#### Option 2: Build from Repository
### For ARM64 (M1/M2 Macs, ARM servers):
```bash
# Basic version (recommended)
docker pull unclecode/crawl4ai:basic-arm64
docker run -p 11235:11235 unclecode/crawl4ai:basic-arm64
# Full ML/LLM support
docker pull unclecode/crawl4ai:all-arm64
docker run -p 11235:11235 unclecode/crawl4ai:all-arm64
# With GPU support
docker pull unclecode/crawl4ai:gpu-arm64
docker run -p 11235:11235 unclecode/crawl4ai:gpu-arm64
```
Need more memory? Add `--shm-size`:
```bash
docker run --shm-size=2gb -p 11235:11235 unclecode/crawl4ai:basic-amd64
```
Test the installation:
```bash
curl http://localhost:11235/health
```
### For Raspberry Pi (32-bit) (coming soon):
```bash
# Pull and run basic version (recommended for Raspberry Pi)
docker pull unclecode/crawl4ai:basic-armv7
docker run -p 11235:11235 unclecode/crawl4ai:basic-armv7
# With increased shared memory if needed
docker run --shm-size=2gb -p 11235:11235 unclecode/crawl4ai:basic-armv7
```
Note: Due to hardware constraints, only the basic version is recommended for Raspberry Pi.
</details>
<details>
<summary>🐳 <strong>Option 2: Build from Repository</strong></summary>
Build the image locally based on your platform:
```bash
# Clone the repository
git clone https://github.com/unclecode/crawl4ai.git
cd crawl4ai
# Build the image
docker build -t crawl4ai:local \
--build-arg INSTALL_TYPE=basic \ # Options: basic, all
# For AMD64 (Regular Linux/Windows)
docker build --platform linux/amd64 \
--tag crawl4ai:local \
--build-arg INSTALL_TYPE=basic \
.
# Run your local build
docker run -p 11235:11235 crawl4ai:local
# For ARM64 (M1/M2 Macs, ARM servers)
docker build --platform linux/arm64 \
--tag crawl4ai:local \
--build-arg INSTALL_TYPE=basic \
.
```
Quick test (works for both options):
Build options:
- INSTALL_TYPE=basic (default): Basic crawling features
- INSTALL_TYPE=all: Full ML/LLM support
- ENABLE_GPU=true: Add GPU support
Example with all options:
```bash
docker build --platform linux/amd64 \
--tag crawl4ai:local \
--build-arg INSTALL_TYPE=all \
--build-arg ENABLE_GPU=true \
.
```
Run your local build:
```bash
# Regular run
docker run -p 11235:11235 crawl4ai:local
# With increased shared memory
docker run --shm-size=2gb -p 11235:11235 crawl4ai:local
```
Test the installation:
```bash
curl http://localhost:11235/health
```
</details>
<details>
<summary>🐳 <strong>Option 3: Using Docker Compose</strong></summary>
Docker Compose provides a more structured way to run Crawl4AI, especially when dealing with environment variables and multiple configurations.
```bash
# Clone the repository
git clone https://github.com/unclecode/crawl4ai.git
cd crawl4ai
```
### For AMD64 (Regular Linux/Windows):
```bash
# Build and run locally
docker-compose --profile local-amd64 up
# Run from Docker Hub
VERSION=basic docker-compose --profile hub-amd64 up # Basic version
VERSION=all docker-compose --profile hub-amd64 up # Full ML/LLM support
VERSION=gpu docker-compose --profile hub-amd64 up # GPU support
```
### For ARM64 (M1/M2 Macs, ARM servers):
```bash
# Build and run locally
docker-compose --profile local-arm64 up
# Run from Docker Hub
VERSION=basic docker-compose --profile hub-arm64 up # Basic version
VERSION=all docker-compose --profile hub-arm64 up # Full ML/LLM support
VERSION=gpu docker-compose --profile hub-arm64 up # GPU support
```
Environment variables (optional):
```bash
# Create a .env file
CRAWL4AI_API_TOKEN=your_token
OPENAI_API_KEY=your_openai_key
CLAUDE_API_KEY=your_claude_key
```
The compose file includes:
- Memory management (4GB limit, 1GB reserved)
- Shared memory volume for browser support
- Health checks
- Auto-restart policy
- All necessary port mappings
Test the installation:
```bash
curl http://localhost:11235/health
```
</details>
---
### Quick Test
Run a quick test (works for both Docker options):
```python
import requests
@@ -156,149 +406,143 @@ response = requests.post(
)
task_id = response.json()["task_id"]
# Get results
# Continue polling until the task is complete (status="completed")
result = requests.get(f"http://localhost:11235/task/{task_id}")
```
For advanced configuration, environment variables, and usage examples, see our [Docker Deployment Guide](https://crawl4ai.com/mkdocs/basic/docker-deployment/).
For more examples, see our [Docker Examples](https://github.com/unclecode/crawl4ai/blob/main/docs/examples/docker_example.py). For advanced configuration, environment variables, and usage examples, see our [Docker Deployment Guide](https://crawl4ai.com/mkdocs/basic/docker-deployment/).
</details>
## Quick Start 🚀
## 🔬 Advanced Usage Examples 🔬
You can check the project structure in the directory [https://github.com/unclecode/crawl4ai/docs/examples](docs/examples). Over there, you can find a variety of examples; here, some popular examples are shared.
<details>
<summary>📝 <strong>Heuristic Markdown Generation with Clean and Fit Markdown</strong></summary>
```python
import asyncio
from crawl4ai import AsyncWebCrawler
from crawl4ai import AsyncWebCrawler, CacheMode
from crawl4ai.content_filter_strategy import PruningContentFilter, BM25ContentFilter
from crawl4ai.markdown_generation_strategy import DefaultMarkdownGenerator
async def main():
async with AsyncWebCrawler(verbose=True) as crawler:
result = await crawler.arun(url="https://www.nbcnews.com/business")
print(result.markdown)
if __name__ == "__main__":
asyncio.run(main())
```
## Advanced Usage 🔬
### Executing JavaScript and Using CSS Selectors
```python
import asyncio
from crawl4ai import AsyncWebCrawler
async def main():
async with AsyncWebCrawler(verbose=True) as crawler:
js_code = ["const loadMoreButton = Array.from(document.querySelectorAll('button')).find(button => button.textContent.includes('Load More')); loadMoreButton && loadMoreButton.click();"]
async with AsyncWebCrawler(
headless=True,
verbose=True,
) as crawler:
result = await crawler.arun(
url="https://www.nbcnews.com/business",
js_code=js_code,
css_selector=".wide-tease-item__description",
bypass_cache=True
url="https://docs.micronaut.io/4.7.6/guide/",
cache_mode=CacheMode.ENABLED,
markdown_generator=DefaultMarkdownGenerator(
content_filter=PruningContentFilter(threshold=0.48, threshold_type="fixed", min_word_threshold=0)
),
# markdown_generator=DefaultMarkdownGenerator(
# content_filter=BM25ContentFilter(user_query="WHEN_WE_FOCUS_BASED_ON_A_USER_QUERY", bm25_threshold=1.0)
# ),
)
print(result.extracted_content)
print(len(result.markdown))
print(len(result.fit_markdown))
print(len(result.markdown_v2.fit_markdown))
if __name__ == "__main__":
asyncio.run(main())
```
### Using a Proxy
</details>
<details>
<summary>🖥️ <strong>Executing JavaScript & Extract Structured Data without LLMs</strong></summary>
```python
import asyncio
from crawl4ai import AsyncWebCrawler
async def main():
async with AsyncWebCrawler(verbose=True, proxy="http://127.0.0.1:7890") as crawler:
result = await crawler.arun(
url="https://www.nbcnews.com/business",
bypass_cache=True
)
print(result.markdown)
if __name__ == "__main__":
asyncio.run(main())
```
### Extracting Structured Data without LLM
The `JsonCssExtractionStrategy` allows for precise extraction of structured data from web pages using CSS selectors.
```python
import asyncio
import json
from crawl4ai import AsyncWebCrawler
from crawl4ai import AsyncWebCrawler, CacheMode
from crawl4ai.extraction_strategy import JsonCssExtractionStrategy
import json
async def extract_news_teasers():
async def main():
schema = {
"name": "News Teaser Extractor",
"baseSelector": ".wide-tease-item__wrapper",
"fields": [
{
"name": "category",
"selector": ".unibrow span[data-testid='unibrow-text']",
"type": "text",
},
{
"name": "headline",
"selector": ".wide-tease-item__headline",
"type": "text",
},
{
"name": "summary",
"selector": ".wide-tease-item__description",
"type": "text",
},
{
"name": "time",
"selector": "[data-testid='wide-tease-date']",
"type": "text",
},
{
"name": "image",
"type": "nested",
"selector": "picture.teasePicture img",
"fields": [
{"name": "src", "type": "attribute", "attribute": "src"},
{"name": "alt", "type": "attribute", "attribute": "alt"},
],
},
{
"name": "link",
"selector": "a[href]",
"type": "attribute",
"attribute": "href",
},
],
}
"name": "KidoCode Courses",
"baseSelector": "section.charge-methodology .w-tab-content > div",
"fields": [
{
"name": "section_title",
"selector": "h3.heading-50",
"type": "text",
},
{
"name": "section_description",
"selector": ".charge-content",
"type": "text",
},
{
"name": "course_name",
"selector": ".text-block-93",
"type": "text",
},
{
"name": "course_description",
"selector": ".course-content-text",
"type": "text",
},
{
"name": "course_icon",
"selector": ".image-92",
"type": "attribute",
"attribute": "src"
}
]
}
extraction_strategy = JsonCssExtractionStrategy(schema, verbose=True)
async with AsyncWebCrawler(verbose=True) as crawler:
async with AsyncWebCrawler(
headless=False,
verbose=True
) as crawler:
# Create the JavaScript that handles clicking multiple times
js_click_tabs = """
(async () => {
const tabs = document.querySelectorAll("section.charge-methodology .tabs-menu-3 > div");
for(let tab of tabs) {
// scroll to the tab
tab.scrollIntoView();
tab.click();
// Wait for content to load and animations to complete
await new Promise(r => setTimeout(r, 500));
}
})();
"""
result = await crawler.arun(
url="https://www.nbcnews.com/business",
extraction_strategy=extraction_strategy,
bypass_cache=True,
url="https://www.kidocode.com/degrees/technology",
extraction_strategy=JsonCssExtractionStrategy(schema, verbose=True),
js_code=[js_click_tabs],
cache_mode=CacheMode.BYPASS
)
assert result.success, "Failed to crawl the page"
companies = json.loads(result.extracted_content)
print(f"Successfully extracted {len(companies)} companies")
print(json.dumps(companies[0], indent=2))
news_teasers = json.loads(result.extracted_content)
print(f"Successfully extracted {len(news_teasers)} news teasers")
print(json.dumps(news_teasers[0], indent=2))
if __name__ == "__main__":
asyncio.run(extract_news_teasers())
asyncio.run(main())
```
For more advanced usage examples, check out our [Examples](https://crawl4ai.com/mkdocs/extraction/css-advanced/) section in the documentation.
</details>
### Extracting Structured Data with OpenAI
<details>
<summary>📚 <strong>Extracting Structured Data with LLMs</strong></summary>
```python
import os
import asyncio
from crawl4ai import AsyncWebCrawler
from crawl4ai import AsyncWebCrawler, CacheMode
from crawl4ai.extraction_strategy import LLMExtractionStrategy
from pydantic import BaseModel, Field
@@ -313,6 +557,8 @@ async def main():
url='https://openai.com/api/pricing/',
word_count_threshold=1,
extraction_strategy=LLMExtractionStrategy(
# Here you can use any provider that Litellm library supports, for instance: ollama/qwen2
# provider="ollama/qwen2", api_token="no-token",
provider="openai/gpt-4o", api_token=os.getenv('OPENAI_API_KEY'),
schema=OpenAIModelFee.schema(),
extraction_type="schema",
@@ -320,7 +566,7 @@ async def main():
Do not miss any models in the entire content. One extracted model JSON format should look like this:
{"model_name": "GPT-4", "input_fee": "US$10.00 / 1M tokens", "output_fee": "US$30.00 / 1M tokens"}."""
),
bypass_cache=True,
cache_mode=CacheMode.BYPASS,
)
print(result.extracted_content)
@@ -328,143 +574,107 @@ if __name__ == "__main__":
asyncio.run(main())
```
### Session Management and Dynamic Content Crawling
</details>
Crawl4AI excels at handling complex scenarios, such as crawling multiple pages with dynamic content loaded via JavaScript. Here's an example of crawling GitHub commits across multiple pages:
<details>
<summary>🤖 <strong>Using You own Browswer with Custome User Profile</strong></summary>
```python
import asyncio
import re
from bs4 import BeautifulSoup
import os, sys
from pathlib import Path
import asyncio, time
from crawl4ai import AsyncWebCrawler
async def crawl_typescript_commits():
first_commit = ""
async def on_execution_started(page):
nonlocal first_commit
try:
while True:
await page.wait_for_selector('li.Box-sc-g0xbh4-0 h4')
commit = await page.query_selector('li.Box-sc-g0xbh4-0 h4')
commit = await commit.evaluate('(element) => element.textContent')
commit = re.sub(r'\s+', '', commit)
if commit and commit != first_commit:
first_commit = commit
break
await asyncio.sleep(0.5)
except Exception as e:
print(f"Warning: New content didn't appear after JavaScript execution: {e}")
async def test_news_crawl():
# Create a persistent user data directory
user_data_dir = os.path.join(Path.home(), ".crawl4ai", "browser_profile")
os.makedirs(user_data_dir, exist_ok=True)
async with AsyncWebCrawler(verbose=True) as crawler:
crawler.crawler_strategy.set_hook('on_execution_started', on_execution_started)
url = "https://github.com/microsoft/TypeScript/commits/main"
session_id = "typescript_commits_session"
all_commits = []
js_next_page = """
const button = document.querySelector('a[data-testid="pagination-next-button"]');
if (button) button.click();
"""
for page in range(3): # Crawl 3 pages
result = await crawler.arun(
url=url,
session_id=session_id,
css_selector="li.Box-sc-g0xbh4-0",
js=js_next_page if page > 0 else None,
bypass_cache=True,
js_only=page > 0
)
assert result.success, f"Failed to crawl page {page + 1}"
soup = BeautifulSoup(result.cleaned_html, 'html.parser')
commits = soup.select("li")
all_commits.extend(commits)
print(f"Page {page + 1}: Found {len(commits)} commits")
await crawler.crawler_strategy.kill_session(session_id)
print(f"Successfully crawled {len(all_commits)} commits across 3 pages")
if __name__ == "__main__":
asyncio.run(crawl_typescript_commits())
async with AsyncWebCrawler(
verbose=True,
headless=True,
user_data_dir=user_data_dir,
use_persistent_context=True,
headers={
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8",
"Accept-Language": "en-US,en;q=0.5",
"Accept-Encoding": "gzip, deflate, br",
"DNT": "1",
"Connection": "keep-alive",
"Upgrade-Insecure-Requests": "1",
"Sec-Fetch-Dest": "document",
"Sec-Fetch-Mode": "navigate",
"Sec-Fetch-Site": "none",
"Sec-Fetch-User": "?1",
"Cache-Control": "max-age=0",
}
) as crawler:
url = "ADDRESS_OF_A_CHALLENGING_WEBSITE"
result = await crawler.arun(
url,
cache_mode=CacheMode.BYPASS,
magic=True,
)
print(f"Successfully crawled {url}")
print(f"Content length: {len(result.markdown)}")
```
This example demonstrates Crawl4AI's ability to handle complex scenarios where content is loaded asynchronously. It crawls multiple pages of GitHub commits, executing JavaScript to load new content and using custom hooks to ensure data is loaded before proceeding.
For more advanced usage examples, check out our [Examples](https://crawl4ai.com/mkdocs/tutorial/episode_12_Session-Based_Crawling_for_Dynamic_Websites/) section in the documentation.
</details>
## Speed Comparison 🚀
## ✨ Recent Updates
Crawl4AI is designed with speed as a primary focus. Our goal is to provide the fastest possible response with high-quality data extraction, minimizing abstractions between the data and the user.
- 🔧 **Configurable Crawlers and Browsers**: Simplified crawling with `BrowserConfig` and `CrawlerRunConfig`, making setups cleaner and more scalable.
- 🔐 **Session Management Enhancements**: Import/export local storage for personalized crawling with seamless session reuse.
- 📸 **Supercharged Screenshots**: Take lightning-fast, full-page screenshots of very long pages.
- 📜 **Full-Page PDF Export**: Convert any web page into a PDF for easy sharing or archiving.
- 🖼️ **Lazy Load Handling**: Improved support for websites with lazy-loaded images. The crawler now waits for all images to fully load, ensuring no content is missed.
- ⚡ **Text-Only Mode**: New mode for fast, lightweight crawling. Disables images, JavaScript, and GPU rendering, improving speed by 3-4x for text-focused crawls.
- 📐 **Dynamic Viewport Adjustment**: Automatically adjusts the browser viewport to fit page content, ensuring accurate rendering and capturing of all elements.
- 🔄 **Full-Page Scanning**: Added scrolling support for pages with infinite scroll or dynamic content loading. Ensures every part of the page is captured.
- 🧑‍💻 **Session Reuse**: Introduced `create_session` for efficient crawling by reusing the same browser session across multiple requests.
- 🌟 **Light Mode**: Optimized browser performance by disabling unnecessary features like extensions, background timers, and sync processes.
We've conducted a speed comparison between Crawl4AI and Firecrawl, a paid service. The results demonstrate Crawl4AI's superior performance:
```bash
Firecrawl:
Time taken: 7.02 seconds
Content length: 42074 characters
Images found: 49
Read the full details of this release in our [0.4.2 Release Notes](https://github.com/unclecode/crawl4ai/blob/main/docs/md_v2/blog/releases/0.4.2.md).
Crawl4AI (simple crawl):
Time taken: 1.60 seconds
Content length: 18238 characters
Images found: 49
## 📖 Documentation & Roadmap
Crawl4AI (with JavaScript execution):
Time taken: 4.64 seconds
Content length: 40869 characters
Images found: 89
```
> 🚨 **Documentation Update Alert**: We're undertaking a major documentation overhaul next week to reflect recent updates and improvements. Stay tuned for a more comprehensive and up-to-date guide!
As you can see, Crawl4AI outperforms Firecrawl significantly:
For current documentation, including installation instructions, advanced features, and API reference, visit our [Documentation Website](https://crawl4ai.com/mkdocs/).
- Simple crawl: Crawl4AI is over 4 times faster than Firecrawl.
- With JavaScript execution: Even when executing JavaScript to load more content (doubling the number of images found), Crawl4AI is still faster than Firecrawl's simple crawl.
To check our development plans and upcoming features, visit our [Roadmap](https://github.com/unclecode/crawl4ai/blob/main/ROADMAP.md).
You can find the full comparison code in our repository at `docs/examples/crawl4ai_vs_firecrawl.py`.
<details>
<summary>📈 <strong>Development TODOs</strong></summary>
## Documentation 📚
For detailed documentation, including installation instructions, advanced features, and API reference, visit our [Documentation Website](https://crawl4ai.com/mkdocs/).
## Crawl4AI Roadmap 🗺️
For detailed information on our development plans and upcoming features, check out our [Roadmap](https://github.com/unclecode/crawl4ai/blob/main/ROADMAP.md).
### Advanced Crawling Systems 🔧
- [x] 0. Graph Crawler: Smart website traversal using graph search algorithms for comprehensive nested page extraction
- [ ] 1. Question-Based Crawler: Natural language driven web discovery and content extraction
- [ ] 2. Knowledge-Optimal Crawler: Smart crawling that maximizes knowledge while minimizing data extraction
- [ ] 3. Agentic Crawler: Autonomous system for complex multi-step crawling operations
### Specialized Features 🛠️
- [ ] 4. Automated Schema Generator: Convert natural language to extraction schemas
- [ ] 5. Domain-Specific Scrapers: Pre-configured extractors for common platforms (academic, e-commerce)
- [ ] 6. Web Embedding Index: Semantic search infrastructure for crawled content
### Development Tools 🔨
- [ ] 7. Interactive Playground: Web UI for testing, comparing strategies with AI assistance
- [ ] 8. Performance Monitor: Real-time insights into crawler operations
- [ ] 9. Cloud Integration: One-click deployment solutions across cloud providers
### Community & Growth 🌱
- [ ] 10. Sponsorship Program: Structured support system with tiered benefits
- [ ] 11. Educational Content: "How to Crawl" video series and interactive tutorials
## Contributing 🤝
</details>
## 🤝 Contributing
We welcome contributions from the open-source community. Check out our [contribution guidelines](https://github.com/unclecode/crawl4ai/blob/main/CONTRIBUTING.md) for more information.
## License 📄
## 📄 License
Crawl4AI is released under the [Apache 2.0 License](https://github.com/unclecode/crawl4ai/blob/main/LICENSE).
## Contact 📧
## 📧 Contact
For questions, suggestions, or feedback, feel free to reach out:
@@ -474,32 +684,32 @@ For questions, suggestions, or feedback, feel free to reach out:
Happy Crawling! 🕸️🚀
## 🗾 Mission
# Mission
Our mission is to unlock the value of personal and enterprise data by transforming digital footprints into structured, tradeable assets. Crawl4AI empowers individuals and organizations with open-source tools to extract and structure data, fostering a shared data economy.
Our mission is to unlock the untapped potential of personal and enterprise data in the digital age. In today's world, individuals and organizations generate vast amounts of valuable digital footprints, yet this data remains largely uncapitalized as a true asset.
We envision a future where AI is powered by real human knowledge, ensuring data creators directly benefit from their contributions. By democratizing data and enabling ethical sharing, we are laying the foundation for authentic AI advancement.
Our open-source solution empowers developers and innovators to build tools for data extraction and structuring, laying the foundation for a new era of data ownership. By transforming personal and enterprise data into structured, tradeable assets, we're creating opportunities for individuals to capitalize on their digital footprints and for organizations to unlock the value of their collective knowledge.
<details>
<summary>🔑 <strong>Key Opportunities</strong></summary>
- **Data Capitalization**: Transform digital footprints into measurable, valuable assets.
- **Authentic AI Data**: Provide AI systems with real human insights.
- **Shared Economy**: Create a fair data marketplace that benefits data creators.
This democratization of data represents the first step toward a shared data economy, where willing participation in data sharing drives AI advancement while ensuring the benefits flow back to data creators. Through this approach, we're building a future where AI development is powered by authentic human knowledge rather than synthetic alternatives.
</details>
![Mission Diagram](./docs/assets/pitch-dark.svg)
<details>
<summary>🚀 <strong>Development Pathway</strong></summary>
For a detailed exploration of our vision, opportunities, and pathway forward, please see our [full mission statement](./MISSION.md).
1. **Open-Source Tools**: Community-driven platforms for transparent data extraction.
2. **Digital Asset Structuring**: Tools to organize and value digital knowledge.
3. **Ethical Data Marketplace**: A secure, fair platform for exchanging structured data.
## Key Opportunities
For more details, see our [full mission statement](./MISSION.md).
</details>
- **Data Capitalization**: Transform digital footprints into valuable assets that can appear on personal and enterprise balance sheets
- **Authentic Data**: Unlock the vast reservoir of real human insights and knowledge for AI advancement
- **Shared Economy**: Create new value streams where data creators directly benefit from their contributions
## Development Pathway
1. **Open-Source Foundation**: Building transparent, community-driven data extraction tools
2. **Data Capitalization Platform**: Creating tools to structure and value digital assets
3. **Shared Data Marketplace**: Establishing an economic platform for ethical data exchange
For a detailed exploration of our vision, challenges, and solutions, please see our [full mission statement](./MISSION.md).
## Star History

View File

@@ -1,244 +0,0 @@
# Crawl4AI v0.2.77 🕷️🤖
[![GitHub Stars](https://img.shields.io/github/stars/unclecode/crawl4ai?style=social)](https://github.com/unclecode/crawl4ai/stargazers)
[![GitHub Forks](https://img.shields.io/github/forks/unclecode/crawl4ai?style=social)](https://github.com/unclecode/crawl4ai/network/members)
[![GitHub Issues](https://img.shields.io/github/issues/unclecode/crawl4ai)](https://github.com/unclecode/crawl4ai/issues)
[![GitHub Pull Requests](https://img.shields.io/github/issues-pr/unclecode/crawl4ai)](https://github.com/unclecode/crawl4ai/pulls)
[![License](https://img.shields.io/github/license/unclecode/crawl4ai)](https://github.com/unclecode/crawl4ai/blob/main/LICENSE)
Crawl4AI simplifies web crawling and data extraction, making it accessible for large language models (LLMs) and AI applications. 🆓🌐
#### [v0.2.77] - 2024-08-02
Major improvements in functionality, performance, and cross-platform compatibility! 🚀
- 🐳 **Docker enhancements**:
- Significantly improved Dockerfile for easy installation on Linux, Mac, and Windows.
- 🌐 **Official Docker Hub image**:
- Launched our first official image on Docker Hub for streamlined deployment (unclecode/crawl4ai).
- 🔧 **Selenium upgrade**:
- Removed dependency on ChromeDriver, now using Selenium's built-in capabilities for better compatibility.
- 🖼️ **Image description**:
- Implemented ability to generate textual descriptions for extracted images from web pages.
-**Performance boost**:
- Various improvements to enhance overall speed and performance.
## Try it Now!
✨ Play around with this [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1sJPAmeLj5PMrg2VgOwMJ2ubGIcK0cJeX?usp=sharing)
✨ visit our [Documentation Website](https://crawl4ai.com/mkdocs/)
✨ Check [Demo](https://crawl4ai.com/mkdocs/demo)
## Features ✨
- 🆓 Completely free and open-source
- 🤖 LLM-friendly output formats (JSON, cleaned HTML, markdown)
- 🌍 Supports crawling multiple URLs simultaneously
- 🎨 Extracts and returns all media tags (Images, Audio, and Video)
- 🔗 Extracts all external and internal links
- 📚 Extracts metadata from the page
- 🔄 Custom hooks for authentication, headers, and page modifications before crawling
- 🕵️ User-agent customization
- 🖼️ Takes screenshots of the page
- 📜 Executes multiple custom JavaScripts before crawling
- 📚 Various chunking strategies: topic-based, regex, sentence, and more
- 🧠 Advanced extraction strategies: cosine clustering, LLM, and more
- 🎯 CSS selector support
- 📝 Passes instructions/keywords to refine extraction
# Crawl4AI
## 🌟 Shoutout to Contributors of v0.2.77!
A big thank you to the amazing contributors who've made this release possible:
- [@aravindkarnam](https://github.com/aravindkarnam) for the new image description feature
- [@FractalMind](https://github.com/FractalMind) for our official Docker Hub image
- [@ketonkss4](https://github.com/ketonkss4) for helping streamline our Selenium setup
Your contributions are driving Crawl4AI forward! 🚀
## Cool Examples 🚀
### Quick Start
```python
from crawl4ai import WebCrawler
# Create an instance of WebCrawler
crawler = WebCrawler()
# Warm up the crawler (load necessary models)
crawler.warmup()
# Run the crawler on a URL
result = crawler.run(url="https://www.nbcnews.com/business")
# Print the extracted content
print(result.markdown)
```
## How to install 🛠
### Using pip 🐍
```bash
virtualenv venv
source venv/bin/activate
pip install "crawl4ai @ git+https://github.com/unclecode/crawl4ai.git"
```
### Using Docker 🐳
```bash
# For Mac users (M1/M2)
# docker build --platform linux/amd64 -t crawl4ai .
docker build -t crawl4ai .
docker run -d -p 8000:80 crawl4ai
```
### Using Docker Hub 🐳
```bash
docker pull unclecode/crawl4ai:latest
docker run -d -p 8000:80 unclecode/crawl4ai:latest
```
## Speed-First Design 🚀
Perhaps the most important design principle for this library is speed. We need to ensure it can handle many links and resources in parallel as quickly as possible. By combining this speed with fast LLMs like Groq, the results will be truly amazing.
```python
import time
from crawl4ai.web_crawler import WebCrawler
crawler = WebCrawler()
crawler.warmup()
start = time.time()
url = r"https://www.nbcnews.com/business"
result = crawler.run( url, word_count_threshold=10, bypass_cache=True)
end = time.time()
print(f"Time taken: {end - start}")
```
Let's take a look the calculated time for the above code snippet:
```bash
[LOG] 🚀 Crawling done, success: True, time taken: 1.3623387813568115 seconds
[LOG] 🚀 Content extracted, success: True, time taken: 0.05715131759643555 seconds
[LOG] 🚀 Extraction, time taken: 0.05750393867492676 seconds.
Time taken: 1.439958095550537
```
Fetching the content from the page took 1.3623 seconds, and extracting the content took 0.0575 seconds. 🚀
### Extract Structured Data from Web Pages 📊
Crawl all OpenAI models and their fees from the official page.
```python
import os
from crawl4ai import WebCrawler
from crawl4ai.extraction_strategy import LLMExtractionStrategy
from pydantic import BaseModel, Field
class OpenAIModelFee(BaseModel):
model_name: str = Field(..., description="Name of the OpenAI model.")
input_fee: str = Field(..., description="Fee for input token for the OpenAI model.")
output_fee: str = Field(..., description="Fee for output token ßfor the OpenAI model.")
url = 'https://openai.com/api/pricing/'
crawler = WebCrawler()
crawler.warmup()
result = crawler.run(
url=url,
word_count_threshold=1,
extraction_strategy= LLMExtractionStrategy(
provider= "openai/gpt-4o", api_token = os.getenv('OPENAI_API_KEY'),
schema=OpenAIModelFee.schema(),
extraction_type="schema",
instruction="""From the crawled content, extract all mentioned model names along with their fees for input and output tokens.
Do not miss any models in the entire content. One extracted model JSON format should look like this:
{"model_name": "GPT-4", "input_fee": "US$10.00 / 1M tokens", "output_fee": "US$30.00 / 1M tokens"}."""
),
bypass_cache=True,
)
print(result.extracted_content)
```
### Execute JS, Filter Data with CSS Selector, and Clustering
```python
from crawl4ai import WebCrawler
from crawl4ai.chunking_strategy import CosineStrategy
js_code = ["const loadMoreButton = Array.from(document.querySelectorAll('button')).find(button => button.textContent.includes('Load More')); loadMoreButton && loadMoreButton.click();"]
crawler = WebCrawler()
crawler.warmup()
result = crawler.run(
url="https://www.nbcnews.com/business",
js=js_code,
css_selector="p",
extraction_strategy=CosineStrategy(semantic_filter="technology")
)
print(result.extracted_content)
```
### Extract Structured Data from Web Pages With Proxy and BaseUrl
```python
from crawl4ai import WebCrawler
from crawl4ai.extraction_strategy import LLMExtractionStrategy
def create_crawler():
crawler = WebCrawler(verbose=True, proxy="http://127.0.0.1:7890")
crawler.warmup()
return crawler
crawler = create_crawler()
crawler.warmup()
result = crawler.run(
url="https://www.nbcnews.com/business",
extraction_strategy=LLMExtractionStrategy(
provider="openai/gpt-4o",
api_token="sk-",
base_url="https://api.openai.com/v1"
)
)
print(result.markdown)
```
## Documentation 📚
For detailed documentation, including installation instructions, advanced features, and API reference, visit our [Documentation Website](https://crawl4ai.com/mkdocs/).
## Contributing 🤝
We welcome contributions from the open-source community. Check out our [contribution guidelines](https://github.com/unclecode/crawl4ai/blob/main/CONTRIBUTING.md) for more information.
## License 📄
Crawl4AI is released under the [Apache 2.0 License](https://github.com/unclecode/crawl4ai/blob/main/LICENSE).
## Contact 📧
For questions, suggestions, or feedback, feel free to reach out:
- GitHub: [unclecode](https://github.com/unclecode)
- Twitter: [@unclecode](https://twitter.com/unclecode)
- Website: [crawl4ai.com](https://crawl4ai.com)
Happy Crawling! 🕸️🚀
## Star History
[![Star History Chart](https://api.star-history.com/svg?repos=unclecode/crawl4ai&type=Date)](https://star-history.com/#unclecode/crawl4ai&Date)

View File

@@ -1,13 +1,29 @@
# __init__.py
from .async_webcrawler import AsyncWebCrawler
from .async_webcrawler import AsyncWebCrawler, CacheMode
from .async_configs import BrowserConfig, CrawlerRunConfig
from .extraction_strategy import ExtractionStrategy, LLMExtractionStrategy, CosineStrategy, JsonCssExtractionStrategy
from .chunking_strategy import ChunkingStrategy, RegexChunking
from .markdown_generation_strategy import DefaultMarkdownGenerator
from .content_filter_strategy import PruningContentFilter, BM25ContentFilter
from .models import CrawlResult
from ._version import __version__
# __version__ = "0.3.73"
from .__version__ import __version__
__all__ = [
"AsyncWebCrawler",
"CrawlResult",
"CacheMode",
'BrowserConfig',
'CrawlerRunConfig',
'ExtractionStrategy',
'LLMExtractionStrategy',
'CosineStrategy',
'JsonCssExtractionStrategy',
'ChunkingStrategy',
'RegexChunking',
'DefaultMarkdownGenerator',
'PruningContentFilter',
'BM25ContentFilter',
]
def is_sync_version_installed():
@@ -26,5 +42,5 @@ if is_sync_version_installed():
print("Warning: Failed to import WebCrawler even though selenium is installed. This might be due to other missing dependencies.")
else:
WebCrawler = None
import warnings
print("Warning: Synchronous WebCrawler is not available. Install crawl4ai[sync] for synchronous support. However, please note that the synchronous version will be deprecated soon.")
# import warnings
# print("Warning: Synchronous WebCrawler is not available. Install crawl4ai[sync] for synchronous support. However, please note that the synchronous version will be deprecated soon.")

View File

@@ -1,2 +1,2 @@
# crawl4ai/_version.py
__version__ = "0.3.73"
__version__ = "0.4.22"

406
crawl4ai/async_configs.py Normal file
View File

@@ -0,0 +1,406 @@
from .config import (
MIN_WORD_THRESHOLD,
IMAGE_DESCRIPTION_MIN_WORD_THRESHOLD,
SCREENSHOT_HEIGHT_TRESHOLD,
PAGE_TIMEOUT
)
from .user_agent_generator import UserAgentGenerator
from .extraction_strategy import ExtractionStrategy
from .chunking_strategy import ChunkingStrategy
from .markdown_generation_strategy import MarkdownGenerationStrategy
class BrowserConfig:
"""
Configuration class for setting up a browser instance and its context in AsyncPlaywrightCrawlerStrategy.
This class centralizes all parameters that affect browser and context creation. Instead of passing
scattered keyword arguments, users can instantiate and modify this configuration object. The crawler
code will then reference these settings to initialize the browser in a consistent, documented manner.
Attributes:
browser_type (str): The type of browser to launch. Supported values: "chromium", "firefox", "webkit".
Default: "chromium".
headless (bool): Whether to run the browser in headless mode (no visible GUI).
Default: True.
use_managed_browser (bool): Launch the browser using a managed approach (e.g., via CDP), allowing
advanced manipulation. Default: False.
use_persistent_context (bool): Use a persistent browser context (like a persistent profile).
Automatically sets use_managed_browser=True. Default: False.
user_data_dir (str or None): Path to a user data directory for persistent sessions. If None, a
temporary directory may be used. Default: None.
chrome_channel (str): The Chrome channel to launch (e.g., "chrome", "msedge"). Only applies if browser_type
is "chromium". Default: "chrome".
proxy (str or None): Proxy server URL (e.g., "http://username:password@proxy:port"). If None, no proxy is used.
Default: None.
proxy_config (dict or None): Detailed proxy configuration, e.g. {"server": "...", "username": "..."}.
If None, no additional proxy config. Default: None.
viewport_width (int): Default viewport width for pages. Default: 1920.
viewport_height (int): Default viewport height for pages. Default: 1080.
verbose (bool): Enable verbose logging.
Default: True.
accept_downloads (bool): Whether to allow file downloads. If True, requires a downloads_path.
Default: False.
downloads_path (str or None): Directory to store downloaded files. If None and accept_downloads is True,
a default path will be created. Default: None.
storage_state (str or dict or None): Path or object describing storage state (cookies, localStorage).
Default: None.
ignore_https_errors (bool): Ignore HTTPS certificate errors. Default: True.
java_script_enabled (bool): Enable JavaScript execution in pages. Default: True.
cookies (list): List of cookies to add to the browser context. Each cookie is a dict with fields like
{"name": "...", "value": "...", "url": "..."}.
Default: [].
headers (dict): Extra HTTP headers to apply to all requests in this context.
Default: {}.
user_agent (str): Custom User-Agent string to use. Default: "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) "
"AppleWebKit/537.36 (KHTML, like Gecko) Chrome/116.0.0.0 Safari/537.36".
user_agent_mode (str or None): Mode for generating the user agent (e.g., "random"). If None, use the provided
user_agent as-is. Default: None.
user_agent_generator_config (dict or None): Configuration for user agent generation if user_agent_mode is set.
Default: None.
text_only (bool): If True, disables images and other rich content for potentially faster load times.
Default: False.
light_mode (bool): Disables certain background features for performance gains. Default: False.
extra_args (list): Additional command-line arguments passed to the browser.
Default: [].
"""
def __init__(
self,
browser_type: str = "chromium",
headless: bool = True,
use_managed_browser: bool = False,
use_persistent_context: bool = False,
user_data_dir: str = None,
chrome_channel: str = "chrome",
proxy: str = None,
proxy_config: dict = None,
viewport_width: int = 1920,
viewport_height: int = 1080,
accept_downloads: bool = False,
downloads_path: str = None,
storage_state=None,
ignore_https_errors: bool = True,
java_script_enabled: bool = True,
sleep_on_close: bool = False,
verbose: bool = True,
cookies: list = None,
headers: dict = None,
user_agent: str = (
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:109.0) AppleWebKit/537.36 "
"(KHTML, like Gecko) Chrome/116.0.5845.187 Safari/604.1 Edg/117.0.2045.47"
),
user_agent_mode: str = None,
user_agent_generator_config: dict = None,
text_only: bool = False,
light_mode: bool = False,
extra_args: list = None,
):
self.browser_type = browser_type
self.headless = headless
self.use_managed_browser = use_managed_browser
self.use_persistent_context = use_persistent_context
self.user_data_dir = user_data_dir
if self.browser_type == "chromium":
self.chrome_channel = "chrome"
elif self.browser_type == "firefox":
self.chrome_channel = "firefox"
elif self.browser_type == "webkit":
self.chrome_channel = "webkit"
else:
self.chrome_channel = chrome_channel or "chrome"
self.proxy = proxy
self.proxy_config = proxy_config
self.viewport_width = viewport_width
self.viewport_height = viewport_height
self.accept_downloads = accept_downloads
self.downloads_path = downloads_path
self.storage_state = storage_state
self.ignore_https_errors = ignore_https_errors
self.java_script_enabled = java_script_enabled
self.cookies = cookies if cookies is not None else []
self.headers = headers if headers is not None else {}
self.user_agent = user_agent
self.user_agent_mode = user_agent_mode
self.user_agent_generator_config = user_agent_generator_config
self.text_only = text_only
self.light_mode = light_mode
self.extra_args = extra_args if extra_args is not None else []
self.sleep_on_close = sleep_on_close
self.verbose = verbose
user_agenr_generator = UserAgentGenerator()
if self.user_agent_mode != "random":
self.user_agent = user_agenr_generator.generate(
**(self.user_agent_generator_config or {})
)
self.browser_hint = user_agenr_generator.generate_client_hints(self.user_agent)
self.headers.setdefault("sec-ch-ua", self.browser_hint)
# If persistent context is requested, ensure managed browser is enabled
if self.use_persistent_context:
self.use_managed_browser = True
@staticmethod
def from_kwargs(kwargs: dict) -> "BrowserConfig":
return BrowserConfig(
browser_type=kwargs.get("browser_type", "chromium"),
headless=kwargs.get("headless", True),
use_managed_browser=kwargs.get("use_managed_browser", False),
use_persistent_context=kwargs.get("use_persistent_context", False),
user_data_dir=kwargs.get("user_data_dir"),
chrome_channel=kwargs.get("chrome_channel", "chrome"),
proxy=kwargs.get("proxy"),
proxy_config=kwargs.get("proxy_config"),
viewport_width=kwargs.get("viewport_width", 1920),
viewport_height=kwargs.get("viewport_height", 1080),
accept_downloads=kwargs.get("accept_downloads", False),
downloads_path=kwargs.get("downloads_path"),
storage_state=kwargs.get("storage_state"),
ignore_https_errors=kwargs.get("ignore_https_errors", True),
java_script_enabled=kwargs.get("java_script_enabled", True),
cookies=kwargs.get("cookies", []),
headers=kwargs.get("headers", {}),
user_agent=kwargs.get("user_agent",
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) "
"AppleWebKit/537.36 (KHTML, like Gecko) Chrome/116.0.0.0 Safari/537.36"
),
user_agent_mode=kwargs.get("user_agent_mode"),
user_agent_generator_config=kwargs.get("user_agent_generator_config"),
text_only=kwargs.get("text_only", False),
light_mode=kwargs.get("light_mode", False),
extra_args=kwargs.get("extra_args", [])
)
class CrawlerRunConfig:
"""
Configuration class for controlling how the crawler runs each crawl operation.
This includes parameters for content extraction, page manipulation, waiting conditions,
caching, and other runtime behaviors.
This centralizes parameters that were previously scattered as kwargs to `arun()` and related methods.
By using this class, you have a single place to understand and adjust the crawling options.
Attributes:
word_count_threshold (int): Minimum word count threshold before processing content.
Default: MIN_WORD_THRESHOLD (typically 200).
extraction_strategy (ExtractionStrategy or None): Strategy to extract structured data from crawled pages.
Default: None (NoExtractionStrategy is used if None).
chunking_strategy (ChunkingStrategy): Strategy to chunk content before extraction.
Default: RegexChunking().
content_filter (RelevantContentFilter or None): Optional filter to prune irrelevant content.
Default: None.
cache_mode (CacheMode or None): Defines how caching is handled.
If None, defaults to CacheMode.ENABLED internally.
Default: None.
session_id (str or None): Optional session ID to persist the browser context and the created
page instance. If the ID already exists, the crawler does not
create a new page and uses the current page to preserve the state;
if not, it creates a new page and context then stores it in
memory with the given session ID.
bypass_cache (bool): Legacy parameter, if True acts like CacheMode.BYPASS.
Default: False.
disable_cache (bool): Legacy parameter, if True acts like CacheMode.DISABLED.
Default: False.
no_cache_read (bool): Legacy parameter, if True acts like CacheMode.WRITE_ONLY.
Default: False.
no_cache_write (bool): Legacy parameter, if True acts like CacheMode.READ_ONLY.
Default: False.
css_selector (str or None): CSS selector to extract a specific portion of the page.
Default: None.
screenshot (bool): Whether to take a screenshot after crawling.
Default: False.
pdf (bool): Whether to generate a PDF of the page.
Default: False.
verbose (bool): Enable verbose logging.
Default: True.
only_text (bool): If True, attempt to extract text-only content where applicable.
Default: False.
image_description_min_word_threshold (int): Minimum words for image description extraction.
Default: IMAGE_DESCRIPTION_MIN_WORD_THRESHOLD (e.g., 50).
prettiify (bool): If True, apply `fast_format_html` to produce prettified HTML output.
Default: False.
js_code (str or list of str or None): JavaScript code/snippets to run on the page.
Default: None.
wait_for (str or None): A CSS selector or JS condition to wait for before extracting content.
Default: None.
js_only (bool): If True, indicates subsequent calls are JS-driven updates, not full page loads.
Default: False.
wait_until (str): The condition to wait for when navigating, e.g. "domcontentloaded".
Default: "domcontentloaded".
page_timeout (int): Timeout in ms for page operations like navigation.
Default: 60000 (60 seconds).
ignore_body_visibility (bool): If True, ignore whether the body is visible before proceeding.
Default: True.
wait_for_images (bool): If True, wait for images to load before extracting content.
Default: True.
adjust_viewport_to_content (bool): If True, adjust viewport according to the page content dimensions.
Default: False.
scan_full_page (bool): If True, scroll through the entire page to load all content.
Default: False.
scroll_delay (float): Delay in seconds between scroll steps if scan_full_page is True.
Default: 0.2.
process_iframes (bool): If True, attempts to process and inline iframe content.
Default: False.
remove_overlay_elements (bool): If True, remove overlays/popups before extracting HTML.
Default: False.
delay_before_return_html (float): Delay in seconds before retrieving final HTML.
Default: 0.1.
log_console (bool): If True, log console messages from the page.
Default: False.
simulate_user (bool): If True, simulate user interactions (mouse moves, clicks) for anti-bot measures.
Default: False.
override_navigator (bool): If True, overrides navigator properties for more human-like behavior.
Default: False.
magic (bool): If True, attempts automatic handling of overlays/popups.
Default: False.
screenshot_wait_for (float or None): Additional wait time before taking a screenshot.
Default: None.
screenshot_height_threshold (int): Threshold for page height to decide screenshot strategy.
Default: SCREENSHOT_HEIGHT_TRESHOLD (from config, e.g. 20000).
mean_delay (float): Mean base delay between requests when calling arun_many.
Default: 0.1.
max_range (float): Max random additional delay range for requests in arun_many.
Default: 0.3.
# session_id and semaphore_count might be set at runtime, not needed as defaults here.
"""
def __init__(
self,
word_count_threshold: int = MIN_WORD_THRESHOLD ,
extraction_strategy : ExtractionStrategy=None, # Will default to NoExtractionStrategy if None
chunking_strategy : ChunkingStrategy= None, # Will default to RegexChunking if None
markdown_generator : MarkdownGenerationStrategy = None,
content_filter=None,
cache_mode=None,
session_id: str = None,
bypass_cache: bool = False,
disable_cache: bool = False,
no_cache_read: bool = False,
no_cache_write: bool = False,
css_selector: str = None,
screenshot: bool = False,
pdf: bool = False,
verbose: bool = True,
only_text: bool = False,
image_description_min_word_threshold: int = IMAGE_DESCRIPTION_MIN_WORD_THRESHOLD,
prettiify: bool = False,
js_code=None,
wait_for: str = None,
js_only: bool = False,
wait_until: str = "domcontentloaded",
page_timeout: int = PAGE_TIMEOUT,
ignore_body_visibility: bool = True,
wait_for_images: bool = True,
adjust_viewport_to_content: bool = False,
scan_full_page: bool = False,
scroll_delay: float = 0.2,
process_iframes: bool = False,
remove_overlay_elements: bool = False,
delay_before_return_html: float = 0.1,
log_console: bool = False,
simulate_user: bool = False,
override_navigator: bool = False,
magic: bool = False,
screenshot_wait_for: float = None,
screenshot_height_threshold: int = SCREENSHOT_HEIGHT_TRESHOLD,
mean_delay: float = 0.1,
max_range: float = 0.3,
semaphore_count: int = 5,
):
self.word_count_threshold = word_count_threshold
self.extraction_strategy = extraction_strategy
self.chunking_strategy = chunking_strategy
self.markdown_generator = markdown_generator
self.content_filter = content_filter
self.cache_mode = cache_mode
self.session_id = session_id
self.bypass_cache = bypass_cache
self.disable_cache = disable_cache
self.no_cache_read = no_cache_read
self.no_cache_write = no_cache_write
self.css_selector = css_selector
self.screenshot = screenshot
self.pdf = pdf
self.verbose = verbose
self.only_text = only_text
self.image_description_min_word_threshold = image_description_min_word_threshold
self.prettiify = prettiify
self.js_code = js_code
self.wait_for = wait_for
self.js_only = js_only
self.wait_until = wait_until
self.page_timeout = page_timeout
self.ignore_body_visibility = ignore_body_visibility
self.wait_for_images = wait_for_images
self.adjust_viewport_to_content = adjust_viewport_to_content
self.scan_full_page = scan_full_page
self.scroll_delay = scroll_delay
self.process_iframes = process_iframes
self.remove_overlay_elements = remove_overlay_elements
self.delay_before_return_html = delay_before_return_html
self.log_console = log_console
self.simulate_user = simulate_user
self.override_navigator = override_navigator
self.magic = magic
self.screenshot_wait_for = screenshot_wait_for
self.screenshot_height_threshold = screenshot_height_threshold
self.mean_delay = mean_delay
self.max_range = max_range
self.semaphore_count = semaphore_count
# Validate type of extraction strategy and chunking strategy if they are provided
if self.extraction_strategy is not None and not isinstance(self.extraction_strategy, ExtractionStrategy):
raise ValueError("extraction_strategy must be an instance of ExtractionStrategy")
if self.chunking_strategy is not None and not isinstance(self.chunking_strategy, ChunkingStrategy):
raise ValueError("chunking_strategy must be an instance of ChunkingStrategy")
# Set default chunking strategy if None
if self.chunking_strategy is None:
from .chunking_strategy import RegexChunking
self.chunking_strategy = RegexChunking()
@staticmethod
def from_kwargs(kwargs: dict) -> "CrawlerRunConfig":
return CrawlerRunConfig(
word_count_threshold=kwargs.get("word_count_threshold", 200),
extraction_strategy=kwargs.get("extraction_strategy"),
chunking_strategy=kwargs.get("chunking_strategy"),
markdown_generator=kwargs.get("markdown_generator"),
content_filter=kwargs.get("content_filter"),
cache_mode=kwargs.get("cache_mode"),
session_id=kwargs.get("session_id"),
bypass_cache=kwargs.get("bypass_cache", False),
disable_cache=kwargs.get("disable_cache", False),
no_cache_read=kwargs.get("no_cache_read", False),
no_cache_write=kwargs.get("no_cache_write", False),
css_selector=kwargs.get("css_selector"),
screenshot=kwargs.get("screenshot", False),
pdf=kwargs.get("pdf", False),
verbose=kwargs.get("verbose", True),
only_text=kwargs.get("only_text", False),
image_description_min_word_threshold=kwargs.get("image_description_min_word_threshold", IMAGE_DESCRIPTION_MIN_WORD_THRESHOLD),
prettiify=kwargs.get("prettiify", False),
js_code=kwargs.get("js_code"), # If not provided here, will default inside constructor
wait_for=kwargs.get("wait_for"),
js_only=kwargs.get("js_only", False),
wait_until=kwargs.get("wait_until", "domcontentloaded"),
page_timeout=kwargs.get("page_timeout", 60000),
ignore_body_visibility=kwargs.get("ignore_body_visibility", True),
adjust_viewport_to_content=kwargs.get("adjust_viewport_to_content", False),
scan_full_page=kwargs.get("scan_full_page", False),
scroll_delay=kwargs.get("scroll_delay", 0.2),
process_iframes=kwargs.get("process_iframes", False),
remove_overlay_elements=kwargs.get("remove_overlay_elements", False),
delay_before_return_html=kwargs.get("delay_before_return_html", 0.1),
log_console=kwargs.get("log_console", False),
simulate_user=kwargs.get("simulate_user", False),
override_navigator=kwargs.get("override_navigator", False),
magic=kwargs.get("magic", False),
screenshot_wait_for=kwargs.get("screenshot_wait_for"),
screenshot_height_threshold=kwargs.get("screenshot_height_threshold", 20000),
mean_delay=kwargs.get("mean_delay", 0.1),
max_range=kwargs.get("max_range", 0.3),
semaphore_count=kwargs.get("semaphore_count", 5)
)

File diff suppressed because it is too large Load Diff

View File

@@ -1,32 +1,94 @@
import os
import os, sys
from pathlib import Path
import aiosqlite
import asyncio
from typing import Optional, Tuple, Dict
from contextlib import asynccontextmanager
import logging
import json # Added for serialization/deserialization
from .utils import ensure_content_dirs, generate_content_hash
from .models import CrawlResult
import xxhash
import aiofiles
from .config import NEED_MIGRATION
from .version_manager import VersionManager
from .async_logger import AsyncLogger
from .utils import get_error_context, create_box_message
# Set up logging
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)
DB_PATH = os.path.join(os.getenv("CRAWL4_AI_BASE_DIRECTORY", Path.home()), ".crawl4ai")
base_directory = DB_PATH = os.path.join(os.getenv("CRAWL4_AI_BASE_DIRECTORY", Path.home()), ".crawl4ai")
os.makedirs(DB_PATH, exist_ok=True)
DB_PATH = os.path.join(DB_PATH, "crawl4ai.db")
DB_PATH = os.path.join(base_directory, "crawl4ai.db")
class AsyncDatabaseManager:
def __init__(self, pool_size: int = 10, max_retries: int = 3):
self.db_path = DB_PATH
self.content_paths = ensure_content_dirs(os.path.dirname(DB_PATH))
self.pool_size = pool_size
self.max_retries = max_retries
self.connection_pool: Dict[int, aiosqlite.Connection] = {}
self.pool_lock = asyncio.Lock()
self.init_lock = asyncio.Lock()
self.connection_semaphore = asyncio.Semaphore(pool_size)
self._initialized = False
self.version_manager = VersionManager()
self.logger = AsyncLogger(
log_file=os.path.join(base_directory, ".crawl4ai", "crawler_db.log"),
verbose=False,
tag_width=10
)
async def initialize(self):
"""Initialize the database and connection pool"""
await self.ainit_db()
try:
self.logger.info("Initializing database", tag="INIT")
# Ensure the database file exists
os.makedirs(os.path.dirname(self.db_path), exist_ok=True)
# Check if version update is needed
needs_update = self.version_manager.needs_update()
# Always ensure base table exists
await self.ainit_db()
# Verify the table exists
async with aiosqlite.connect(self.db_path, timeout=30.0) as db:
async with db.execute(
"SELECT name FROM sqlite_master WHERE type='table' AND name='crawled_data'"
) as cursor:
result = await cursor.fetchone()
if not result:
raise Exception("crawled_data table was not created")
# If version changed or fresh install, run updates
if needs_update:
self.logger.info("New version detected, running updates", tag="INIT")
await self.update_db_schema()
from .migrations import run_migration # Import here to avoid circular imports
await run_migration()
self.version_manager.update_version() # Update stored version after successful migration
self.logger.success("Version update completed successfully", tag="COMPLETE")
else:
self.logger.success("Database initialization completed successfully", tag="COMPLETE")
except Exception as e:
self.logger.error(
message="Database initialization error: {error}",
tag="ERROR",
params={"error": str(e)}
)
self.logger.info(
message="Database will be initialized on first use",
tag="INIT"
)
raise
async def cleanup(self):
"""Cleanup connections when shutting down"""
async with self.pool_lock:
@@ -36,30 +98,93 @@ class AsyncDatabaseManager:
@asynccontextmanager
async def get_connection(self):
"""Connection pool manager"""
async with self.connection_semaphore:
task_id = id(asyncio.current_task())
try:
async with self.pool_lock:
if task_id not in self.connection_pool:
"""Connection pool manager with enhanced error handling"""
if not self._initialized:
async with self.init_lock:
if not self._initialized:
try:
await self.initialize()
self._initialized = True
except Exception as e:
import sys
error_context = get_error_context(sys.exc_info())
self.logger.error(
message="Database initialization failed:\n{error}\n\nContext:\n{context}\n\nTraceback:\n{traceback}",
tag="ERROR",
force_verbose=True,
params={
"error": str(e),
"context": error_context["code_context"],
"traceback": error_context["full_traceback"]
}
)
raise
await self.connection_semaphore.acquire()
task_id = id(asyncio.current_task())
try:
async with self.pool_lock:
if task_id not in self.connection_pool:
try:
conn = await aiosqlite.connect(
self.db_path,
timeout=30.0
)
await conn.execute('PRAGMA journal_mode = WAL')
await conn.execute('PRAGMA busy_timeout = 5000')
# Verify database structure
async with conn.execute("PRAGMA table_info(crawled_data)") as cursor:
columns = await cursor.fetchall()
column_names = [col[1] for col in columns]
expected_columns = {
'url', 'html', 'cleaned_html', 'markdown', 'extracted_content',
'success', 'media', 'links', 'metadata', 'screenshot',
'response_headers', 'downloaded_files'
}
missing_columns = expected_columns - set(column_names)
if missing_columns:
raise ValueError(f"Database missing columns: {missing_columns}")
self.connection_pool[task_id] = conn
yield self.connection_pool[task_id]
except Exception as e:
logger.error(f"Connection error: {e}")
raise
finally:
async with self.pool_lock:
if task_id in self.connection_pool:
await self.connection_pool[task_id].close()
del self.connection_pool[task_id]
except Exception as e:
import sys
error_context = get_error_context(sys.exc_info())
error_message = (
f"Unexpected error in db get_connection at line {error_context['line_no']} "
f"in {error_context['function']} ({error_context['filename']}):\n"
f"Error: {str(e)}\n\n"
f"Code context:\n{error_context['code_context']}"
)
self.logger.error(
message=create_box_message(error_message, type= "error"),
)
raise
yield self.connection_pool[task_id]
except Exception as e:
import sys
error_context = get_error_context(sys.exc_info())
error_message = (
f"Unexpected error in db get_connection at line {error_context['line_no']} "
f"in {error_context['function']} ({error_context['filename']}):\n"
f"Error: {str(e)}\n\n"
f"Code context:\n{error_context['code_context']}"
)
self.logger.error(
message=create_box_message(error_message, type= "error"),
)
raise
finally:
async with self.pool_lock:
if task_id in self.connection_pool:
await self.connection_pool[task_id].close()
del self.connection_pool[task_id]
self.connection_semaphore.release()
async def execute_with_retry(self, operation, *args):
"""Execute database operations with retry logic"""
@@ -71,13 +196,21 @@ class AsyncDatabaseManager:
return result
except Exception as e:
if attempt == self.max_retries - 1:
logger.error(f"Operation failed after {self.max_retries} attempts: {e}")
self.logger.error(
message="Operation failed after {retries} attempts: {error}",
tag="ERROR",
force_verbose=True,
params={
"retries": self.max_retries,
"error": str(e)
}
)
raise
await asyncio.sleep(1 * (attempt + 1)) # Exponential backoff
async def ainit_db(self):
"""Initialize database schema"""
async def _init(db):
async with aiosqlite.connect(self.db_path, timeout=30.0) as db:
await db.execute('''
CREATE TABLE IF NOT EXISTS crawled_data (
url TEXT PRIMARY KEY,
@@ -89,71 +222,169 @@ class AsyncDatabaseManager:
media TEXT DEFAULT "{}",
links TEXT DEFAULT "{}",
metadata TEXT DEFAULT "{}",
screenshot TEXT DEFAULT ""
screenshot TEXT DEFAULT "",
response_headers TEXT DEFAULT "{}",
downloaded_files TEXT DEFAULT "{}" -- New column added
)
''')
await db.commit()
await self.execute_with_retry(_init)
await self.update_db_schema()
async def update_db_schema(self):
"""Update database schema if needed"""
async def _check_columns(db):
async with aiosqlite.connect(self.db_path, timeout=30.0) as db:
cursor = await db.execute("PRAGMA table_info(crawled_data)")
columns = await cursor.fetchall()
return [column[1] for column in columns]
column_names = [column[1] for column in columns]
# List of new columns to add
new_columns = ['media', 'links', 'metadata', 'screenshot', 'response_headers', 'downloaded_files']
for column in new_columns:
if column not in column_names:
await self.aalter_db_add_column(column, db)
await db.commit()
column_names = await self.execute_with_retry(_check_columns)
for column in ['media', 'links', 'metadata', 'screenshot']:
if column not in column_names:
await self.aalter_db_add_column(column)
async def aalter_db_add_column(self, new_column: str):
async def aalter_db_add_column(self, new_column: str, db):
"""Add new column to the database"""
async def _alter(db):
if new_column == 'response_headers':
await db.execute(f'ALTER TABLE crawled_data ADD COLUMN {new_column} TEXT DEFAULT "{{}}"')
else:
await db.execute(f'ALTER TABLE crawled_data ADD COLUMN {new_column} TEXT DEFAULT ""')
logger.info(f"Added column '{new_column}' to the database.")
self.logger.info(
message="Added column '{column}' to the database",
tag="INIT",
params={"column": new_column}
)
await self.execute_with_retry(_alter)
async def aget_cached_url(self, url: str) -> Optional[Tuple[str, str, str, str, str, str, str, bool, str]]:
"""Retrieve cached URL data"""
async def aget_cached_url(self, url: str) -> Optional[CrawlResult]:
"""Retrieve cached URL data as CrawlResult"""
async def _get(db):
async with db.execute(
'SELECT url, html, cleaned_html, markdown, extracted_content, success, media, links, metadata, screenshot FROM crawled_data WHERE url = ?',
(url,)
'SELECT * FROM crawled_data WHERE url = ?', (url,)
) as cursor:
return await cursor.fetchone()
row = await cursor.fetchone()
if not row:
return None
# Get column names
columns = [description[0] for description in cursor.description]
# Create dict from row data
row_dict = dict(zip(columns, row))
# Load content from files using stored hashes
content_fields = {
'html': row_dict['html'],
'cleaned_html': row_dict['cleaned_html'],
'markdown': row_dict['markdown'],
'extracted_content': row_dict['extracted_content'],
'screenshot': row_dict['screenshot'],
'screenshots': row_dict['screenshot'],
}
for field, hash_value in content_fields.items():
if hash_value:
content = await self._load_content(
hash_value,
field.split('_')[0] # Get content type from field name
)
row_dict[field] = content or ""
else:
row_dict[field] = ""
# Parse JSON fields
json_fields = ['media', 'links', 'metadata', 'response_headers']
for field in json_fields:
try:
row_dict[field] = json.loads(row_dict[field]) if row_dict[field] else {}
except json.JSONDecodeError:
row_dict[field] = {}
# Parse downloaded_files
try:
row_dict['downloaded_files'] = json.loads(row_dict['downloaded_files']) if row_dict['downloaded_files'] else []
except json.JSONDecodeError:
row_dict['downloaded_files'] = []
# Remove any fields not in CrawlResult model
valid_fields = CrawlResult.__annotations__.keys()
filtered_dict = {k: v for k, v in row_dict.items() if k in valid_fields}
return CrawlResult(**filtered_dict)
try:
return await self.execute_with_retry(_get)
except Exception as e:
logger.error(f"Error retrieving cached URL: {e}")
self.logger.error(
message="Error retrieving cached URL: {error}",
tag="ERROR",
force_verbose=True,
params={"error": str(e)}
)
return None
async def acache_url(self, url: str, html: str, cleaned_html: str, markdown: str, extracted_content: str, success: bool, media: str = "{}", links: str = "{}", metadata: str = "{}", screenshot: str = ""):
"""Cache URL data with retry logic"""
async def acache_url(self, result: CrawlResult):
"""Cache CrawlResult data"""
# Store content files and get hashes
content_map = {
'html': (result.html, 'html'),
'cleaned_html': (result.cleaned_html or "", 'cleaned'),
'markdown': (result.markdown or "", 'markdown'),
'extracted_content': (result.extracted_content or "", 'extracted'),
'screenshot': (result.screenshot or "", 'screenshots')
}
content_hashes = {}
for field, (content, content_type) in content_map.items():
content_hashes[field] = await self._store_content(content, content_type)
async def _cache(db):
await db.execute('''
INSERT INTO crawled_data (url, html, cleaned_html, markdown, extracted_content, success, media, links, metadata, screenshot)
VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?)
INSERT INTO crawled_data (
url, html, cleaned_html, markdown,
extracted_content, success, media, links, metadata,
screenshot, response_headers, downloaded_files
)
VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?)
ON CONFLICT(url) DO UPDATE SET
html = excluded.html,
cleaned_html = excluded.cleaned_html,
markdown = excluded.markdown,
extracted_content = excluded.extracted_content,
success = excluded.success,
media = excluded.media,
links = excluded.links,
metadata = excluded.metadata,
screenshot = excluded.screenshot
''', (url, html, cleaned_html, markdown, extracted_content, success, media, links, metadata, screenshot))
media = excluded.media,
links = excluded.links,
metadata = excluded.metadata,
screenshot = excluded.screenshot,
response_headers = excluded.response_headers,
downloaded_files = excluded.downloaded_files
''', (
result.url,
content_hashes['html'],
content_hashes['cleaned_html'],
content_hashes['markdown'],
content_hashes['extracted_content'],
result.success,
json.dumps(result.media),
json.dumps(result.links),
json.dumps(result.metadata or {}),
content_hashes['screenshot'],
json.dumps(result.response_headers or {}),
json.dumps(result.downloaded_files or [])
))
try:
await self.execute_with_retry(_cache)
except Exception as e:
logger.error(f"Error caching URL: {e}")
self.logger.error(
message="Error caching URL: {error}",
tag="ERROR",
force_verbose=True,
params={"error": str(e)}
)
async def aget_total_count(self) -> int:
"""Get total number of cached URLs"""
@@ -165,7 +396,12 @@ class AsyncDatabaseManager:
try:
return await self.execute_with_retry(_count)
except Exception as e:
logger.error(f"Error getting total count: {e}")
self.logger.error(
message="Error getting total count: {error}",
tag="ERROR",
force_verbose=True,
params={"error": str(e)}
)
return 0
async def aclear_db(self):
@@ -176,7 +412,12 @@ class AsyncDatabaseManager:
try:
await self.execute_with_retry(_clear)
except Exception as e:
logger.error(f"Error clearing database: {e}")
self.logger.error(
message="Error clearing database: {error}",
tag="ERROR",
force_verbose=True,
params={"error": str(e)}
)
async def aflush_db(self):
"""Drop the entire table"""
@@ -186,7 +427,46 @@ class AsyncDatabaseManager:
try:
await self.execute_with_retry(_flush)
except Exception as e:
logger.error(f"Error flushing database: {e}")
self.logger.error(
message="Error flushing database: {error}",
tag="ERROR",
force_verbose=True,
params={"error": str(e)}
)
async def _store_content(self, content: str, content_type: str) -> str:
"""Store content in filesystem and return hash"""
if not content:
return ""
content_hash = generate_content_hash(content)
file_path = os.path.join(self.content_paths[content_type], content_hash)
# Only write if file doesn't exist
if not os.path.exists(file_path):
async with aiofiles.open(file_path, 'w', encoding='utf-8') as f:
await f.write(content)
return content_hash
async def _load_content(self, content_hash: str, content_type: str) -> Optional[str]:
"""Load content from filesystem by hash"""
if not content_hash:
return None
file_path = os.path.join(self.content_paths[content_type], content_hash)
try:
async with aiofiles.open(file_path, 'r', encoding='utf-8') as f:
return await f.read()
except:
self.logger.error(
message="Failed to load content: {file_path}",
tag="ERROR",
force_verbose=True,
params={"file_path": file_path}
)
return None
# Create a singleton instance
async_db_manager = AsyncDatabaseManager()
async_db_manager = AsyncDatabaseManager()

231
crawl4ai/async_logger.py Normal file
View File

@@ -0,0 +1,231 @@
from enum import Enum
from typing import Optional, Dict, Any, Union
from colorama import Fore, Back, Style, init
import time
import os
from datetime import datetime
class LogLevel(Enum):
DEBUG = 1
INFO = 2
SUCCESS = 3
WARNING = 4
ERROR = 5
class AsyncLogger:
"""
Asynchronous logger with support for colored console output and file logging.
Supports templated messages with colored components.
"""
DEFAULT_ICONS = {
'INIT': '',
'READY': '',
'FETCH': '',
'SCRAPE': '',
'EXTRACT': '',
'COMPLETE': '',
'ERROR': '×',
'DEBUG': '',
'INFO': '',
'WARNING': '',
}
DEFAULT_COLORS = {
LogLevel.DEBUG: Fore.LIGHTBLACK_EX,
LogLevel.INFO: Fore.CYAN,
LogLevel.SUCCESS: Fore.GREEN,
LogLevel.WARNING: Fore.YELLOW,
LogLevel.ERROR: Fore.RED,
}
def __init__(
self,
log_file: Optional[str] = None,
log_level: LogLevel = LogLevel.INFO,
tag_width: int = 10,
icons: Optional[Dict[str, str]] = None,
colors: Optional[Dict[LogLevel, str]] = None,
verbose: bool = True
):
"""
Initialize the logger.
Args:
log_file: Optional file path for logging
log_level: Minimum log level to display
tag_width: Width for tag formatting
icons: Custom icons for different tags
colors: Custom colors for different log levels
verbose: Whether to output to console
"""
init() # Initialize colorama
self.log_file = log_file
self.log_level = log_level
self.tag_width = tag_width
self.icons = icons or self.DEFAULT_ICONS
self.colors = colors or self.DEFAULT_COLORS
self.verbose = verbose
# Create log file directory if needed
if log_file:
os.makedirs(os.path.dirname(os.path.abspath(log_file)), exist_ok=True)
def _format_tag(self, tag: str) -> str:
"""Format a tag with consistent width."""
return f"[{tag}]".ljust(self.tag_width, ".")
def _get_icon(self, tag: str) -> str:
"""Get the icon for a tag, defaulting to info icon if not found."""
return self.icons.get(tag, self.icons['INFO'])
def _write_to_file(self, message: str):
"""Write a message to the log file if configured."""
if self.log_file:
timestamp = datetime.now().strftime('%Y-%m-%d %H:%M:%S.%f')[:-3]
with open(self.log_file, 'a', encoding='utf-8') as f:
# Strip ANSI color codes for file output
clean_message = message.replace(Fore.RESET, '').replace(Style.RESET_ALL, '')
for color in vars(Fore).values():
if isinstance(color, str):
clean_message = clean_message.replace(color, '')
f.write(f"[{timestamp}] {clean_message}\n")
def _log(
self,
level: LogLevel,
message: str,
tag: str,
params: Optional[Dict[str, Any]] = None,
colors: Optional[Dict[str, str]] = None,
base_color: Optional[str] = None,
**kwargs
):
"""
Core logging method that handles message formatting and output.
Args:
level: Log level for this message
message: Message template string
tag: Tag for the message
params: Parameters to format into the message
colors: Color overrides for specific parameters
base_color: Base color for the entire message
"""
if level.value < self.log_level.value:
return
# Format the message with parameters if provided
if params:
try:
# First format the message with raw parameters
formatted_message = message.format(**params)
# Then apply colors if specified
if colors:
for key, color in colors.items():
# Find the formatted value in the message and wrap it with color
if key in params:
value_str = str(params[key])
formatted_message = formatted_message.replace(
value_str,
f"{color}{value_str}{Style.RESET_ALL}"
)
except KeyError as e:
formatted_message = f"LOGGING ERROR: Missing parameter {e} in message template"
level = LogLevel.ERROR
else:
formatted_message = message
# Construct the full log line
color = base_color or self.colors[level]
log_line = f"{color}{self._format_tag(tag)} {self._get_icon(tag)} {formatted_message}{Style.RESET_ALL}"
# Output to console if verbose
if self.verbose or kwargs.get("force_verbose", False):
print(log_line)
# Write to file if configured
self._write_to_file(log_line)
def debug(self, message: str, tag: str = "DEBUG", **kwargs):
"""Log a debug message."""
self._log(LogLevel.DEBUG, message, tag, **kwargs)
def info(self, message: str, tag: str = "INFO", **kwargs):
"""Log an info message."""
self._log(LogLevel.INFO, message, tag, **kwargs)
def success(self, message: str, tag: str = "SUCCESS", **kwargs):
"""Log a success message."""
self._log(LogLevel.SUCCESS, message, tag, **kwargs)
def warning(self, message: str, tag: str = "WARNING", **kwargs):
"""Log a warning message."""
self._log(LogLevel.WARNING, message, tag, **kwargs)
def error(self, message: str, tag: str = "ERROR", **kwargs):
"""Log an error message."""
self._log(LogLevel.ERROR, message, tag, **kwargs)
def url_status(
self,
url: str,
success: bool,
timing: float,
tag: str = "FETCH",
url_length: int = 50
):
"""
Convenience method for logging URL fetch status.
Args:
url: The URL being processed
success: Whether the operation was successful
timing: Time taken for the operation
tag: Tag for the message
url_length: Maximum length for URL in log
"""
self._log(
level=LogLevel.SUCCESS if success else LogLevel.ERROR,
message="{url:.{url_length}}... | Status: {status} | Time: {timing:.2f}s",
tag=tag,
params={
"url": url,
"url_length": url_length,
"status": success,
"timing": timing
},
colors={
"status": Fore.GREEN if success else Fore.RED,
"timing": Fore.YELLOW
}
)
def error_status(
self,
url: str,
error: str,
tag: str = "ERROR",
url_length: int = 50
):
"""
Convenience method for logging error status.
Args:
url: The URL being processed
error: Error message
tag: Tag for the message
url_length: Maximum length for URL in log
"""
self._log(
level=LogLevel.ERROR,
message="{url:.{url_length}}... | Error: {error}",
tag=tag,
params={
"url": url,
"url_length": url_length,
"error": error
}
)

183
crawl4ai/async_tools.py Normal file
View File

@@ -0,0 +1,183 @@
import asyncio
import base64
import time
from abc import ABC, abstractmethod
from typing import Callable, Dict, Any, List, Optional, Awaitable
import os, sys, shutil
import tempfile, subprocess
from playwright.async_api import async_playwright, Page, Browser, Error
from playwright.async_api import TimeoutError as PlaywrightTimeoutError
from io import BytesIO
from PIL import Image, ImageDraw, ImageFont
from pathlib import Path
from playwright.async_api import ProxySettings
from pydantic import BaseModel
import hashlib
import json
import uuid
from .models import AsyncCrawlResponse
from .utils import create_box_message
from .user_agent_generator import UserAgentGenerator
from playwright_stealth import StealthConfig, stealth_async
class ManagedBrowser:
def __init__(self, browser_type: str = "chromium", user_data_dir: Optional[str] = None, headless: bool = False, logger = None, host: str = "localhost", debugging_port: int = 9222):
self.browser_type = browser_type
self.user_data_dir = user_data_dir
self.headless = headless
self.browser_process = None
self.temp_dir = None
self.debugging_port = debugging_port
self.host = host
self.logger = logger
self.shutting_down = False
async def start(self) -> str:
"""
Starts the browser process and returns the CDP endpoint URL.
If user_data_dir is not provided, creates a temporary directory.
"""
# Create temp dir if needed
if not self.user_data_dir:
self.temp_dir = tempfile.mkdtemp(prefix="browser-profile-")
self.user_data_dir = self.temp_dir
# Get browser path and args based on OS and browser type
browser_path = self._get_browser_path()
args = self._get_browser_args()
# Start browser process
try:
self.browser_process = subprocess.Popen(
args,
stdout=subprocess.PIPE,
stderr=subprocess.PIPE
)
# Monitor browser process output for errors
asyncio.create_task(self._monitor_browser_process())
await asyncio.sleep(2) # Give browser time to start
return f"http://{self.host}:{self.debugging_port}"
except Exception as e:
await self.cleanup()
raise Exception(f"Failed to start browser: {e}")
async def _monitor_browser_process(self):
"""Monitor the browser process for unexpected termination."""
if self.browser_process:
try:
stdout, stderr = await asyncio.gather(
asyncio.to_thread(self.browser_process.stdout.read),
asyncio.to_thread(self.browser_process.stderr.read)
)
# Check shutting_down flag BEFORE logging anything
if self.browser_process.poll() is not None:
if not self.shutting_down:
self.logger.error(
message="Browser process terminated unexpectedly | Code: {code} | STDOUT: {stdout} | STDERR: {stderr}",
tag="ERROR",
params={
"code": self.browser_process.returncode,
"stdout": stdout.decode(),
"stderr": stderr.decode()
}
)
await self.cleanup()
else:
self.logger.info(
message="Browser process terminated normally | Code: {code}",
tag="INFO",
params={"code": self.browser_process.returncode}
)
except Exception as e:
if not self.shutting_down:
self.logger.error(
message="Error monitoring browser process: {error}",
tag="ERROR",
params={"error": str(e)}
)
def _get_browser_path(self) -> str:
"""Returns the browser executable path based on OS and browser type"""
if sys.platform == "darwin": # macOS
paths = {
"chromium": "/Applications/Google Chrome.app/Contents/MacOS/Google Chrome",
"firefox": "/Applications/Firefox.app/Contents/MacOS/firefox",
"webkit": "/Applications/Safari.app/Contents/MacOS/Safari"
}
elif sys.platform == "win32": # Windows
paths = {
"chromium": "C:\\Program Files\\Google\\Chrome\\Application\\chrome.exe",
"firefox": "C:\\Program Files\\Mozilla Firefox\\firefox.exe",
"webkit": None # WebKit not supported on Windows
}
else: # Linux
paths = {
"chromium": "google-chrome",
"firefox": "firefox",
"webkit": None # WebKit not supported on Linux
}
return paths.get(self.browser_type)
def _get_browser_args(self) -> List[str]:
"""Returns browser-specific command line arguments"""
base_args = [self._get_browser_path()]
if self.browser_type == "chromium":
args = [
f"--remote-debugging-port={self.debugging_port}",
f"--user-data-dir={self.user_data_dir}",
]
if self.headless:
args.append("--headless=new")
elif self.browser_type == "firefox":
args = [
"--remote-debugging-port", str(self.debugging_port),
"--profile", self.user_data_dir,
]
if self.headless:
args.append("--headless")
else:
raise NotImplementedError(f"Browser type {self.browser_type} not supported")
return base_args + args
async def cleanup(self):
"""Cleanup browser process and temporary directory"""
# Set shutting_down flag BEFORE any termination actions
self.shutting_down = True
if self.browser_process:
try:
self.browser_process.terminate()
# Wait for process to end gracefully
for _ in range(10): # 10 attempts, 100ms each
if self.browser_process.poll() is not None:
break
await asyncio.sleep(0.1)
# Force kill if still running
if self.browser_process.poll() is None:
self.browser_process.kill()
await asyncio.sleep(0.1) # Brief wait for kill to take effect
except Exception as e:
self.logger.error(
message="Error terminating browser: {error}",
tag="ERROR",
params={"error": str(e)}
)
if self.temp_dir and os.path.exists(self.temp_dir):
try:
shutil.rmtree(self.temp_dir)
except Exception as e:
self.logger.error(
message="Error removing temporary directory: {error}",
tag="ERROR",
params={"error": str(e)}
)

View File

@@ -1,41 +1,131 @@
import os
import os, sys
import time
import warnings
from enum import Enum
from colorama import init, Fore, Back, Style
from pathlib import Path
from typing import Optional
from typing import Optional, List, Union
import json
import asyncio
from .models import CrawlResult
# from contextlib import nullcontext, asynccontextmanager
from contextlib import asynccontextmanager
from .models import CrawlResult, MarkdownGenerationResult
from .async_database import async_db_manager
from .chunking_strategy import *
from .content_filter_strategy import *
from .extraction_strategy import *
from .async_crawler_strategy import AsyncCrawlerStrategy, AsyncPlaywrightCrawlerStrategy, AsyncCrawlResponse
from .content_scrapping_strategy import WebScrappingStrategy
from .config import MIN_WORD_THRESHOLD, IMAGE_DESCRIPTION_MIN_WORD_THRESHOLD
from .cache_context import CacheMode, CacheContext, _legacy_to_cache_mode
from .markdown_generation_strategy import DefaultMarkdownGenerator, MarkdownGenerationStrategy
from .content_scraping_strategy import WebScrapingStrategy
from .async_logger import AsyncLogger
from .async_configs import BrowserConfig, CrawlerRunConfig
from .config import (
MIN_WORD_THRESHOLD,
IMAGE_DESCRIPTION_MIN_WORD_THRESHOLD,
URL_LOG_SHORTEN_LENGTH
)
from .utils import (
sanitize_input_encode,
InvalidCSSSelectorError,
format_html
format_html,
fast_format_html,
create_box_message
)
from ._version import __version__ as crawl4ai_version
from urllib.parse import urlparse
import random
from .__version__ import __version__ as crawl4ai_version
class AsyncWebCrawler:
"""
Asynchronous web crawler with flexible caching capabilities.
Migration Guide:
Old way (deprecated):
crawler = AsyncWebCrawler(always_by_pass_cache=True, browser_type="chromium", headless=True)
New way (recommended):
browser_config = BrowserConfig(browser_type="chromium", headless=True)
crawler = AsyncWebCrawler(browser_config=browser_config)
"""
_domain_last_hit = {}
def __init__(
self,
crawler_strategy: Optional[AsyncCrawlerStrategy] = None,
always_by_pass_cache: bool = False,
config: Optional[BrowserConfig] = None,
always_bypass_cache: bool = False,
always_by_pass_cache: Optional[bool] = None, # Deprecated parameter
base_directory: str = str(os.getenv("CRAWL4_AI_BASE_DIRECTORY", Path.home())),
thread_safe: bool = False,
**kwargs,
):
self.crawler_strategy = crawler_strategy or AsyncPlaywrightCrawlerStrategy(
**kwargs
"""
Initialize the AsyncWebCrawler.
Args:
crawler_strategy: Strategy for crawling web pages. If None, will create AsyncPlaywrightCrawlerStrategy
config: Configuration object for browser settings. If None, will be created from kwargs
always_bypass_cache: Whether to always bypass cache (new parameter)
always_by_pass_cache: Deprecated, use always_bypass_cache instead
base_directory: Base directory for storing cache
thread_safe: Whether to use thread-safe operations
**kwargs: Additional arguments for backwards compatibility
"""
# Handle browser configuration
browser_config = config
if browser_config is not None:
if any(k in kwargs for k in ["browser_type", "headless", "viewport_width", "viewport_height"]):
self.logger.warning(
message="Both browser_config and legacy browser parameters provided. browser_config will take precedence.",
tag="WARNING"
)
else:
# Create browser config from kwargs for backwards compatibility
browser_config = BrowserConfig.from_kwargs(kwargs)
self.browser_config = browser_config
# Initialize logger first since other components may need it
self.logger = AsyncLogger(
log_file=os.path.join(base_directory, ".crawl4ai", "crawler.log"),
verbose=self.browser_config.verbose,
tag_width=10
)
self.always_by_pass_cache = always_by_pass_cache
# self.crawl4ai_folder = os.path.join(os.getenv("CRAWL4_AI_BASE_DIRECTORY", Path.home()), ".crawl4ai")
# Initialize crawler strategy
self.crawler_strategy = crawler_strategy or AsyncPlaywrightCrawlerStrategy(
browser_config=browser_config,
logger=self.logger,
**kwargs # Pass remaining kwargs for backwards compatibility
)
# Handle deprecated cache parameter
if always_by_pass_cache is not None:
if kwargs.get("warning", True):
warnings.warn(
"'always_by_pass_cache' is deprecated and will be removed in version 0.5.0. "
"Use 'always_bypass_cache' instead. "
"Pass warning=False to suppress this warning.",
DeprecationWarning,
stacklevel=2
)
self.always_bypass_cache = always_by_pass_cache
else:
self.always_bypass_cache = always_bypass_cache
# Thread safety setup
self._lock = asyncio.Lock() if thread_safe else None
# Initialize directories
self.crawl4ai_folder = os.path.join(base_directory, ".crawl4ai")
os.makedirs(self.crawl4ai_folder, exist_ok=True)
os.makedirs(f"{self.crawl4ai_folder}/cache", exist_ok=True)
self.ready = False
self.verbose = kwargs.get("verbose", False)
async def __aenter__(self):
await self.crawler_strategy.__aenter__()
@@ -44,246 +134,553 @@ class AsyncWebCrawler:
async def __aexit__(self, exc_type, exc_val, exc_tb):
await self.crawler_strategy.__aexit__(exc_type, exc_val, exc_tb)
async def awarmup(self):
# Print a message for crawl4ai and its version
print(f"[LOG] 🚀 Crawl4AI {crawl4ai_version}")
if self.verbose:
print("[LOG] 🌤️ Warming up the AsyncWebCrawler")
# await async_db_manager.ainit_db()
await async_db_manager.initialize()
await self.arun(
url="https://google.com/",
word_count_threshold=5,
bypass_cache=False,
verbose=False,
)
"""Initialize the crawler with warm-up sequence."""
self.logger.info(f"Crawl4AI {crawl4ai_version}", tag="INIT")
self.ready = True
if self.verbose:
print("[LOG] 🌞 AsyncWebCrawler is ready to crawl")
@asynccontextmanager
async def nullcontext(self):
"""异步空上下文管理器"""
yield
async def arun(
self,
url: str,
word_count_threshold=MIN_WORD_THRESHOLD,
extraction_strategy: ExtractionStrategy = None,
chunking_strategy: ChunkingStrategy = RegexChunking(),
bypass_cache: bool = False,
css_selector: str = None,
screenshot: bool = False,
user_agent: str = None,
verbose=True,
**kwargs,
) -> CrawlResult:
try:
extraction_strategy = extraction_strategy or NoExtractionStrategy()
extraction_strategy.verbose = verbose
if not isinstance(extraction_strategy, ExtractionStrategy):
raise ValueError("Unsupported extraction strategy")
if not isinstance(chunking_strategy, ChunkingStrategy):
raise ValueError("Unsupported chunking strategy")
self,
url: str,
config: Optional[CrawlerRunConfig] = None,
# Legacy parameters maintained for backwards compatibility
word_count_threshold=MIN_WORD_THRESHOLD,
extraction_strategy: ExtractionStrategy = None,
chunking_strategy: ChunkingStrategy = RegexChunking(),
content_filter: RelevantContentFilter = None,
cache_mode: Optional[CacheMode] = None,
# Deprecated cache parameters
bypass_cache: bool = False,
disable_cache: bool = False,
no_cache_read: bool = False,
no_cache_write: bool = False,
# Other legacy parameters
css_selector: str = None,
screenshot: bool = False,
pdf: bool = False,
user_agent: str = None,
verbose=True,
**kwargs,
) -> CrawlResult:
"""
Runs the crawler for a single source: URL (web, local file, or raw HTML).
Migration Guide:
Old way (deprecated):
result = await crawler.arun(
url="https://example.com",
word_count_threshold=200,
screenshot=True,
...
)
word_count_threshold = max(word_count_threshold, MIN_WORD_THRESHOLD)
New way (recommended):
config = CrawlerRunConfig(
word_count_threshold=200,
screenshot=True,
...
)
result = await crawler.arun(url="https://example.com", crawler_config=config)
async_response: AsyncCrawlResponse = None
cached = None
screenshot_data = None
extracted_content = None
if not bypass_cache and not self.always_by_pass_cache:
cached = await async_db_manager.aget_cached_url(url)
Args:
url: The URL to crawl (http://, https://, file://, or raw:)
crawler_config: Configuration object controlling crawl behavior
[other parameters maintained for backwards compatibility]
Returns:
CrawlResult: The result of crawling and processing
"""
crawler_config = config
if not isinstance(url, str) or not url:
raise ValueError("Invalid URL, make sure the URL is a non-empty string")
async with self._lock or self.nullcontext():
try:
# Handle configuration
if crawler_config is not None:
if any(param is not None for param in [
word_count_threshold, extraction_strategy, chunking_strategy,
content_filter, cache_mode, css_selector, screenshot, pdf
]):
self.logger.warning(
message="Both crawler_config and legacy parameters provided. crawler_config will take precedence.",
tag="WARNING"
)
config = crawler_config
else:
# Merge all parameters into a single kwargs dict for config creation
config_kwargs = {
"word_count_threshold": word_count_threshold,
"extraction_strategy": extraction_strategy,
"chunking_strategy": chunking_strategy,
"content_filter": content_filter,
"cache_mode": cache_mode,
"bypass_cache": bypass_cache,
"disable_cache": disable_cache,
"no_cache_read": no_cache_read,
"no_cache_write": no_cache_write,
"css_selector": css_selector,
"screenshot": screenshot,
"pdf": pdf,
"verbose": verbose,
**kwargs
}
config = CrawlerRunConfig.from_kwargs(config_kwargs)
if kwargs.get("warmup", True) and not self.ready:
return None
# Handle deprecated cache parameters
if any([bypass_cache, disable_cache, no_cache_read, no_cache_write]):
if kwargs.get("warning", True):
warnings.warn(
"Cache control boolean flags are deprecated and will be removed in version 0.5.0. "
"Use 'cache_mode' parameter instead.",
DeprecationWarning,
stacklevel=2
)
# Convert legacy parameters if cache_mode not provided
if config.cache_mode is None:
config.cache_mode = _legacy_to_cache_mode(
disable_cache=disable_cache,
bypass_cache=bypass_cache,
no_cache_read=no_cache_read,
no_cache_write=no_cache_write
)
# Default to ENABLED if no cache mode specified
if config.cache_mode is None:
config.cache_mode = CacheMode.ENABLED
if cached:
html = sanitize_input_encode(cached[1])
extracted_content = sanitize_input_encode(cached[4])
if screenshot:
screenshot_data = cached[9]
if not screenshot_data:
cached = None
# Create cache context
cache_context = CacheContext(url, config.cache_mode, self.always_bypass_cache)
if not cached or not html:
t1 = time.time()
if user_agent:
self.crawler_strategy.update_user_agent(user_agent)
async_response: AsyncCrawlResponse = await self.crawler_strategy.crawl(url, screenshot=screenshot, **kwargs)
html = sanitize_input_encode(async_response.html)
screenshot_data = async_response.screenshot
t2 = time.time()
if verbose:
print(
f"[LOG] 🚀 Crawling done for {url}, success: {bool(html)}, time taken: {t2 - t1:.2f} seconds"
# Initialize processing variables
async_response: AsyncCrawlResponse = None
cached_result = None
screenshot_data = None
pdf_data = None
extracted_content = None
start_time = time.perf_counter()
# Try to get cached result if appropriate
if cache_context.should_read():
cached_result = await async_db_manager.aget_cached_url(url)
if cached_result:
html = sanitize_input_encode(cached_result.html)
extracted_content = sanitize_input_encode(cached_result.extracted_content or "")
# If screenshot is requested but its not in cache, then set cache_result to None
screenshot_data = cached_result.screenshot
pdf_data = cached_result.pdf
if config.screenshot and not screenshot or config.pdf and not pdf:
cached_result = None
self.logger.url_status(
url=cache_context.display_url,
success=bool(html),
timing=time.perf_counter() - start_time,
tag="FETCH"
)
# Fetch fresh content if needed
if not cached_result or not html:
t1 = time.perf_counter()
if user_agent:
self.crawler_strategy.update_user_agent(user_agent)
# Pass config to crawl method
async_response = await self.crawler_strategy.crawl(
url,
config=config # Pass the entire config object
)
html = sanitize_input_encode(async_response.html)
screenshot_data = async_response.screenshot
pdf_data = async_response.pdf_data
t2 = time.perf_counter()
self.logger.url_status(
url=cache_context.display_url,
success=bool(html),
timing=t2 - t1,
tag="FETCH"
)
# Process the HTML content
crawl_result = await self.aprocess_html(
url=url,
html=html,
extracted_content=extracted_content,
config=config, # Pass the config object instead of individual parameters
screenshot=screenshot_data,
pdf_data=pdf_data,
verbose=config.verbose,
**kwargs
)
crawl_result = await self.aprocess_html(
url,
html,
extracted_content,
word_count_threshold,
extraction_strategy,
chunking_strategy,
css_selector,
screenshot_data,
verbose,
bool(cached),
async_response=async_response,
bypass_cache=bypass_cache,
**kwargs,
)
crawl_result.status_code = async_response.status_code if async_response else 200
crawl_result.response_headers = async_response.response_headers if async_response else {}
crawl_result.success = bool(html)
crawl_result.session_id = kwargs.get("session_id", None)
return crawl_result
except Exception as e:
if not hasattr(e, "msg"):
e.msg = str(e)
print(f"[ERROR] 🚫 arun(): Failed to crawl {url}, error: {e.msg}")
return CrawlResult(url=url, html="", markdown = f"[ERROR] 🚫 arun(): Failed to crawl {url}, error: {e.msg}", success=False, error_message=e.msg)
# Set response data
if async_response:
crawl_result.status_code = async_response.status_code
crawl_result.response_headers = async_response.response_headers
crawl_result.downloaded_files = async_response.downloaded_files
else:
crawl_result.status_code = 200
crawl_result.response_headers = cached_result.response_headers if cached_result else {}
async def arun_many(
self,
urls: List[str],
word_count_threshold=MIN_WORD_THRESHOLD,
extraction_strategy: ExtractionStrategy = None,
chunking_strategy: ChunkingStrategy = RegexChunking(),
bypass_cache: bool = False,
css_selector: str = None,
screenshot: bool = False,
user_agent: str = None,
verbose=True,
**kwargs,
) -> List[CrawlResult]:
tasks = [
self.arun(
url,
word_count_threshold,
extraction_strategy,
chunking_strategy,
bypass_cache,
css_selector,
screenshot,
user_agent,
verbose,
**kwargs
)
for url in urls
]
return await asyncio.gather(*tasks)
crawl_result.success = bool(html)
crawl_result.session_id = getattr(config, 'session_id', None)
self.logger.success(
message="{url:.50}... | Status: {status} | Total: {timing}",
tag="COMPLETE",
params={
"url": cache_context.display_url,
"status": crawl_result.success,
"timing": f"{time.perf_counter() - start_time:.2f}s"
},
colors={
"status": Fore.GREEN if crawl_result.success else Fore.RED,
"timing": Fore.YELLOW
}
)
# Update cache if appropriate
if cache_context.should_write() and not bool(cached_result):
await async_db_manager.acache_url(crawl_result)
return crawl_result
except Exception as e:
error_context = get_error_context(sys.exc_info())
error_message = (
f"Unexpected error in _crawl_web at line {error_context['line_no']} "
f"in {error_context['function']} ({error_context['filename']}):\n"
f"Error: {str(e)}\n\n"
f"Code context:\n{error_context['code_context']}"
)
# if not hasattr(e, "msg"):
# e.msg = str(e)
self.logger.error_status(
url=url,
error=create_box_message(error_message, type="error"),
tag="ERROR"
)
return CrawlResult(
url=url,
html="",
success=False,
error_message=error_message
)
async def aprocess_html(
self,
url: str,
html: str,
extracted_content: str,
word_count_threshold: int,
extraction_strategy: ExtractionStrategy,
chunking_strategy: ChunkingStrategy,
css_selector: str,
screenshot: str,
verbose: bool,
is_cached: bool,
**kwargs,
) -> CrawlResult:
t = time.time()
# Extract content from HTML
try:
t1 = time.time()
scrapping_strategy = WebScrappingStrategy()
# result = await scrapping_strategy.ascrap(
result = scrapping_strategy.scrap(
url,
html,
word_count_threshold=word_count_threshold,
css_selector=css_selector,
only_text=kwargs.get("only_text", False),
image_description_min_word_threshold=kwargs.get(
"image_description_min_word_threshold", IMAGE_DESCRIPTION_MIN_WORD_THRESHOLD
),
**kwargs,
self,
url: str,
html: str,
extracted_content: str,
config: CrawlerRunConfig,
screenshot: str,
pdf_data: str,
verbose: bool,
**kwargs,
) -> CrawlResult:
"""
Process HTML content using the provided configuration.
Args:
url: The URL being processed
html: Raw HTML content
extracted_content: Previously extracted content (if any)
config: Configuration object controlling processing behavior
screenshot: Screenshot data (if any)
verbose: Whether to enable verbose logging
**kwargs: Additional parameters for backwards compatibility
Returns:
CrawlResult: Processed result containing extracted and formatted content
"""
try:
_url = url if not kwargs.get("is_raw_html", False) else "Raw HTML"
t1 = time.perf_counter()
# Initialize scraping strategy
scrapping_strategy = WebScrapingStrategy(logger=self.logger)
# Process HTML content
result = scrapping_strategy.scrap(
url,
html,
word_count_threshold=config.word_count_threshold,
css_selector=config.css_selector,
only_text=config.only_text,
image_description_min_word_threshold=config.image_description_min_word_threshold,
content_filter=config.content_filter,
**kwargs
)
if result is None:
raise ValueError(f"Process HTML, Failed to extract content from the website: {url}")
except InvalidCSSSelectorError as e:
raise ValueError(str(e))
except Exception as e:
raise ValueError(f"Process HTML, Failed to extract content from the website: {url}, error: {str(e)}")
# Extract results
cleaned_html = sanitize_input_encode(result.get("cleaned_html", ""))
fit_markdown = sanitize_input_encode(result.get("fit_markdown", ""))
fit_html = sanitize_input_encode(result.get("fit_html", ""))
media = result.get("media", [])
links = result.get("links", [])
metadata = result.get("metadata", {})
# Markdown Generation
markdown_generator: Optional[MarkdownGenerationStrategy] = config.markdown_generator or DefaultMarkdownGenerator()
if not config.content_filter and not markdown_generator.content_filter:
markdown_generator.content_filter = PruningContentFilter()
markdown_result: MarkdownGenerationResult = markdown_generator.generate_markdown(
cleaned_html=cleaned_html,
base_url=url,
# html2text_options=kwargs.get('html2text', {})
)
if verbose:
print(
f"[LOG] 🚀 Content extracted for {url}, success: True, time taken: {time.time() - t1:.2f} seconds"
markdown_v2 = markdown_result
markdown = sanitize_input_encode(markdown_result.raw_markdown)
# Log processing completion
self.logger.info(
message="Processed {url:.50}... | Time: {timing}ms",
tag="SCRAPE",
params={
"url": _url,
"timing": int((time.perf_counter() - t1) * 1000)
}
)
# Handle content extraction if needed
if (extracted_content is None and
config.extraction_strategy and
config.chunking_strategy and
not isinstance(config.extraction_strategy, NoExtractionStrategy)):
t1 = time.perf_counter()
# Handle different extraction strategy types
if isinstance(config.extraction_strategy, (JsonCssExtractionStrategy, JsonCssExtractionStrategy)):
config.extraction_strategy.verbose = verbose
extracted_content = config.extraction_strategy.run(url, [html])
extracted_content = json.dumps(extracted_content, indent=4, default=str, ensure_ascii=False)
else:
sections = config.chunking_strategy.chunk(markdown)
extracted_content = config.extraction_strategy.run(url, sections)
extracted_content = json.dumps(extracted_content, indent=4, default=str, ensure_ascii=False)
# Log extraction completion
self.logger.info(
message="Completed for {url:.50}... | Time: {timing}s",
tag="EXTRACT",
params={
"url": _url,
"timing": time.perf_counter() - t1
}
)
if result is None:
raise ValueError(f"Process HTML, Failed to extract content from the website: {url}")
except InvalidCSSSelectorError as e:
raise ValueError(str(e))
except Exception as e:
raise ValueError(f"Process HTML, Failed to extract content from the website: {url}, error: {str(e)}")
# Handle screenshot and PDF data
screenshot_data = None if not screenshot else screenshot
pdf_data = None if not pdf_data else pdf_data
cleaned_html = sanitize_input_encode(result.get("cleaned_html", ""))
markdown = sanitize_input_encode(result.get("markdown", ""))
fit_markdown = sanitize_input_encode(result.get("fit_markdown", ""))
fit_html = sanitize_input_encode(result.get("fit_html", ""))
media = result.get("media", [])
links = result.get("links", [])
metadata = result.get("metadata", {})
# Apply HTML formatting if requested
if config.prettiify:
cleaned_html = fast_format_html(cleaned_html)
if extracted_content is None and extraction_strategy and chunking_strategy:
if verbose:
print(
f"[LOG] 🔥 Extracting semantic blocks for {url}, Strategy: {self.__class__.__name__}"
# Return complete crawl result
return CrawlResult(
url=url,
html=html,
cleaned_html=cleaned_html,
markdown_v2=markdown_v2,
markdown=markdown,
fit_markdown=fit_markdown,
fit_html=fit_html,
media=media,
links=links,
metadata=metadata,
screenshot=screenshot_data,
pdf=pdf_data,
extracted_content=extracted_content,
success=True,
error_message="",
)
async def arun_many(
self,
urls: List[str],
config: Optional[CrawlerRunConfig] = None,
# Legacy parameters maintained for backwards compatibility
word_count_threshold=MIN_WORD_THRESHOLD,
extraction_strategy: ExtractionStrategy = None,
chunking_strategy: ChunkingStrategy = RegexChunking(),
content_filter: RelevantContentFilter = None,
cache_mode: Optional[CacheMode] = None,
bypass_cache: bool = False,
css_selector: str = None,
screenshot: bool = False,
pdf: bool = False,
user_agent: str = None,
verbose=True,
**kwargs,
) -> List[CrawlResult]:
"""
Runs the crawler for multiple URLs concurrently.
Migration Guide:
Old way (deprecated):
results = await crawler.arun_many(
urls,
word_count_threshold=200,
screenshot=True,
...
)
New way (recommended):
config = CrawlerRunConfig(
word_count_threshold=200,
screenshot=True,
...
)
results = await crawler.arun_many(urls, crawler_config=config)
# Check if extraction strategy is type of JsonCssExtractionStrategy
if isinstance(extraction_strategy, JsonCssExtractionStrategy) or isinstance(extraction_strategy, JsonCssExtractionStrategy):
extraction_strategy.verbose = verbose
extracted_content = extraction_strategy.run(url, [html])
extracted_content = json.dumps(extracted_content, indent=4, default=str, ensure_ascii=False)
Args:
urls: List of URLs to crawl
crawler_config: Configuration object controlling crawl behavior for all URLs
[other parameters maintained for backwards compatibility]
Returns:
List[CrawlResult]: Results for each URL
"""
crawler_config = config
# Handle configuration
if crawler_config is not None:
if any(param is not None for param in [
word_count_threshold, extraction_strategy, chunking_strategy,
content_filter, cache_mode, css_selector, screenshot, pdf
]):
self.logger.warning(
message="Both crawler_config and legacy parameters provided. crawler_config will take precedence.",
tag="WARNING"
)
config = crawler_config
else:
sections = chunking_strategy.chunk(markdown)
extracted_content = extraction_strategy.run(url, sections)
extracted_content = json.dumps(extracted_content, indent=4, default=str, ensure_ascii=False)
# Merge all parameters into a single kwargs dict for config creation
config_kwargs = {
"word_count_threshold": word_count_threshold,
"extraction_strategy": extraction_strategy,
"chunking_strategy": chunking_strategy,
"content_filter": content_filter,
"cache_mode": cache_mode,
"bypass_cache": bypass_cache,
"css_selector": css_selector,
"screenshot": screenshot,
"pdf": pdf,
"verbose": verbose,
**kwargs
}
config = CrawlerRunConfig.from_kwargs(config_kwargs)
if verbose:
print(
f"[LOG] 🚀 Extraction done for {url}, time taken: {time.time() - t:.2f} seconds."
if bypass_cache:
if kwargs.get("warning", True):
warnings.warn(
"'bypass_cache' is deprecated and will be removed in version 0.5.0. "
"Use 'cache_mode=CacheMode.BYPASS' instead. "
"Pass warning=False to suppress this warning.",
DeprecationWarning,
stacklevel=2
)
if config.cache_mode is None:
config.cache_mode = CacheMode.BYPASS
semaphore_count = config.semaphore_count or 5
semaphore = asyncio.Semaphore(semaphore_count)
async def crawl_with_semaphore(url):
# Handle rate limiting per domain
domain = urlparse(url).netloc
current_time = time.time()
self.logger.debug(
message="Started task for {url:.50}...",
tag="PARALLEL",
params={"url": url}
)
# Get delay settings from config
mean_delay = config.mean_delay
max_range = config.max_range
# Apply rate limiting
if domain in self._domain_last_hit:
time_since_last = current_time - self._domain_last_hit[domain]
if time_since_last < mean_delay:
delay = mean_delay + random.uniform(0, max_range)
await asyncio.sleep(delay)
self._domain_last_hit[domain] = current_time
async with semaphore:
return await self.arun(
url,
crawler_config=config, # Pass the entire config object
user_agent=user_agent # Maintain user_agent override capability
)
# Log start of concurrent crawling
self.logger.info(
message="Starting concurrent crawling for {count} URLs...",
tag="INIT",
params={"count": len(urls)}
)
screenshot = None if not screenshot else screenshot
# Execute concurrent crawls
start_time = time.perf_counter()
tasks = [crawl_with_semaphore(url) for url in urls]
results = await asyncio.gather(*tasks, return_exceptions=True)
end_time = time.perf_counter()
if not is_cached or kwargs.get("bypass_cache", False) or self.always_by_pass_cache:
await async_db_manager.acache_url(
url,
html,
cleaned_html,
markdown,
extracted_content,
True,
json.dumps(media),
json.dumps(links),
json.dumps(metadata),
screenshot=screenshot,
# Log completion
self.logger.success(
message="Concurrent crawling completed for {count} URLs | Total time: {timing}",
tag="COMPLETE",
params={
"count": len(urls),
"timing": f"{end_time - start_time:.2f}s"
},
colors={
"timing": Fore.YELLOW
}
)
return CrawlResult(
url=url,
html=html,
cleaned_html=format_html(cleaned_html),
markdown=markdown,
fit_markdown=fit_markdown,
fit_html= fit_html,
media=media,
links=links,
metadata=metadata,
screenshot=screenshot,
extracted_content=extracted_content,
success=True,
error_message="",
)
return [result if not isinstance(result, Exception) else str(result) for result in results]
async def aclear_cache(self):
# await async_db_manager.aclear_db()
"""Clear the cache database."""
await async_db_manager.cleanup()
async def aflush_cache(self):
"""Flush the cache database."""
await async_db_manager.aflush_db()
async def aget_cache_size(self):
"""Get the total number of cached items."""
return await async_db_manager.aget_total_count()

79
crawl4ai/cache_context.py Normal file
View File

@@ -0,0 +1,79 @@
from enum import Enum
class CacheMode(Enum):
"""
Defines the caching behavior for web crawling operations.
Modes:
- ENABLED: Normal caching behavior (read and write)
- DISABLED: No caching at all
- READ_ONLY: Only read from cache, don't write
- WRITE_ONLY: Only write to cache, don't read
- BYPASS: Bypass cache for this operation
"""
ENABLED = "enabled"
DISABLED = "disabled"
READ_ONLY = "read_only"
WRITE_ONLY = "write_only"
BYPASS = "bypass"
class CacheContext:
"""
Encapsulates cache-related decisions and URL handling.
This class centralizes all cache-related logic and URL type checking,
making the caching behavior more predictable and maintainable.
"""
def __init__(self, url: str, cache_mode: CacheMode, always_bypass: bool = False):
self.url = url
self.cache_mode = cache_mode
self.always_bypass = always_bypass
self.is_cacheable = url.startswith(('http://', 'https://', 'file://'))
self.is_web_url = url.startswith(('http://', 'https://'))
self.is_local_file = url.startswith("file://")
self.is_raw_html = url.startswith("raw:")
self._url_display = url if not self.is_raw_html else "Raw HTML"
def should_read(self) -> bool:
"""Determines if cache should be read based on context."""
if self.always_bypass or not self.is_cacheable:
return False
return self.cache_mode in [CacheMode.ENABLED, CacheMode.READ_ONLY]
def should_write(self) -> bool:
"""Determines if cache should be written based on context."""
if self.always_bypass or not self.is_cacheable:
return False
return self.cache_mode in [CacheMode.ENABLED, CacheMode.WRITE_ONLY]
@property
def display_url(self) -> str:
"""Returns the URL in display format."""
return self._url_display
def _legacy_to_cache_mode(
disable_cache: bool = False,
bypass_cache: bool = False,
no_cache_read: bool = False,
no_cache_write: bool = False
) -> CacheMode:
"""
Converts legacy cache parameters to the new CacheMode enum.
This is an internal function to help transition from the old boolean flags
to the new CacheMode system.
"""
if disable_cache:
return CacheMode.DISABLED
if bypass_cache:
return CacheMode.BYPASS
if no_cache_read and no_cache_write:
return CacheMode.DISABLED
if no_cache_read:
return CacheMode.WRITE_ONLY
if no_cache_write:
return CacheMode.READ_ONLY
return CacheMode.ENABLED

View File

@@ -51,3 +51,12 @@ SOCIAL_MEDIA_DOMAINS = [
# If image format is in jpg, png or webp
# If image is in the first half of the total images extracted from the page
IMAGE_SCORE_THRESHOLD = 2
MAX_METRICS_HISTORY = 1000
NEED_MIGRATION = True
URL_LOG_SHORTEN_LENGTH = 30
SHOW_DEPRECATION_WARNINGS = True
SCREENSHOT_HEIGHT_TRESHOLD = 10000
PAGE_TIMEOUT=60000
DOWNLOAD_PAGE_TIMEOUT=60000

View File

@@ -1,196 +0,0 @@
from bs4 import BeautifulSoup, Tag
import re
from typing import Optional
class ContentCleaningStrategy:
def __init__(self):
# Precompile regex patterns for performance
self.negative_patterns = re.compile(r'nav|footer|header|sidebar|ads|comment', re.I)
self.positive_patterns = re.compile(r'content|article|main|post', re.I)
self.priority_tags = {'article', 'main', 'section', 'div'}
self.non_content_tags = {'nav', 'footer', 'header', 'aside'}
# Thresholds
self.text_density_threshold = 9.0
self.min_word_count = 50
self.link_density_threshold = 0.2
self.max_dom_depth = 10 # To prevent excessive DOM traversal
def clean(self, clean_html: str) -> str:
"""
Main function that takes cleaned HTML and returns super cleaned HTML.
Args:
clean_html (str): The cleaned HTML content.
Returns:
str: The super cleaned HTML containing only the main content.
"""
try:
if not clean_html or not isinstance(clean_html, str):
return ''
soup = BeautifulSoup(clean_html, 'html.parser')
main_content = self.extract_main_content(soup)
if main_content:
super_clean_element = self.clean_element(main_content)
return str(super_clean_element)
else:
return ''
except Exception:
# Handle exceptions silently or log them as needed
return ''
def extract_main_content(self, soup: BeautifulSoup) -> Optional[Tag]:
"""
Identifies and extracts the main content element from the HTML.
Args:
soup (BeautifulSoup): The parsed HTML soup.
Returns:
Optional[Tag]: The Tag object containing the main content, or None if not found.
"""
candidates = []
for element in soup.find_all(self.priority_tags):
if self.is_non_content_tag(element):
continue
if self.has_negative_class_id(element):
continue
score = self.calculate_content_score(element)
candidates.append((score, element))
if not candidates:
return None
# Sort candidates by score in descending order
candidates.sort(key=lambda x: x[0], reverse=True)
# Select the element with the highest score
best_element = candidates[0][1]
return best_element
def calculate_content_score(self, element: Tag) -> float:
"""
Calculates a score for an element based on various heuristics.
Args:
element (Tag): The HTML element to score.
Returns:
float: The content score of the element.
"""
score = 0.0
if self.is_priority_tag(element):
score += 5.0
if self.has_positive_class_id(element):
score += 3.0
if self.has_negative_class_id(element):
score -= 3.0
if self.is_high_text_density(element):
score += 2.0
if self.is_low_link_density(element):
score += 2.0
if self.has_sufficient_content(element):
score += 2.0
if self.has_headings(element):
score += 3.0
dom_depth = self.calculate_dom_depth(element)
score += min(dom_depth, self.max_dom_depth) * 0.5 # Adjust weight as needed
return score
def is_priority_tag(self, element: Tag) -> bool:
"""Checks if the element is a priority tag."""
return element.name in self.priority_tags
def is_non_content_tag(self, element: Tag) -> bool:
"""Checks if the element is a non-content tag."""
return element.name in self.non_content_tags
def has_negative_class_id(self, element: Tag) -> bool:
"""Checks if the element has negative indicators in its class or id."""
class_id = ' '.join(filter(None, [
self.get_attr_str(element.get('class')),
element.get('id', '')
]))
return bool(self.negative_patterns.search(class_id))
def has_positive_class_id(self, element: Tag) -> bool:
"""Checks if the element has positive indicators in its class or id."""
class_id = ' '.join(filter(None, [
self.get_attr_str(element.get('class')),
element.get('id', '')
]))
return bool(self.positive_patterns.search(class_id))
@staticmethod
def get_attr_str(attr) -> str:
"""Converts an attribute value to a string."""
if isinstance(attr, list):
return ' '.join(attr)
elif isinstance(attr, str):
return attr
else:
return ''
def is_high_text_density(self, element: Tag) -> bool:
"""Determines if the element has high text density."""
text_density = self.calculate_text_density(element)
return text_density > self.text_density_threshold
def calculate_text_density(self, element: Tag) -> float:
"""Calculates the text density of an element."""
text_length = len(element.get_text(strip=True))
tag_count = len(element.find_all())
tag_count = tag_count or 1 # Prevent division by zero
return text_length / tag_count
def is_low_link_density(self, element: Tag) -> bool:
"""Determines if the element has low link density."""
link_density = self.calculate_link_density(element)
return link_density < self.link_density_threshold
def calculate_link_density(self, element: Tag) -> float:
"""Calculates the link density of an element."""
text = element.get_text(strip=True)
if not text:
return 0.0
link_text = ' '.join(a.get_text(strip=True) for a in element.find_all('a'))
return len(link_text) / len(text) if text else 0.0
def has_sufficient_content(self, element: Tag) -> bool:
"""Checks if the element has sufficient word count."""
word_count = len(element.get_text(strip=True).split())
return word_count >= self.min_word_count
def calculate_dom_depth(self, element: Tag) -> int:
"""Calculates the depth of an element in the DOM tree."""
depth = 0
current_element = element
while current_element.parent and depth < self.max_dom_depth:
depth += 1
current_element = current_element.parent
return depth
def has_headings(self, element: Tag) -> bool:
"""Checks if the element contains heading tags."""
return bool(element.find(['h1', 'h2', 'h3']))
def clean_element(self, element: Tag) -> Tag:
"""
Cleans the selected element by removing unnecessary attributes and nested non-content elements.
Args:
element (Tag): The HTML element to clean.
Returns:
Tag: The cleaned HTML element.
"""
for tag in element.find_all(['script', 'style', 'aside']):
tag.decompose()
for tag in element.find_all():
attrs = dict(tag.attrs)
for attr in attrs:
if attr in ['style', 'onclick', 'onmouseover', 'align', 'bgcolor']:
del tag.attrs[attr]
return element

View File

@@ -0,0 +1,543 @@
import re
from bs4 import BeautifulSoup, Tag
from typing import List, Tuple, Dict
from rank_bm25 import BM25Okapi
from time import perf_counter
from collections import deque
from bs4 import BeautifulSoup, NavigableString, Tag, Comment
from .utils import clean_tokens
from abc import ABC, abstractmethod
import math
from snowballstemmer import stemmer
# import regex
# def tokenize_text(text):
# # Regular expression to match words or CJK (Chinese, Japanese, Korean) characters
# pattern = r'\p{L}+|\p{N}+|[\p{Script=Han}\p{Script=Hiragana}\p{Script=Katakana}ー]|[\p{P}]'
# return regex.findall(pattern, text)
# from nltk.stem import PorterStemmer
# ps = PorterStemmer()
class RelevantContentFilter(ABC):
def __init__(self, user_query: str = None):
self.user_query = user_query
self.included_tags = {
# Primary structure
'article', 'main', 'section', 'div',
# List structures
'ul', 'ol', 'li', 'dl', 'dt', 'dd',
# Text content
'p', 'span', 'blockquote', 'pre', 'code',
# Headers
'h1', 'h2', 'h3', 'h4', 'h5', 'h6',
# Tables
'table', 'thead', 'tbody', 'tr', 'td', 'th',
# Other semantic elements
'figure', 'figcaption', 'details', 'summary',
# Text formatting
'em', 'strong', 'b', 'i', 'mark', 'small',
# Rich content
'time', 'address', 'cite', 'q'
}
self.excluded_tags = {
'nav', 'footer', 'header', 'aside', 'script',
'style', 'form', 'iframe', 'noscript'
}
self.header_tags = {'h1', 'h2', 'h3', 'h4', 'h5', 'h6'}
self.negative_patterns = re.compile(
r'nav|footer|header|sidebar|ads|comment|promo|advert|social|share',
re.I
)
self.min_word_count = 2
@abstractmethod
def filter_content(self, html: str) -> List[str]:
"""Abstract method to be implemented by specific filtering strategies"""
pass
def extract_page_query(self, soup: BeautifulSoup, body: Tag) -> str:
"""Common method to extract page metadata with fallbacks"""
if self.user_query:
return self.user_query
query_parts = []
# Title
try:
title = soup.title.string
if title:
query_parts.append(title)
except Exception:
pass
if soup.find('h1'):
query_parts.append(soup.find('h1').get_text())
# Meta tags
temp = ""
for meta_name in ['keywords', 'description']:
meta = soup.find('meta', attrs={'name': meta_name})
if meta and meta.get('content'):
query_parts.append(meta['content'])
temp += meta['content']
# If still empty, grab first significant paragraph
if not temp:
# Find the first tag P thatits text contains more than 50 characters
for p in body.find_all('p'):
if len(p.get_text()) > 150:
query_parts.append(p.get_text()[:150])
break
return ' '.join(filter(None, query_parts))
def extract_text_chunks(self, body: Tag, min_word_threshold: int = None) -> List[Tuple[str, str]]:
"""
Extracts text chunks from a BeautifulSoup body element while preserving order.
Returns list of tuples (text, tag_name) for classification.
Args:
body: BeautifulSoup Tag object representing the body element
Returns:
List of (text, tag_name) tuples
"""
# Tags to ignore - inline elements that shouldn't break text flow
INLINE_TAGS = {
'a', 'abbr', 'acronym', 'b', 'bdo', 'big', 'br', 'button', 'cite', 'code',
'dfn', 'em', 'i', 'img', 'input', 'kbd', 'label', 'map', 'object', 'q',
'samp', 'script', 'select', 'small', 'span', 'strong', 'sub', 'sup',
'textarea', 'time', 'tt', 'var'
}
# Tags that typically contain meaningful headers
HEADER_TAGS = {'h1', 'h2', 'h3', 'h4', 'h5', 'h6', 'header'}
chunks = []
current_text = []
chunk_index = 0
def should_break_chunk(tag: Tag) -> bool:
"""Determine if a tag should cause a break in the current text chunk"""
return (
tag.name not in INLINE_TAGS
and not (tag.name == 'p' and len(current_text) == 0)
)
# Use deque for efficient push/pop operations
stack = deque([(body, False)])
while stack:
element, visited = stack.pop()
if visited:
# End of block element - flush accumulated text
if current_text and should_break_chunk(element):
text = ' '.join(''.join(current_text).split())
if text:
tag_type = 'header' if element.name in HEADER_TAGS else 'content'
chunks.append((chunk_index, text, tag_type, element))
chunk_index += 1
current_text = []
continue
if isinstance(element, NavigableString):
if str(element).strip():
current_text.append(str(element).strip())
continue
# Pre-allocate children to avoid multiple list operations
children = list(element.children)
if not children:
continue
# Mark block for revisit after processing children
stack.append((element, True))
# Add children in reverse order for correct processing
for child in reversed(children):
if isinstance(child, (Tag, NavigableString)):
stack.append((child, False))
# Handle any remaining text
if current_text:
text = ' '.join(''.join(current_text).split())
if text:
chunks.append((chunk_index, text, 'content', body))
if min_word_threshold:
chunks = [chunk for chunk in chunks if len(chunk[1].split()) >= min_word_threshold]
return chunks
def extract_text_chunks1(self, soup: BeautifulSoup) -> List[Tuple[int, str, Tag]]:
"""Common method for extracting text chunks"""
_text_cache = {}
def fast_text(element: Tag) -> str:
elem_id = id(element)
if elem_id in _text_cache:
return _text_cache[elem_id]
texts = []
for content in element.contents:
if isinstance(content, str):
text = content.strip()
if text:
texts.append(text)
result = ' '.join(texts)
_text_cache[elem_id] = result
return result
candidates = []
index = 0
def dfs(element):
nonlocal index
if isinstance(element, Tag):
if element.name in self.included_tags:
if not self.is_excluded(element):
text = fast_text(element)
word_count = len(text.split())
# Headers pass through with adjusted minimum
if element.name in self.header_tags:
if word_count >= 3: # Minimal sanity check for headers
candidates.append((index, text, element))
index += 1
# Regular content uses standard minimum
elif word_count >= self.min_word_count:
candidates.append((index, text, element))
index += 1
for child in element.children:
dfs(child)
dfs(soup.body if soup.body else soup)
return candidates
def is_excluded(self, tag: Tag) -> bool:
"""Common method for exclusion logic"""
if tag.name in self.excluded_tags:
return True
class_id = ' '.join(filter(None, [
' '.join(tag.get('class', [])),
tag.get('id', '')
]))
return bool(self.negative_patterns.search(class_id))
def clean_element(self, tag: Tag) -> str:
"""Common method for cleaning HTML elements with minimal overhead"""
if not tag or not isinstance(tag, Tag):
return ""
unwanted_tags = {'script', 'style', 'aside', 'form', 'iframe', 'noscript'}
unwanted_attrs = {'style', 'onclick', 'onmouseover', 'align', 'bgcolor', 'class', 'id'}
# Use string builder pattern for better performance
builder = []
def render_tag(elem):
if not isinstance(elem, Tag):
if isinstance(elem, str):
builder.append(elem.strip())
return
if elem.name in unwanted_tags:
return
# Start tag
builder.append(f'<{elem.name}')
# Add cleaned attributes
attrs = {k: v for k, v in elem.attrs.items() if k not in unwanted_attrs}
for key, value in attrs.items():
builder.append(f' {key}="{value}"')
builder.append('>')
# Process children
for child in elem.children:
render_tag(child)
# Close tag
builder.append(f'</{elem.name}>')
try:
render_tag(tag)
return ''.join(builder)
except Exception:
return str(tag) # Fallback to original if anything fails
class BM25ContentFilter(RelevantContentFilter):
def __init__(self, user_query: str = None, bm25_threshold: float = 1.0, language: str = 'english'):
super().__init__(user_query=user_query)
self.bm25_threshold = bm25_threshold
self.priority_tags = {
'h1': 5.0,
'h2': 4.0,
'h3': 3.0,
'title': 4.0,
'strong': 2.0,
'b': 1.5,
'em': 1.5,
'blockquote': 2.0,
'code': 2.0,
'pre': 1.5,
'th': 1.5, # Table headers
}
self.stemmer = stemmer(language)
def filter_content(self, html: str, min_word_threshold: int = None) -> List[str]:
"""Implements content filtering using BM25 algorithm with priority tag handling"""
if not html or not isinstance(html, str):
return []
soup = BeautifulSoup(html, 'lxml')
# Check if body is present
if not soup.body:
# Wrap in body tag if missing
soup = BeautifulSoup(f'<body>{html}</body>', 'lxml')
body = soup.find('body')
query = self.extract_page_query(soup, body)
if not query:
return []
# return [self.clean_element(soup)]
candidates = self.extract_text_chunks(body, min_word_threshold)
if not candidates:
return []
# Tokenize corpus
# tokenized_corpus = [chunk.lower().split() for _, chunk, _, _ in candidates]
# tokenized_query = query.lower().split()
# tokenized_corpus = [[ps.stem(word) for word in chunk.lower().split()]
# for _, chunk, _, _ in candidates]
# tokenized_query = [ps.stem(word) for word in query.lower().split()]
tokenized_corpus = [[self.stemmer.stemWord(word) for word in chunk.lower().split()]
for _, chunk, _, _ in candidates]
tokenized_query = [self.stemmer.stemWord(word) for word in query.lower().split()]
# tokenized_corpus = [[self.stemmer.stemWord(word) for word in tokenize_text(chunk.lower())]
# for _, chunk, _, _ in candidates]
# tokenized_query = [self.stemmer.stemWord(word) for word in tokenize_text(query.lower())]
# Clean from stop words and noise
tokenized_corpus = [clean_tokens(tokens) for tokens in tokenized_corpus]
tokenized_query = clean_tokens(tokenized_query)
bm25 = BM25Okapi(tokenized_corpus)
scores = bm25.get_scores(tokenized_query)
# Adjust scores with tag weights
adjusted_candidates = []
for score, (index, chunk, tag_type, tag) in zip(scores, candidates):
tag_weight = self.priority_tags.get(tag.name, 1.0)
adjusted_score = score * tag_weight
adjusted_candidates.append((adjusted_score, index, chunk, tag))
# Filter candidates by threshold
selected_candidates = [
(index, chunk, tag) for adjusted_score, index, chunk, tag in adjusted_candidates
if adjusted_score >= self.bm25_threshold
]
if not selected_candidates:
return []
# Sort selected candidates by original document order
selected_candidates.sort(key=lambda x: x[0])
return [self.clean_element(tag) for _, _, tag in selected_candidates]
class PruningContentFilter(RelevantContentFilter):
def __init__(self, user_query: str = None, min_word_threshold: int = None,
threshold_type: str = 'fixed', threshold: float = 0.48):
super().__init__(user_query)
self.min_word_threshold = min_word_threshold
self.threshold_type = threshold_type
self.threshold = threshold
# Add tag importance for dynamic threshold
self.tag_importance = {
'article': 1.5,
'main': 1.4,
'section': 1.3,
'p': 1.2,
'h1': 1.4,
'h2': 1.3,
'h3': 1.2,
'div': 0.7,
'span': 0.6
}
# Metric configuration
self.metric_config = {
'text_density': True,
'link_density': True,
'tag_weight': True,
'class_id_weight': True,
'text_length': True,
}
self.metric_weights = {
'text_density': 0.4,
'link_density': 0.2,
'tag_weight': 0.2,
'class_id_weight': 0.1,
'text_length': 0.1,
}
self.tag_weights = {
'div': 0.5,
'p': 1.0,
'article': 1.5,
'section': 1.0,
'span': 0.3,
'li': 0.5,
'ul': 0.5,
'ol': 0.5,
'h1': 1.2,
'h2': 1.1,
'h3': 1.0,
'h4': 0.9,
'h5': 0.8,
'h6': 0.7,
}
def filter_content(self, html: str, min_word_threshold: int = None) -> List[str]:
if not html or not isinstance(html, str):
return []
soup = BeautifulSoup(html, 'lxml')
if not soup.body:
soup = BeautifulSoup(f'<body>{html}</body>', 'lxml')
# Remove comments and unwanted tags
self._remove_comments(soup)
self._remove_unwanted_tags(soup)
# Prune tree starting from body
body = soup.find('body')
self._prune_tree(body)
# Extract remaining content as list of HTML strings
content_blocks = []
for element in body.children:
if isinstance(element, str) or not hasattr(element, 'name'):
continue
if len(element.get_text(strip=True)) > 0:
content_blocks.append(str(element))
return content_blocks
def _remove_comments(self, soup):
for element in soup(text=lambda text: isinstance(text, Comment)):
element.extract()
def _remove_unwanted_tags(self, soup):
for tag in self.excluded_tags:
for element in soup.find_all(tag):
element.decompose()
def _prune_tree(self, node):
if not node or not hasattr(node, 'name') or node.name is None:
return
text_len = len(node.get_text(strip=True))
tag_len = len(node.encode_contents().decode('utf-8'))
link_text_len = sum(len(s.strip()) for s in (a.string for a in node.find_all('a', recursive=False)) if s)
metrics = {
'node': node,
'tag_name': node.name,
'text_len': text_len,
'tag_len': tag_len,
'link_text_len': link_text_len
}
score = self._compute_composite_score(metrics, text_len, tag_len, link_text_len)
if self.threshold_type == 'fixed':
should_remove = score < self.threshold
else: # dynamic
tag_importance = self.tag_importance.get(node.name, 0.7)
text_ratio = text_len / tag_len if tag_len > 0 else 0
link_ratio = link_text_len / text_len if text_len > 0 else 1
threshold = self.threshold # base threshold
if tag_importance > 1:
threshold *= 0.8
if text_ratio > 0.4:
threshold *= 0.9
if link_ratio > 0.6:
threshold *= 1.2
should_remove = score < threshold
if should_remove:
node.decompose()
else:
children = [child for child in node.children if hasattr(child, 'name')]
for child in children:
self._prune_tree(child)
def _compute_composite_score(self, metrics, text_len, tag_len, link_text_len):
if self.min_word_threshold:
# Get raw text from metrics node - avoid extra processing
text = metrics['node'].get_text(strip=True)
word_count = text.count(' ') + 1
if word_count < self.min_word_threshold:
return -1.0 # Guaranteed removal
score = 0.0
total_weight = 0.0
if self.metric_config['text_density']:
density = text_len / tag_len if tag_len > 0 else 0
score += self.metric_weights['text_density'] * density
total_weight += self.metric_weights['text_density']
if self.metric_config['link_density']:
density = 1 - (link_text_len / text_len if text_len > 0 else 0)
score += self.metric_weights['link_density'] * density
total_weight += self.metric_weights['link_density']
if self.metric_config['tag_weight']:
tag_score = self.tag_weights.get(metrics['tag_name'], 0.5)
score += self.metric_weights['tag_weight'] * tag_score
total_weight += self.metric_weights['tag_weight']
if self.metric_config['class_id_weight']:
class_score = self._compute_class_id_weight(metrics['node'])
score += self.metric_weights['class_id_weight'] * max(0, class_score)
total_weight += self.metric_weights['class_id_weight']
if self.metric_config['text_length']:
score += self.metric_weights['text_length'] * math.log(text_len + 1)
total_weight += self.metric_weights['text_length']
return score / total_weight if total_weight > 0 else 0
def _compute_class_id_weight(self, node):
class_id_score = 0
if 'class' in node.attrs:
classes = ' '.join(node['class'])
if self.negative_patterns.match(classes):
class_id_score -= 0.5
if 'id' in node.attrs:
element_id = node['id']
if self.negative_patterns.match(element_id):
class_id_score -= 0.5
return class_id_score

View File

@@ -0,0 +1,620 @@
import re # Point 1: Pre-Compile Regular Expressions
from abc import ABC, abstractmethod
from typing import Dict, Any, Optional
from bs4 import BeautifulSoup
from concurrent.futures import ThreadPoolExecutor
import asyncio, requests, re, os
from .config import *
from bs4 import element, NavigableString, Comment
from bs4 import PageElement, Tag
from urllib.parse import urljoin
from requests.exceptions import InvalidSchema
# from .content_cleaning_strategy import ContentCleaningStrategy
from .content_filter_strategy import RelevantContentFilter, BM25ContentFilter#, HeuristicContentFilter
from .markdown_generation_strategy import MarkdownGenerationStrategy, DefaultMarkdownGenerator
from .models import MarkdownGenerationResult
from .utils import (
extract_metadata,
normalize_url,
is_external_url
)
# Pre-compile regular expressions for Open Graph and Twitter metadata
OG_REGEX = re.compile(r'^og:')
TWITTER_REGEX = re.compile(r'^twitter:')
DIMENSION_REGEX = re.compile(r"(\d+)(\D*)")
# Function to parse image height/width value and units
def parse_dimension(dimension):
if dimension:
# match = re.match(r"(\d+)(\D*)", dimension)
match = DIMENSION_REGEX.match(dimension)
if match:
number = int(match.group(1))
unit = match.group(2) or 'px' # Default unit is 'px' if not specified
return number, unit
return None, None
# Fetch image file metadata to extract size and extension
def fetch_image_file_size(img, base_url):
#If src is relative path construct full URL, if not it may be CDN URL
img_url = urljoin(base_url,img.get('src'))
try:
response = requests.head(img_url)
if response.status_code == 200:
return response.headers.get('Content-Length',None)
else:
print(f"Failed to retrieve file size for {img_url}")
return None
except InvalidSchema as e:
return None
finally:
return
class ContentScrapingStrategy(ABC):
@abstractmethod
def scrap(self, url: str, html: str, **kwargs) -> Dict[str, Any]:
pass
@abstractmethod
async def ascrap(self, url: str, html: str, **kwargs) -> Dict[str, Any]:
pass
class WebScrapingStrategy(ContentScrapingStrategy):
def __init__(self, logger=None):
self.logger = logger
def _log(self, level, message, tag="SCRAPE", **kwargs):
"""Helper method to safely use logger."""
if self.logger:
log_method = getattr(self.logger, level)
log_method(message=message, tag=tag, **kwargs)
def scrap(self, url: str, html: str, **kwargs) -> Dict[str, Any]:
return self._scrap(url, html, is_async=False, **kwargs)
async def ascrap(self, url: str, html: str, **kwargs) -> Dict[str, Any]:
return await asyncio.to_thread(self._scrap, url, html, **kwargs)
def _generate_markdown_content(self,
cleaned_html: str,
html: str,
url: str,
success: bool,
**kwargs) -> Dict[str, Any]:
markdown_generator: Optional[MarkdownGenerationStrategy] = kwargs.get('markdown_generator', DefaultMarkdownGenerator())
if markdown_generator:
try:
if kwargs.get('fit_markdown', False) and not markdown_generator.content_filter:
markdown_generator.content_filter = BM25ContentFilter(
user_query=kwargs.get('fit_markdown_user_query', None),
bm25_threshold=kwargs.get('fit_markdown_bm25_threshold', 1.0)
)
markdown_result: MarkdownGenerationResult = markdown_generator.generate_markdown(
cleaned_html=cleaned_html,
base_url=url,
html2text_options=kwargs.get('html2text', {})
)
return {
'markdown': markdown_result.raw_markdown,
'fit_markdown': markdown_result.fit_markdown,
'fit_html': markdown_result.fit_html,
'markdown_v2': markdown_result
}
except Exception as e:
self._log('error',
message="Error using new markdown generation strategy: {error}",
tag="SCRAPE",
params={"error": str(e)}
)
markdown_generator = None
return {
'markdown': f"Error using new markdown generation strategy: {str(e)}",
'fit_markdown': "Set flag 'fit_markdown' to True to get cleaned HTML content.",
'fit_html': "Set flag 'fit_markdown' to True to get cleaned HTML content.",
'markdown_v2': None
}
# Legacy method
"""
# h = CustomHTML2Text()
# h.update_params(**kwargs.get('html2text', {}))
# markdown = h.handle(cleaned_html)
# markdown = markdown.replace(' ```', '```')
# fit_markdown = "Set flag 'fit_markdown' to True to get cleaned HTML content."
# fit_html = "Set flag 'fit_markdown' to True to get cleaned HTML content."
# if kwargs.get('content_filter', None) or kwargs.get('fit_markdown', False):
# content_filter = kwargs.get('content_filter', None)
# if not content_filter:
# content_filter = BM25ContentFilter(
# user_query=kwargs.get('fit_markdown_user_query', None),
# bm25_threshold=kwargs.get('fit_markdown_bm25_threshold', 1.0)
# )
# fit_html = content_filter.filter_content(html)
# fit_html = '\n'.join('<div>{}</div>'.format(s) for s in fit_html)
# fit_markdown = h.handle(fit_html)
# markdown_v2 = MarkdownGenerationResult(
# raw_markdown=markdown,
# markdown_with_citations=markdown,
# references_markdown=markdown,
# fit_markdown=fit_markdown
# )
# return {
# 'markdown': markdown,
# 'fit_markdown': fit_markdown,
# 'fit_html': fit_html,
# 'markdown_v2' : markdown_v2
# }
"""
def flatten_nested_elements(self, node):
if isinstance(node, NavigableString):
return node
if len(node.contents) == 1 and isinstance(node.contents[0], Tag) and node.contents[0].name == node.name:
return self.flatten_nested_elements(node.contents[0])
node.contents = [self.flatten_nested_elements(child) for child in node.contents]
return node
def find_closest_parent_with_useful_text(self, tag, **kwargs):
image_description_min_word_threshold = kwargs.get('image_description_min_word_threshold', IMAGE_DESCRIPTION_MIN_WORD_THRESHOLD)
current_tag = tag
while current_tag:
current_tag = current_tag.parent
# Get the text content of the parent tag
if current_tag:
text_content = current_tag.get_text(separator=' ',strip=True)
# Check if the text content has at least word_count_threshold
if len(text_content.split()) >= image_description_min_word_threshold:
return text_content
return None
def remove_unwanted_attributes(self, element, important_attrs, keep_data_attributes=False):
attrs_to_remove = []
for attr in element.attrs:
if attr not in important_attrs:
if keep_data_attributes:
if not attr.startswith('data-'):
attrs_to_remove.append(attr)
else:
attrs_to_remove.append(attr)
for attr in attrs_to_remove:
del element[attr]
def process_image(self, img, url, index, total_images, **kwargs):
parse_srcset = lambda s: [{'url': u.strip().split()[0], 'width': u.strip().split()[-1].rstrip('w')
if ' ' in u else None}
for u in [f"http{p}" for p in s.split("http") if p]]
# Constants for checks
classes_to_check = frozenset(['button', 'icon', 'logo'])
tags_to_check = frozenset(['button', 'input'])
# Pre-fetch commonly used attributes
style = img.get('style', '')
alt = img.get('alt', '')
src = img.get('src', '')
data_src = img.get('data-src', '')
width = img.get('width')
height = img.get('height')
parent = img.parent
parent_classes = parent.get('class', [])
# Quick validation checks
if ('display:none' in style or
parent.name in tags_to_check or
any(c in cls for c in parent_classes for cls in classes_to_check) or
any(c in src for c in classes_to_check) or
any(c in alt for c in classes_to_check)):
return None
# Quick score calculation
score = 0
if width and width.isdigit():
width_val = int(width)
score += 1 if width_val > 150 else 0
if height and height.isdigit():
height_val = int(height)
score += 1 if height_val > 150 else 0
if alt:
score += 1
score += index/total_images < 0.5
image_format = ''
if "data:image/" in src:
image_format = src.split(',')[0].split(';')[0].split('/')[1].split(';')[0]
else:
image_format = os.path.splitext(src)[1].lower().strip('.').split('?')[0]
if image_format in ('jpg', 'png', 'webp', 'avif'):
score += 1
if score <= kwargs.get('image_score_threshold', IMAGE_SCORE_THRESHOLD):
return None
# Use set for deduplication
unique_urls = set()
image_variants = []
# Generate a unique group ID for this set of variants
group_id = index
# Base image info template
image_description_min_word_threshold = kwargs.get('image_description_min_word_threshold', IMAGE_DESCRIPTION_MIN_WORD_THRESHOLD)
base_info = {
'alt': alt,
'desc': self.find_closest_parent_with_useful_text(img, **kwargs),
'score': score,
'type': 'image',
'group_id': group_id # Group ID for this set of variants
}
# Inline function for adding variants
def add_variant(src, width=None):
if src and not src.startswith('data:') and src not in unique_urls:
unique_urls.add(src)
image_variants.append({**base_info, 'src': src, 'width': width})
# Process all sources
add_variant(src)
add_variant(data_src)
# Handle srcset and data-srcset in one pass
for attr in ('srcset', 'data-srcset'):
if value := img.get(attr):
for source in parse_srcset(value):
add_variant(source['url'], source['width'])
# Quick picture element check
if picture := img.find_parent('picture'):
for source in picture.find_all('source'):
if srcset := source.get('srcset'):
for src in parse_srcset(srcset):
add_variant(src['url'], src['width'])
# Framework-specific attributes in one pass
for attr, value in img.attrs.items():
if attr.startswith('data-') and ('src' in attr or 'srcset' in attr) and 'http' in value:
add_variant(value)
return image_variants if image_variants else None
def process_element(self, url, element: PageElement, **kwargs) -> Dict[str, Any]:
media = {'images': [], 'videos': [], 'audios': []}
internal_links_dict = {}
external_links_dict = {}
self._process_element(
url,
element,
media,
internal_links_dict,
external_links_dict,
**kwargs
)
return {
'media': media,
'internal_links_dict': internal_links_dict,
'external_links_dict': external_links_dict
}
def _process_element(self, url, element: PageElement, media: Dict[str, Any], internal_links_dict: Dict[str, Any], external_links_dict: Dict[str, Any], **kwargs) -> bool:
try:
if isinstance(element, NavigableString):
if isinstance(element, Comment):
element.extract()
return False
# if element.name == 'img':
# process_image(element, url, 0, 1)
# return True
if element.name in ['script', 'style', 'link', 'meta', 'noscript']:
element.decompose()
return False
keep_element = False
exclude_social_media_domains = SOCIAL_MEDIA_DOMAINS + kwargs.get('exclude_social_media_domains', [])
exclude_social_media_domains = list(set(exclude_social_media_domains))
try:
if element.name == 'a' and element.get('href'):
href = element.get('href', '').strip()
if not href: # Skip empty hrefs
return False
url_base = url.split('/')[2]
# Normalize the URL
try:
normalized_href = normalize_url(href, url)
except ValueError as e:
# logging.warning(f"Invalid URL format: {href}, Error: {str(e)}")
return False
link_data = {
'href': normalized_href,
'text': element.get_text().strip(),
'title': element.get('title', '').strip()
}
# Check for duplicates and add to appropriate dictionary
is_external = is_external_url(normalized_href, url_base)
if is_external:
if normalized_href not in external_links_dict:
external_links_dict[normalized_href] = link_data
else:
if normalized_href not in internal_links_dict:
internal_links_dict[normalized_href] = link_data
keep_element = True
# Handle external link exclusions
if is_external:
if kwargs.get('exclude_external_links', False):
element.decompose()
return False
elif kwargs.get('exclude_social_media_links', False):
if any(domain in normalized_href.lower() for domain in exclude_social_media_domains):
element.decompose()
return False
elif kwargs.get('exclude_domains', []):
if any(domain in normalized_href.lower() for domain in kwargs.get('exclude_domains', [])):
element.decompose()
return False
except Exception as e:
raise Exception(f"Error processing links: {str(e)}")
try:
if element.name == 'img':
potential_sources = ['src', 'data-src', 'srcset' 'data-lazy-src', 'data-original']
src = element.get('src', '')
while not src and potential_sources:
src = element.get(potential_sources.pop(0), '')
if not src:
element.decompose()
return False
# If it is srcset pick up the first image
if 'srcset' in element.attrs:
src = element.attrs['srcset'].split(',')[0].split(' ')[0]
# Check flag if we should remove external images
if kwargs.get('exclude_external_images', False):
src_url_base = src.split('/')[2]
url_base = url.split('/')[2]
if url_base not in src_url_base:
element.decompose()
return False
if not kwargs.get('exclude_external_images', False) and kwargs.get('exclude_social_media_links', False):
src_url_base = src.split('/')[2]
url_base = url.split('/')[2]
if any(domain in src for domain in exclude_social_media_domains):
element.decompose()
return False
# Handle exclude domains
if kwargs.get('exclude_domains', []):
if any(domain in src for domain in kwargs.get('exclude_domains', [])):
element.decompose()
return False
return True # Always keep image elements
except Exception as e:
raise "Error processing images"
# Check if flag to remove all forms is set
if kwargs.get('remove_forms', False) and element.name == 'form':
element.decompose()
return False
if element.name in ['video', 'audio']:
media[f"{element.name}s"].append({
'src': element.get('src'),
'alt': element.get('alt'),
'type': element.name,
'description': self.find_closest_parent_with_useful_text(element, **kwargs)
})
source_tags = element.find_all('source')
for source_tag in source_tags:
media[f"{element.name}s"].append({
'src': source_tag.get('src'),
'alt': element.get('alt'),
'type': element.name,
'description': self.find_closest_parent_with_useful_text(element, **kwargs)
})
return True # Always keep video and audio elements
if element.name in ONLY_TEXT_ELIGIBLE_TAGS:
if kwargs.get('only_text', False):
element.replace_with(element.get_text())
try:
self.remove_unwanted_attributes(element, IMPORTANT_ATTRS, kwargs.get('keep_data_attributes', False))
except Exception as e:
# print('Error removing unwanted attributes:', str(e))
self._log('error',
message="Error removing unwanted attributes: {error}",
tag="SCRAPE",
params={"error": str(e)}
)
# Process children
for child in list(element.children):
if isinstance(child, NavigableString) and not isinstance(child, Comment):
if len(child.strip()) > 0:
keep_element = True
else:
if self._process_element(url, child, media, internal_links_dict, external_links_dict, **kwargs):
keep_element = True
# Check word count
word_count_threshold = kwargs.get('word_count_threshold', MIN_WORD_THRESHOLD)
if not keep_element:
word_count = len(element.get_text(strip=True).split())
keep_element = word_count >= word_count_threshold
if not keep_element:
element.decompose()
return keep_element
except Exception as e:
# print('Error processing element:', str(e))
self._log('error',
message="Error processing element: {error}",
tag="SCRAPE",
params={"error": str(e)}
)
return False
def _scrap(self, url: str, html: str, word_count_threshold: int = MIN_WORD_THRESHOLD, css_selector: str = None, **kwargs) -> Dict[str, Any]:
success = True
if not html:
return None
soup = BeautifulSoup(html, 'lxml')
body = soup.body
try:
meta = extract_metadata("", soup)
except Exception as e:
self._log('error',
message="Error extracting metadata: {error}",
tag="SCRAPE",
params={"error": str(e)}
)
meta = {}
# Handle tag-based removal first - faster than CSS selection
excluded_tags = set(kwargs.get('excluded_tags', []) or [])
if excluded_tags:
for element in body.find_all(lambda tag: tag.name in excluded_tags):
element.extract()
# Handle CSS selector-based removal
excluded_selector = kwargs.get('excluded_selector', '')
if excluded_selector:
is_single_selector = ',' not in excluded_selector and ' ' not in excluded_selector
if is_single_selector:
while element := body.select_one(excluded_selector):
element.extract()
else:
for element in body.select(excluded_selector):
element.extract()
if css_selector:
selected_elements = body.select(css_selector)
if not selected_elements:
return {
'markdown': '',
'cleaned_html': '',
'success': True,
'media': {'images': [], 'videos': [], 'audios': []},
'links': {'internal': [], 'external': []},
'metadata': {},
'message': f"No elements found for CSS selector: {css_selector}"
}
# raise InvalidCSSSelectorError(f"Invalid CSS selector, No elements found for CSS selector: {css_selector}")
body = soup.new_tag('div')
for el in selected_elements:
body.append(el)
result_obj = self.process_element(
url,
body,
word_count_threshold = word_count_threshold,
**kwargs
)
links = {'internal': [], 'external': []}
media = result_obj['media']
internal_links_dict = result_obj['internal_links_dict']
external_links_dict = result_obj['external_links_dict']
# Update the links dictionary with unique links
links['internal'] = list(internal_links_dict.values())
links['external'] = list(external_links_dict.values())
# # Process images using ThreadPoolExecutor
imgs = body.find_all('img')
media['images'] = [
img for result in (self.process_image(img, url, i, len(imgs))
for i, img in enumerate(imgs))
if result is not None
for img in result
]
body = self.flatten_nested_elements(body)
base64_pattern = re.compile(r'data:image/[^;]+;base64,([^"]+)')
for img in imgs:
src = img.get('src', '')
if base64_pattern.match(src):
# Replace base64 data with empty string
img['src'] = base64_pattern.sub('', src)
str_body = ""
try:
str_body = body.encode_contents().decode('utf-8')
except Exception as e:
# Reset body to the original HTML
success = False
body = BeautifulSoup(html, 'html.parser')
# Create a new div with a special ID
error_div = body.new_tag('div', id='crawl4ai_error_message')
error_div.string = '''
Crawl4AI Error: This page is not fully supported.
Possible reasons:
1. The page may have restrictions that prevent crawling.
2. The page might not be fully loaded.
Suggestions:
- Try calling the crawl function with these parameters:
magic=True,
- Set headless=False to visualize what's happening on the page.
If the issue persists, please check the page's structure and any potential anti-crawling measures.
'''
# Append the error div to the body
body.body.append(error_div)
str_body = body.encode_contents().decode('utf-8')
print(f"[LOG] 😧 Error: After processing the crawled HTML and removing irrelevant tags, nothing was left in the page. Check the markdown for further details.")
self._log('error',
message="After processing the crawled HTML and removing irrelevant tags, nothing was left in the page. Check the markdown for further details.",
tag="SCRAPE"
)
cleaned_html = str_body.replace('\n\n', '\n').replace(' ', ' ')
# markdown_content = self._generate_markdown_content(
# cleaned_html=cleaned_html,
# html=html,
# url=url,
# success=success,
# **kwargs
# )
return {
# **markdown_content,
'cleaned_html': cleaned_html,
'success': success,
'media': media,
'links': links,
'metadata': meta
}

View File

@@ -1,541 +0,0 @@
from abc import ABC, abstractmethod
from typing import Dict, Any
from bs4 import BeautifulSoup
from concurrent.futures import ThreadPoolExecutor
import asyncio, requests, re, os
from .config import *
from bs4 import element, NavigableString, Comment
from urllib.parse import urljoin
from requests.exceptions import InvalidSchema
from .content_cleaning_strategy import ContentCleaningStrategy
from .utils import (
sanitize_input_encode,
sanitize_html,
extract_metadata,
InvalidCSSSelectorError,
# CustomHTML2Text,
normalize_url,
is_external_url
)
from .html2text import HTML2Text
class CustomHTML2Text(HTML2Text):
def __init__(self, *args, **kwargs):
super().__init__(*args, **kwargs)
self.inside_pre = False
self.inside_code = False
self.preserve_tags = set() # Set of tags to preserve
self.current_preserved_tag = None
self.preserved_content = []
self.preserve_depth = 0
# Configuration options
self.skip_internal_links = False
self.single_line_break = False
self.mark_code = False
self.include_sup_sub = False
self.body_width = 0
self.ignore_mailto_links = True
self.ignore_links = False
self.escape_backslash = False
self.escape_dot = False
self.escape_plus = False
self.escape_dash = False
self.escape_snob = False
def update_params(self, **kwargs):
"""Update parameters and set preserved tags."""
for key, value in kwargs.items():
if key == 'preserve_tags':
self.preserve_tags = set(value)
else:
setattr(self, key, value)
def handle_tag(self, tag, attrs, start):
# Handle preserved tags
if tag in self.preserve_tags:
if start:
if self.preserve_depth == 0:
self.current_preserved_tag = tag
self.preserved_content = []
# Format opening tag with attributes
attr_str = ''.join(f' {k}="{v}"' for k, v in attrs.items() if v is not None)
self.preserved_content.append(f'<{tag}{attr_str}>')
self.preserve_depth += 1
return
else:
self.preserve_depth -= 1
if self.preserve_depth == 0:
self.preserved_content.append(f'</{tag}>')
# Output the preserved HTML block with proper spacing
preserved_html = ''.join(self.preserved_content)
self.o('\n' + preserved_html + '\n')
self.current_preserved_tag = None
return
# If we're inside a preserved tag, collect all content
if self.preserve_depth > 0:
if start:
# Format nested tags with attributes
attr_str = ''.join(f' {k}="{v}"' for k, v in attrs.items() if v is not None)
self.preserved_content.append(f'<{tag}{attr_str}>')
else:
self.preserved_content.append(f'</{tag}>')
return
# Handle pre tags
if tag == 'pre':
if start:
self.o('```\n')
self.inside_pre = True
else:
self.o('\n```')
self.inside_pre = False
# elif tag in ["h1", "h2", "h3", "h4", "h5", "h6"]:
# pass
else:
super().handle_tag(tag, attrs, start)
def handle_data(self, data, entity_char=False):
"""Override handle_data to capture content within preserved tags."""
if self.preserve_depth > 0:
self.preserved_content.append(data)
return
super().handle_data(data, entity_char)
class ContentScrappingStrategy(ABC):
@abstractmethod
def scrap(self, url: str, html: str, **kwargs) -> Dict[str, Any]:
pass
@abstractmethod
async def ascrap(self, url: str, html: str, **kwargs) -> Dict[str, Any]:
pass
class WebScrappingStrategy(ContentScrappingStrategy):
def scrap(self, url: str, html: str, **kwargs) -> Dict[str, Any]:
return self._get_content_of_website_optimized(url, html, is_async=False, **kwargs)
async def ascrap(self, url: str, html: str, **kwargs) -> Dict[str, Any]:
return await asyncio.to_thread(self._get_content_of_website_optimized, url, html, **kwargs)
def _get_content_of_website_optimized(self, url: str, html: str, word_count_threshold: int = MIN_WORD_THRESHOLD, css_selector: str = None, **kwargs) -> Dict[str, Any]:
success = True
if not html:
return None
soup = BeautifulSoup(html, 'html.parser')
body = soup.body
image_description_min_word_threshold = kwargs.get('image_description_min_word_threshold', IMAGE_DESCRIPTION_MIN_WORD_THRESHOLD)
for tag in kwargs.get('excluded_tags', []) or []:
for el in body.select(tag):
el.decompose()
if css_selector:
selected_elements = body.select(css_selector)
if not selected_elements:
return {
'markdown': '',
'cleaned_html': '',
'success': True,
'media': {'images': [], 'videos': [], 'audios': []},
'links': {'internal': [], 'external': []},
'metadata': {},
'message': f"No elements found for CSS selector: {css_selector}"
}
# raise InvalidCSSSelectorError(f"Invalid CSS selector, No elements found for CSS selector: {css_selector}")
body = soup.new_tag('div')
for el in selected_elements:
body.append(el)
links = {'internal': [], 'external': []}
media = {'images': [], 'videos': [], 'audios': []}
internal_links_dict = {}
external_links_dict = {}
# Extract meaningful text for media files from closest parent
def find_closest_parent_with_useful_text(tag):
current_tag = tag
while current_tag:
current_tag = current_tag.parent
# Get the text content of the parent tag
if current_tag:
text_content = current_tag.get_text(separator=' ',strip=True)
# Check if the text content has at least word_count_threshold
if len(text_content.split()) >= image_description_min_word_threshold:
return text_content
return None
def process_image(img, url, index, total_images):
#Check if an image has valid display and inside undesired html elements
def is_valid_image(img, parent, parent_classes):
style = img.get('style', '')
src = img.get('src', '')
classes_to_check = ['button', 'icon', 'logo']
tags_to_check = ['button', 'input']
return all([
'display:none' not in style,
src,
not any(s in var for var in [src, img.get('alt', ''), *parent_classes] for s in classes_to_check),
parent.name not in tags_to_check
])
#Score an image for it's usefulness
def score_image_for_usefulness(img, base_url, index, images_count):
# Function to parse image height/width value and units
def parse_dimension(dimension):
if dimension:
match = re.match(r"(\d+)(\D*)", dimension)
if match:
number = int(match.group(1))
unit = match.group(2) or 'px' # Default unit is 'px' if not specified
return number, unit
return None, None
# Fetch image file metadata to extract size and extension
def fetch_image_file_size(img, base_url):
#If src is relative path construct full URL, if not it may be CDN URL
img_url = urljoin(base_url,img.get('src'))
try:
response = requests.head(img_url)
if response.status_code == 200:
return response.headers.get('Content-Length',None)
else:
print(f"Failed to retrieve file size for {img_url}")
return None
except InvalidSchema as e:
return None
finally:
return
image_height = img.get('height')
height_value, height_unit = parse_dimension(image_height)
image_width = img.get('width')
width_value, width_unit = parse_dimension(image_width)
image_size = 0 #int(fetch_image_file_size(img,base_url) or 0)
image_src = img.get('src','')
if "data:image/" in image_src:
image_format = image_src.split(',')[0].split(';')[0].split('/')[1]
else:
image_format = os.path.splitext(img.get('src',''))[1].lower()
# Remove . from format
image_format = image_format.strip('.').split('?')[0]
score = 0
if height_value:
if height_unit == 'px' and height_value > 150:
score += 1
if height_unit in ['%','vh','vmin','vmax'] and height_value >30:
score += 1
if width_value:
if width_unit == 'px' and width_value > 150:
score += 1
if width_unit in ['%','vh','vmin','vmax'] and width_value >30:
score += 1
if image_size > 10000:
score += 1
if img.get('alt') != '':
score+=1
if any(image_format==format for format in ['jpg','png','webp']):
score+=1
if index/images_count<0.5:
score+=1
return score
if not is_valid_image(img, img.parent, img.parent.get('class', [])):
return None
score = score_image_for_usefulness(img, url, index, total_images)
if score <= IMAGE_SCORE_THRESHOLD:
return None
return {
'src': img.get('src', ''),
'data-src': img.get('data-src', ''),
'alt': img.get('alt', ''),
'desc': find_closest_parent_with_useful_text(img),
'score': score,
'type': 'image'
}
def remove_unwanted_attributes(element, important_attrs, keep_data_attributes=False):
attrs_to_remove = []
for attr in element.attrs:
if attr not in important_attrs:
if keep_data_attributes:
if not attr.startswith('data-'):
attrs_to_remove.append(attr)
else:
attrs_to_remove.append(attr)
for attr in attrs_to_remove:
del element[attr]
def process_element(element: element.PageElement) -> bool:
try:
if isinstance(element, NavigableString):
if isinstance(element, Comment):
element.extract()
return False
# if element.name == 'img':
# process_image(element, url, 0, 1)
# return True
if element.name in ['script', 'style', 'link', 'meta', 'noscript']:
element.decompose()
return False
keep_element = False
exclude_social_media_domains = SOCIAL_MEDIA_DOMAINS + kwargs.get('exclude_social_media_domains', [])
exclude_social_media_domains = list(set(exclude_social_media_domains))
try:
if element.name == 'a' and element.get('href'):
href = element.get('href', '').strip()
if not href: # Skip empty hrefs
return False
url_base = url.split('/')[2]
# Normalize the URL
try:
normalized_href = normalize_url(href, url)
except ValueError as e:
# logging.warning(f"Invalid URL format: {href}, Error: {str(e)}")
return False
link_data = {
'href': normalized_href,
'text': element.get_text().strip(),
'title': element.get('title', '').strip()
}
# Check for duplicates and add to appropriate dictionary
is_external = is_external_url(normalized_href, url_base)
if is_external:
if normalized_href not in external_links_dict:
external_links_dict[normalized_href] = link_data
else:
if normalized_href not in internal_links_dict:
internal_links_dict[normalized_href] = link_data
keep_element = True
# Handle external link exclusions
if is_external:
if kwargs.get('exclude_external_links', False):
element.decompose()
return False
elif kwargs.get('exclude_social_media_links', False):
if any(domain in normalized_href.lower() for domain in exclude_social_media_domains):
element.decompose()
return False
elif kwargs.get('exclude_domains', []):
if any(domain in normalized_href.lower() for domain in kwargs.get('exclude_domains', [])):
element.decompose()
return False
except Exception as e:
raise Exception(f"Error processing links: {str(e)}")
try:
if element.name == 'img':
potential_sources = ['src', 'data-src', 'srcset' 'data-lazy-src', 'data-original']
src = element.get('src', '')
while not src and potential_sources:
src = element.get(potential_sources.pop(0), '')
if not src:
element.decompose()
return False
# If it is srcset pick up the first image
if 'srcset' in element.attrs:
src = element.attrs['srcset'].split(',')[0].split(' ')[0]
# Check flag if we should remove external images
if kwargs.get('exclude_external_images', False):
src_url_base = src.split('/')[2]
url_base = url.split('/')[2]
if url_base not in src_url_base:
element.decompose()
return False
if not kwargs.get('exclude_external_images', False) and kwargs.get('exclude_social_media_links', False):
src_url_base = src.split('/')[2]
url_base = url.split('/')[2]
if any(domain in src for domain in exclude_social_media_domains):
element.decompose()
return False
# Handle exclude domains
if kwargs.get('exclude_domains', []):
if any(domain in src for domain in kwargs.get('exclude_domains', [])):
element.decompose()
return False
return True # Always keep image elements
except Exception as e:
raise "Error processing images"
# Check if flag to remove all forms is set
if kwargs.get('remove_forms', False) and element.name == 'form':
element.decompose()
return False
if element.name in ['video', 'audio']:
media[f"{element.name}s"].append({
'src': element.get('src'),
'alt': element.get('alt'),
'type': element.name,
'description': find_closest_parent_with_useful_text(element)
})
source_tags = element.find_all('source')
for source_tag in source_tags:
media[f"{element.name}s"].append({
'src': source_tag.get('src'),
'alt': element.get('alt'),
'type': element.name,
'description': find_closest_parent_with_useful_text(element)
})
return True # Always keep video and audio elements
if element.name in ONLY_TEXT_ELIGIBLE_TAGS:
if kwargs.get('only_text', False):
element.replace_with(element.get_text())
try:
remove_unwanted_attributes(element, IMPORTANT_ATTRS, kwargs.get('keep_data_attributes', False))
except Exception as e:
print('Error removing unwanted attributes:', str(e))
# Process children
for child in list(element.children):
if isinstance(child, NavigableString) and not isinstance(child, Comment):
if len(child.strip()) > 0:
keep_element = True
else:
if process_element(child):
keep_element = True
# Check word count
if not keep_element:
word_count = len(element.get_text(strip=True).split())
keep_element = word_count >= word_count_threshold
if not keep_element:
element.decompose()
return keep_element
except Exception as e:
print('Error processing element:', str(e))
return False
#process images by filtering and extracting contextual text from the page
# imgs = body.find_all('img')
# media['images'] = [
# result for result in
# (process_image(img, url, i, len(imgs)) for i, img in enumerate(imgs))
# if result is not None
# ]
process_element(body)
# Update the links dictionary with unique links
links['internal'] = list(internal_links_dict.values())
links['external'] = list(external_links_dict.values())
# # Process images using ThreadPoolExecutor
imgs = body.find_all('img')
with ThreadPoolExecutor() as executor:
image_results = list(executor.map(process_image, imgs, [url]*len(imgs), range(len(imgs)), [len(imgs)]*len(imgs)))
media['images'] = [result for result in image_results if result is not None]
def flatten_nested_elements(node):
if isinstance(node, NavigableString):
return node
if len(node.contents) == 1 and isinstance(node.contents[0], element.Tag) and node.contents[0].name == node.name:
return flatten_nested_elements(node.contents[0])
node.contents = [flatten_nested_elements(child) for child in node.contents]
return node
body = flatten_nested_elements(body)
base64_pattern = re.compile(r'data:image/[^;]+;base64,([^"]+)')
for img in imgs:
src = img.get('src', '')
if base64_pattern.match(src):
# Replace base64 data with empty string
img['src'] = base64_pattern.sub('', src)
try:
str(body)
except Exception as e:
# Reset body to the original HTML
success = False
body = BeautifulSoup(html, 'html.parser')
# Create a new div with a special ID
error_div = body.new_tag('div', id='crawl4ai_error_message')
error_div.string = '''
Crawl4AI Error: This page is not fully supported.
Possible reasons:
1. The page may have restrictions that prevent crawling.
2. The page might not be fully loaded.
Suggestions:
- Try calling the crawl function with these parameters:
magic=True,
- Set headless=False to visualize what's happening on the page.
If the issue persists, please check the page's structure and any potential anti-crawling measures.
'''
# Append the error div to the body
body.body.append(error_div)
print(f"[LOG] 😧 Error: After processing the crawled HTML and removing irrelevant tags, nothing was left in the page. Check the markdown for further details.")
cleaned_html = str(body).replace('\n\n', '\n').replace(' ', ' ')
try:
h = CustomHTML2Text()
h.update_params(**kwargs.get('html2text', {}))
markdown = h.handle(cleaned_html)
except Exception as e:
markdown = h.handle(sanitize_html(cleaned_html))
markdown = markdown.replace(' ```', '```')
try:
meta = extract_metadata(html, soup)
except Exception as e:
print('Error extracting metadata:', str(e))
meta = {}
cleaner = ContentCleaningStrategy()
fit_html = cleaner.clean(cleaned_html)
fit_markdown = h.handle(fit_html)
cleaned_html = sanitize_html(cleaned_html)
return {
'markdown': markdown,
'fit_markdown': fit_markdown,
'fit_html': fit_html,
'cleaned_html': cleaned_html,
'success': success,
'media': media,
'links': links,
'metadata': meta
}

View File

@@ -283,7 +283,7 @@ class LocalSeleniumCrawlerStrategy(CrawlerStrategy):
print(f"[LOG] ✅ Crawled {url} successfully!")
return html
except InvalidArgumentException:
except InvalidArgumentException as e:
if not hasattr(e, 'msg'):
e.msg = sanitize_input_encode(str(e))
raise InvalidArgumentException(f"Failed to crawl {url}: {e.msg}")

View File

@@ -92,8 +92,10 @@ class LLMExtractionStrategy(ExtractionStrategy):
def extract(self, url: str, ix:int, html: str) -> List[Dict[str, Any]]:
# print("[LOG] Extracting blocks from URL:", url)
print(f"[LOG] Call LLM for {url} - block index: {ix}")
if self.verbose:
# print("[LOG] Extracting blocks from URL:", url)
print(f"[LOG] Call LLM for {url} - block index: {ix}")
variable_values = {
"URL": url,
"HTML": escape_json_string(sanitize_html(html)),
@@ -632,7 +634,7 @@ class ContentSummarizationStrategy(ExtractionStrategy):
# Sort summaries by the original section index to maintain order
summaries.sort(key=lambda x: x[0])
return [summary for _, summary in summaries]
class JsonCssExtractionStrategy(ExtractionStrategy):
def __init__(self, schema: Dict[str, Any], **kwargs):
super().__init__(**kwargs)
@@ -868,4 +870,4 @@ class JsonXPATHExtractionStrategy(ExtractionStrategy):
def run(self, url: str, sections: List[str], *q, **kwargs) -> List[Dict[str, Any]]:
combined_html = self.DEL.join(sections)
return self.extract(url, combined_html, **kwargs)
return self.extract(url, combined_html, **kwargs)

View File

@@ -1006,10 +1006,136 @@ class HTML2Text(html.parser.HTMLParser):
newlines += 1
return result
def html2text(html: str, baseurl: str = "", bodywidth: Optional[int] = None) -> str:
if bodywidth is None:
bodywidth = config.BODY_WIDTH
h = HTML2Text(baseurl=baseurl, bodywidth=bodywidth)
return h.handle(html)
class CustomHTML2Text(HTML2Text):
def __init__(self, *args, handle_code_in_pre=False, **kwargs):
super().__init__(*args, **kwargs)
self.inside_pre = False
self.inside_code = False
self.preserve_tags = set() # Set of tags to preserve
self.current_preserved_tag = None
self.preserved_content = []
self.preserve_depth = 0
self.handle_code_in_pre = handle_code_in_pre
# Configuration options
self.skip_internal_links = False
self.single_line_break = False
self.mark_code = False
self.include_sup_sub = False
self.body_width = 0
self.ignore_mailto_links = True
self.ignore_links = False
self.escape_backslash = False
self.escape_dot = False
self.escape_plus = False
self.escape_dash = False
self.escape_snob = False
def update_params(self, **kwargs):
"""Update parameters and set preserved tags."""
for key, value in kwargs.items():
if key == 'preserve_tags':
self.preserve_tags = set(value)
elif key == 'handle_code_in_pre':
self.handle_code_in_pre = value
else:
setattr(self, key, value)
def handle_tag(self, tag, attrs, start):
# Handle preserved tags
if tag in self.preserve_tags:
if start:
if self.preserve_depth == 0:
self.current_preserved_tag = tag
self.preserved_content = []
# Format opening tag with attributes
attr_str = ''.join(f' {k}="{v}"' for k, v in attrs.items() if v is not None)
self.preserved_content.append(f'<{tag}{attr_str}>')
self.preserve_depth += 1
return
else:
self.preserve_depth -= 1
if self.preserve_depth == 0:
self.preserved_content.append(f'</{tag}>')
# Output the preserved HTML block with proper spacing
preserved_html = ''.join(self.preserved_content)
self.o('\n' + preserved_html + '\n')
self.current_preserved_tag = None
return
# If we're inside a preserved tag, collect all content
if self.preserve_depth > 0:
if start:
# Format nested tags with attributes
attr_str = ''.join(f' {k}="{v}"' for k, v in attrs.items() if v is not None)
self.preserved_content.append(f'<{tag}{attr_str}>')
else:
self.preserved_content.append(f'</{tag}>')
return
# Handle pre tags
if tag == 'pre':
if start:
self.o('```\n') # Markdown code block start
self.inside_pre = True
else:
self.o('\n```\n') # Markdown code block end
self.inside_pre = False
elif tag == 'code':
if self.inside_pre and not self.handle_code_in_pre:
# Ignore code tags inside pre blocks if handle_code_in_pre is False
return
if start:
self.o('`') # Markdown inline code start
self.inside_code = True
else:
self.o('`') # Markdown inline code end
self.inside_code = False
else:
super().handle_tag(tag, attrs, start)
def handle_data(self, data, entity_char=False):
"""Override handle_data to capture content within preserved tags."""
if self.preserve_depth > 0:
self.preserved_content.append(data)
return
if self.inside_pre:
# Output the raw content for pre blocks, including content inside code tags
self.o(data) # Directly output the data as-is (preserve newlines)
return
if self.inside_code:
# Inline code: no newlines allowed
self.o(data.replace('\n', ' '))
return
# Default behavior for other tags
super().handle_data(data, entity_char)
# # Handle pre tags
# if tag == 'pre':
# if start:
# self.o('```\n')
# self.inside_pre = True
# else:
# self.o('\n```')
# self.inside_pre = False
# # elif tag in ["h1", "h2", "h3", "h4", "h5", "h6"]:
# # pass
# else:
# super().handle_tag(tag, attrs, start)
# def handle_data(self, data, entity_char=False):
# """Override handle_data to capture content within preserved tags."""
# if self.preserve_depth > 0:
# self.preserved_content.append(data)
# return
# super().handle_data(data, entity_char)

44
crawl4ai/install.py Normal file
View File

@@ -0,0 +1,44 @@
import subprocess
import sys
import asyncio
from .async_logger import AsyncLogger, LogLevel
# Initialize logger
logger = AsyncLogger(log_level=LogLevel.DEBUG, verbose=True)
def post_install():
"""Run all post-installation tasks"""
logger.info("Running post-installation setup...", tag="INIT")
install_playwright()
run_migration()
logger.success("Post-installation setup completed!", tag="COMPLETE")
def install_playwright():
logger.info("Installing Playwright browsers...", tag="INIT")
try:
subprocess.check_call([sys.executable, "-m", "playwright", "install"])
logger.success("Playwright installation completed successfully.", tag="COMPLETE")
except subprocess.CalledProcessError as e:
logger.error(f"Error during Playwright installation: {e}", tag="ERROR")
logger.warning(
"Please run 'python -m playwright install' manually after the installation."
)
except Exception as e:
logger.error(f"Unexpected error during Playwright installation: {e}", tag="ERROR")
logger.warning(
"Please run 'python -m playwright install' manually after the installation."
)
def run_migration():
"""Initialize database during installation"""
try:
logger.info("Starting database initialization...", tag="INIT")
from crawl4ai.async_database import async_db_manager
asyncio.run(async_db_manager.initialize())
logger.success("Database initialization completed successfully.", tag="COMPLETE")
except ImportError:
logger.warning("Database module not found. Will initialize on first use.")
except Exception as e:
logger.warning(f"Database initialization failed: {e}")
logger.warning("Database will be initialized on first use")

View File

@@ -0,0 +1,15 @@
import os, sys
# Create a function get name of a js script, then load from the CURRENT folder of this script and return its content as string, make sure its error free
def load_js_script(script_name):
# Get the path of the current script
current_script_path = os.path.dirname(os.path.realpath(__file__))
# Get the path of the script to load
script_path = os.path.join(current_script_path, script_name + '.js')
# Check if the script exists
if not os.path.exists(script_path):
raise ValueError(f"Script {script_name} not found in the folder {current_script_path}")
# Load the content of the script
with open(script_path, 'r') as f:
script_content = f.read()
return script_content

View File

@@ -0,0 +1,25 @@
// Pass the Permissions Test.
const originalQuery = window.navigator.permissions.query;
window.navigator.permissions.query = (parameters) =>
parameters.name === "notifications"
? Promise.resolve({ state: Notification.permission })
: originalQuery(parameters);
Object.defineProperty(navigator, "webdriver", {
get: () => undefined,
});
window.navigator.chrome = {
runtime: {},
// Add other properties if necessary
};
Object.defineProperty(navigator, "plugins", {
get: () => [1, 2, 3, 4, 5],
});
Object.defineProperty(navigator, "languages", {
get: () => ["en-US", "en"],
});
Object.defineProperty(document, "hidden", {
get: () => false,
});
Object.defineProperty(document, "visibilityState", {
get: () => "visible",
});

View File

@@ -0,0 +1,119 @@
async () => {
// Function to check if element is visible
const isVisible = (elem) => {
const style = window.getComputedStyle(elem);
return style.display !== "none" && style.visibility !== "hidden" && style.opacity !== "0";
};
// Common selectors for popups and overlays
const commonSelectors = [
// Close buttons first
'button[class*="close" i]',
'button[class*="dismiss" i]',
'button[aria-label*="close" i]',
'button[title*="close" i]',
'a[class*="close" i]',
'span[class*="close" i]',
// Cookie notices
'[class*="cookie-banner" i]',
'[id*="cookie-banner" i]',
'[class*="cookie-consent" i]',
'[id*="cookie-consent" i]',
// Newsletter/subscription dialogs
'[class*="newsletter" i]',
'[class*="subscribe" i]',
// Generic popups/modals
'[class*="popup" i]',
'[class*="modal" i]',
'[class*="overlay" i]',
'[class*="dialog" i]',
'[role="dialog"]',
'[role="alertdialog"]',
];
// Try to click close buttons first
for (const selector of commonSelectors.slice(0, 6)) {
const closeButtons = document.querySelectorAll(selector);
for (const button of closeButtons) {
if (isVisible(button)) {
try {
button.click();
await new Promise((resolve) => setTimeout(resolve, 100));
} catch (e) {
console.log("Error clicking button:", e);
}
}
}
}
// Remove remaining overlay elements
const removeOverlays = () => {
// Find elements with high z-index
const allElements = document.querySelectorAll("*");
for (const elem of allElements) {
const style = window.getComputedStyle(elem);
const zIndex = parseInt(style.zIndex);
const position = style.position;
if (
isVisible(elem) &&
(zIndex > 999 || position === "fixed" || position === "absolute") &&
(elem.offsetWidth > window.innerWidth * 0.5 ||
elem.offsetHeight > window.innerHeight * 0.5 ||
style.backgroundColor.includes("rgba") ||
parseFloat(style.opacity) < 1)
) {
elem.remove();
}
}
// Remove elements matching common selectors
for (const selector of commonSelectors) {
const elements = document.querySelectorAll(selector);
elements.forEach((elem) => {
if (isVisible(elem)) {
elem.remove();
}
});
}
};
// Remove overlay elements
removeOverlays();
// Remove any fixed/sticky position elements at the top/bottom
const removeFixedElements = () => {
const elements = document.querySelectorAll("*");
elements.forEach((elem) => {
const style = window.getComputedStyle(elem);
if ((style.position === "fixed" || style.position === "sticky") && isVisible(elem)) {
elem.remove();
}
});
};
removeFixedElements();
// Remove empty block elements as: div, p, span, etc.
const removeEmptyBlockElements = () => {
const blockElements = document.querySelectorAll(
"div, p, span, section, article, header, footer, aside, nav, main, ul, ol, li, dl, dt, dd, h1, h2, h3, h4, h5, h6"
);
blockElements.forEach((elem) => {
if (elem.innerText.trim() === "") {
elem.remove();
}
});
};
// Remove margin-right and padding-right from body (often added by modal scripts)
document.body.style.marginRight = "0px";
document.body.style.paddingRight = "0px";
document.body.style.overflow = "auto";
// Wait a bit for any animations to complete
await new Promise((resolve) => setTimeout(resolve, 100));
};

View File

@@ -0,0 +1,54 @@
() => {
return new Promise((resolve) => {
const filterImage = (img) => {
// Filter out images that are too small
if (img.width < 100 && img.height < 100) return false;
// Filter out images that are not visible
const rect = img.getBoundingClientRect();
if (rect.width === 0 || rect.height === 0) return false;
// Filter out images with certain class names (e.g., icons, thumbnails)
if (img.classList.contains("icon") || img.classList.contains("thumbnail")) return false;
// Filter out images with certain patterns in their src (e.g., placeholder images)
if (img.src.includes("placeholder") || img.src.includes("icon")) return false;
return true;
};
const images = Array.from(document.querySelectorAll("img")).filter(filterImage);
let imagesLeft = images.length;
if (imagesLeft === 0) {
resolve();
return;
}
const checkImage = (img) => {
if (img.complete && img.naturalWidth !== 0) {
img.setAttribute("width", img.naturalWidth);
img.setAttribute("height", img.naturalHeight);
imagesLeft--;
if (imagesLeft === 0) resolve();
}
};
images.forEach((img) => {
checkImage(img);
if (!img.complete) {
img.onload = () => {
checkImage(img);
};
img.onerror = () => {
imagesLeft--;
if (imagesLeft === 0) resolve();
};
}
});
// Fallback timeout of 5 seconds
// setTimeout(() => resolve(), 5000);
resolve();
});
};

View File

@@ -0,0 +1,131 @@
from abc import ABC, abstractmethod
from typing import Optional, Dict, Any, Tuple
from .models import MarkdownGenerationResult
from .html2text import CustomHTML2Text
from .content_filter_strategy import RelevantContentFilter, BM25ContentFilter
import re
from urllib.parse import urljoin
# Pre-compile the regex pattern
LINK_PATTERN = re.compile(r'!?\[([^\]]+)\]\(([^)]+?)(?:\s+"([^"]*)")?\)')
def fast_urljoin(base: str, url: str) -> str:
"""Fast URL joining for common cases."""
if url.startswith(('http://', 'https://', 'mailto:', '//')):
return url
if url.startswith('/'):
# Handle absolute paths
if base.endswith('/'):
return base[:-1] + url
return base + url
return urljoin(base, url)
class MarkdownGenerationStrategy(ABC):
"""Abstract base class for markdown generation strategies."""
def __init__(self, content_filter: Optional[RelevantContentFilter] = None, options: Optional[Dict[str, Any]] = None):
self.content_filter = content_filter
self.options = options or {}
@abstractmethod
def generate_markdown(self,
cleaned_html: str,
base_url: str = "",
html2text_options: Optional[Dict[str, Any]] = None,
content_filter: Optional[RelevantContentFilter] = None,
citations: bool = True,
**kwargs) -> MarkdownGenerationResult:
"""Generate markdown from cleaned HTML."""
pass
class DefaultMarkdownGenerator(MarkdownGenerationStrategy):
"""Default implementation of markdown generation strategy."""
def __init__(self, content_filter: Optional[RelevantContentFilter] = None, options: Optional[Dict[str, Any]] = None):
super().__init__(content_filter, options)
def convert_links_to_citations(self, markdown: str, base_url: str = "") -> Tuple[str, str]:
link_map = {}
url_cache = {} # Cache for URL joins
parts = []
last_end = 0
counter = 1
for match in LINK_PATTERN.finditer(markdown):
parts.append(markdown[last_end:match.start()])
text, url, title = match.groups()
# Use cached URL if available, otherwise compute and cache
if base_url and not url.startswith(('http://', 'https://', 'mailto:')):
if url not in url_cache:
url_cache[url] = fast_urljoin(base_url, url)
url = url_cache[url]
if url not in link_map:
desc = []
if title: desc.append(title)
if text and text != title: desc.append(text)
link_map[url] = (counter, ": " + " - ".join(desc) if desc else "")
counter += 1
num = link_map[url][0]
parts.append(f"{text}{num}" if not match.group(0).startswith('!') else f"![{text}{num}⟩]")
last_end = match.end()
parts.append(markdown[last_end:])
converted_text = ''.join(parts)
# Pre-build reference strings
references = ["\n\n## References\n\n"]
references.extend(
f"{num}{url}{desc}\n"
for url, (num, desc) in sorted(link_map.items(), key=lambda x: x[1][0])
)
return converted_text, ''.join(references)
def generate_markdown(self,
cleaned_html: str,
base_url: str = "",
html2text_options: Optional[Dict[str, Any]] = None,
options: Optional[Dict[str, Any]] = None,
content_filter: Optional[RelevantContentFilter] = None,
citations: bool = True,
**kwargs) -> MarkdownGenerationResult:
"""Generate markdown with citations from cleaned HTML."""
# Initialize HTML2Text with options
h = CustomHTML2Text()
if html2text_options:
h.update_params(**html2text_options)
elif options:
h.update_params(**options)
elif self.options:
h.update_params(**self.options)
# Generate raw markdown
raw_markdown = h.handle(cleaned_html)
raw_markdown = raw_markdown.replace(' ```', '```')
# Convert links to citations
markdown_with_citations: str = ""
references_markdown: str = ""
if citations:
markdown_with_citations, references_markdown = self.convert_links_to_citations(
raw_markdown, base_url
)
# Generate fit markdown if content filter is provided
fit_markdown: Optional[str] = ""
filtered_html: Optional[str] = ""
if content_filter or self.content_filter:
content_filter = content_filter or self.content_filter
filtered_html = content_filter.filter_content(cleaned_html)
filtered_html = '\n'.join('<div>{}</div>'.format(s) for s in filtered_html)
fit_markdown = h.handle(filtered_html)
return MarkdownGenerationResult(
raw_markdown=raw_markdown,
markdown_with_citations=markdown_with_citations,
references_markdown=references_markdown,
fit_markdown=fit_markdown,
fit_html=filtered_html,
)

168
crawl4ai/migrations.py Normal file
View File

@@ -0,0 +1,168 @@
import os
import asyncio
import logging
from pathlib import Path
import aiosqlite
from typing import Optional
import xxhash
import aiofiles
import shutil
import time
from datetime import datetime
from .async_logger import AsyncLogger, LogLevel
# Initialize logger
logger = AsyncLogger(log_level=LogLevel.DEBUG, verbose=True)
# logging.basicConfig(level=logging.INFO)
# logger = logging.getLogger(__name__)
class DatabaseMigration:
def __init__(self, db_path: str):
self.db_path = db_path
self.content_paths = self._ensure_content_dirs(os.path.dirname(db_path))
def _ensure_content_dirs(self, base_path: str) -> dict:
dirs = {
'html': 'html_content',
'cleaned': 'cleaned_html',
'markdown': 'markdown_content',
'extracted': 'extracted_content',
'screenshots': 'screenshots'
}
content_paths = {}
for key, dirname in dirs.items():
path = os.path.join(base_path, dirname)
os.makedirs(path, exist_ok=True)
content_paths[key] = path
return content_paths
def _generate_content_hash(self, content: str) -> str:
x = xxhash.xxh64()
x.update(content.encode())
content_hash = x.hexdigest()
return content_hash
# return hashlib.sha256(content.encode()).hexdigest()
async def _store_content(self, content: str, content_type: str) -> str:
if not content:
return ""
content_hash = self._generate_content_hash(content)
file_path = os.path.join(self.content_paths[content_type], content_hash)
if not os.path.exists(file_path):
async with aiofiles.open(file_path, 'w', encoding='utf-8') as f:
await f.write(content)
return content_hash
async def migrate_database(self):
"""Migrate existing database to file-based storage"""
# logger.info("Starting database migration...")
logger.info("Starting database migration...", tag="INIT")
try:
async with aiosqlite.connect(self.db_path) as db:
# Get all rows
async with db.execute(
'''SELECT url, html, cleaned_html, markdown,
extracted_content, screenshot FROM crawled_data'''
) as cursor:
rows = await cursor.fetchall()
migrated_count = 0
for row in rows:
url, html, cleaned_html, markdown, extracted_content, screenshot = row
# Store content in files and get hashes
html_hash = await self._store_content(html, 'html')
cleaned_hash = await self._store_content(cleaned_html, 'cleaned')
markdown_hash = await self._store_content(markdown, 'markdown')
extracted_hash = await self._store_content(extracted_content, 'extracted')
screenshot_hash = await self._store_content(screenshot, 'screenshots')
# Update database with hashes
await db.execute('''
UPDATE crawled_data
SET html = ?,
cleaned_html = ?,
markdown = ?,
extracted_content = ?,
screenshot = ?
WHERE url = ?
''', (html_hash, cleaned_hash, markdown_hash,
extracted_hash, screenshot_hash, url))
migrated_count += 1
if migrated_count % 100 == 0:
logger.info(f"Migrated {migrated_count} records...", tag="INIT")
await db.commit()
logger.success(f"Migration completed. {migrated_count} records processed.", tag="COMPLETE")
except Exception as e:
# logger.error(f"Migration failed: {e}")
logger.error(
message="Migration failed: {error}",
tag="ERROR",
params={"error": str(e)}
)
raise e
async def backup_database(db_path: str) -> str:
"""Create backup of existing database"""
if not os.path.exists(db_path):
logger.info("No existing database found. Skipping backup.", tag="INIT")
return None
# Create backup with timestamp
timestamp = datetime.now().strftime('%Y%m%d_%H%M%S')
backup_path = f"{db_path}.backup_{timestamp}"
try:
# Wait for any potential write operations to finish
await asyncio.sleep(1)
# Create backup
shutil.copy2(db_path, backup_path)
logger.info(f"Database backup created at: {backup_path}", tag="COMPLETE")
return backup_path
except Exception as e:
# logger.error(f"Backup failed: {e}")
logger.error(
message="Migration failed: {error}",
tag="ERROR",
params={"error": str(e)}
)
raise e
async def run_migration(db_path: Optional[str] = None):
"""Run database migration"""
if db_path is None:
db_path = os.path.join(Path.home(), ".crawl4ai", "crawl4ai.db")
if not os.path.exists(db_path):
logger.info("No existing database found. Skipping migration.", tag="INIT")
return
# Create backup first
backup_path = await backup_database(db_path)
if not backup_path:
return
migration = DatabaseMigration(db_path)
await migration.migrate_database()
def main():
"""CLI entry point for migration"""
import argparse
parser = argparse.ArgumentParser(description='Migrate Crawl4AI database to file-based storage')
parser.add_argument('--db-path', help='Custom database path')
args = parser.parse_args()
asyncio.run(run_migration(args.db_path))
if __name__ == "__main__":
main()

View File

@@ -1,10 +1,19 @@
from pydantic import BaseModel, HttpUrl
from typing import List, Dict, Optional
from typing import List, Dict, Optional, Callable, Awaitable, Union
class UrlModel(BaseModel):
url: HttpUrl
forced: bool = False
class MarkdownGenerationResult(BaseModel):
raw_markdown: str
markdown_with_citations: str
references_markdown: str
fit_markdown: Optional[str] = None
fit_html: Optional[str] = None
class CrawlResult(BaseModel):
url: str
html: str
@@ -12,8 +21,11 @@ class CrawlResult(BaseModel):
cleaned_html: Optional[str] = None
media: Dict[str, List[Dict]] = {}
links: Dict[str, List[Dict]] = {}
downloaded_files: Optional[List[str]] = None
screenshot: Optional[str] = None
markdown: Optional[str] = None
pdf : Optional[bytes] = None
markdown: Optional[Union[str, MarkdownGenerationResult]] = None
markdown_v2: Optional[MarkdownGenerationResult] = None
fit_markdown: Optional[str] = None
fit_html: Optional[str] = None
extracted_content: Optional[str] = None
@@ -21,4 +33,18 @@ class CrawlResult(BaseModel):
error_message: Optional[str] = None
session_id: Optional[str] = None
response_headers: Optional[dict] = None
status_code: Optional[int] = None
status_code: Optional[int] = None
class AsyncCrawlResponse(BaseModel):
html: str
response_headers: Dict[str, str]
status_code: int
screenshot: Optional[str] = None
pdf_data: Optional[bytes] = None
get_delayed_content: Optional[Callable[[Optional[float]], Awaitable[str]]] = None
downloaded_files: Optional[List[str]] = None
class Config:
arbitrary_types_allowed = True

View File

@@ -0,0 +1,263 @@
import random
from typing import Optional, Literal, List, Dict, Tuple
import re
class UserAgentGenerator:
def __init__(self):
# Previous platform definitions remain the same...
self.desktop_platforms = {
"windows": {
"10_64": "(Windows NT 10.0; Win64; x64)",
"10_32": "(Windows NT 10.0; WOW64)",
},
"macos": {
"intel": "(Macintosh; Intel Mac OS X 10_15_7)",
"newer": "(Macintosh; Intel Mac OS X 10.15; rv:109.0)",
},
"linux": {
"generic": "(X11; Linux x86_64)",
"ubuntu": "(X11; Ubuntu; Linux x86_64)",
"chrome_os": "(X11; CrOS x86_64 14541.0.0)",
}
}
self.mobile_platforms = {
"android": {
"samsung": "(Linux; Android 13; SM-S901B)",
"pixel": "(Linux; Android 12; Pixel 6)",
"oneplus": "(Linux; Android 13; OnePlus 9 Pro)",
"xiaomi": "(Linux; Android 12; M2102J20SG)",
},
"ios": {
"iphone": "(iPhone; CPU iPhone OS 16_5 like Mac OS X)",
"ipad": "(iPad; CPU OS 16_5 like Mac OS X)",
}
}
# Browser Combinations
self.browser_combinations = {
1: [
["chrome"],
["firefox"],
["safari"],
["edge"]
],
2: [
["gecko", "firefox"],
["chrome", "safari"],
["webkit", "safari"]
],
3: [
["chrome", "safari", "edge"],
["webkit", "chrome", "safari"]
]
}
# Rendering Engines with versions
self.rendering_engines = {
"chrome_webkit": "AppleWebKit/537.36",
"safari_webkit": "AppleWebKit/605.1.15",
"gecko": [ # Added Gecko versions
"Gecko/20100101",
"Gecko/20100101", # Firefox usually uses this constant version
"Gecko/2010010",
]
}
# Browser Versions
self.chrome_versions = [
"Chrome/119.0.6045.199",
"Chrome/118.0.5993.117",
"Chrome/117.0.5938.149",
"Chrome/116.0.5845.187",
"Chrome/115.0.5790.171",
]
self.edge_versions = [
"Edg/119.0.2151.97",
"Edg/118.0.2088.76",
"Edg/117.0.2045.47",
"Edg/116.0.1938.81",
"Edg/115.0.1901.203",
]
self.safari_versions = [
"Safari/537.36", # For Chrome-based
"Safari/605.1.15",
"Safari/604.1",
"Safari/602.1",
"Safari/601.5.17",
]
# Added Firefox versions
self.firefox_versions = [
"Firefox/119.0",
"Firefox/118.0.2",
"Firefox/117.0.1",
"Firefox/116.0",
"Firefox/115.0.3",
"Firefox/114.0.2",
"Firefox/113.0.1",
"Firefox/112.0",
"Firefox/111.0.1",
"Firefox/110.0",
]
def get_browser_stack(self, num_browsers: int = 1) -> List[str]:
"""Get a valid combination of browser versions"""
if num_browsers not in self.browser_combinations:
raise ValueError(f"Unsupported number of browsers: {num_browsers}")
combination = random.choice(self.browser_combinations[num_browsers])
browser_stack = []
for browser in combination:
if browser == "chrome":
browser_stack.append(random.choice(self.chrome_versions))
elif browser == "firefox":
browser_stack.append(random.choice(self.firefox_versions))
elif browser == "safari":
browser_stack.append(random.choice(self.safari_versions))
elif browser == "edge":
browser_stack.append(random.choice(self.edge_versions))
elif browser == "gecko":
browser_stack.append(random.choice(self.rendering_engines["gecko"]))
elif browser == "webkit":
browser_stack.append(self.rendering_engines["chrome_webkit"])
return browser_stack
def generate(self,
device_type: Optional[Literal['desktop', 'mobile']] = None,
os_type: Optional[str] = None,
device_brand: Optional[str] = None,
browser_type: Optional[Literal['chrome', 'edge', 'safari', 'firefox']] = None,
num_browsers: int = 3) -> str:
"""
Generate a random user agent with specified constraints.
Args:
device_type: 'desktop' or 'mobile'
os_type: 'windows', 'macos', 'linux', 'android', 'ios'
device_brand: Specific device brand
browser_type: 'chrome', 'edge', 'safari', or 'firefox'
num_browsers: Number of browser specifications (1-3)
"""
# Get platform string
platform = self.get_random_platform(device_type, os_type, device_brand)
# Start with Mozilla
components = ["Mozilla/5.0", platform]
# Add browser stack
browser_stack = self.get_browser_stack(num_browsers)
# Add appropriate legacy token based on browser stack
if "Firefox" in str(browser_stack):
components.append(random.choice(self.rendering_engines["gecko"]))
elif "Chrome" in str(browser_stack) or "Safari" in str(browser_stack):
components.append(self.rendering_engines["chrome_webkit"])
components.append("(KHTML, like Gecko)")
# Add browser versions
components.extend(browser_stack)
return " ".join(components)
def generate_with_client_hints(self, **kwargs) -> Tuple[str, str]:
"""Generate both user agent and matching client hints"""
user_agent = self.generate(**kwargs)
client_hints = self.generate_client_hints(user_agent)
return user_agent, client_hints
def get_random_platform(self, device_type, os_type, device_brand):
"""Helper method to get random platform based on constraints"""
platforms = self.desktop_platforms if device_type == 'desktop' else \
self.mobile_platforms if device_type == 'mobile' else \
{**self.desktop_platforms, **self.mobile_platforms}
if os_type:
for platform_group in [self.desktop_platforms, self.mobile_platforms]:
if os_type in platform_group:
platforms = {os_type: platform_group[os_type]}
break
os_key = random.choice(list(platforms.keys()))
if device_brand and device_brand in platforms[os_key]:
return platforms[os_key][device_brand]
return random.choice(list(platforms[os_key].values()))
def parse_user_agent(self, user_agent: str) -> Dict[str, str]:
"""Parse a user agent string to extract browser and version information"""
browsers = {
'chrome': r'Chrome/(\d+)',
'edge': r'Edg/(\d+)',
'safari': r'Version/(\d+)',
'firefox': r'Firefox/(\d+)'
}
result = {}
for browser, pattern in browsers.items():
match = re.search(pattern, user_agent)
if match:
result[browser] = match.group(1)
return result
def generate_client_hints(self, user_agent: str) -> str:
"""Generate Sec-CH-UA header value based on user agent string"""
browsers = self.parse_user_agent(user_agent)
# Client hints components
hints = []
# Handle different browser combinations
if 'chrome' in browsers:
hints.append(f'"Chromium";v="{browsers["chrome"]}"')
hints.append('"Not_A Brand";v="8"')
if 'edge' in browsers:
hints.append(f'"Microsoft Edge";v="{browsers["edge"]}"')
else:
hints.append(f'"Google Chrome";v="{browsers["chrome"]}"')
elif 'firefox' in browsers:
# Firefox doesn't typically send Sec-CH-UA
return '""'
elif 'safari' in browsers:
# Safari's format for client hints
hints.append(f'"Safari";v="{browsers["safari"]}"')
hints.append('"Not_A Brand";v="8"')
return ', '.join(hints)
# Example usage:
if __name__ == "__main__":
generator = UserAgentGenerator()
print(generator.generate())
print("\nSingle browser (Chrome):")
print(generator.generate(num_browsers=1, browser_type='chrome'))
print("\nTwo browsers (Gecko/Firefox):")
print(generator.generate(num_browsers=2))
print("\nThree browsers (Chrome/Safari/Edge):")
print(generator.generate(num_browsers=3))
print("\nFirefox on Linux:")
print(generator.generate(
device_type='desktop',
os_type='linux',
browser_type='firefox',
num_browsers=2
))
print("\nChrome/Safari/Edge on Windows:")
print(generator.generate(
device_type='desktop',
os_type='windows',
num_browsers=3
))

View File

@@ -14,10 +14,75 @@ from typing import Dict, Any
from urllib.parse import urljoin
import requests
from requests.exceptions import InvalidSchema
import hashlib
from typing import Optional, Tuple, Dict, Any
import xxhash
from colorama import Fore, Style, init
import textwrap
import cProfile
import pstats
from functools import wraps
class InvalidCSSSelectorError(Exception):
pass
def create_box_message(
message: str,
type: str = "info",
width: int = 120,
add_newlines: bool = True,
double_line: bool = False
) -> str:
init()
# Define border and text colors for different types
styles = {
"warning": (Fore.YELLOW, Fore.LIGHTYELLOW_EX, ""),
"info": (Fore.BLUE, Fore.LIGHTBLUE_EX, ""),
"success": (Fore.GREEN, Fore.LIGHTGREEN_EX, ""),
"error": (Fore.RED, Fore.LIGHTRED_EX, "×"),
}
border_color, text_color, prefix = styles.get(type.lower(), styles["info"])
# Define box characters based on line style
box_chars = {
"single": ("", "", "", "", "", ""),
"double": ("", "", "", "", "", "")
}
line_style = "double" if double_line else "single"
h_line, v_line, tl, tr, bl, br = box_chars[line_style]
# Process lines with lighter text color
formatted_lines = []
raw_lines = message.split('\n')
if raw_lines:
first_line = f"{prefix} {raw_lines[0].strip()}"
wrapped_first = textwrap.fill(first_line, width=width-4)
formatted_lines.extend(wrapped_first.split('\n'))
for line in raw_lines[1:]:
if line.strip():
wrapped = textwrap.fill(f" {line.strip()}", width=width-4)
formatted_lines.extend(wrapped.split('\n'))
else:
formatted_lines.append("")
# Create the box with colored borders and lighter text
horizontal_line = h_line * (width - 1)
box = [
f"{border_color}{tl}{horizontal_line}{tr}",
*[f"{border_color}{v_line}{text_color} {line:<{width-2}}{border_color}{v_line}" for line in formatted_lines],
f"{border_color}{bl}{horizontal_line}{br}{Style.RESET_ALL}"
]
result = "\n".join(box)
if add_newlines:
result = f"\n{result}\n"
return result
def calculate_semaphore_count():
cpu_count = os.cpu_count()
memory_gb = get_system_memory() / (1024 ** 3) # Convert to GB
@@ -142,12 +207,17 @@ def sanitize_html(html):
def sanitize_input_encode(text: str) -> str:
"""Sanitize input to handle potential encoding issues."""
try:
# Attempt to encode and decode as UTF-8 to handle potential encoding issues
return text.encode('utf-8', errors='ignore').decode('utf-8')
except UnicodeEncodeError as e:
print(f"Warning: Encoding issue detected. Some characters may be lost. Error: {e}")
# Fall back to ASCII if UTF-8 fails
return text.encode('ascii', errors='ignore').decode('ascii')
try:
if not text:
return ''
# Attempt to encode and decode as UTF-8 to handle potential encoding issues
return text.encode('utf-8', errors='ignore').decode('utf-8')
except UnicodeEncodeError as e:
print(f"Warning: Encoding issue detected. Some characters may be lost. Error: {e}")
# Fall back to ASCII if UTF-8 fails
return text.encode('ascii', errors='ignore').decode('ascii')
except Exception as e:
raise ValueError(f"Error sanitizing input: {str(e)}") from e
def escape_json_string(s):
"""
@@ -178,50 +248,6 @@ def escape_json_string(s):
return s
class CustomHTML2Text_v0(HTML2Text):
def __init__(self, *args, **kwargs):
super().__init__(*args, **kwargs)
self.inside_pre = False
self.inside_code = False
self.skip_internal_links = False
self.single_line_break = False
self.mark_code = False
self.include_sup_sub = False
self.body_width = 0
self.ignore_mailto_links = True
self.ignore_links = False
self.escape_backslash = False
self.escape_dot = False
self.escape_plus = False
self.escape_dash = False
self.escape_snob = False
def handle_tag(self, tag, attrs, start):
if tag == 'pre':
if start:
self.o('```\n')
self.inside_pre = True
else:
self.o('\n```')
self.inside_pre = False
elif tag in ["h1", "h2", "h3", "h4", "h5", "h6"]:
pass
# elif tag == 'code' and not self.inside_pre:
# if start:
# if not self.inside_pre:
# self.o('`')
# self.inside_code = True
# else:
# if not self.inside_pre:
# self.o('`')
# self.inside_code = False
super().handle_tag(tag, attrs, start)
def replace_inline_tags(soup, tags, only_text=False):
tag_replacements = {
'b': lambda tag: f"**{tag.text}**",
@@ -736,44 +762,51 @@ def get_content_of_website_optimized(url: str, html: str, word_count_threshold:
'metadata': meta
}
def extract_metadata(html, soup = None):
def extract_metadata(html, soup=None):
metadata = {}
if not html:
if not html and not soup:
return {}
if not soup:
soup = BeautifulSoup(html, 'lxml')
head = soup.head
if not head:
return metadata
# Parse HTML content with BeautifulSoup
if not soup:
soup = BeautifulSoup(html, 'html.parser')
# Title
title_tag = soup.find('title')
metadata['title'] = title_tag.string if title_tag else None
title_tag = head.find('title')
metadata['title'] = title_tag.string.strip() if title_tag and title_tag.string else None
# Meta description
description_tag = soup.find('meta', attrs={'name': 'description'})
metadata['description'] = description_tag['content'] if description_tag else None
description_tag = head.find('meta', attrs={'name': 'description'})
metadata['description'] = description_tag.get('content', '').strip() if description_tag else None
# Meta keywords
keywords_tag = soup.find('meta', attrs={'name': 'keywords'})
metadata['keywords'] = keywords_tag['content'] if keywords_tag else None
keywords_tag = head.find('meta', attrs={'name': 'keywords'})
metadata['keywords'] = keywords_tag.get('content', '').strip() if keywords_tag else None
# Meta author
author_tag = soup.find('meta', attrs={'name': 'author'})
metadata['author'] = author_tag['content'] if author_tag else None
author_tag = head.find('meta', attrs={'name': 'author'})
metadata['author'] = author_tag.get('content', '').strip() if author_tag else None
# Open Graph metadata
og_tags = soup.find_all('meta', attrs={'property': lambda value: value and value.startswith('og:')})
og_tags = head.find_all('meta', attrs={'property': re.compile(r'^og:')})
for tag in og_tags:
property_name = tag['property']
metadata[property_name] = tag['content']
property_name = tag.get('property', '').strip()
content = tag.get('content', '').strip()
if property_name and content:
metadata[property_name] = content
# Twitter Card metadata
twitter_tags = soup.find_all('meta', attrs={'name': lambda value: value and value.startswith('twitter:')})
twitter_tags = head.find_all('meta', attrs={'name': re.compile(r'^twitter:')})
for tag in twitter_tags:
property_name = tag['name']
metadata[property_name] = tag['content']
property_name = tag.get('name', '').strip()
content = tag.get('content', '').strip()
if property_name and content:
metadata[property_name] = content
return metadata
def extract_xml_tags(string):
@@ -793,7 +826,6 @@ def extract_xml_data(tags, string):
return data
# Function to perform the completion with exponential backoff
def perform_completion_with_backoff(
provider,
prompt_with_variables,
@@ -807,7 +839,11 @@ def perform_completion_with_backoff(
max_attempts = 3
base_delay = 2 # Base delay in seconds, you can adjust this based on your needs
extra_args = {}
extra_args = {
"temperature": 0.01,
'api_key': api_token,
'base_url': base_url
}
if json_response:
extra_args["response_format"] = { "type": "json_object" }
@@ -816,14 +852,12 @@ def perform_completion_with_backoff(
for attempt in range(max_attempts):
try:
response =completion(
model=provider,
messages=[
{"role": "user", "content": prompt_with_variables}
],
temperature=0.01,
api_key=api_token,
base_url=base_url,
**extra_args
)
return response # Return the successful response
@@ -980,9 +1014,54 @@ def wrap_text(draw, text, font, max_width):
return '\n'.join(lines)
def format_html(html_string):
soup = BeautifulSoup(html_string, 'html.parser')
soup = BeautifulSoup(html_string, 'lxml.parser')
return soup.prettify()
def fast_format_html(html_string):
"""
A fast HTML formatter that uses string operations instead of parsing.
Args:
html_string (str): The HTML string to format
Returns:
str: The formatted HTML string
"""
# Initialize variables
indent = 0
indent_str = " " # Two spaces for indentation
formatted = []
in_content = False
# Split by < and > to separate tags and content
parts = html_string.replace('>', '>\n').replace('<', '\n<').split('\n')
for part in parts:
if not part.strip():
continue
# Handle closing tags
if part.startswith('</'):
indent -= 1
formatted.append(indent_str * indent + part)
# Handle self-closing tags
elif part.startswith('<') and part.endswith('/>'):
formatted.append(indent_str * indent + part)
# Handle opening tags
elif part.startswith('<'):
formatted.append(indent_str * indent + part)
indent += 1
# Handle content between tags
else:
content = part.strip()
if content:
formatted.append(indent_str * indent + content)
return '\n'.join(formatted)
def normalize_url(href, base_url):
"""Normalize URLs to ensure consistent format"""
from urllib.parse import urljoin, urlparse
@@ -1046,3 +1125,168 @@ def is_external_url(url, base_domain):
return False
return False
def clean_tokens(tokens: list[str]) -> list[str]:
# Set of tokens to remove
noise = {'ccp', 'up', '', '', '⬆️', 'a', 'an', 'at', 'by', 'in', 'of', 'on', 'to', 'the'}
STOP_WORDS = {
'a', 'an', 'and', 'are', 'as', 'at', 'be', 'by', 'for', 'from',
'has', 'he', 'in', 'is', 'it', 'its', 'of', 'on', 'that', 'the',
'to', 'was', 'were', 'will', 'with',
# Pronouns
'i', 'you', 'he', 'she', 'it', 'we', 'they',
'me', 'him', 'her', 'us', 'them',
'my', 'your', 'his', 'her', 'its', 'our', 'their',
'mine', 'yours', 'hers', 'ours', 'theirs',
'myself', 'yourself', 'himself', 'herself', 'itself', 'ourselves', 'themselves',
# Common verbs
'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being',
'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing',
# Prepositions
'about', 'above', 'across', 'after', 'against', 'along', 'among', 'around',
'at', 'before', 'behind', 'below', 'beneath', 'beside', 'between', 'beyond',
'by', 'down', 'during', 'except', 'for', 'from', 'in', 'inside', 'into',
'near', 'of', 'off', 'on', 'out', 'outside', 'over', 'past', 'through',
'to', 'toward', 'under', 'underneath', 'until', 'up', 'upon', 'with', 'within',
# Conjunctions
'and', 'but', 'or', 'nor', 'for', 'yet', 'so',
'although', 'because', 'since', 'unless',
# Articles
'a', 'an', 'the',
# Other common words
'this', 'that', 'these', 'those',
'what', 'which', 'who', 'whom', 'whose',
'when', 'where', 'why', 'how',
'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such',
'can', 'cannot', "can't", 'could', "couldn't",
'may', 'might', 'must', "mustn't",
'shall', 'should', "shouldn't",
'will', "won't", 'would', "wouldn't",
'not', "n't", 'no', 'nor', 'none'
}
# Single comprehension, more efficient than multiple passes
return [token for token in tokens
if len(token) > 2
and token not in noise
and token not in STOP_WORDS
and not token.startswith('')
and not token.startswith('')
and not token.startswith('')]
def profile_and_time(func):
@wraps(func)
def wrapper(self, *args, **kwargs):
# Start timer
start_time = time.perf_counter()
# Setup profiler
profiler = cProfile.Profile()
profiler.enable()
# Run function
result = func(self, *args, **kwargs)
# Stop profiler
profiler.disable()
# Calculate elapsed time
elapsed_time = time.perf_counter() - start_time
# Print timing
print(f"[PROFILER] Scraping completed in {elapsed_time:.2f} seconds")
# Print profiling stats
stats = pstats.Stats(profiler)
stats.sort_stats('cumulative') # Sort by cumulative time
stats.print_stats(20) # Print top 20 time-consuming functions
return result
return wrapper
def generate_content_hash(content: str) -> str:
"""Generate a unique hash for content"""
return xxhash.xxh64(content.encode()).hexdigest()
# return hashlib.sha256(content.encode()).hexdigest()
def ensure_content_dirs(base_path: str) -> Dict[str, str]:
"""Create content directories if they don't exist"""
dirs = {
'html': 'html_content',
'cleaned': 'cleaned_html',
'markdown': 'markdown_content',
'extracted': 'extracted_content',
'screenshots': 'screenshots',
'screenshot': 'screenshots'
}
content_paths = {}
for key, dirname in dirs.items():
path = os.path.join(base_path, dirname)
os.makedirs(path, exist_ok=True)
content_paths[key] = path
return content_paths
def get_error_context(exc_info, context_lines: int = 5):
"""
Extract error context with more reliable line number tracking.
Args:
exc_info: The exception info from sys.exc_info()
context_lines: Number of lines to show before and after the error
Returns:
dict: Error context information
"""
import traceback
import linecache
import os
# Get the full traceback
tb = traceback.extract_tb(exc_info[2])
# Get the last frame (where the error occurred)
last_frame = tb[-1]
filename = last_frame.filename
line_no = last_frame.lineno
func_name = last_frame.name
# Get the source code context using linecache
# This is more reliable than inspect.getsourcelines
context_start = max(1, line_no - context_lines)
context_end = line_no + context_lines + 1
# Build the context lines with line numbers
context_lines = []
for i in range(context_start, context_end):
line = linecache.getline(filename, i)
if line:
# Remove any trailing whitespace/newlines and add the pointer for error line
line = line.rstrip()
pointer = '' if i == line_no else ' '
context_lines.append(f"{i:4d} {pointer} {line}")
# Join the lines with newlines
code_context = '\n'.join(context_lines)
# Get relative path for cleaner output
try:
rel_path = os.path.relpath(filename)
except ValueError:
# Fallback if relpath fails (can happen on Windows with different drives)
rel_path = filename
return {
"filename": rel_path,
"line_no": line_no,
"function": func_name,
"code_context": code_context
}

View File

@@ -0,0 +1,30 @@
# version_manager.py
import os
from pathlib import Path
from packaging import version
from . import __version__
class VersionManager:
def __init__(self):
self.home_dir = Path.home() / ".crawl4ai"
self.version_file = self.home_dir / "version.txt"
def get_installed_version(self):
"""Get the version recorded in home directory"""
if not self.version_file.exists():
return None
try:
return version.parse(self.version_file.read_text().strip())
except:
return None
def update_version(self):
"""Update the version file to current library version"""
self.version_file.write_text(__version__.__version__)
def needs_update(self):
"""Check if database needs update based on version"""
installed = self.get_installed_version()
current = version.parse(__version__.__version__)
return installed is None or installed < current

View File

@@ -10,6 +10,7 @@ from .extraction_strategy import *
from .crawler_strategy import *
from typing import List
from concurrent.futures import ThreadPoolExecutor
from .content_scraping_strategy import WebScrapingStrategy
from .config import *
import warnings
import json
@@ -181,7 +182,21 @@ class WebCrawler:
# Extract content from HTML
try:
t1 = time.time()
result = get_content_of_website_optimized(url, html, word_count_threshold, css_selector=css_selector, only_text=kwargs.get("only_text", False))
scrapping_strategy = WebScrapingStrategy()
extra_params = {k: v for k, v in kwargs.items() if k not in ["only_text", "image_description_min_word_threshold"]}
result = scrapping_strategy.scrap(
url,
html,
word_count_threshold=word_count_threshold,
css_selector=css_selector,
only_text=kwargs.get("only_text", False),
image_description_min_word_threshold=kwargs.get(
"image_description_min_word_threshold", IMAGE_DESCRIPTION_MIN_WORD_THRESHOLD
),
**extra_params,
)
# result = get_content_of_website_optimized(url, html, word_count_threshold, css_selector=css_selector, only_text=kwargs.get("only_text", False))
if verbose:
print(f"[LOG] 🚀 Content extracted for {url}, success: True, time taken: {time.time() - t1:.2f} seconds")

67
docker-compose.yml Normal file
View File

@@ -0,0 +1,67 @@
services:
# Local build services for different platforms
crawl4ai-amd64:
build:
context: .
dockerfile: Dockerfile
args:
PYTHON_VERSION: "3.10"
INSTALL_TYPE: ${INSTALL_TYPE:-basic}
ENABLE_GPU: false
platforms:
- linux/amd64
profiles: ["local-amd64"]
extends: &base-config
file: docker-compose.yml
service: base-config
crawl4ai-arm64:
build:
context: .
dockerfile: Dockerfile
args:
PYTHON_VERSION: "3.10"
INSTALL_TYPE: ${INSTALL_TYPE:-basic}
ENABLE_GPU: false
platforms:
- linux/arm64
profiles: ["local-arm64"]
extends: *base-config
# Hub services for different platforms and versions
crawl4ai-hub-amd64:
image: unclecode/crawl4ai:${VERSION:-basic}-amd64
profiles: ["hub-amd64"]
extends: *base-config
crawl4ai-hub-arm64:
image: unclecode/crawl4ai:${VERSION:-basic}-arm64
profiles: ["hub-arm64"]
extends: *base-config
# Base configuration to be extended
base-config:
ports:
- "11235:11235"
- "8000:8000"
- "9222:9222"
- "8080:8080"
environment:
- CRAWL4AI_API_TOKEN=${CRAWL4AI_API_TOKEN:-}
- OPENAI_API_KEY=${OPENAI_API_KEY:-}
- CLAUDE_API_KEY=${CLAUDE_API_KEY:-}
volumes:
- /dev/shm:/dev/shm
deploy:
resources:
limits:
memory: 4G
reservations:
memory: 1G
restart: unless-stopped
healthcheck:
test: ["CMD", "curl", "-f", "http://localhost:11235/health"]
interval: 30s
timeout: 10s
retries: 3
start_period: 40s

View File

@@ -7,12 +7,16 @@ import os
from typing import Dict, Any
class Crawl4AiTester:
def __init__(self, base_url: str = "http://localhost:11235"):
def __init__(self, base_url: str = "http://localhost:11235", api_token: str = None):
self.base_url = base_url
self.api_token = api_token or os.getenv('CRAWL4AI_API_TOKEN') or "test_api_code" # Check environment variable as fallback
self.headers = {'Authorization': f'Bearer {self.api_token}'} if self.api_token else {}
def submit_and_wait(self, request_data: Dict[str, Any], timeout: int = 300) -> Dict[str, Any]:
# Submit crawl job
response = requests.post(f"{self.base_url}/crawl", json=request_data)
response = requests.post(f"{self.base_url}/crawl", json=request_data, headers=self.headers)
if response.status_code == 403:
raise Exception("API token is invalid or missing")
task_id = response.json()["task_id"]
print(f"Task ID: {task_id}")
@@ -22,7 +26,7 @@ class Crawl4AiTester:
if time.time() - start_time > timeout:
raise TimeoutError(f"Task {task_id} did not complete within {timeout} seconds")
result = requests.get(f"{self.base_url}/task/{task_id}")
result = requests.get(f"{self.base_url}/task/{task_id}", headers=self.headers)
status = result.json()
if status["status"] == "failed":
@@ -33,9 +37,30 @@ class Crawl4AiTester:
return status
time.sleep(2)
def submit_sync(self, request_data: Dict[str, Any]) -> Dict[str, Any]:
response = requests.post(f"{self.base_url}/crawl_sync", json=request_data, headers=self.headers, timeout=60)
if response.status_code == 408:
raise TimeoutError("Task did not complete within server timeout")
response.raise_for_status()
return response.json()
def crawl_direct(self, request_data: Dict[str, Any]) -> Dict[str, Any]:
"""Directly crawl without using task queue"""
response = requests.post(
f"{self.base_url}/crawl_direct",
json=request_data,
headers=self.headers
)
response.raise_for_status()
return response.json()
def test_docker_deployment(version="basic"):
tester = Crawl4AiTester()
tester = Crawl4AiTester(
base_url="http://localhost:11235" ,
# base_url="https://api.crawl4ai.com" # just for example
# api_token="test" # just for example
)
print(f"Testing Crawl4AI Docker {version} version")
# Health check with timeout and retry
@@ -53,24 +78,28 @@ def test_docker_deployment(version="basic"):
time.sleep(5)
# Test cases based on version
test_basic_crawl_direct(tester)
test_basic_crawl(tester)
test_basic_crawl(tester)
test_basic_crawl_sync(tester)
# if version in ["full", "transformer"]:
# test_cosine_extraction(tester)
if version in ["full", "transformer"]:
test_cosine_extraction(tester)
# test_js_execution(tester)
# test_css_selector(tester)
# test_structured_extraction(tester)
# test_llm_extraction(tester)
# test_llm_with_ollama(tester)
# test_screenshot(tester)
test_js_execution(tester)
test_css_selector(tester)
test_structured_extraction(tester)
test_llm_extraction(tester)
test_llm_with_ollama(tester)
test_screenshot(tester)
def test_basic_crawl(tester: Crawl4AiTester):
print("\n=== Testing Basic Crawl ===")
request = {
"urls": "https://www.nbcnews.com/business",
"priority": 10
"priority": 10,
"session_id": "test"
}
result = tester.submit_and_wait(request)
@@ -78,6 +107,34 @@ def test_basic_crawl(tester: Crawl4AiTester):
assert result["result"]["success"]
assert len(result["result"]["markdown"]) > 0
def test_basic_crawl_sync(tester: Crawl4AiTester):
print("\n=== Testing Basic Crawl (Sync) ===")
request = {
"urls": "https://www.nbcnews.com/business",
"priority": 10,
"session_id": "test"
}
result = tester.submit_sync(request)
print(f"Basic crawl result length: {len(result['result']['markdown'])}")
assert result['status'] == 'completed'
assert result['result']['success']
assert len(result['result']['markdown']) > 0
def test_basic_crawl_direct(tester: Crawl4AiTester):
print("\n=== Testing Basic Crawl (Direct) ===")
request = {
"urls": "https://www.nbcnews.com/business",
"priority": 10,
# "session_id": "test"
"cache_mode": "bypass" # or "enabled", "disabled", "read_only", "write_only"
}
result = tester.crawl_direct(request)
print(f"Basic crawl result length: {len(result['result']['markdown'])}")
assert result['result']['success']
assert len(result['result']['markdown']) > 0
def test_js_execution(tester: Crawl4AiTester):
print("\n=== Testing JS Execution ===")
request = {

View File

@@ -0,0 +1,58 @@
# Capturing Full-Page Screenshots and PDFs from Massive Webpages with Crawl4AI
When dealing with very long web pages, traditional full-page screenshots can be slow or fail entirely. For large pages (like extensive Wikipedia articles), generating a single massive screenshot often leads to delays, memory issues, or style differences.
**The New Approach:**
Weve introduced a new feature that effortlessly handles even the biggest pages by first exporting them as a PDF, then converting that PDF into a high-quality image. This approach leverages the browsers built-in PDF rendering, making it both stable and efficient for very long content. You also have the option to directly save the PDF for your own usage—no need for multiple passes or complex stitching logic.
**Key Benefits:**
- **Reliability:** The PDF export never times out and works regardless of page length.
- **Versatility:** Get both the PDF and a screenshot in one crawl, without reloading or reprocessing.
- **Performance:** Skips manual scrolling and stitching images, reducing complexity and runtime.
**Simple Example:**
```python
import os, sys
import asyncio
from crawl4ai import AsyncWebCrawler, CacheMode
# Adjust paths as needed
parent_dir = os.path.dirname(os.path.dirname(os.path.abspath(__file__)))
sys.path.append(parent_dir)
__location__ = os.path.realpath(os.path.join(os.getcwd(), os.path.dirname(__file__)))
async def main():
async with AsyncWebCrawler() as crawler:
# Request both PDF and screenshot
result = await crawler.arun(
url='https://en.wikipedia.org/wiki/List_of_common_misconceptions',
cache_mode=CacheMode.BYPASS,
pdf=True,
screenshot=True
)
if result.success:
# Save screenshot
if result.screenshot:
from base64 import b64decode
with open(os.path.join(__location__, "screenshot.png"), "wb") as f:
f.write(b64decode(result.screenshot))
# Save PDF
if result.pdf_data:
pdf_bytes = b64decode(result.pdf_data)
with open(os.path.join(__location__, "page.pdf"), "wb") as f:
f.write(pdf_bytes)
if __name__ == "__main__":
asyncio.run(main())
```
**What Happens Under the Hood:**
- Crawl4AI navigates to the target page.
- If `pdf=True`, it exports the current page as a full PDF, capturing all of its content no matter the length.
- If `screenshot=True`, and a PDF is already available, it directly converts the first page of that PDF to an image for you—no repeated loading or scrolling.
- Finally, you get your PDF and/or screenshot ready to use.
**Conclusion:**
With this feature, Crawl4AI becomes even more robust and versatile for large-scale content extraction. Whether you need a PDF snapshot or a quick screenshot, you now have a reliable solution for even the most extensive webpages.

View File

@@ -1,41 +1,40 @@
import os
import time
from crawl4ai.web_crawler import WebCrawler
from crawl4ai.chunking_strategy import *
from crawl4ai.extraction_strategy import *
from crawl4ai.crawler_strategy import *
import asyncio
from pydantic import BaseModel, Field
url = r'https://openai.com/api/pricing/'
crawler = WebCrawler()
crawler.warmup()
from pydantic import BaseModel, Field
class OpenAIModelFee(BaseModel):
model_name: str = Field(..., description="Name of the OpenAI model.")
input_fee: str = Field(..., description="Fee for input token for the OpenAI model.")
output_fee: str = Field(..., description="Fee for output token for the OpenAI model.")
result = crawler.run(
url=url,
word_count_threshold=1,
extraction_strategy= LLMExtractionStrategy(
# provider= "openai/gpt-4o", api_token = os.getenv('OPENAI_API_KEY'),
provider= "groq/llama-3.1-70b-versatile", api_token = os.getenv('GROQ_API_KEY'),
schema=OpenAIModelFee.model_json_schema(),
extraction_type="schema",
instruction="From the crawled content, extract all mentioned model names along with their "\
"fees for input and output tokens. Make sure not to miss anything in the entire content. "\
'One extracted model JSON format should look like this: '\
'{ "model_name": "GPT-4", "input_fee": "US$10.00 / 1M tokens", "output_fee": "US$30.00 / 1M tokens" }'
),
bypass_cache=True,
)
from crawl4ai import AsyncWebCrawler
model_fees = json.loads(result.extracted_content)
async def main():
# Use AsyncWebCrawler
async with AsyncWebCrawler() as crawler:
result = await crawler.arun(
url=url,
word_count_threshold=1,
extraction_strategy= LLMExtractionStrategy(
# provider= "openai/gpt-4o", api_token = os.getenv('OPENAI_API_KEY'),
provider= "groq/llama-3.1-70b-versatile", api_token = os.getenv('GROQ_API_KEY'),
schema=OpenAIModelFee.model_json_schema(),
extraction_type="schema",
instruction="From the crawled content, extract all mentioned model names along with their " \
"fees for input and output tokens. Make sure not to miss anything in the entire content. " \
'One extracted model JSON format should look like this: ' \
'{ "model_name": "GPT-4", "input_fee": "US$10.00 / 1M tokens", "output_fee": "US$30.00 / 1M tokens" }'
),
print(len(model_fees))
)
print("Success:", result.success)
model_fees = json.loads(result.extracted_content)
print(len(model_fees))
with open(".data/data.json", "w", encoding="utf-8") as f:
f.write(result.extracted_content)
with open(".data/data.json", "w", encoding="utf-8") as f:
f.write(result.extracted_content)
asyncio.run(main())

View File

@@ -0,0 +1,518 @@
import os, sys
sys.path.append(os.path.dirname(os.path.dirname(os.path.dirname(os.path.abspath(__file__)))))
os.environ['FIRECRAWL_API_KEY'] = "fc-84b370ccfad44beabc686b38f1769692"
import asyncio
import time
import json
import re
from typing import Dict, List
from bs4 import BeautifulSoup
from pydantic import BaseModel, Field
from crawl4ai import AsyncWebCrawler, CacheMode, BrowserConfig, CrawlerRunConfig
from crawl4ai.markdown_generation_strategy import DefaultMarkdownGenerator
from crawl4ai.content_filter_strategy import BM25ContentFilter, PruningContentFilter
from crawl4ai.extraction_strategy import JsonCssExtractionStrategy, LLMExtractionStrategy
__location__ = os.path.realpath(os.path.join(os.getcwd(), os.path.dirname(__file__)))
print("Crawl4AI: Advanced Web Crawling and Data Extraction")
print("GitHub Repository: https://github.com/unclecode/crawl4ai")
print("Twitter: @unclecode")
print("Website: https://crawl4ai.com")
# Basic Example - Simple Crawl
async def simple_crawl():
print("\n--- Basic Usage ---")
browser_config = BrowserConfig(headless=True)
crawler_config = CrawlerRunConfig(
cache_mode=CacheMode.BYPASS
)
async with AsyncWebCrawler(config=browser_config) as crawler:
result = await crawler.arun(
url="https://www.nbcnews.com/business",
config=crawler_config
)
print(result.markdown[:500])
# JavaScript Execution Example
async def simple_example_with_running_js_code():
print("\n--- Executing JavaScript and Using CSS Selectors ---")
browser_config = BrowserConfig(
headless=True,
java_script_enabled=True
)
crawler_config = CrawlerRunConfig(
cache_mode=CacheMode.BYPASS,
js_code=["const loadMoreButton = Array.from(document.querySelectorAll('button')).find(button => button.textContent.includes('Load More')); loadMoreButton && loadMoreButton.click();"],
# wait_for="() => { return Array.from(document.querySelectorAll('article.tease-card')).length > 10; }"
)
async with AsyncWebCrawler(config=browser_config) as crawler:
result = await crawler.arun(
url="https://www.nbcnews.com/business",
config=crawler_config
)
print(result.markdown[:500])
# CSS Selector Example
async def simple_example_with_css_selector():
print("\n--- Using CSS Selectors ---")
browser_config = BrowserConfig(headless=True)
crawler_config = CrawlerRunConfig(
cache_mode=CacheMode.BYPASS,
css_selector=".wide-tease-item__description"
)
async with AsyncWebCrawler(config=browser_config) as crawler:
result = await crawler.arun(
url="https://www.nbcnews.com/business",
config=crawler_config
)
print(result.markdown[:500])
# Proxy Example
async def use_proxy():
print("\n--- Using a Proxy ---")
browser_config = BrowserConfig(
headless=True,
proxy="http://your-proxy-url:port"
)
crawler_config = CrawlerRunConfig(
cache_mode=CacheMode.BYPASS
)
async with AsyncWebCrawler(config=browser_config) as crawler:
result = await crawler.arun(
url="https://www.nbcnews.com/business",
config=crawler_config
)
if result.success:
print(result.markdown[:500])
# Screenshot Example
async def capture_and_save_screenshot(url: str, output_path: str):
browser_config = BrowserConfig(headless=True)
crawler_config = CrawlerRunConfig(
cache_mode=CacheMode.BYPASS,
screenshot=True
)
async with AsyncWebCrawler(config=browser_config) as crawler:
result = await crawler.arun(
url=url,
config=crawler_config
)
if result.success and result.screenshot:
import base64
screenshot_data = base64.b64decode(result.screenshot)
with open(output_path, 'wb') as f:
f.write(screenshot_data)
print(f"Screenshot saved successfully to {output_path}")
else:
print("Failed to capture screenshot")
# LLM Extraction Example
class OpenAIModelFee(BaseModel):
model_name: str = Field(..., description="Name of the OpenAI model.")
input_fee: str = Field(..., description="Fee for input token for the OpenAI model.")
output_fee: str = Field(..., description="Fee for output token for the OpenAI model.")
async def extract_structured_data_using_llm(provider: str, api_token: str = None, extra_headers: Dict[str, str] = None):
print(f"\n--- Extracting Structured Data with {provider} ---")
if api_token is None and provider != "ollama":
print(f"API token is required for {provider}. Skipping this example.")
return
browser_config = BrowserConfig(headless=True)
extra_args = {
"temperature": 0,
"top_p": 0.9,
"max_tokens": 2000
}
if extra_headers:
extra_args["extra_headers"] = extra_headers
crawler_config = CrawlerRunConfig(
cache_mode=CacheMode.BYPASS,
word_count_threshold=1,
page_timeout = 80000,
extraction_strategy=LLMExtractionStrategy(
provider=provider,
api_token=api_token,
schema=OpenAIModelFee.model_json_schema(),
extraction_type="schema",
instruction="""From the crawled content, extract all mentioned model names along with their fees for input and output tokens.
Do not miss any models in the entire content.""",
extra_args=extra_args
)
)
async with AsyncWebCrawler(config=browser_config) as crawler:
result = await crawler.arun(
url="https://openai.com/api/pricing/",
config=crawler_config
)
print(result.extracted_content)
# CSS Extraction Example
async def extract_structured_data_using_css_extractor():
print("\n--- Using JsonCssExtractionStrategy for Fast Structured Output ---")
schema = {
"name": "KidoCode Courses",
"baseSelector": "section.charge-methodology .w-tab-content > div",
"fields": [
{
"name": "section_title",
"selector": "h3.heading-50",
"type": "text",
},
{
"name": "section_description",
"selector": ".charge-content",
"type": "text",
},
{
"name": "course_name",
"selector": ".text-block-93",
"type": "text",
},
{
"name": "course_description",
"selector": ".course-content-text",
"type": "text",
},
{
"name": "course_icon",
"selector": ".image-92",
"type": "attribute",
"attribute": "src"
}
]
}
browser_config = BrowserConfig(
headless=True,
java_script_enabled=True
)
js_click_tabs = """
(async () => {
const tabs = document.querySelectorAll("section.charge-methodology .tabs-menu-3 > div");
for(let tab of tabs) {
tab.scrollIntoView();
tab.click();
await new Promise(r => setTimeout(r, 500));
}
})();
"""
crawler_config = CrawlerRunConfig(
cache_mode=CacheMode.BYPASS,
extraction_strategy=JsonCssExtractionStrategy(schema),
js_code=[js_click_tabs]
)
async with AsyncWebCrawler(config=browser_config) as crawler:
result = await crawler.arun(
url="https://www.kidocode.com/degrees/technology",
config=crawler_config
)
companies = json.loads(result.extracted_content)
print(f"Successfully extracted {len(companies)} companies")
print(json.dumps(companies[0], indent=2))
# Dynamic Content Examples - Method 1
async def crawl_dynamic_content_pages_method_1():
print("\n--- Advanced Multi-Page Crawling with JavaScript Execution ---")
first_commit = ""
async def on_execution_started(page, **kwargs):
nonlocal first_commit
try:
while True:
await page.wait_for_selector("li.Box-sc-g0xbh4-0 h4")
commit = await page.query_selector("li.Box-sc-g0xbh4-0 h4")
commit = await commit.evaluate("(element) => element.textContent")
commit = re.sub(r"\s+", "", commit)
if commit and commit != first_commit:
first_commit = commit
break
await asyncio.sleep(0.5)
except Exception as e:
print(f"Warning: New content didn't appear after JavaScript execution: {e}")
browser_config = BrowserConfig(
headless=False,
java_script_enabled=True
)
async with AsyncWebCrawler(config=browser_config) as crawler:
crawler.crawler_strategy.set_hook("on_execution_started", on_execution_started)
url = "https://github.com/microsoft/TypeScript/commits/main"
session_id = "typescript_commits_session"
all_commits = []
js_next_page = """
const button = document.querySelector('a[data-testid="pagination-next-button"]');
if (button) button.click();
"""
for page in range(3):
crawler_config = CrawlerRunConfig(
cache_mode=CacheMode.BYPASS,
css_selector="li.Box-sc-g0xbh4-0",
js_code=js_next_page if page > 0 else None,
js_only=page > 0,
session_id=session_id
)
result = await crawler.arun(url=url, config=crawler_config)
assert result.success, f"Failed to crawl page {page + 1}"
soup = BeautifulSoup(result.cleaned_html, "html.parser")
commits = soup.select("li")
all_commits.extend(commits)
print(f"Page {page + 1}: Found {len(commits)} commits")
print(f"Successfully crawled {len(all_commits)} commits across 3 pages")
# Dynamic Content Examples - Method 2
async def crawl_dynamic_content_pages_method_2():
print("\n--- Advanced Multi-Page Crawling with JavaScript Execution ---")
browser_config = BrowserConfig(
headless=False,
java_script_enabled=True
)
js_next_page_and_wait = """
(async () => {
const getCurrentCommit = () => {
const commits = document.querySelectorAll('li.Box-sc-g0xbh4-0 h4');
return commits.length > 0 ? commits[0].textContent.trim() : null;
};
const initialCommit = getCurrentCommit();
const button = document.querySelector('a[data-testid="pagination-next-button"]');
if (button) button.click();
while (true) {
await new Promise(resolve => setTimeout(resolve, 100));
const newCommit = getCurrentCommit();
if (newCommit && newCommit !== initialCommit) {
break;
}
}
})();
"""
schema = {
"name": "Commit Extractor",
"baseSelector": "li.Box-sc-g0xbh4-0",
"fields": [
{
"name": "title",
"selector": "h4.markdown-title",
"type": "text",
"transform": "strip",
},
],
}
async with AsyncWebCrawler(config=browser_config) as crawler:
url = "https://github.com/microsoft/TypeScript/commits/main"
session_id = "typescript_commits_session"
all_commits = []
extraction_strategy = JsonCssExtractionStrategy(schema)
for page in range(3):
crawler_config = CrawlerRunConfig(
cache_mode=CacheMode.BYPASS,
css_selector="li.Box-sc-g0xbh4-0",
extraction_strategy=extraction_strategy,
js_code=js_next_page_and_wait if page > 0 else None,
js_only=page > 0,
session_id=session_id
)
result = await crawler.arun(url=url, config=crawler_config)
assert result.success, f"Failed to crawl page {page + 1}"
commits = json.loads(result.extracted_content)
all_commits.extend(commits)
print(f"Page {page + 1}: Found {len(commits)} commits")
print(f"Successfully crawled {len(all_commits)} commits across 3 pages")
# Browser Comparison
async def crawl_custom_browser_type():
print("\n--- Browser Comparison ---")
# Firefox
browser_config_firefox = BrowserConfig(
browser_type="firefox",
headless=True
)
start = time.time()
async with AsyncWebCrawler(config=browser_config_firefox) as crawler:
result = await crawler.arun(
url="https://www.example.com",
config=CrawlerRunConfig(cache_mode=CacheMode.BYPASS)
)
print("Firefox:", time.time() - start)
print(result.markdown[:500])
# WebKit
browser_config_webkit = BrowserConfig(
browser_type="webkit",
headless=True
)
start = time.time()
async with AsyncWebCrawler(config=browser_config_webkit) as crawler:
result = await crawler.arun(
url="https://www.example.com",
config=CrawlerRunConfig(cache_mode=CacheMode.BYPASS)
)
print("WebKit:", time.time() - start)
print(result.markdown[:500])
# Chromium (default)
browser_config_chromium = BrowserConfig(
browser_type="chromium",
headless=True
)
start = time.time()
async with AsyncWebCrawler(config=browser_config_chromium) as crawler:
result = await crawler.arun(
url="https://www.example.com",
config=CrawlerRunConfig(cache_mode=CacheMode.BYPASS)
)
print("Chromium:", time.time() - start)
print(result.markdown[:500])
# Anti-Bot and User Simulation
async def crawl_with_user_simulation():
browser_config = BrowserConfig(
headless=True,
user_agent_mode="random",
user_agent_generator_config={
"device_type": "mobile",
"os_type": "android"
}
)
crawler_config = CrawlerRunConfig(
cache_mode=CacheMode.BYPASS,
magic=True,
simulate_user=True,
override_navigator=True
)
async with AsyncWebCrawler(config=browser_config) as crawler:
result = await crawler.arun(
url="YOUR-URL-HERE",
config=crawler_config
)
print(result.markdown)
# Speed Comparison
async def speed_comparison():
print("\n--- Speed Comparison ---")
# Firecrawl comparison
from firecrawl import FirecrawlApp
app = FirecrawlApp(api_key=os.environ['FIRECRAWL_API_KEY'])
start = time.time()
scrape_status = app.scrape_url(
'https://www.nbcnews.com/business',
params={'formats': ['markdown', 'html']}
)
end = time.time()
print("Firecrawl:")
print(f"Time taken: {end - start:.2f} seconds")
print(f"Content length: {len(scrape_status['markdown'])} characters")
print(f"Images found: {scrape_status['markdown'].count('cldnry.s-nbcnews.com')}")
print()
# Crawl4AI comparisons
browser_config = BrowserConfig(headless=True)
# Simple crawl
async with AsyncWebCrawler(config=browser_config) as crawler:
start = time.time()
result = await crawler.arun(
url="https://www.nbcnews.com/business",
config=CrawlerRunConfig(
cache_mode=CacheMode.BYPASS,
word_count_threshold=0
)
)
end = time.time()
print("Crawl4AI (simple crawl):")
print(f"Time taken: {end - start:.2f} seconds")
print(f"Content length: {len(result.markdown)} characters")
print(f"Images found: {result.markdown.count('cldnry.s-nbcnews.com')}")
print()
# Advanced filtering
start = time.time()
result = await crawler.arun(
url="https://www.nbcnews.com/business",
config=CrawlerRunConfig(
cache_mode=CacheMode.BYPASS,
word_count_threshold=0,
markdown_generator=DefaultMarkdownGenerator(
content_filter=PruningContentFilter(
threshold=0.48,
threshold_type="fixed",
min_word_threshold=0
)
)
)
)
end = time.time()
print("Crawl4AI (Markdown Plus):")
print(f"Time taken: {end - start:.2f} seconds")
print(f"Content length: {len(result.markdown_v2.raw_markdown)} characters")
print(f"Fit Markdown: {len(result.markdown_v2.fit_markdown)} characters")
print(f"Images found: {result.markdown.count('cldnry.s-nbcnews.com')}")
print()
# Main execution
async def main():
# Basic examples
# await simple_crawl()
# await simple_example_with_running_js_code()
# await simple_example_with_css_selector()
# Advanced examples
# await extract_structured_data_using_css_extractor()
await extract_structured_data_using_llm("openai/gpt-4o", os.getenv("OPENAI_API_KEY"))
# await crawl_dynamic_content_pages_method_1()
# await crawl_dynamic_content_pages_method_2()
# Browser comparisons
# await crawl_custom_browser_type()
# Performance testing
# await speed_comparison()
# Screenshot example
# await capture_and_save_screenshot(
# "https://www.example.com",
# os.path.join(__location__, "tmp/example_screenshot.jpg")
# )
if __name__ == "__main__":
asyncio.run(main())

View File

@@ -13,7 +13,9 @@ import re
from typing import Dict, List
from bs4 import BeautifulSoup
from pydantic import BaseModel, Field
from crawl4ai import AsyncWebCrawler
from crawl4ai import AsyncWebCrawler, CacheMode
from crawl4ai.markdown_generation_strategy import DefaultMarkdownGenerator
from crawl4ai.content_filter_strategy import BM25ContentFilter, PruningContentFilter
from crawl4ai.extraction_strategy import (
JsonCssExtractionStrategy,
LLMExtractionStrategy,
@@ -30,7 +32,7 @@ print("Website: https://crawl4ai.com")
async def simple_crawl():
print("\n--- Basic Usage ---")
async with AsyncWebCrawler(verbose=True) as crawler:
result = await crawler.arun(url="https://www.nbcnews.com/business")
result = await crawler.arun(url="https://www.nbcnews.com/business", cache_mode= CacheMode.BYPASS)
print(result.markdown[:500]) # Print first 500 characters
async def simple_example_with_running_js_code():
@@ -51,7 +53,7 @@ async def simple_example_with_running_js_code():
url="https://www.nbcnews.com/business",
js_code=js_code,
# wait_for=wait_for,
bypass_cache=True,
cache_mode=CacheMode.BYPASS,
)
print(result.markdown[:500]) # Print first 500 characters
@@ -61,7 +63,7 @@ async def simple_example_with_css_selector():
result = await crawler.arun(
url="https://www.nbcnews.com/business",
css_selector=".wide-tease-item__description",
bypass_cache=True,
cache_mode=CacheMode.BYPASS,
)
print(result.markdown[:500]) # Print first 500 characters
@@ -71,19 +73,20 @@ async def use_proxy():
"Note: Replace 'http://your-proxy-url:port' with a working proxy to run this example."
)
# Uncomment and modify the following lines to use a proxy
# async with AsyncWebCrawler(verbose=True, proxy="http://your-proxy-url:port") as crawler:
# result = await crawler.arun(
# url="https://www.nbcnews.com/business",
# bypass_cache=True
# )
# print(result.markdown[:500]) # Print first 500 characters
async with AsyncWebCrawler(verbose=True, proxy="http://your-proxy-url:port") as crawler:
result = await crawler.arun(
url="https://www.nbcnews.com/business",
cache_mode= CacheMode.BYPASS
)
if result.success:
print(result.markdown[:500]) # Print first 500 characters
async def capture_and_save_screenshot(url: str, output_path: str):
async with AsyncWebCrawler(verbose=True) as crawler:
result = await crawler.arun(
url=url,
screenshot=True,
bypass_cache=True
cache_mode= CacheMode.BYPASS
)
if result.success and result.screenshot:
@@ -114,7 +117,13 @@ async def extract_structured_data_using_llm(provider: str, api_token: str = None
print(f"API token is required for {provider}. Skipping this example.")
return
extra_args = {}
# extra_args = {}
extra_args={
"temperature": 0,
"top_p": 0.9,
"max_tokens": 2000,
# any other supported parameters for litellm
}
if extra_headers:
extra_args["extra_headers"] = extra_headers
@@ -125,55 +134,82 @@ async def extract_structured_data_using_llm(provider: str, api_token: str = None
extraction_strategy=LLMExtractionStrategy(
provider=provider,
api_token=api_token,
schema=OpenAIModelFee.schema(),
schema=OpenAIModelFee.model_json_schema(),
extraction_type="schema",
instruction="""From the crawled content, extract all mentioned model names along with their fees for input and output tokens.
Do not miss any models in the entire content. One extracted model JSON format should look like this:
{"model_name": "GPT-4", "input_fee": "US$10.00 / 1M tokens", "output_fee": "US$30.00 / 1M tokens"}.""",
extra_args=extra_args
),
bypass_cache=True,
cache_mode=CacheMode.BYPASS,
)
print(result.extracted_content)
async def extract_structured_data_using_css_extractor():
print("\n--- Using JsonCssExtractionStrategy for Fast Structured Output ---")
schema = {
"name": "Coinbase Crypto Prices",
"baseSelector": ".cds-tableRow-t45thuk",
"fields": [
{
"name": "crypto",
"selector": "td:nth-child(1) h2",
"type": "text",
},
{
"name": "symbol",
"selector": "td:nth-child(1) p",
"type": "text",
},
{
"name": "price",
"selector": "td:nth-child(2)",
"type": "text",
"name": "KidoCode Courses",
"baseSelector": "section.charge-methodology .w-tab-content > div",
"fields": [
{
"name": "section_title",
"selector": "h3.heading-50",
"type": "text",
},
{
"name": "section_description",
"selector": ".charge-content",
"type": "text",
},
{
"name": "course_name",
"selector": ".text-block-93",
"type": "text",
},
{
"name": "course_description",
"selector": ".course-content-text",
"type": "text",
},
{
"name": "course_icon",
"selector": ".image-92",
"type": "attribute",
"attribute": "src"
}
]
}
async with AsyncWebCrawler(
headless=True,
verbose=True
) as crawler:
# Create the JavaScript that handles clicking multiple times
js_click_tabs = """
(async () => {
const tabs = document.querySelectorAll("section.charge-methodology .tabs-menu-3 > div");
for(let tab of tabs) {
// scroll to the tab
tab.scrollIntoView();
tab.click();
// Wait for content to load and animations to complete
await new Promise(r => setTimeout(r, 500));
}
],
}
})();
"""
extraction_strategy = JsonCssExtractionStrategy(schema, verbose=True)
async with AsyncWebCrawler(verbose=True) as crawler:
result = await crawler.arun(
url="https://www.coinbase.com/explore",
extraction_strategy=extraction_strategy,
bypass_cache=True,
url="https://www.kidocode.com/degrees/technology",
extraction_strategy=JsonCssExtractionStrategy(schema, verbose=True),
js_code=[js_click_tabs],
cache_mode=CacheMode.BYPASS
)
assert result.success, "Failed to crawl the page"
news_teasers = json.loads(result.extracted_content)
print(f"Successfully extracted {len(news_teasers)} news teasers")
print(json.dumps(news_teasers[0], indent=2))
companies = json.loads(result.extracted_content)
print(f"Successfully extracted {len(companies)} companies")
print(json.dumps(companies[0], indent=2))
# Advanced Session-Based Crawling with Dynamic Content 🔄
async def crawl_dynamic_content_pages_method_1():
@@ -203,8 +239,10 @@ async def crawl_dynamic_content_pages_method_1():
all_commits = []
js_next_page = """
const button = document.querySelector('a[data-testid="pagination-next-button"]');
if (button) button.click();
(() => {
const button = document.querySelector('a[data-testid="pagination-next-button"]');
if (button) button.click();
})();
"""
for page in range(3): # Crawl 3 pages
@@ -213,7 +251,7 @@ async def crawl_dynamic_content_pages_method_1():
session_id=session_id,
css_selector="li.Box-sc-g0xbh4-0",
js=js_next_page if page > 0 else None,
bypass_cache=True,
cache_mode=CacheMode.BYPASS,
js_only=page > 0,
headless=False,
)
@@ -282,7 +320,7 @@ async def crawl_dynamic_content_pages_method_2():
extraction_strategy=extraction_strategy,
js_code=js_next_page_and_wait if page > 0 else None,
js_only=page > 0,
bypass_cache=True,
cache_mode=CacheMode.BYPASS,
headless=False,
)
@@ -343,7 +381,7 @@ async def crawl_dynamic_content_pages_method_3():
js_code=js_next_page if page > 0 else None,
wait_for=wait_for if page > 0 else None,
js_only=page > 0,
bypass_cache=True,
cache_mode=CacheMode.BYPASS,
headless=False,
)
@@ -361,21 +399,21 @@ async def crawl_custom_browser_type():
# Use Firefox
start = time.time()
async with AsyncWebCrawler(browser_type="firefox", verbose=True, headless = True) as crawler:
result = await crawler.arun(url="https://www.example.com", bypass_cache=True)
result = await crawler.arun(url="https://www.example.com", cache_mode= CacheMode.BYPASS)
print(result.markdown[:500])
print("Time taken: ", time.time() - start)
# Use WebKit
start = time.time()
async with AsyncWebCrawler(browser_type="webkit", verbose=True, headless = True) as crawler:
result = await crawler.arun(url="https://www.example.com", bypass_cache=True)
result = await crawler.arun(url="https://www.example.com", cache_mode= CacheMode.BYPASS)
print(result.markdown[:500])
print("Time taken: ", time.time() - start)
# Use Chromium (default)
start = time.time()
async with AsyncWebCrawler(verbose=True, headless = True) as crawler:
result = await crawler.arun(url="https://www.example.com", bypass_cache=True)
result = await crawler.arun(url="https://www.example.com", cache_mode= CacheMode.BYPASS)
print(result.markdown[:500])
print("Time taken: ", time.time() - start)
@@ -384,7 +422,7 @@ async def crawl_with_user_simultion():
url = "YOUR-URL-HERE"
result = await crawler.arun(
url=url,
bypass_cache=True,
cache_mode=CacheMode.BYPASS,
magic = True, # Automatically detects and removes overlays, popups, and other elements that block content
# simulate_user = True,# Causes a series of random mouse movements and clicks to simulate user interaction
# override_navigator = True # Overrides the navigator object to make it look like a real user
@@ -408,7 +446,7 @@ async def speed_comparison():
params={'formats': ['markdown', 'html']}
)
end = time.time()
print("Firecrawl (simulated):")
print("Firecrawl:")
print(f"Time taken: {end - start:.2f} seconds")
print(f"Content length: {len(scrape_status['markdown'])} characters")
print(f"Images found: {scrape_status['markdown'].count('cldnry.s-nbcnews.com')}")
@@ -420,7 +458,7 @@ async def speed_comparison():
result = await crawler.arun(
url="https://www.nbcnews.com/business",
word_count_threshold=0,
bypass_cache=True,
cache_mode=CacheMode.BYPASS,
verbose=False,
)
end = time.time()
@@ -430,6 +468,26 @@ async def speed_comparison():
print(f"Images found: {result.markdown.count('cldnry.s-nbcnews.com')}")
print()
# Crawl4AI with advanced content filtering
start = time.time()
result = await crawler.arun(
url="https://www.nbcnews.com/business",
word_count_threshold=0,
markdown_generator=DefaultMarkdownGenerator(
content_filter = PruningContentFilter(threshold=0.48, threshold_type="fixed", min_word_threshold=0)
# content_filter=BM25ContentFilter(user_query=None, bm25_threshold=1.0)
),
cache_mode=CacheMode.BYPASS,
verbose=False,
)
end = time.time()
print("Crawl4AI (Markdown Plus):")
print(f"Time taken: {end - start:.2f} seconds")
print(f"Content length: {len(result.markdown_v2.raw_markdown)} characters")
print(f"Fit Markdown: {len(result.markdown_v2.fit_markdown)} characters")
print(f"Images found: {result.markdown.count('cldnry.s-nbcnews.com')}")
print()
# Crawl4AI with JavaScript execution
start = time.time()
result = await crawler.arun(
@@ -438,13 +496,18 @@ async def speed_comparison():
"const loadMoreButton = Array.from(document.querySelectorAll('button')).find(button => button.textContent.includes('Load More')); loadMoreButton && loadMoreButton.click();"
],
word_count_threshold=0,
bypass_cache=True,
cache_mode=CacheMode.BYPASS,
markdown_generator=DefaultMarkdownGenerator(
content_filter = PruningContentFilter(threshold=0.48, threshold_type="fixed", min_word_threshold=0)
# content_filter=BM25ContentFilter(user_query=None, bm25_threshold=1.0)
),
verbose=False,
)
end = time.time()
print("Crawl4AI (with JavaScript execution):")
print(f"Time taken: {end - start:.2f} seconds")
print(f"Content length: {len(result.markdown)} characters")
print(f"Fit Markdown: {len(result.markdown_v2.fit_markdown)} characters")
print(f"Images found: {result.markdown.count('cldnry.s-nbcnews.com')}")
print("\nNote on Speed Comparison:")
@@ -483,7 +546,7 @@ async def generate_knowledge_graph():
url = "https://paulgraham.com/love.html"
result = await crawler.arun(
url=url,
bypass_cache=True,
cache_mode=CacheMode.BYPASS,
extraction_strategy=extraction_strategy,
# magic=True
)
@@ -492,45 +555,80 @@ async def generate_knowledge_graph():
f.write(result.extracted_content)
async def fit_markdown_remove_overlay():
async with AsyncWebCrawler(headless = False) as crawler:
url = "https://janineintheworld.com/places-to-visit-in-central-mexico"
async with AsyncWebCrawler(
headless=True, # Set to False to see what is happening
verbose=True,
user_agent_mode="random",
user_agent_generator_config={
"device_type": "mobile",
"os_type": "android"
},
) as crawler:
result = await crawler.arun(
url=url,
bypass_cache=True,
word_count_threshold = 10,
remove_overlay_elements=True,
screenshot = True
url='https://www.kidocode.com/degrees/technology',
cache_mode=CacheMode.BYPASS,
markdown_generator=DefaultMarkdownGenerator(
content_filter=PruningContentFilter(
threshold=0.48, threshold_type="fixed", min_word_threshold=0
),
options={
"ignore_links": True
}
),
# markdown_generator=DefaultMarkdownGenerator(
# content_filter=BM25ContentFilter(user_query="", bm25_threshold=1.0),
# options={
# "ignore_links": True
# }
# ),
)
# Save markdown to file
with open(os.path.join(__location__, "mexico_places.md"), "w") as f:
f.write(result.fit_markdown)
if result.success:
print(len(result.markdown_v2.raw_markdown))
print(len(result.markdown_v2.markdown_with_citations))
print(len(result.markdown_v2.fit_markdown))
# Save clean html
with open(os.path.join(__location__, "output/cleaned_html.html"), "w") as f:
f.write(result.cleaned_html)
with open(os.path.join(__location__, "output/output_raw_markdown.md"), "w") as f:
f.write(result.markdown_v2.raw_markdown)
with open(os.path.join(__location__, "output/output_markdown_with_citations.md"), "w") as f:
f.write(result.markdown_v2.markdown_with_citations)
with open(os.path.join(__location__, "output/output_fit_markdown.md"), "w") as f:
f.write(result.markdown_v2.fit_markdown)
print("Done")
async def main():
await simple_crawl()
await simple_example_with_running_js_code()
await simple_example_with_css_selector()
await use_proxy()
await capture_and_save_screenshot("https://www.example.com", os.path.join(__location__, "tmp/example_screenshot.jpg"))
await extract_structured_data_using_css_extractor()
# await extract_structured_data_using_llm("openai/gpt-4o", os.getenv("OPENAI_API_KEY"))
# await simple_crawl()
# await simple_example_with_running_js_code()
# await simple_example_with_css_selector()
# # await use_proxy()
# await capture_and_save_screenshot("https://www.example.com", os.path.join(__location__, "tmp/example_screenshot.jpg"))
# await extract_structured_data_using_css_extractor()
# LLM extraction examples
await extract_structured_data_using_llm()
await extract_structured_data_using_llm("huggingface/meta-llama/Meta-Llama-3.1-8B-Instruct", os.getenv("HUGGINGFACE_API_KEY"))
await extract_structured_data_using_llm("openai/gpt-4o", os.getenv("OPENAI_API_KEY"))
await extract_structured_data_using_llm("ollama/llama3.2")
# await extract_structured_data_using_llm()
# await extract_structured_data_using_llm("huggingface/meta-llama/Meta-Llama-3.1-8B-Instruct", os.getenv("HUGGINGFACE_API_KEY"))
# await extract_structured_data_using_llm("ollama/llama3.2")
# You always can pass custom headers to the extraction strategy
custom_headers = {
"Authorization": "Bearer your-custom-token",
"X-Custom-Header": "Some-Value"
}
await extract_structured_data_using_llm(extra_headers=custom_headers)
# custom_headers = {
# "Authorization": "Bearer your-custom-token",
# "X-Custom-Header": "Some-Value"
# }
# await extract_structured_data_using_llm(extra_headers=custom_headers)
# await crawl_dynamic_content_pages_method_1()
# await crawl_dynamic_content_pages_method_2()
await crawl_dynamic_content_pages_method_1()
await crawl_dynamic_content_pages_method_2()
await crawl_dynamic_content_pages_method_3()
await crawl_custom_browser_type()

View File

@@ -0,0 +1,225 @@
### Using `storage_state` to Pre-Load Cookies and LocalStorage
Crawl4ais `AsyncWebCrawler` lets you preserve and reuse session data, including cookies and localStorage, across multiple runs. By providing a `storage_state`, you can start your crawls already “logged in” or with any other necessary session data—no need to repeat the login flow every time.
#### What is `storage_state`?
`storage_state` can be:
- A dictionary containing cookies and localStorage data.
- A path to a JSON file that holds this information.
When you pass `storage_state` to the crawler, it applies these cookies and localStorage entries before loading any pages. This means your crawler effectively starts in a known authenticated or pre-configured state.
#### Example Structure
Heres an example storage state:
```json
{
"cookies": [
{
"name": "session",
"value": "abcd1234",
"domain": "example.com",
"path": "/",
"expires": 1675363572.037711,
"httpOnly": false,
"secure": false,
"sameSite": "None"
}
],
"origins": [
{
"origin": "https://example.com",
"localStorage": [
{ "name": "token", "value": "my_auth_token" },
{ "name": "refreshToken", "value": "my_refresh_token" }
]
}
]
}
```
This JSON sets a `session` cookie and two localStorage entries (`token` and `refreshToken`) for `https://example.com`.
---
### Passing `storage_state` as a Dictionary
You can directly provide the data as a dictionary:
```python
import asyncio
from crawl4ai import AsyncWebCrawler
async def main():
storage_dict = {
"cookies": [
{
"name": "session",
"value": "abcd1234",
"domain": "example.com",
"path": "/",
"expires": 1675363572.037711,
"httpOnly": False,
"secure": False,
"sameSite": "None"
}
],
"origins": [
{
"origin": "https://example.com",
"localStorage": [
{"name": "token", "value": "my_auth_token"},
{"name": "refreshToken", "value": "my_refresh_token"}
]
}
]
}
async with AsyncWebCrawler(
headless=True,
storage_state=storage_dict
) as crawler:
result = await crawler.arun(url='https://example.com/protected')
if result.success:
print("Crawl succeeded with pre-loaded session data!")
print("Page HTML length:", len(result.html))
if __name__ == "__main__":
asyncio.run(main())
```
---
### Passing `storage_state` as a File
If you prefer a file-based approach, save the JSON above to `mystate.json` and reference it:
```python
import asyncio
from crawl4ai import AsyncWebCrawler
async def main():
async with AsyncWebCrawler(
headless=True,
storage_state="mystate.json" # Uses a JSON file instead of a dictionary
) as crawler:
result = await crawler.arun(url='https://example.com/protected')
if result.success:
print("Crawl succeeded with pre-loaded session data!")
print("Page HTML length:", len(result.html))
if __name__ == "__main__":
asyncio.run(main())
```
---
### Using `storage_state` to Avoid Repeated Logins (Sign In Once, Use Later)
A common scenario is when you need to log in to a site (entering username/password, etc.) to access protected pages. Doing so every crawl is cumbersome. Instead, you can:
1. Perform the login once in a hook.
2. After login completes, export the resulting `storage_state` to a file.
3. On subsequent runs, provide that `storage_state` to skip the login step.
**Step-by-Step Example:**
**First Run (Perform Login and Save State):**
```python
import asyncio
from crawl4ai import AsyncWebCrawler, CacheMode
from crawl4ai.markdown_generation_strategy import DefaultMarkdownGenerator
async def on_browser_created_hook(browser):
# Access the default context and create a page
context = browser.contexts[0]
page = await context.new_page()
# Navigate to the login page
await page.goto("https://example.com/login", wait_until="domcontentloaded")
# Fill in credentials and submit
await page.fill("input[name='username']", "myuser")
await page.fill("input[name='password']", "mypassword")
await page.click("button[type='submit']")
await page.wait_for_load_state("networkidle")
# Now the site sets tokens in localStorage and cookies
# Export this state to a file so we can reuse it
await context.storage_state(path="my_storage_state.json")
await page.close()
async def main():
# First run: perform login and export the storage_state
async with AsyncWebCrawler(
headless=True,
verbose=True,
hooks={"on_browser_created": on_browser_created_hook},
use_persistent_context=True,
user_data_dir="./my_user_data"
) as crawler:
# After on_browser_created_hook runs, we have storage_state saved to my_storage_state.json
result = await crawler.arun(
url='https://example.com/protected-page',
cache_mode=CacheMode.BYPASS,
markdown_generator=DefaultMarkdownGenerator(options={"ignore_links": True}),
)
print("First run result success:", result.success)
if result.success:
print("Protected page HTML length:", len(result.html))
if __name__ == "__main__":
asyncio.run(main())
```
**Second Run (Reuse Saved State, No Login Needed):**
```python
import asyncio
from crawl4ai import AsyncWebCrawler, CacheMode
from crawl4ai.markdown_generation_strategy import DefaultMarkdownGenerator
async def main():
# Second run: no need to hook on_browser_created this time.
# Just provide the previously saved storage state.
async with AsyncWebCrawler(
headless=True,
verbose=True,
use_persistent_context=True,
user_data_dir="./my_user_data",
storage_state="my_storage_state.json" # Reuse previously exported state
) as crawler:
# Now the crawler starts already logged in
result = await crawler.arun(
url='https://example.com/protected-page',
cache_mode=CacheMode.BYPASS,
markdown_generator=DefaultMarkdownGenerator(options={"ignore_links": True}),
)
print("Second run result success:", result.success)
if result.success:
print("Protected page HTML length:", len(result.html))
if __name__ == "__main__":
asyncio.run(main())
```
**Whats Happening Here?**
- During the first run, the `on_browser_created_hook` logs into the site.
- After logging in, the crawler exports the current session (cookies, localStorage, etc.) to `my_storage_state.json`.
- On subsequent runs, passing `storage_state="my_storage_state.json"` starts the browser context with these tokens already in place, skipping the login steps.
**Sign Out Scenario:**
If the website allows you to sign out by clearing tokens or by navigating to a sign-out URL, you can also run a script that uses `on_browser_created_hook` or `arun` to simulate signing out, then export the resulting `storage_state` again. That would give you a baseline “logged out” state to start fresh from next time.
---
### Conclusion
By using `storage_state`, you can skip repetitive actions, like logging in, and jump straight into crawling protected content. Whether you provide a file path or a dictionary, this powerful feature helps maintain state between crawls, simplifying your data extraction pipelines.

View File

@@ -0,0 +1,117 @@
# Tutorial: Clicking Buttons to Load More Content with Crawl4AI
## Introduction
When scraping dynamic websites, its common to encounter “Load More” or “Next” buttons that must be clicked to reveal new content. Crawl4AI provides a straightforward way to handle these situations using JavaScript execution and waiting conditions. In this tutorial, well cover two approaches:
1. **Step-by-step (Session-based) Approach:** Multiple calls to `arun()` to progressively load more content.
2. **Single-call Approach:** Execute a more complex JavaScript snippet inside a single `arun()` call to handle all clicks at once before the extraction.
## Prerequisites
- A working installation of Crawl4AI
- Basic familiarity with Pythons `async`/`await` syntax
## Step-by-Step Approach
Use a session ID to maintain state across multiple `arun()` calls:
```python
from crawl4ai import AsyncWebCrawler, CacheMode
js_code = [
# This JS finds the “Next” button and clicks it
"const nextButton = document.querySelector('button.next'); nextButton && nextButton.click();"
]
wait_for_condition = "css:.new-content-class"
async with AsyncWebCrawler(headless=True, verbose=True) as crawler:
# 1. Load the initial page
result_initial = await crawler.arun(
url="https://example.com",
cache_mode=CacheMode.BYPASS,
session_id="my_session"
)
# 2. Click the 'Next' button and wait for new content
result_next = await crawler.arun(
url="https://example.com",
session_id="my_session",
js_code=js_code,
wait_for=wait_for_condition,
js_only=True,
cache_mode=CacheMode.BYPASS
)
# `result_next` now contains the updated HTML after clicking 'Next'
```
**Key Points:**
- **`session_id`**: Keeps the same browser context open.
- **`js_code`**: Executes JavaScript in the context of the already loaded page.
- **`wait_for`**: Ensures the crawler waits until new content is fully loaded.
- **`js_only=True`**: Runs the JS in the current session without reloading the page.
By repeating the `arun()` call multiple times and modifying the `js_code` (e.g., clicking different modules or pages), you can iteratively load all the desired content.
## Single-call Approach
If the page allows it, you can run a single `arun()` call with a more elaborate JavaScript snippet that:
- Iterates over all the modules or "Next" buttons
- Clicks them one by one
- Waits for content updates between each click
- Once done, returns control to Crawl4AI for extraction.
Example snippet:
```python
from crawl4ai import AsyncWebCrawler, CacheMode
js_code = [
# Example JS that clicks multiple modules:
"""
(async () => {
const modules = document.querySelectorAll('.module-item');
for (let i = 0; i < modules.length; i++) {
modules[i].scrollIntoView();
modules[i].click();
// Wait for each modules content to load, adjust 100ms as needed
await new Promise(r => setTimeout(r, 100));
}
})();
"""
]
async with AsyncWebCrawler(headless=True, verbose=True) as crawler:
result = await crawler.arun(
url="https://example.com",
js_code=js_code,
wait_for="css:.final-loaded-content-class",
cache_mode=CacheMode.BYPASS
)
# `result` now contains all content after all modules have been clicked in one go.
```
**Key Points:**
- All interactions (clicks and waits) happen before the extraction.
- Ideal for pages where all steps can be done in a single pass.
## Choosing the Right Approach
- **Step-by-Step (Session-based)**:
- Good when you need fine-grained control or must dynamically check conditions before clicking the next page.
- Useful if the page requires multiple conditions checked at runtime.
- **Single-call**:
- Perfect if the sequence of interactions is known in advance.
- Cleaner code if the pages structure is consistent and predictable.
## Conclusion
Crawl4AI makes it easy to handle dynamic content:
- Use session IDs and multiple `arun()` calls for stepwise crawling.
- Or pack all actions into one `arun()` call if the interactions are well-defined upfront.
This flexibility ensures you can handle a wide range of dynamic web pages efficiently.

View File

@@ -0,0 +1,277 @@
import os, sys
# append the parent directory to the sys.path
parent_dir = os.path.dirname(os.path.dirname(os.path.abspath(__file__)))
sys.path.append(parent_dir)
parent_parent_dir = os.path.dirname(parent_dir)
sys.path.append(parent_parent_dir)
__location__ = os.path.realpath(os.path.join(os.getcwd(), os.path.dirname(__file__)))
__data__ = os.path.join(__location__, "__data")
import asyncio
from pathlib import Path
import aiohttp
import json
from crawl4ai import AsyncWebCrawler, CacheMode
from crawl4ai.content_filter_strategy import BM25ContentFilter
# 1. File Download Processing Example
async def download_example():
"""Example of downloading files from Python.org"""
# downloads_path = os.path.join(os.getcwd(), "downloads")
downloads_path = os.path.join(Path.home(), ".crawl4ai", "downloads")
os.makedirs(downloads_path, exist_ok=True)
print(f"Downloads will be saved to: {downloads_path}")
async with AsyncWebCrawler(
accept_downloads=True,
downloads_path=downloads_path,
verbose=True
) as crawler:
result = await crawler.arun(
url="https://www.python.org/downloads/",
js_code="""
// Find and click the first Windows installer link
const downloadLink = document.querySelector('a[href$=".exe"]');
if (downloadLink) {
console.log('Found download link:', downloadLink.href);
downloadLink.click();
} else {
console.log('No .exe download link found');
}
""",
delay_before_return_html=1, # Wait 5 seconds to ensure download starts
cache_mode=CacheMode.BYPASS
)
if result.downloaded_files:
print("\nDownload successful!")
print("Downloaded files:")
for file_path in result.downloaded_files:
print(f"- {file_path}")
print(f" File size: {os.path.getsize(file_path) / (1024*1024):.2f} MB")
else:
print("\nNo files were downloaded")
# 2. Local File and Raw HTML Processing Example
async def local_and_raw_html_example():
"""Example of processing local files and raw HTML"""
# Create a sample HTML file
sample_file = os.path.join(__data__, "sample.html")
with open(sample_file, "w") as f:
f.write("""
<html><body>
<h1>Test Content</h1>
<p>This is a test paragraph.</p>
</body></html>
""")
async with AsyncWebCrawler(verbose=True) as crawler:
# Process local file
local_result = await crawler.arun(
url=f"file://{os.path.abspath(sample_file)}"
)
# Process raw HTML
raw_html = """
<html><body>
<h1>Raw HTML Test</h1>
<p>This is a test of raw HTML processing.</p>
</body></html>
"""
raw_result = await crawler.arun(
url=f"raw:{raw_html}"
)
# Clean up
os.remove(sample_file)
print("Local file content:", local_result.markdown)
print("\nRaw HTML content:", raw_result.markdown)
# 3. Enhanced Markdown Generation Example
async def markdown_generation_example():
"""Example of enhanced markdown generation with citations and LLM-friendly features"""
async with AsyncWebCrawler(verbose=True) as crawler:
# Create a content filter (optional)
content_filter = BM25ContentFilter(
# user_query="History and cultivation",
bm25_threshold=1.0
)
result = await crawler.arun(
url="https://en.wikipedia.org/wiki/Apple",
css_selector="main div#bodyContent",
content_filter=content_filter,
cache_mode=CacheMode.BYPASS
)
from crawl4ai import AsyncWebCrawler
from crawl4ai.content_filter_strategy import BM25ContentFilter
result = await crawler.arun(
url="https://en.wikipedia.org/wiki/Apple",
css_selector="main div#bodyContent",
content_filter=BM25ContentFilter()
)
print(result.markdown_v2.fit_markdown)
print("\nMarkdown Generation Results:")
print(f"1. Original markdown length: {len(result.markdown)}")
print(f"2. New markdown versions (markdown_v2):")
print(f" - Raw markdown length: {len(result.markdown_v2.raw_markdown)}")
print(f" - Citations markdown length: {len(result.markdown_v2.markdown_with_citations)}")
print(f" - References section length: {len(result.markdown_v2.references_markdown)}")
if result.markdown_v2.fit_markdown:
print(f" - Filtered markdown length: {len(result.markdown_v2.fit_markdown)}")
# Save examples to files
output_dir = os.path.join(__data__, "markdown_examples")
os.makedirs(output_dir, exist_ok=True)
# Save different versions
with open(os.path.join(output_dir, "1_raw_markdown.md"), "w") as f:
f.write(result.markdown_v2.raw_markdown)
with open(os.path.join(output_dir, "2_citations_markdown.md"), "w") as f:
f.write(result.markdown_v2.markdown_with_citations)
with open(os.path.join(output_dir, "3_references.md"), "w") as f:
f.write(result.markdown_v2.references_markdown)
if result.markdown_v2.fit_markdown:
with open(os.path.join(output_dir, "4_filtered_markdown.md"), "w") as f:
f.write(result.markdown_v2.fit_markdown)
print(f"\nMarkdown examples saved to: {output_dir}")
# Show a sample of citations and references
print("\nSample of markdown with citations:")
print(result.markdown_v2.markdown_with_citations[:500] + "...\n")
print("Sample of references:")
print('\n'.join(result.markdown_v2.references_markdown.split('\n')[:10]) + "...")
# 4. Browser Management Example
async def browser_management_example():
"""Example of using enhanced browser management features"""
# Use the specified user directory path
user_data_dir = os.path.join(Path.home(), ".crawl4ai", "browser_profile")
os.makedirs(user_data_dir, exist_ok=True)
print(f"Browser profile will be saved to: {user_data_dir}")
async with AsyncWebCrawler(
use_managed_browser=True,
user_data_dir=user_data_dir,
headless=False,
verbose=True
) as crawler:
result = await crawler.arun(
url="https://crawl4ai.com",
# session_id="persistent_session_1",
cache_mode=CacheMode.BYPASS
)
# Use GitHub as an example - it's a good test for browser management
# because it requires proper browser handling
result = await crawler.arun(
url="https://github.com/trending",
# session_id="persistent_session_1",
cache_mode=CacheMode.BYPASS
)
print("\nBrowser session result:", result.success)
if result.success:
print("Page title:", result.metadata.get('title', 'No title found'))
# 5. API Usage Example
async def api_example():
"""Example of using the new API endpoints"""
api_token = os.getenv('CRAWL4AI_API_TOKEN') or "test_api_code"
headers = {'Authorization': f'Bearer {api_token}'}
async with aiohttp.ClientSession() as session:
# Submit crawl job
crawl_request = {
"urls": ["https://news.ycombinator.com"], # Hacker News as an example
"extraction_config": {
"type": "json_css",
"params": {
"schema": {
"name": "Hacker News Articles",
"baseSelector": ".athing",
"fields": [
{
"name": "title",
"selector": ".title a",
"type": "text"
},
{
"name": "score",
"selector": ".score",
"type": "text"
},
{
"name": "url",
"selector": ".title a",
"type": "attribute",
"attribute": "href"
}
]
}
}
},
"crawler_params": {
"headless": True,
# "use_managed_browser": True
},
"cache_mode": "bypass",
# "screenshot": True,
# "magic": True
}
async with session.post(
"http://localhost:11235/crawl",
json=crawl_request,
headers=headers
) as response:
task_data = await response.json()
task_id = task_data["task_id"]
# Check task status
while True:
async with session.get(
f"http://localhost:11235/task/{task_id}",
headers=headers
) as status_response:
result = await status_response.json()
print(f"Task status: {result['status']}")
if result["status"] == "completed":
print("Task completed!")
print("Results:")
news = json.loads(result["results"][0]['extracted_content'])
print(json.dumps(news[:4], indent=2))
break
else:
await asyncio.sleep(1)
# Main execution
async def main():
# print("Running Crawl4AI feature examples...")
# print("\n1. Running Download Example:")
# await download_example()
# print("\n2. Running Markdown Generation Example:")
# await markdown_generation_example()
# # print("\n3. Running Local and Raw HTML Example:")
# await local_and_raw_html_example()
# # print("\n4. Running Browser Management Example:")
await browser_management_example()
# print("\n5. Running API Example:")
await api_example()
if __name__ == "__main__":
asyncio.run(main())

View File

@@ -18,7 +18,7 @@ Let's see how we can customize the AsyncWebCrawler using hooks! In this example,
import asyncio
from crawl4ai import AsyncWebCrawler
from crawl4ai.async_crawler_strategy import AsyncPlaywrightCrawlerStrategy
from playwright.async_api import Page, Browser
from playwright.async_api import Page, Browser, BrowserContext
async def on_browser_created(browser: Browser):
print("[HOOK] on_browser_created")
@@ -71,7 +71,11 @@ from crawl4ai.async_crawler_strategy import AsyncPlaywrightCrawlerStrategy
async def main():
print("\n🔗 Using Crawler Hooks: Let's see how we can customize the AsyncWebCrawler using hooks!")
crawler_strategy = AsyncPlaywrightCrawlerStrategy(verbose=True)
initial_cookies = [
{"name": "sessionId", "value": "abc123", "domain": ".example.com"},
{"name": "userId", "value": "12345", "domain": ".example.com"}
]
crawler_strategy = AsyncPlaywrightCrawlerStrategy(verbose=True, cookies=initial_cookies)
crawler_strategy.set_hook('on_browser_created', on_browser_created)
crawler_strategy.set_hook('before_goto', before_goto)
crawler_strategy.set_hook('after_goto', after_goto)

View File

@@ -0,0 +1,136 @@
# Content Filtering in Crawl4AI
This guide explains how to use content filtering strategies in Crawl4AI to extract the most relevant information from crawled web pages. You'll learn how to use the built-in `BM25ContentFilter` and how to create your own custom content filtering strategies.
## Relevance Content Filter
The `RelevanceContentFilter` is an abstract class that provides a common interface for content filtering strategies. Specific filtering algorithms, like `PruningContentFilter` or `BM25ContentFilter`, inherit from this class and implement the `filter_content` method. This method takes the HTML content as input and returns a list of filtered text blocks.
## Pruning Content Filter
The `PruningContentFilter` is a tree-shaking algorithm that analyzes the HTML DOM structure and removes less relevant nodes based on various metrics like text density, link density, and tag importance. It evaluates each node using a composite scoring system and "prunes" nodes that fall below a certain threshold.
### Usage
```python
from crawl4ai import AsyncWebCrawler
from crawl4ai.content_filter_strategy import PruningContentFilter
async def filter_content(url):
async with AsyncWebCrawler() as crawler:
content_filter = PruningContentFilter(
min_word_threshold=5,
threshold_type='dynamic',
threshold=0.45
)
result = await crawler.arun(url=url, extraction_strategy=content_filter, fit_markdown=True)
if result.success:
print(f"Cleaned Markdown:\n{result.fit_markdown}")
```
### Parameters
- **`min_word_threshold`**: (Optional) Minimum number of words a node must contain to be considered relevant. Nodes with fewer words are automatically pruned.
- **`threshold_type`**: (Optional, default 'fixed') Controls how pruning thresholds are calculated:
- `'fixed'`: Uses a constant threshold value for all nodes
- `'dynamic'`: Adjusts threshold based on node characteristics like tag importance and text/link ratios
- **`threshold`**: (Optional, default 0.48) Base threshold value for node pruning:
- For fixed threshold: Nodes scoring below this value are removed
- For dynamic threshold: This value is adjusted based on node properties
### How It Works
The pruning algorithm evaluates each node using multiple metrics:
- Text density: Ratio of actual text to overall node content
- Link density: Proportion of text within links
- Tag importance: Weight based on HTML tag type (e.g., article, p, div)
- Content quality: Metrics like text length and structural importance
Nodes scoring below the threshold are removed, effectively "shaking" less relevant content from the DOM tree. This results in a cleaner document containing only the most relevant content blocks.
The algorithm is particularly effective for:
- Removing boilerplate content
- Eliminating navigation menus and sidebars
- Preserving main article content
- Maintaining document structure while removing noise
## BM25 Algorithm
The `BM25ContentFilter` uses the BM25 algorithm, a ranking function used in information retrieval to estimate the relevance of documents to a given search query. In Crawl4AI, this algorithm helps to identify and extract text chunks that are most relevant to the page's metadata or a user-specified query.
### Usage
To use the `BM25ContentFilter`, initialize it and then pass it as the `extraction_strategy` parameter to the `arun` method of the crawler.
```python
from crawl4ai import AsyncWebCrawler
from crawl4ai.content_filter_strategy import BM25ContentFilter
async def filter_content(url, query=None):
async with AsyncWebCrawler() as crawler:
content_filter = BM25ContentFilter(user_query=query)
result = await crawler.arun(url=url, extraction_strategy=content_filter, fit_markdown=True) # Set fit_markdown flag to True to trigger BM25 filtering
if result.success:
print(f"Filtered Content (JSON):\n{result.extracted_content}")
print(f"\nFiltered Markdown:\n{result.fit_markdown}") # New field in CrawlResult object
print(f"\nFiltered HTML:\n{result.fit_html}") # New field in CrawlResult object. Note that raw HTML may have tags re-organized due to internal parsing.
else:
print("Error:", result.error_message)
# Example usage:
asyncio.run(filter_content("https://en.wikipedia.org/wiki/Apple", "fruit nutrition health")) # with query
asyncio.run(filter_content("https://en.wikipedia.org/wiki/Apple")) # without query, metadata will be used as the query.
```
### Parameters
- **`user_query`**: (Optional) A string representing the search query. If not provided, the filter extracts relevant metadata (title, description, keywords) from the page and uses that as the query.
- **`bm25_threshold`**: (Optional, default 1.0) A float value that controls the threshold for relevance. Higher values result in stricter filtering, returning only the most relevant text chunks. Lower values result in more lenient filtering.
## Fit Markdown Flag
Setting the `fit_markdown` flag to `True` in the `arun` method activates the BM25 content filtering during the crawl. The `fit_markdown` parameter instructs the scraper to extract and clean the HTML, primarily to prepare for a Large Language Model that cannot process large amounts of data. Setting this flag not only improves the quality of the extracted content but also adds the filtered content to two new attributes in the returned `CrawlResult` object: `fit_markdown` and `fit_html`.
## Custom Content Filtering Strategies
You can create your own custom filtering strategies by inheriting from the `RelevantContentFilter` class and implementing the `filter_content` method. This allows you to tailor the filtering logic to your specific needs.
```python
from crawl4ai.content_filter_strategy import RelevantContentFilter
from bs4 import BeautifulSoup, Tag
from typing import List
class MyCustomFilter(RelevantContentFilter):
def filter_content(self, html: str) -> List[str]:
soup = BeautifulSoup(html, 'lxml')
# Implement custom filtering logic here
# Example: extract all paragraphs within divs with class "article-body"
filtered_paragraphs = []
for tag in soup.select("div.article-body p"):
if isinstance(tag, Tag):
filtered_paragraphs.append(str(tag)) # Add the cleaned HTML element.
return filtered_paragraphs
async def custom_filter_demo(url: str):
async with AsyncWebCrawler() as crawler:
custom_filter = MyCustomFilter()
result = await crawler.arun(url, extraction_strategy=custom_filter)
if result.success:
print(result.extracted_content)
```
This example demonstrates extracting paragraphs from a specific div class. You can customize this logic to implement different filtering strategies, use regular expressions, analyze text density, or apply other relevant techniques.
## Conclusion
Content filtering strategies provide a powerful way to refine the output of your crawls. By using `BM25ContentFilter` or creating custom strategies, you can focus on the most pertinent information and improve the efficiency of your data processing pipeline.

View File

@@ -30,7 +30,7 @@ Let's start with a basic example of session-based crawling:
```python
import asyncio
from crawl4ai import AsyncWebCrawler
from crawl4ai import AsyncWebCrawler, CacheMode
async def basic_session_crawl():
async with AsyncWebCrawler(verbose=True) as crawler:
@@ -43,7 +43,7 @@ async def basic_session_crawl():
session_id=session_id,
js_code="document.querySelector('.load-more-button').click();" if page > 0 else None,
css_selector=".content-item",
bypass_cache=True
cache_mode=CacheMode.BYPASS
)
print(f"Page {page + 1}: Found {result.extracted_content.count('.content-item')} items")
@@ -102,7 +102,7 @@ async def advanced_session_crawl_with_hooks():
session_id=session_id,
css_selector="li.commit-item",
js_code=js_next_page if page > 0 else None,
bypass_cache=True,
cache_mode=CacheMode.BYPASS,
js_only=page > 0
)
@@ -174,7 +174,7 @@ async def integrated_js_and_wait_crawl():
extraction_strategy=extraction_strategy,
js_code=js_next_page_and_wait if page > 0 else None,
js_only=page > 0,
bypass_cache=True
cache_mode=CacheMode.BYPASS
)
commits = json.loads(result.extracted_content)
@@ -241,7 +241,7 @@ async def wait_for_parameter_crawl():
js_code=js_next_page if page > 0 else None,
wait_for=wait_for if page > 0 else None,
js_only=page > 0,
bypass_cache=True
cache_mode=CacheMode.BYPASS
)
commits = json.loads(result.extracted_content)

View File

@@ -75,7 +75,7 @@ async def crawl_dynamic_content():
js_code=js_next_page if page > 0 else None,
wait_for=wait_for if page > 0 else None,
js_only=page > 0,
bypass_cache=True
cache_mode=CacheMode.BYPASS
)
if result.success:

View File

@@ -8,11 +8,26 @@ The following parameters can be passed to the `arun()` method. They are organize
await crawler.arun(
url="https://example.com", # Required: URL to crawl
verbose=True, # Enable detailed logging
bypass_cache=False, # Skip cache for this request
cache_mode=CacheMode.ENABLED, # Control cache behavior
warmup=True # Whether to run warmup check
)
```
## Cache Control
```python
from crawl4ai import CacheMode
await crawler.arun(
cache_mode=CacheMode.ENABLED, # Normal caching (read/write)
# Other cache modes:
# cache_mode=CacheMode.DISABLED # No caching at all
# cache_mode=CacheMode.READ_ONLY # Only read from cache
# cache_mode=CacheMode.WRITE_ONLY # Only write to cache
# cache_mode=CacheMode.BYPASS # Skip cache for this operation
)
```
## Content Processing Parameters
### Text Processing
@@ -162,14 +177,13 @@ await crawler.arun(
## Parameter Interactions and Notes
1. **Magic Mode Combinations**
1. **Cache and Performance Setup**
```python
# Full anti-detection setup
# Optimal caching for repeated crawls
await crawler.arun(
magic=True,
headless=False,
simulate_user=True,
override_navigator=True
cache_mode=CacheMode.ENABLED,
word_count_threshold=10,
process_iframes=False
)
```
@@ -179,7 +193,8 @@ await crawler.arun(
await crawler.arun(
js_code="window.scrollTo(0, document.body.scrollHeight);",
wait_for="css:.lazy-content",
delay_before_return_html=2.0
delay_before_return_html=2.0,
cache_mode=CacheMode.WRITE_ONLY # Cache results after dynamic load
)
```
@@ -192,7 +207,8 @@ await crawler.arun(
extraction_strategy=my_strategy,
chunking_strategy=my_chunking,
process_iframes=True,
remove_overlay_elements=True
remove_overlay_elements=True,
cache_mode=CacheMode.ENABLED
)
```
@@ -201,7 +217,7 @@ await crawler.arun(
1. **Performance Optimization**
```python
await crawler.arun(
bypass_cache=False, # Use cache when possible
cache_mode=CacheMode.ENABLED, # Use full caching
word_count_threshold=10, # Filter out noise
process_iframes=False # Skip iframes if not needed
)
@@ -212,7 +228,8 @@ await crawler.arun(
await crawler.arun(
magic=True, # Enable anti-detection
delay_before_return_html=1.0, # Wait for dynamic content
page_timeout=60000 # Longer timeout for slow pages
page_timeout=60000, # Longer timeout for slow pages
cache_mode=CacheMode.WRITE_ONLY # Cache results after successful crawl
)
```
@@ -221,6 +238,7 @@ await crawler.arun(
await crawler.arun(
remove_overlay_elements=True, # Remove popups
excluded_tags=['nav', 'aside'],# Remove unnecessary elements
keep_data_attributes=False # Remove data attributes
keep_data_attributes=False, # Remove data attributes
cache_mode=CacheMode.ENABLED # Use cache for faster processing
)
```

View File

@@ -20,6 +20,7 @@ class CrawlResult(BaseModel):
fit_html: Optional[str] = None # Most relevant HTML content
markdown: Optional[str] = None # HTML converted to markdown
fit_markdown: Optional[str] = None # Most relevant markdown content
downloaded_files: Optional[List[str]] = None # Downloaded files
# Extracted Data
extracted_content: Optional[str] = None # Content from extraction strategy

View File

@@ -32,4 +32,5 @@
| async_webcrawler.py | warmup | `kwargs.get("warmup", True)` | AsyncWebCrawler | Initialize crawler with warmup request |
| async_webcrawler.py | session_id | `kwargs.get("session_id", None)` | AsyncWebCrawler | Session identifier for browser reuse |
| async_webcrawler.py | only_text | `kwargs.get("only_text", False)` | AsyncWebCrawler | Extract only text content |
| async_webcrawler.py | bypass_cache | `kwargs.get("bypass_cache", False)` | AsyncWebCrawler | Skip cache and force fresh crawl |
| async_webcrawler.py | bypass_cache | `kwargs.get("bypass_cache", False)` | AsyncWebCrawler | Skip cache and force fresh crawl |
| async_webcrawler.py | cache_mode | `kwargs.get("cache_mode", CacheMode.ENABLE)` | AsyncWebCrawler | Cache handling mode for request |

View File

@@ -0,0 +1,79 @@
# Crawl4AI Cache System and Migration Guide
## Overview
Starting from version 0.5.0, Crawl4AI introduces a new caching system that replaces the old boolean flags with a more intuitive `CacheMode` enum. This change simplifies cache control and makes the behavior more predictable.
## Old vs New Approach
### Old Way (Deprecated)
The old system used multiple boolean flags:
- `bypass_cache`: Skip cache entirely
- `disable_cache`: Disable all caching
- `no_cache_read`: Don't read from cache
- `no_cache_write`: Don't write to cache
### New Way (Recommended)
The new system uses a single `CacheMode` enum:
- `CacheMode.ENABLED`: Normal caching (read/write)
- `CacheMode.DISABLED`: No caching at all
- `CacheMode.READ_ONLY`: Only read from cache
- `CacheMode.WRITE_ONLY`: Only write to cache
- `CacheMode.BYPASS`: Skip cache for this operation
## Migration Example
### Old Code (Deprecated)
```python
import asyncio
from crawl4ai import AsyncWebCrawler
async def use_proxy():
async with AsyncWebCrawler(verbose=True) as crawler:
result = await crawler.arun(
url="https://www.nbcnews.com/business",
bypass_cache=True # Old way
)
print(len(result.markdown))
async def main():
await use_proxy()
if __name__ == "__main__":
asyncio.run(main())
```
### New Code (Recommended)
```python
import asyncio
from crawl4ai import AsyncWebCrawler, CacheMode # Import CacheMode
async def use_proxy():
async with AsyncWebCrawler(verbose=True) as crawler:
result = await crawler.arun(
url="https://www.nbcnews.com/business",
cache_mode=CacheMode.BYPASS # New way
)
print(len(result.markdown))
async def main():
await use_proxy()
if __name__ == "__main__":
asyncio.run(main())
```
## Common Migration Patterns
Old Flag | New Mode
---------|----------
`bypass_cache=True` | `cache_mode=CacheMode.BYPASS`
`disable_cache=True` | `cache_mode=CacheMode.DISABLED`
`no_cache_read=True` | `cache_mode=CacheMode.WRITE_ONLY`
`no_cache_write=True` | `cache_mode=CacheMode.READ_ONLY`
## Suppressing Deprecation Warnings
If you need time to migrate, you can temporarily suppress deprecation warnings:
```python
# In your config.py
SHOW_DEPRECATION_WARNINGS = False
```

View File

@@ -0,0 +1,136 @@
# Content Filtering in Crawl4AI
This guide explains how to use content filtering strategies in Crawl4AI to extract the most relevant information from crawled web pages. You'll learn how to use the built-in `BM25ContentFilter` and how to create your own custom content filtering strategies.
## Relevance Content Filter
The `RelevanceContentFilter` is an abstract class that provides a common interface for content filtering strategies. Specific filtering algorithms, like `PruningContentFilter` or `BM25ContentFilter`, inherit from this class and implement the `filter_content` method. This method takes the HTML content as input and returns a list of filtered text blocks.
## Pruning Content Filter
The `PruningContentFilter` is a tree-shaking algorithm that analyzes the HTML DOM structure and removes less relevant nodes based on various metrics like text density, link density, and tag importance. It evaluates each node using a composite scoring system and "prunes" nodes that fall below a certain threshold.
### Usage
```python
from crawl4ai import AsyncWebCrawler
from crawl4ai.content_filter_strategy import PruningContentFilter
async def filter_content(url):
async with AsyncWebCrawler() as crawler:
content_filter = PruningContentFilter(
min_word_threshold=5,
threshold_type='dynamic',
threshold=0.45
)
result = await crawler.arun(url=url, extraction_strategy=content_filter, fit_markdown=True)
if result.success:
print(f"Cleaned Markdown:\n{result.fit_markdown}")
```
### Parameters
- **`min_word_threshold`**: (Optional) Minimum number of words a node must contain to be considered relevant. Nodes with fewer words are automatically pruned.
- **`threshold_type`**: (Optional, default 'fixed') Controls how pruning thresholds are calculated:
- `'fixed'`: Uses a constant threshold value for all nodes
- `'dynamic'`: Adjusts threshold based on node characteristics like tag importance and text/link ratios
- **`threshold`**: (Optional, default 0.48) Base threshold value for node pruning:
- For fixed threshold: Nodes scoring below this value are removed
- For dynamic threshold: This value is adjusted based on node properties
### How It Works
The pruning algorithm evaluates each node using multiple metrics:
- Text density: Ratio of actual text to overall node content
- Link density: Proportion of text within links
- Tag importance: Weight based on HTML tag type (e.g., article, p, div)
- Content quality: Metrics like text length and structural importance
Nodes scoring below the threshold are removed, effectively "shaking" less relevant content from the DOM tree. This results in a cleaner document containing only the most relevant content blocks.
The algorithm is particularly effective for:
- Removing boilerplate content
- Eliminating navigation menus and sidebars
- Preserving main article content
- Maintaining document structure while removing noise
## BM25 Algorithm
The `BM25ContentFilter` uses the BM25 algorithm, a ranking function used in information retrieval to estimate the relevance of documents to a given search query. In Crawl4AI, this algorithm helps to identify and extract text chunks that are most relevant to the page's metadata or a user-specified query.
### Usage
To use the `BM25ContentFilter`, initialize it and then pass it as the `extraction_strategy` parameter to the `arun` method of the crawler.
```python
from crawl4ai import AsyncWebCrawler
from crawl4ai.content_filter_strategy import BM25ContentFilter
async def filter_content(url, query=None):
async with AsyncWebCrawler() as crawler:
content_filter = BM25ContentFilter(user_query=query)
result = await crawler.arun(url=url, extraction_strategy=content_filter, fit_markdown=True) # Set fit_markdown flag to True to trigger BM25 filtering
if result.success:
print(f"Filtered Content (JSON):\n{result.extracted_content}")
print(f"\nFiltered Markdown:\n{result.fit_markdown}") # New field in CrawlResult object
print(f"\nFiltered HTML:\n{result.fit_html}") # New field in CrawlResult object. Note that raw HTML may have tags re-organized due to internal parsing.
else:
print("Error:", result.error_message)
# Example usage:
asyncio.run(filter_content("https://en.wikipedia.org/wiki/Apple", "fruit nutrition health")) # with query
asyncio.run(filter_content("https://en.wikipedia.org/wiki/Apple")) # without query, metadata will be used as the query.
```
### Parameters
- **`user_query`**: (Optional) A string representing the search query. If not provided, the filter extracts relevant metadata (title, description, keywords) from the page and uses that as the query.
- **`bm25_threshold`**: (Optional, default 1.0) A float value that controls the threshold for relevance. Higher values result in stricter filtering, returning only the most relevant text chunks. Lower values result in more lenient filtering.
## Fit Markdown Flag
Setting the `fit_markdown` flag to `True` in the `arun` method activates the BM25 content filtering during the crawl. The `fit_markdown` parameter instructs the scraper to extract and clean the HTML, primarily to prepare for a Large Language Model that cannot process large amounts of data. Setting this flag not only improves the quality of the extracted content but also adds the filtered content to two new attributes in the returned `CrawlResult` object: `fit_markdown` and `fit_html`.
## Custom Content Filtering Strategies
You can create your own custom filtering strategies by inheriting from the `RelevantContentFilter` class and implementing the `filter_content` method. This allows you to tailor the filtering logic to your specific needs.
```python
from crawl4ai.content_filter_strategy import RelevantContentFilter
from bs4 import BeautifulSoup, Tag
from typing import List
class MyCustomFilter(RelevantContentFilter):
def filter_content(self, html: str) -> List[str]:
soup = BeautifulSoup(html, 'lxml')
# Implement custom filtering logic here
# Example: extract all paragraphs within divs with class "article-body"
filtered_paragraphs = []
for tag in soup.select("div.article-body p"):
if isinstance(tag, Tag):
filtered_paragraphs.append(str(tag)) # Add the cleaned HTML element.
return filtered_paragraphs
async def custom_filter_demo(url: str):
async with AsyncWebCrawler() as crawler:
custom_filter = MyCustomFilter()
result = await crawler.arun(url, extraction_strategy=custom_filter)
if result.success:
print(result.extracted_content)
```
This example demonstrates extracting paragraphs from a specific div class. You can customize this logic to implement different filtering strategies, use regular expressions, analyze text density, or apply other relevant techniques.
## Conclusion
Content filtering strategies provide a powerful way to refine the output of your crawls. By using `BM25ContentFilter` or creating custom strategies, you can focus on the most pertinent information and improve the efficiency of your data processing pipeline.

View File

@@ -7,66 +7,325 @@ Crawl4AI provides official Docker images for easy deployment and scalability. Th
Pull and run the basic version:
```bash
# Basic run without security
docker pull unclecode/crawl4ai:basic
docker run -p 11235:11235 unclecode/crawl4ai:basic
# Run with API security enabled
docker run -p 11235:11235 -e CRAWL4AI_API_TOKEN=your_secret_token unclecode/crawl4ai:basic
```
Test the deployment:
## Running with Docker Compose 🐳
### Use Docker Compose (From Local Dockerfile or Docker Hub)
Crawl4AI provides flexibility to use Docker Compose for managing your containerized services. You can either build the image locally from the provided `Dockerfile` or use the pre-built image from Docker Hub.
### **Option 1: Using Docker Compose to Build Locally**
If you want to build the image locally, use the provided `docker-compose.local.yml` file.
```bash
docker-compose -f docker-compose.local.yml up -d
```
This will:
1. Build the Docker image from the provided `Dockerfile`.
2. Start the container and expose it on `http://localhost:11235`.
---
### **Option 2: Using Docker Compose with Pre-Built Image from Hub**
If you prefer using the pre-built image on Docker Hub, use the `docker-compose.hub.yml` file.
```bash
docker-compose -f docker-compose.hub.yml up -d
```
This will:
1. Pull the pre-built image `unclecode/crawl4ai:basic` (or `all`, depending on your configuration).
2. Start the container and expose it on `http://localhost:11235`.
---
### **Stopping the Running Services**
To stop the services started via Docker Compose, you can use:
```bash
docker-compose -f docker-compose.local.yml down
# OR
docker-compose -f docker-compose.hub.yml down
```
If the containers dont stop and the application is still running, check the running containers:
```bash
docker ps
```
Find the `CONTAINER ID` of the running service and stop it forcefully:
```bash
docker stop <CONTAINER_ID>
```
---
### **Debugging with Docker Compose**
- **Check Logs**: To view the container logs:
```bash
docker-compose -f docker-compose.local.yml logs -f
```
- **Remove Orphaned Containers**: If the service is still running unexpectedly:
```bash
docker-compose -f docker-compose.local.yml down --remove-orphans
```
- **Manually Remove Network**: If the network is still in use:
```bash
docker network ls
docker network rm crawl4ai_default
```
---
### Why Use Docker Compose?
Docker Compose is the recommended way to deploy Crawl4AI because:
1. It simplifies multi-container setups.
2. Allows you to define environment variables, resources, and ports in a single file.
3. Makes it easier to switch between local development and production-ready images.
For example, your `docker-compose.yml` could include API keys, token settings, and memory limits, making deployment quick and consistent.
## API Security 🔒
### Understanding CRAWL4AI_API_TOKEN
The `CRAWL4AI_API_TOKEN` provides optional security for your Crawl4AI instance:
- If `CRAWL4AI_API_TOKEN` is set: All API endpoints (except `/health`) require authentication
- If `CRAWL4AI_API_TOKEN` is not set: The API is publicly accessible
```bash
# Secured Instance
docker run -p 11235:11235 -e CRAWL4AI_API_TOKEN=your_secret_token unclecode/crawl4ai:all
# Unsecured Instance
docker run -p 11235:11235 unclecode/crawl4ai:all
```
### Making API Calls
For secured instances, include the token in all requests:
```python
import requests
# Test health endpoint
health = requests.get("http://localhost:11235/health")
print("Health check:", health.json())
# Setup headers if token is being used
api_token = "your_secret_token" # Same token set in CRAWL4AI_API_TOKEN
headers = {"Authorization": f"Bearer {api_token}"} if api_token else {}
# Test basic crawl
# Making authenticated requests
response = requests.post(
"http://localhost:11235/crawl",
headers=headers,
json={
"urls": "https://www.nbcnews.com/business",
"urls": "https://example.com",
"priority": 10
}
)
# Checking task status
task_id = response.json()["task_id"]
print("Task ID:", task_id)
status = requests.get(
f"http://localhost:11235/task/{task_id}",
headers=headers
)
```
## Available Images 🏷️
### Using with Docker Compose
- `unclecode/crawl4ai:basic` - Basic web crawling capabilities
- `unclecode/crawl4ai:all` - Full installation with all features
- `unclecode/crawl4ai:gpu` - GPU-enabled version for ML features
In your `docker-compose.yml`:
```yaml
services:
crawl4ai:
image: unclecode/crawl4ai:all
environment:
- CRAWL4AI_API_TOKEN=${CRAWL4AI_API_TOKEN:-} # Optional
# ... other configuration
```
Then either:
1. Set in `.env` file:
```env
CRAWL4AI_API_TOKEN=your_secret_token
```
2. Or set via command line:
```bash
CRAWL4AI_API_TOKEN=your_secret_token docker-compose up
```
> **Security Note**: If you enable the API token, make sure to keep it secure and never commit it to version control. The token will be required for all API endpoints except the health check endpoint (`/health`).
## Configuration Options 🔧
### Environment Variables
You can configure the service using environment variables:
```bash
# Basic configuration
docker run -p 11235:11235 \
-e MAX_CONCURRENT_TASKS=5 \
-e OPENAI_API_KEY=your_key \
unclecode/crawl4ai:all
```
### Volume Mounting
Mount a directory for persistent data:
```bash
# With security and LLM support
docker run -p 11235:11235 \
-v $(pwd)/data:/app/data \
-e CRAWL4AI_API_TOKEN=your_secret_token \
-e OPENAI_API_KEY=sk-... \
-e ANTHROPIC_API_KEY=sk-ant-... \
unclecode/crawl4ai:all
```
### Resource Limits
### Using Docker Compose (Recommended) 🐳
Control container resources:
Create a `docker-compose.yml`:
```yaml
version: '3.8'
services:
crawl4ai:
image: unclecode/crawl4ai:all
ports:
- "11235:11235"
environment:
- CRAWL4AI_API_TOKEN=${CRAWL4AI_API_TOKEN:-} # Optional API security
- MAX_CONCURRENT_TASKS=5
# LLM Provider Keys
- OPENAI_API_KEY=${OPENAI_API_KEY:-}
- ANTHROPIC_API_KEY=${ANTHROPIC_API_KEY:-}
volumes:
- /dev/shm:/dev/shm
deploy:
resources:
limits:
memory: 4G
reservations:
memory: 1G
```
You can run it in two ways:
1. Using environment variables directly:
```bash
docker run -p 11235:11235 \
--memory=4g \
--cpus=2 \
unclecode/crawl4ai:all
CRAWL4AI_API_TOKEN=secret123 OPENAI_API_KEY=sk-... docker-compose up
```
2. Using a `.env` file (recommended):
Create a `.env` file in the same directory:
```env
# API Security (optional)
CRAWL4AI_API_TOKEN=your_secret_token
# LLM Provider Keys
OPENAI_API_KEY=sk-...
ANTHROPIC_API_KEY=sk-ant-...
# Other Configuration
MAX_CONCURRENT_TASKS=5
```
Then simply run:
```bash
docker-compose up
```
### Testing the Deployment 🧪
```python
import requests
# For unsecured instances
def test_unsecured():
# Health check
health = requests.get("http://localhost:11235/health")
print("Health check:", health.json())
# Basic crawl
response = requests.post(
"http://localhost:11235/crawl",
json={
"urls": "https://www.nbcnews.com/business",
"priority": 10
}
)
task_id = response.json()["task_id"]
print("Task ID:", task_id)
# For secured instances
def test_secured(api_token):
headers = {"Authorization": f"Bearer {api_token}"}
# Basic crawl with authentication
response = requests.post(
"http://localhost:11235/crawl",
headers=headers,
json={
"urls": "https://www.nbcnews.com/business",
"priority": 10
}
)
task_id = response.json()["task_id"]
print("Task ID:", task_id)
```
### LLM Extraction Example 🤖
When you've configured your LLM provider keys (via environment variables or `.env`), you can use LLM extraction:
```python
request = {
"urls": "https://example.com",
"extraction_config": {
"type": "llm",
"params": {
"provider": "openai/gpt-4",
"instruction": "Extract main topics from the page"
}
}
}
# Make the request (add headers if using API security)
response = requests.post("http://localhost:11235/crawl", json=request)
```
> **Note**: Remember to add `.env` to your `.gitignore` to keep your API keys secure!
## Usage Examples 📝
### Basic Crawling

View File

@@ -0,0 +1,148 @@
# Download Handling in Crawl4AI
This guide explains how to use Crawl4AI to handle file downloads during crawling. You'll learn how to trigger downloads, specify download locations, and access downloaded files.
## Enabling Downloads
By default, Crawl4AI does not download files. To enable downloads, set the `accept_downloads` parameter to `True` in either the `AsyncWebCrawler` constructor or the `arun` method.
```python
from crawl4ai import AsyncWebCrawler
async def main():
async with AsyncWebCrawler(accept_downloads=True) as crawler: # Globally enable downloads
# ... your crawling logic ...
asyncio.run(main())
```
Or, enable it for a specific crawl:
```python
async def main():
async with AsyncWebCrawler() as crawler:
result = await crawler.arun(url="...", accept_downloads=True)
# ...
```
## Specifying Download Location
You can specify the download directory using the `downloads_path` parameter. If not provided, Crawl4AI creates a "downloads" directory inside the `.crawl4ai` folder in your home directory.
```python
import os
from pathlib import Path
# ... inside your crawl function:
downloads_path = os.path.join(os.getcwd(), "my_downloads") # Custom download path
os.makedirs(downloads_path, exist_ok=True)
result = await crawler.arun(url="...", downloads_path=downloads_path, accept_downloads=True)
# ...
```
If you are setting it globally, provide the path to the AsyncWebCrawler:
```python
async def crawl_with_downloads(url: str, download_path: str):
async with AsyncWebCrawler(
accept_downloads=True,
downloads_path=download_path, # or set it on arun
verbose=True
) as crawler:
result = await crawler.arun(url=url) # you still need to enable downloads per call.
# ...
```
## Triggering Downloads
Downloads are typically triggered by user interactions on a web page (e.g., clicking a download button). You can simulate these actions with the `js_code` parameter, injecting JavaScript code to be executed within the browser context. The `wait_for` parameter might also be crucial to allowing sufficient time for downloads to initiate before the crawler proceeds.
```python
result = await crawler.arun(
url="https://www.python.org/downloads/",
js_code="""
// Find and click the first Windows installer link
const downloadLink = document.querySelector('a[href$=".exe"]');
if (downloadLink) {
downloadLink.click();
}
""",
wait_for=5 # Wait for 5 seconds for the download to start
)
```
## Accessing Downloaded Files
Downloaded file paths are stored in the `downloaded_files` attribute of the returned `CrawlResult` object. This is a list of strings, with each string representing the absolute path to a downloaded file.
```python
if result.downloaded_files:
print("Downloaded files:")
for file_path in result.downloaded_files:
print(f"- {file_path}")
# Perform operations with downloaded files, e.g., check file size
file_size = os.path.getsize(file_path)
print(f"- File size: {file_size} bytes")
else:
print("No files downloaded.")
```
## Example: Downloading Multiple Files
```python
import asyncio
import os
from pathlib import Path
from crawl4ai import AsyncWebCrawler
async def download_multiple_files(url: str, download_path: str):
async with AsyncWebCrawler(
accept_downloads=True,
downloads_path=download_path,
verbose=True
) as crawler:
result = await crawler.arun(
url=url,
js_code="""
// Trigger multiple downloads (example)
const downloadLinks = document.querySelectorAll('a[download]'); // Or a more specific selector
for (const link of downloadLinks) {
link.click();
await new Promise(r => setTimeout(r, 2000)); // Add a small delay between clicks if needed
}
""",
wait_for=10 # Adjust the timeout to match the expected time for all downloads to start
)
if result.downloaded_files:
print("Downloaded files:")
for file in result.downloaded_files:
print(f"- {file}")
else:
print("No files downloaded.")
# Example usage
download_path = os.path.join(Path.home(), ".crawl4ai", "downloads")
os.makedirs(download_path, exist_ok=True) # Create directory if it doesn't exist
asyncio.run(download_multiple_files("https://www.python.org/downloads/windows/", download_path))
```
## Important Considerations
- **Browser Context:** Downloads are managed within the browser context. Ensure your `js_code` correctly targets the download triggers on the specific web page.
- **Waiting:** Use `wait_for` to manage the timing of the crawl process if immediate download might not occur.
- **Error Handling:** Implement proper error handling to gracefully manage failed downloads or incorrect file paths.
- **Security:** Downloaded files should be scanned for potential security threats before use.
This guide provides a foundation for handling downloads with Crawl4AI. You can adapt these techniques to manage downloads in various scenarios and integrate them into more complex crawling workflows.

View File

@@ -58,6 +58,51 @@ crawl4ai-download-models
This is optional but will boost the performance and speed of the crawler. You only need to do this once after installation.
## Playwright Installation Note for Ubuntu
If you encounter issues with Playwright installation on Ubuntu, you may need to install additional dependencies:
```bash
sudo apt-get install -y \
libwoff1 \
libopus0 \
libwebp7 \
libwebpdemux2 \
libenchant-2-2 \
libgudev-1.0-0 \
libsecret-1-0 \
libhyphen0 \
libgdk-pixbuf2.0-0 \
libegl1 \
libnotify4 \
libxslt1.1 \
libevent-2.1-7 \
libgles2 \
libxcomposite1 \
libatk1.0-0 \
libatk-bridge2.0-0 \
libepoxy0 \
libgtk-3-0 \
libharfbuzz-icu0 \
libgstreamer-gl1.0-0 \
libgstreamer-plugins-bad1.0-0 \
gstreamer1.0-plugins-good \
gstreamer1.0-plugins-bad \
libxt6 \
libxaw7 \
xvfb \
fonts-noto-color-emoji \
libfontconfig \
libfreetype6 \
xfonts-cyrillic \
xfonts-scalable \
fonts-liberation \
fonts-ipafont-gothic \
fonts-wqy-zenhei \
fonts-tlwg-loma-otf \
fonts-freefont-ttf
```
## Option 2: Using Docker (Coming Soon)
Docker support for Crawl4AI is currently in progress and will be available soon. This will allow you to run Crawl4AI in a containerized environment, ensuring consistency across different systems.

View File

@@ -0,0 +1,235 @@
# Prefix-Based Input Handling in Crawl4AI
This guide will walk you through using the Crawl4AI library to crawl web pages, local HTML files, and raw HTML strings. We'll demonstrate these capabilities using a Wikipedia page as an example.
## Table of Contents
- [Prefix-Based Input Handling in Crawl4AI](#prefix-based-input-handling-in-crawl4ai)
- [Table of Contents](#table-of-contents)
- [Crawling a Web URL](#crawling-a-web-url)
- [Crawling a Local HTML File](#crawling-a-local-html-file)
- [Crawling Raw HTML Content](#crawling-raw-html-content)
- [Complete Example](#complete-example)
- [**How It Works**](#how-it-works)
- [**Running the Example**](#running-the-example)
- [Conclusion](#conclusion)
---
### Crawling a Web URL
To crawl a live web page, provide the URL starting with `http://` or `https://`.
```python
import asyncio
from crawl4ai import AsyncWebCrawler
async def crawl_web():
async with AsyncWebCrawler(verbose=True) as crawler:
result = await crawler.arun(url="https://en.wikipedia.org/wiki/apple", bypass_cache=True)
if result.success:
print("Markdown Content:")
print(result.markdown)
else:
print(f"Failed to crawl: {result.error_message}")
asyncio.run(crawl_web())
```
### Crawling a Local HTML File
To crawl a local HTML file, prefix the file path with `file://`.
```python
import asyncio
from crawl4ai import AsyncWebCrawler
async def crawl_local_file():
local_file_path = "/path/to/apple.html" # Replace with your file path
file_url = f"file://{local_file_path}"
async with AsyncWebCrawler(verbose=True) as crawler:
result = await crawler.arun(url=file_url, bypass_cache=True)
if result.success:
print("Markdown Content from Local File:")
print(result.markdown)
else:
print(f"Failed to crawl local file: {result.error_message}")
asyncio.run(crawl_local_file())
```
### Crawling Raw HTML Content
To crawl raw HTML content, prefix the HTML string with `raw:`.
```python
import asyncio
from crawl4ai import AsyncWebCrawler
async def crawl_raw_html():
raw_html = "<html><body><h1>Hello, World!</h1></body></html>"
raw_html_url = f"raw:{raw_html}"
async with AsyncWebCrawler(verbose=True) as crawler:
result = await crawler.arun(url=raw_html_url, bypass_cache=True)
if result.success:
print("Markdown Content from Raw HTML:")
print(result.markdown)
else:
print(f"Failed to crawl raw HTML: {result.error_message}")
asyncio.run(crawl_raw_html())
```
---
## Complete Example
Below is a comprehensive script that:
1. **Crawls the Wikipedia page for "Apple".**
2. **Saves the HTML content to a local file (`apple.html`).**
3. **Crawls the local HTML file and verifies the markdown length matches the original crawl.**
4. **Crawls the raw HTML content from the saved file and verifies consistency.**
```python
import os
import sys
import asyncio
from pathlib import Path
# Adjust the parent directory to include the crawl4ai module
parent_dir = os.path.dirname(os.path.dirname(os.path.abspath(__file__)))
sys.path.append(parent_dir)
from crawl4ai import AsyncWebCrawler
async def main():
# Define the URL to crawl
wikipedia_url = "https://en.wikipedia.org/wiki/apple"
# Define the path to save the HTML file
# Save the file in the same directory as the script
script_dir = Path(__file__).parent
html_file_path = script_dir / "apple.html"
async with AsyncWebCrawler(verbose=True) as crawler:
print("\n=== Step 1: Crawling the Wikipedia URL ===")
# Crawl the Wikipedia URL
result = await crawler.arun(url=wikipedia_url, bypass_cache=True)
# Check if crawling was successful
if not result.success:
print(f"Failed to crawl {wikipedia_url}: {result.error_message}")
return
# Save the HTML content to a local file
with open(html_file_path, 'w', encoding='utf-8') as f:
f.write(result.html)
print(f"Saved HTML content to {html_file_path}")
# Store the length of the generated markdown
web_crawl_length = len(result.markdown)
print(f"Length of markdown from web crawl: {web_crawl_length}\n")
print("=== Step 2: Crawling from the Local HTML File ===")
# Construct the file URL with 'file://' prefix
file_url = f"file://{html_file_path.resolve()}"
# Crawl the local HTML file
local_result = await crawler.arun(url=file_url, bypass_cache=True)
# Check if crawling was successful
if not local_result.success:
print(f"Failed to crawl local file {file_url}: {local_result.error_message}")
return
# Store the length of the generated markdown from local file
local_crawl_length = len(local_result.markdown)
print(f"Length of markdown from local file crawl: {local_crawl_length}")
# Compare the lengths
assert web_crawl_length == local_crawl_length, (
f"Markdown length mismatch: Web crawl ({web_crawl_length}) != Local file crawl ({local_crawl_length})"
)
print("✅ Markdown length matches between web crawl and local file crawl.\n")
print("=== Step 3: Crawling Using Raw HTML Content ===")
# Read the HTML content from the saved file
with open(html_file_path, 'r', encoding='utf-8') as f:
raw_html_content = f.read()
# Prefix the raw HTML content with 'raw:'
raw_html_url = f"raw:{raw_html_content}"
# Crawl using the raw HTML content
raw_result = await crawler.arun(url=raw_html_url, bypass_cache=True)
# Check if crawling was successful
if not raw_result.success:
print(f"Failed to crawl raw HTML content: {raw_result.error_message}")
return
# Store the length of the generated markdown from raw HTML
raw_crawl_length = len(raw_result.markdown)
print(f"Length of markdown from raw HTML crawl: {raw_crawl_length}")
# Compare the lengths
assert web_crawl_length == raw_crawl_length, (
f"Markdown length mismatch: Web crawl ({web_crawl_length}) != Raw HTML crawl ({raw_crawl_length})"
)
print("✅ Markdown length matches between web crawl and raw HTML crawl.\n")
print("All tests passed successfully!")
# Clean up by removing the saved HTML file
if html_file_path.exists():
os.remove(html_file_path)
print(f"Removed the saved HTML file: {html_file_path}")
# Run the main function
if __name__ == "__main__":
asyncio.run(main())
```
### **How It Works**
1. **Step 1: Crawl the Web URL**
- Crawls `https://en.wikipedia.org/wiki/apple`.
- Saves the HTML content to `apple.html`.
- Records the length of the generated markdown.
2. **Step 2: Crawl from the Local HTML File**
- Uses the `file://` prefix to crawl `apple.html`.
- Ensures the markdown length matches the original web crawl.
3. **Step 3: Crawl Using Raw HTML Content**
- Reads the HTML from `apple.html`.
- Prefixes it with `raw:` and crawls.
- Verifies the markdown length matches the previous results.
4. **Cleanup**
- Deletes the `apple.html` file after testing.
### **Running the Example**
1. **Save the Script:**
- Save the above code as `test_crawl4ai.py` in your project directory.
2. **Execute the Script:**
- Run the script using:
```bash
python test_crawl4ai.py
```
3. **Observe the Output:**
- The script will print logs detailing each step.
- Assertions ensure consistency across different crawling methods.
- Upon success, it confirms that all markdown lengths match.
---
## Conclusion
With the new prefix-based input handling in **Crawl4AI**, you can effortlessly crawl web URLs, local HTML files, and raw HTML strings using a unified `url` parameter. This enhancement simplifies the API usage and provides greater flexibility for diverse crawling scenarios.

View File

@@ -8,7 +8,7 @@ First, let's import the necessary modules and create an instance of `AsyncWebCra
```python
import asyncio
from crawl4ai import AsyncWebCrawler
from crawl4ai import AsyncWebCrawler, CacheMode
async def main():
async with AsyncWebCrawler(verbose=True) as crawler:
@@ -42,7 +42,7 @@ async def capture_and_save_screenshot(url: str, output_path: str):
result = await crawler.arun(
url=url,
screenshot=True,
bypass_cache=True
cache_mode=CacheMode.BYPASS
)
if result.success and result.screenshot:
@@ -62,15 +62,15 @@ Crawl4AI supports multiple browser engines. Here's how to use different browsers
```python
# Use Firefox
async with AsyncWebCrawler(browser_type="firefox", verbose=True, headless=True) as crawler:
result = await crawler.arun(url="https://www.example.com", bypass_cache=True)
result = await crawler.arun(url="https://www.example.com", cache_mode=CacheMode.BYPASS)
# Use WebKit
async with AsyncWebCrawler(browser_type="webkit", verbose=True, headless=True) as crawler:
result = await crawler.arun(url="https://www.example.com", bypass_cache=True)
result = await crawler.arun(url="https://www.example.com", cache_mode=CacheMode.BYPASS)
# Use Chromium (default)
async with AsyncWebCrawler(verbose=True, headless=True) as crawler:
result = await crawler.arun(url="https://www.example.com", bypass_cache=True)
result = await crawler.arun(url="https://www.example.com", cache_mode=CacheMode.BYPASS)
```
### User Simulation 🎭
@@ -81,7 +81,7 @@ Simulate real user behavior to avoid detection:
async with AsyncWebCrawler(verbose=True, headless=True) as crawler:
result = await crawler.arun(
url="YOUR-URL-HERE",
bypass_cache=True,
cache_mode=CacheMode.BYPASS,
simulate_user=True, # Causes random mouse movements and clicks
override_navigator=True # Makes the browser appear more like a real user
)
@@ -99,7 +99,7 @@ async def main():
print(f"First crawl result: {result1.markdown[:100]}...")
# Force to crawl again
result2 = await crawler.arun(url="https://www.nbcnews.com/business", bypass_cache=True)
result2 = await crawler.arun(url="https://www.nbcnews.com/business", cache_mode=CacheMode.BYPASS)
print(f"Second crawl result: {result2.markdown[:100]}...")
asyncio.run(main())
@@ -189,7 +189,7 @@ extraction_strategy = LLMExtractionStrategy(
async with AsyncWebCrawler() as crawler:
result = await crawler.arun(
url="https://paulgraham.com/love.html",
bypass_cache=True,
cache_mode=CacheMode.BYPASS,
extraction_strategy=extraction_strategy
)
```
@@ -239,7 +239,7 @@ async def crawl_dynamic_content():
js_code=js_next_page if page > 0 else None,
wait_for=wait_for if page > 0 else None,
js_only=page > 0,
bypass_cache=True,
cache_mode=CacheMode.BYPASS,
headless=False,
)
@@ -254,7 +254,7 @@ Remove overlay elements and fit content appropriately:
async with AsyncWebCrawler(headless=False) as crawler:
result = await crawler.arun(
url="your-url-here",
bypass_cache=True,
cache_mode=CacheMode.BYPASS,
word_count_threshold=10,
remove_overlay_elements=True,
screenshot=True
@@ -282,7 +282,7 @@ async with AsyncWebCrawler() as crawler:
result = await crawler.arun(
url="https://www.nbcnews.com/business",
word_count_threshold=0,
bypass_cache=True,
cache_mode=CacheMode.BYPASS,
verbose=False,
)
end = time.time()

View File

@@ -12,7 +12,9 @@ from crawl4ai import AsyncWebCrawler
async def main():
async with AsyncWebCrawler() as crawler:
result = await crawler.arun(url="https://example.com")
result = await crawler.arun(
url="https://example.com"
)
print(result.markdown) # Print clean markdown content
if __name__ == "__main__":
@@ -24,7 +26,7 @@ if __name__ == "__main__":
The `arun()` method returns a `CrawlResult` object with several useful properties. Here's a quick overview (see [CrawlResult](../api/crawl-result.md) for complete details):
```python
result = await crawler.arun(url="https://example.com")
result = await crawler.arun(url="https://example.com", fit_markdown=True)
# Different content formats
print(result.html) # Raw HTML
@@ -81,7 +83,7 @@ Here's a more comprehensive example showing common usage patterns:
```python
import asyncio
from crawl4ai import AsyncWebCrawler
from crawl4ai import AsyncWebCrawler, CacheMode
async def main():
async with AsyncWebCrawler(verbose=True) as crawler:
@@ -97,7 +99,7 @@ async def main():
remove_overlay_elements=True,
# Cache control
bypass_cache=False # Use cache if available
cache_mode=CacheMode.ENABLE # Use cache if available
)
if result.success:

47
docs/md_v2/blog/index.md Normal file
View File

@@ -0,0 +1,47 @@
# Crawl4AI Blog
Welcome to the Crawl4AI blog! Here you'll find detailed release notes, technical insights, and updates about the project. Whether you're looking for the latest improvements or want to dive deep into web crawling techniques, this is the place.
## Latest Release
### [0.4.2 - Configurable Crawlers, Session Management, and Smarter Screenshots](releases/0.4.2.md)
*December 12, 2024*
The 0.4.2 update brings massive improvements to configuration, making crawlers and browsers easier to manage with dedicated objects. You can now import/export local storage for seamless session management. Plus, long-page screenshots are faster and cleaner, and full-page PDF exports are now possible. Check out all the new features to make your crawling experience even smoother.
[Read full release notes →](releases/0.4.2.md)
---
### [0.4.1 - Smarter Crawling with Lazy-Load Handling, Text-Only Mode, and More](releases/0.4.1.md)
*December 8, 2024*
This release brings major improvements to handling lazy-loaded images, a blazing-fast Text-Only Mode, full-page scanning for infinite scrolls, dynamic viewport adjustments, and session reuse for efficient crawling. If you're looking to improve speed, reliability, or handle dynamic content with ease, this update has you covered.
[Read full release notes →](releases/0.4.1.md)
---
### [0.4.0 - Major Content Filtering Update](releases/0.4.0.md)
*December 1, 2024*
Introduced significant improvements to content filtering, multi-threaded environment handling, and user-agent generation. This release features the new PruningContentFilter, enhanced thread safety, and improved test coverage.
[Read full release notes →](releases/0.4.0.md)
## Project History
Curious about how Crawl4AI has evolved? Check out our [complete changelog](https://github.com/unclecode/crawl4ai/blob/main/CHANGELOG.md) for a detailed history of all versions and updates.
## Categories
- [Technical Deep Dives](/blog/technical) - Coming soon
- [Tutorials & Guides](/blog/tutorials) - Coming soon
- [Community Updates](/blog/community) - Coming soon
## Stay Updated
- Star us on [GitHub](https://github.com/unclecode/crawl4ai)
- Follow [@unclecode](https://twitter.com/unclecode) on Twitter
- Join our community discussions on GitHub

View File

@@ -0,0 +1,62 @@
# Release Summary for Version 0.4.0 (December 1, 2024)
## Overview
The 0.4.0 release introduces significant improvements to content filtering, multi-threaded environment handling, user-agent generation, and test coverage. Key highlights include the introduction of the PruningContentFilter, designed to automatically identify and extract the most valuable parts of an HTML document, as well as enhancements to the BM25ContentFilter to extend its versatility and effectiveness.
## Major Features and Enhancements
### 1. PruningContentFilter
- Introduced a new unsupervised content filtering strategy that scores and prunes less relevant nodes in an HTML document based on metrics like text and link density.
- Focuses on retaining the most valuable parts of the content, making it highly effective for extracting relevant information from complex web pages.
- Fully documented with updated README and expanded user guides.
### 2. User-Agent Generator
- Added a user-agent generator utility that resolves compatibility issues and supports customizable user-agent strings.
- By default, the generator randomizes user agents for each request, adding diversity, but users can customize it for tailored scenarios.
### 3. Enhanced Thread Safety
- Improved handling of multi-threaded environments by adding better thread locks for parallel processing, ensuring consistency and stability when running multiple threads.
### 4. Extended Content Filtering Strategies
- Users now have access to both the PruningContentFilter for unsupervised extraction and the BM25ContentFilter for supervised filtering based on user queries.
- Enhanced BM25ContentFilter with improved capabilities to process page titles, meta tags, and descriptions, allowing for more effective classification and clustering of text chunks.
### 5. Documentation Updates
- Updated examples and tutorials to promote the use of the PruningContentFilter alongside the BM25ContentFilter, providing clear instructions for selecting the appropriate filter for each use case.
### 6. Unit Test Enhancements
- Added unit tests for PruningContentFilter to ensure accuracy and reliability.
- Enhanced BM25ContentFilter tests to cover additional edge cases and performance metrics, particularly for malformed HTML inputs.
## Revised Change Logs for Version 0.4.0
### PruningContentFilter (Dec 01, 2024)
- Introduced the PruningContentFilter to optimize content extraction by pruning less relevant HTML nodes.
- **Affected Files:**
- **crawl4ai/content_filter_strategy.py**: Added a scoring-based pruning algorithm.
- **README.md**: Updated to include PruningContentFilter usage.
- **docs/md_v2/basic/content_filtering.md**: Expanded user documentation, detailing the use and benefits of PruningContentFilter.
### Unit Tests for PruningContentFilter (Dec 01, 2024)
- Added comprehensive unit tests for PruningContentFilter to ensure correctness and efficiency.
- **Affected Files:**
- **tests/async/test_content_filter_prune.py**: Created tests covering different pruning scenarios to ensure stability and correctness.
### Enhanced BM25ContentFilter Tests (Dec 01, 2024)
- Expanded tests to cover additional extraction scenarios and performance metrics, improving robustness.
- **Affected Files:**
- **tests/async/test_content_filter_bm25.py**: Added tests for edge cases, including malformed HTML inputs.
### Documentation and Example Updates (Dec 01, 2024)
- Revised examples to illustrate the use of PruningContentFilter alongside existing content filtering methods.
- **Affected Files:**
- **docs/examples/quickstart_async.py**: Enhanced example clarity and usability for new users.
## Experimental Features
- The PruningContentFilter is still under experimental development, and we continue to gather feedback for further refinements.
## Conclusion
This release significantly enhances the content extraction capabilities of Crawl4ai with the introduction of the PruningContentFilter, improved supervised filtering with BM25ContentFilter, and robust multi-threaded handling. Additionally, the user-agent generator provides much-needed versatility, resolving compatibility issues faced by many users.
Users are encouraged to experiment with the new content filtering methods to determine which best suits their needs.

View File

@@ -0,0 +1,145 @@
# Release Summary for Version 0.4.1 (December 8, 2024): Major Efficiency Boosts with New Features!
_This post was generated with the help of ChatGPT, take everything with a grain of salt. 🧂_
Hi everyone,
I just finished putting together version 0.4.1 of Crawl4AI, and there are a few changes in here that I think youll find really helpful. Ill explain whats new, why it matters, and exactly how you can use these features (with the code to back it up). Lets get into it.
---
### Handling Lazy Loading Better (Images Included)
One thing that always bugged me with crawlers is how often they miss lazy-loaded content, especially images. In this version, I made sure Crawl4AI **waits for all images to load** before moving forward. This is useful because many modern websites only load images when theyre in the viewport or after some JavaScript executes.
Heres how to enable it:
```python
await crawler.crawl(
url="https://example.com",
wait_for_images=True # Add this argument to ensure images are fully loaded
)
```
What this does is:
1. Waits for the page to reach a "network idle" state.
2. Ensures all images on the page have been completely loaded.
This single change handles the majority of lazy-loading cases youre likely to encounter.
---
### Text-Only Mode (Fast, Lightweight Crawling)
Sometimes, you dont need to download images or process JavaScript at all. For example, if youre crawling to extract text data, you can enable **text-only mode** to speed things up. By disabling images, JavaScript, and other heavy resources, this mode makes crawling **3-4 times faster** in most cases.
Heres how to turn it on:
```python
crawler = AsyncPlaywrightCrawlerStrategy(
text_only=True # Set this to True to enable text-only crawling
)
```
When `text_only=True`, the crawler automatically:
- Disables GPU processing.
- Blocks image and JavaScript resources.
- Reduces the viewport size to 800x600 (you can override this with `viewport_width` and `viewport_height`).
If you need to crawl thousands of pages where you only care about text, this mode will save you a ton of time and resources.
---
### Adjusting the Viewport Dynamically
Another useful addition is the ability to **dynamically adjust the viewport size** to match the content on the page. This is particularly helpful when youre working with responsive layouts or want to ensure all parts of the page load properly.
Heres how it works:
1. The crawler calculates the pages width and height after it loads.
2. It adjusts the viewport to fit the content dimensions.
3. (Optional) It uses Chrome DevTools Protocol (CDP) to simulate zooming out so everything fits in the viewport.
To enable this, use:
```python
await crawler.crawl(
url="https://example.com",
adjust_viewport_to_content=True # Dynamically adjusts the viewport
)
```
This approach makes sure the entire page gets loaded into the viewport, especially for layouts that load content based on visibility.
---
### Simulating Full-Page Scrolling
Some websites load data dynamically as you scroll down the page. To handle these cases, I added support for **full-page scanning**. It simulates scrolling to the bottom of the page, checking for new content, and capturing it all.
Heres an example:
```python
await crawler.crawl(
url="https://example.com",
scan_full_page=True, # Enables scrolling
scroll_delay=0.2 # Waits 200ms between scrolls (optional)
)
```
What happens here:
1. The crawler scrolls down in increments, waiting for content to load after each scroll.
2. It stops when no new content appears (i.e., dynamic elements stop loading).
3. It scrolls back to the top before finishing (if necessary).
If youve ever had to deal with infinite scroll pages, this is going to save you a lot of headaches.
---
### Reusing Browser Sessions (Save Time on Setup)
By default, every time you crawl a page, a new browser context (or tab) is created. Thats fine for small crawls, but if youre working on a large dataset, its more efficient to reuse the same session.
I added a method called `create_session` for this:
```python
session_id = await crawler.create_session()
# Use the same session for multiple crawls
await crawler.crawl(
url="https://example.com/page1",
session_id=session_id # Reuse the session
)
await crawler.crawl(
url="https://example.com/page2",
session_id=session_id
)
```
This avoids creating a new tab for every page, speeding up the crawl and reducing memory usage.
---
### Other Updates
Here are a few smaller updates Ive made:
- **Light Mode**: Use `light_mode=True` to disable background processes, extensions, and other unnecessary features, making the browser more efficient.
- **Logging**: Improved logs to make debugging easier.
- **Defaults**: Added sensible defaults for things like `delay_before_return_html` (now set to 0.1 seconds).
---
### How to Get the Update
You can install or upgrade to version `0.4.1` like this:
```bash
pip install crawl4ai --upgrade
```
As always, Id love to hear your thoughts. If theres something you think could be improved or if you have suggestions for future versions, let me know!
Enjoy the new features, and happy crawling! 🕷️
---

View File

@@ -0,0 +1,86 @@
## 🚀 Crawl4AI 0.4.2 Update: Smarter Crawling Just Got Easier (Dec 12, 2024)
### Hey Developers,
Im excited to share Crawl4AI 0.4.2—a major upgrade that makes crawling smarter, faster, and a whole lot more intuitive. Ive packed in a bunch of new features to simplify your workflows and improve your experience. Lets cut to the chase!
---
### 🔧 **Configurable Browser and Crawler Behavior**
Youve asked for better control over how browsers and crawlers are configured, and now youve got it. With the new `BrowserConfig` and `CrawlerRunConfig` objects, you can set up your browser and crawling behavior exactly how you want. No more cluttering `arun` with a dozen arguments—just pass in your configs and go.
**Example:**
```python
from crawl4ai import BrowserConfig, CrawlerRunConfig, AsyncWebCrawler
browser_config = BrowserConfig(headless=True, viewport_width=1920, viewport_height=1080)
crawler_config = CrawlerRunConfig(cache_mode="BYPASS")
async with AsyncWebCrawler(config=browser_config) as crawler:
result = await crawler.arun(url="https://example.com", config=crawler_config)
print(result.markdown[:500])
```
This setup is a game-changer for scalability, keeping your code clean and flexible as we add more parameters in the future.
Remember: If you like to use the old way, you can still pass arguments directly to `arun` as before, no worries!
---
### 🔐 **Streamlined Session Management**
Heres the big one: You can now pass local storage and cookies directly. Whether its setting values programmatically or importing a saved JSON state, managing sessions has never been easier. This is a must-have for authenticated crawls—just export your storage state once and reuse it effortlessly across runs.
**Example:**
1. Open a browser, log in manually, and export the storage state.
2. Import the JSON file for seamless authenticated crawling:
```python
result = await crawler.arun(
url="https://example.com/protected",
storage_state="my_storage_state.json"
)
```
---
### 🔢 **Handling Large Pages: Supercharged Screenshots and PDF Conversion**
Two big upgrades here:
- **Blazing-fast long-page screenshots**: Turn extremely long web pages into clean, high-quality screenshots—without breaking a sweat. Its optimized to handle large content without lag.
- **Full-page PDF exports**: Now, you can also convert any page into a PDF with all the details intact. Perfect for archiving or sharing complex layouts.
---
### 🔧 **Other Cool Stuff**
- **Anti-bot enhancements**: Magic mode now handles overlays, user simulation, and anti-detection features like a pro.
- **JavaScript execution**: Execute custom JS snippets to handle dynamic content. No more wrestling with endless page interactions.
---
### 📊 **Performance Boosts and Dev-friendly Updates**
- Faster rendering and viewport adjustments for better performance.
- Improved cookie and local storage handling for seamless authentication.
- Better debugging with detailed logs and actionable error messages.
---
### 🔠 **Use Cases Youll Love**
1. **Authenticated Crawls**: Login once, export your storage state, and reuse it across multiple requests without the headache.
2. **Long-page Screenshots**: Perfect for blogs, e-commerce pages, or any endless-scroll website.
3. **PDF Export**: Create professional-looking page PDFs in seconds.
---
### Lets Get Crawling
Crawl4AI 0.4.2 is ready for you to download and try. Im always looking for ways to improve, so dont hold back—share your thoughts and feedback.
Happy Crawling! 🚀

View File

@@ -52,7 +52,7 @@ Heres a comprehensive outline for the **LLM Extraction Strategy** video, cove
extraction_type="schema",
instruction="Extract model names and fees for input and output tokens from the page."
),
bypass_cache=True
cache_mode=CacheMode.BYPASS
)
print(result.extracted_content)
```
@@ -98,7 +98,7 @@ Heres a comprehensive outline for the **LLM Extraction Strategy** video, cove
result = await crawler.arun(
url="https://example.com/some-article",
extraction_strategy=extraction_strategy,
bypass_cache=True
cache_mode=CacheMode.BYPASS
)
print(result.extracted_content)
```

View File

@@ -55,7 +55,7 @@ Heres a structured outline for the **Cosine Similarity Strategy** video, cove
result = await crawler.arun(
url=url,
extraction_strategy=extraction_strategy,
bypass_cache=True
cache_mode=CacheMode.BYPASS
)
print(result.extracted_content)
```
@@ -103,7 +103,7 @@ Heres a structured outline for the **Cosine Similarity Strategy** video, cove
result = await crawler.arun(
url=url,
extraction_strategy=extraction_strategy,
bypass_cache=True
cache_mode=CacheMode.BYPASS
)
print(result.extracted_content)
```

View File

@@ -26,7 +26,7 @@ Here's a condensed outline of the **Installation and Setup** video content:
- Walk through a simple test script to confirm the setup:
```python
import asyncio
from crawl4ai import AsyncWebCrawler
from crawl4ai import AsyncWebCrawler, CacheMode
async def main():
async with AsyncWebCrawler(verbose=True) as crawler:
@@ -1093,7 +1093,7 @@ Heres a comprehensive outline for the **LLM Extraction Strategy** video, cove
extraction_type="schema",
instruction="Extract model names and fees for input and output tokens from the page."
),
bypass_cache=True
cache_mode=CacheMode.BYPASS
)
print(result.extracted_content)
```
@@ -1139,7 +1139,7 @@ Heres a comprehensive outline for the **LLM Extraction Strategy** video, cove
result = await crawler.arun(
url="https://example.com/some-article",
extraction_strategy=extraction_strategy,
bypass_cache=True
cache_mode=CacheMode.BYPASS
)
print(result.extracted_content)
```
@@ -1248,7 +1248,7 @@ Heres a structured outline for the **Cosine Similarity Strategy** video, cove
result = await crawler.arun(
url=url,
extraction_strategy=extraction_strategy,
bypass_cache=True
cache_mode=CacheMode.BYPASS
)
print(result.extracted_content)
```
@@ -1296,7 +1296,7 @@ Heres a structured outline for the **Cosine Similarity Strategy** video, cove
result = await crawler.arun(
url=url,
extraction_strategy=extraction_strategy,
bypass_cache=True
cache_mode=CacheMode.BYPASS
)
print(result.extracted_content)
```

127
main.py
View File

@@ -10,6 +10,8 @@ from fastapi.exceptions import RequestValidationError
from starlette.middleware.base import BaseHTTPMiddleware
from starlette.responses import FileResponse
from fastapi.responses import RedirectResponse
from fastapi.security import HTTPBearer, HTTPAuthorizationCredentials
from fastapi import Depends, Security
from pydantic import BaseModel, HttpUrl, Field
from typing import Optional, List, Dict, Any, Union
@@ -23,7 +25,8 @@ import logging
from enum import Enum
from dataclasses import dataclass
import json
from crawl4ai import AsyncWebCrawler, CrawlResult
from crawl4ai import AsyncWebCrawler, CrawlResult, CacheMode
from crawl4ai.config import MIN_WORD_THRESHOLD
from crawl4ai.extraction_strategy import (
LLMExtractionStrategy,
CosineStrategy,
@@ -51,18 +54,31 @@ class ExtractionConfig(BaseModel):
type: CrawlerType
params: Dict[str, Any] = {}
class ChunkingStrategy(BaseModel):
type: str
params: Dict[str, Any] = {}
class ContentFilter(BaseModel):
type: str = "bm25"
params: Dict[str, Any] = {}
class CrawlRequest(BaseModel):
urls: Union[HttpUrl, List[HttpUrl]]
word_count_threshold: int = MIN_WORD_THRESHOLD
extraction_config: Optional[ExtractionConfig] = None
crawler_params: Dict[str, Any] = {}
priority: int = Field(default=5, ge=1, le=10)
ttl: Optional[int] = 3600
chunking_strategy: Optional[ChunkingStrategy] = None
content_filter: Optional[ContentFilter] = None
js_code: Optional[List[str]] = None
wait_for: Optional[str] = None
css_selector: Optional[str] = None
screenshot: bool = False
magic: bool = False
extra: Optional[Dict[str, Any]] = {}
session_id: Optional[str] = None
cache_mode: Optional[CacheMode] = CacheMode.ENABLED
priority: int = Field(default=5, ge=1, le=10)
ttl: Optional[int] = 3600
crawler_params: Dict[str, Any] = {}
@dataclass
class TaskInfo:
@@ -276,12 +292,15 @@ class CrawlerService:
if isinstance(request.urls, list):
results = await crawler.arun_many(
urls=[str(url) for url in request.urls],
word_count_threshold=MIN_WORD_THRESHOLD,
extraction_strategy=extraction_strategy,
js_code=request.js_code,
wait_for=request.wait_for,
css_selector=request.css_selector,
screenshot=request.screenshot,
magic=request.magic,
session_id=request.session_id,
cache_mode=request.cache_mode,
**request.extra,
)
else:
@@ -293,6 +312,8 @@ class CrawlerService:
css_selector=request.css_selector,
screenshot=request.screenshot,
magic=request.magic,
session_id=request.session_id,
cache_mode=request.cache_mode,
**request.extra,
)
@@ -319,11 +340,27 @@ app.add_middleware(
allow_headers=["*"], # Allows all headers
)
# Mount the pages directory as a static directory
app.mount("/pages", StaticFiles(directory=__location__ + "/pages"), name="pages")
app.mount("/mkdocs", StaticFiles(directory="site", html=True), name="mkdocs")
# API token security
security = HTTPBearer()
CRAWL4AI_API_TOKEN = os.getenv("CRAWL4AI_API_TOKEN")
async def verify_token(credentials: HTTPAuthorizationCredentials = Security(security)):
if not CRAWL4AI_API_TOKEN:
return credentials # No token verification if CRAWL4AI_API_TOKEN is not set
if credentials.credentials != CRAWL4AI_API_TOKEN:
raise HTTPException(status_code=401, detail="Invalid token")
return credentials
# Helper function to conditionally apply security
def secure_endpoint():
return Depends(verify_token) if CRAWL4AI_API_TOKEN else None
# Check if site directory exists
if os.path.exists(__location__ + "/site"):
# Mount the site directory as a static directory
app.mount("/mkdocs", StaticFiles(directory="site", html=True), name="mkdocs")
site_templates = Jinja2Templates(directory=__location__ + "/site")
templates = Jinja2Templates(directory=__location__ + "/pages")
crawler_service = CrawlerService()
@@ -337,15 +374,18 @@ async def shutdown_event():
@app.get("/")
def read_root():
return RedirectResponse(url="/mkdocs")
if os.path.exists(__location__ + "/site"):
return RedirectResponse(url="/mkdocs")
# Return a json response
return {"message": "Crawl4AI API service is running"}
@app.post("/crawl")
@app.post("/crawl", dependencies=[Depends(verify_token)])
async def crawl(request: CrawlRequest) -> Dict[str, str]:
task_id = await crawler_service.submit_task(request)
return {"task_id": task_id}
@app.get("/task/{task_id}")
@app.get("/task/{task_id}", dependencies=[Depends(verify_token)])
async def get_task_status(task_id: str):
task_info = crawler_service.task_manager.get_task(task_id)
if not task_info:
@@ -367,6 +407,71 @@ async def get_task_status(task_id: str):
return response
@app.post("/crawl_sync", dependencies=[Depends(verify_token)])
async def crawl_sync(request: CrawlRequest) -> Dict[str, Any]:
task_id = await crawler_service.submit_task(request)
# Wait up to 60 seconds for task completion
for _ in range(60):
task_info = crawler_service.task_manager.get_task(task_id)
if not task_info:
raise HTTPException(status_code=404, detail="Task not found")
if task_info.status == TaskStatus.COMPLETED:
# Return same format as /task/{task_id} endpoint
if isinstance(task_info.result, list):
return {"status": task_info.status, "results": [result.dict() for result in task_info.result]}
return {"status": task_info.status, "result": task_info.result.dict()}
if task_info.status == TaskStatus.FAILED:
raise HTTPException(status_code=500, detail=task_info.error)
await asyncio.sleep(1)
# If we get here, task didn't complete within timeout
raise HTTPException(status_code=408, detail="Task timed out")
@app.post("/crawl_direct", dependencies=[Depends(verify_token)])
async def crawl_direct(request: CrawlRequest) -> Dict[str, Any]:
try:
crawler = await crawler_service.crawler_pool.acquire(**request.crawler_params)
extraction_strategy = crawler_service._create_extraction_strategy(request.extraction_config)
try:
if isinstance(request.urls, list):
results = await crawler.arun_many(
urls=[str(url) for url in request.urls],
extraction_strategy=extraction_strategy,
js_code=request.js_code,
wait_for=request.wait_for,
css_selector=request.css_selector,
screenshot=request.screenshot,
magic=request.magic,
cache_mode=request.cache_mode,
session_id=request.session_id,
**request.extra,
)
return {"results": [result.dict() for result in results]}
else:
result = await crawler.arun(
url=str(request.urls),
extraction_strategy=extraction_strategy,
js_code=request.js_code,
wait_for=request.wait_for,
css_selector=request.css_selector,
screenshot=request.screenshot,
magic=request.magic,
cache_mode=request.cache_mode,
session_id=request.session_id,
**request.extra,
)
return {"result": result.dict()}
finally:
await crawler_service.crawler_pool.release(crawler)
except Exception as e:
logger.error(f"Error in direct crawl: {str(e)}")
raise HTTPException(status_code=500, detail=str(e))
@app.get("/health")
async def health_check():
available_slots = await crawler_service.resource_monitor.get_available_slots()

View File

@@ -10,13 +10,18 @@ nav:
- 'Installation': 'basic/installation.md'
- 'Docker Deplotment': 'basic/docker-deploymeny.md'
- 'Quick Start': 'basic/quickstart.md'
- Changelog & Blog:
- 'Blog Home': 'blog/index.md'
- 'Latest (0.4.1)': 'blog/releases/0.4.1.md'
- 'Changelog': 'https://github.com/unclecode/crawl4ai/blob/main/CHANGELOG.md'
- Basic:
- 'Simple Crawling': 'basic/simple-crawling.md'
- 'Output Formats': 'basic/output-formats.md'
- 'Browser Configuration': 'basic/browser-config.md'
- 'Page Interaction': 'basic/page-interaction.md'
- 'Content Selection': 'basic/content-selection.md'
- 'Cache Modes': 'basic/cache-modes.md'
- Advanced:
- 'Content Processing': 'advanced/content-processing.md'
@@ -49,12 +54,12 @@ nav:
- '5. Dynamic Content': 'tutorial/episode_05_JavaScript_Execution_and_Dynamic_Content_Handling.md'
- '6. Magic Mode': 'tutorial/episode_06_Magic_Mode_and_Anti-Bot_Protection.md'
- '7. Content Cleaning': 'tutorial/episode_07_Content_Cleaning_and_Fit_Markdown.md'
- '8. Media Handling': 'tutorial/episode_08_Media_Handling:_Images,_Videos,_and_Audio.md'
- '8. Media Handling': 'tutorial/episode_08_Media_Handling_Images_Videos_and_Audio.md'
- '9. Link Analysis': 'tutorial/episode_09_Link_Analysis_and_Smart_Filtering.md'
- '10. User Simulation': 'tutorial/episode_10_Custom_Headers,_Identity,_and_User_Simulation.md'
- '11.1. JSON CSS': 'tutorial/episode_11_1_Extraction_Strategies:_JSON_CSS.md'
- '11.2. LLM Strategy': 'tutorial/episode_11_2_Extraction_Strategies:_LLM.md'
- '11.3. Cosine Strategy': 'tutorial/episode_11_3_Extraction_Strategies:_Cosine.md'
- '11.1. JSON CSS': 'tutorial/episode_11_1_Extraction_Strategies_JSON_CSS.md'
- '11.2. LLM Strategy': 'tutorial/episode_11_2_Extraction_Strategies_LLM.md'
- '11.3. Cosine Strategy': 'tutorial/episode_11_3_Extraction_Strategies_Cosine.md'
- '12. Session Crawling': 'tutorial/episode_12_Session-Based_Crawling_for_Dynamic_Websites.md'
- '13. Text Chunking': 'tutorial/episode_13_Chunking_Strategies_for_Large_Text_Processing.md'
- '14. Custom Workflows': 'tutorial/episode_14_Hooks_and_Custom_Workflow_with_AsyncWebCrawler.md'

View File

@@ -1,131 +0,0 @@
:root {
--ifm-font-size-base: 100%;
--ifm-line-height-base: 1.65;
--ifm-font-family-base: system-ui, -apple-system, Segoe UI, Roboto, Ubuntu, Cantarell, Noto Sans, sans-serif,
BlinkMacSystemFont, "Segoe UI", Helvetica, Arial, sans-serif, "Apple Color Emoji", "Segoe UI Emoji",
"Segoe UI Symbol";
}
html {
-webkit-font-smoothing: antialiased;
-webkit-text-size-adjust: 100%;
text-size-adjust: 100%;
font: var(--ifm-font-size-base) / var(--ifm-line-height-base) var(--ifm-font-family-base);
}
body {
background-color: #1a202c;
color: #fff;
}
.tab-content {
max-height: 400px;
overflow: auto;
}
pre {
white-space: pre-wrap;
font-size: 14px;
}
pre code {
width: 100%;
}
/* Custom styling for docs-item class and Markdown generated elements */
.docs-item {
background-color: #2d3748; /* bg-gray-800 */
padding: 1rem; /* p-4 */
border-radius: 0.375rem; /* rounded */
box-shadow: 0 4px 6px rgba(0, 0, 0, 0.1); /* shadow-md */
margin-bottom: 1rem; /* space between items */
line-height: 1.5; /* leading-normal */
}
.docs-item h3,
.docs-item h4 {
color: #ffffff; /* text-white */
font-size: 1.25rem; /* text-xl */
font-weight: 700; /* font-bold */
margin-bottom: 0.5rem; /* mb-2 */
}
.docs-item h4 {
font-size: 1rem; /* text-xl */
}
.docs-item p {
color: #e2e8f0; /* text-gray-300 */
margin-bottom: 0.5rem; /* mb-2 */
}
.docs-item code {
background-color: #1a202c; /* bg-gray-900 */
color: #e2e8f0; /* text-gray-300 */
padding: 0.25rem 0.5rem; /* px-2 py-1 */
border-radius: 0.25rem; /* rounded */
font-size: 0.875rem; /* text-sm */
}
.docs-item pre {
background-color: #1a202c; /* bg-gray-900 */
color: #e2e8f0; /* text-gray-300 */
padding: 0.5rem; /* p-2 */
border-radius: 0.375rem; /* rounded */
overflow: auto; /* overflow-auto */
margin-bottom: 0.5rem; /* mb-2 */
}
.docs-item div {
color: #e2e8f0; /* text-gray-300 */
font-size: 1rem; /* prose prose-sm */
line-height: 1.25rem; /* line-height for readability */
}
/* Adjustments to make prose class more suitable for dark mode */
.prose {
max-width: none; /* max-w-none */
}
.prose p,
.prose ul {
margin-bottom: 1rem; /* mb-4 */
}
.prose code {
/* background-color: #4a5568; */ /* bg-gray-700 */
color: #65a30d; /* text-white */
padding: 0.25rem 0.5rem; /* px-1 py-0.5 */
border-radius: 0.25rem; /* rounded */
display: inline-block; /* inline-block */
}
.prose pre {
background-color: #1a202c; /* bg-gray-900 */
color: #ffffff; /* text-white */
padding: 0.5rem; /* p-2 */
border-radius: 0.375rem; /* rounded */
}
.prose h3 {
color: #65a30d; /* text-white */
font-size: 1.25rem; /* text-xl */
font-weight: 700; /* font-bold */
margin-bottom: 0.5rem; /* mb-2 */
}
body {
background-color: #1a1a1a;
color: #b3ff00;
}
.sidebar {
color: #b3ff00;
border-right: 1px solid #333;
}
.sidebar a {
color: #b3ff00;
text-decoration: none;
}
.sidebar a:hover {
background-color: #555;
}
.content-section {
display: none;
}
.content-section.active {
display: block;
}

View File

@@ -1,356 +0,0 @@
// JavaScript to manage dynamic form changes and logic
document.getElementById("extraction-strategy-select").addEventListener("change", function () {
const strategy = this.value;
const providerModelSelect = document.getElementById("provider-model-select");
const tokenInput = document.getElementById("token-input");
const instruction = document.getElementById("instruction");
const semantic_filter = document.getElementById("semantic_filter");
const instruction_div = document.getElementById("instruction_div");
const semantic_filter_div = document.getElementById("semantic_filter_div");
const llm_settings = document.getElementById("llm_settings");
if (strategy === "LLMExtractionStrategy") {
// providerModelSelect.disabled = false;
// tokenInput.disabled = false;
// semantic_filter.disabled = true;
// instruction.disabled = false;
llm_settings.classList.remove("hidden");
instruction_div.classList.remove("hidden");
semantic_filter_div.classList.add("hidden");
} else if (strategy === "NoExtractionStrategy") {
semantic_filter_div.classList.add("hidden");
instruction_div.classList.add("hidden");
llm_settings.classList.add("hidden");
} else {
// providerModelSelect.disabled = true;
// tokenInput.disabled = true;
// semantic_filter.disabled = false;
// instruction.disabled = true;
llm_settings.classList.add("hidden");
instruction_div.classList.add("hidden");
semantic_filter_div.classList.remove("hidden");
}
});
// Get the selected provider model and token from local storage
const storedProviderModel = localStorage.getItem("provider_model");
const storedToken = localStorage.getItem(storedProviderModel);
if (storedProviderModel) {
document.getElementById("provider-model-select").value = storedProviderModel;
}
if (storedToken) {
document.getElementById("token-input").value = storedToken;
}
// Handle provider model dropdown change
document.getElementById("provider-model-select").addEventListener("change", () => {
const selectedProviderModel = document.getElementById("provider-model-select").value;
const storedToken = localStorage.getItem(selectedProviderModel);
if (storedToken) {
document.getElementById("token-input").value = storedToken;
} else {
document.getElementById("token-input").value = "";
}
});
// Fetch total count from the database
axios
.get("/total-count")
.then((response) => {
document.getElementById("total-count").textContent = response.data.count;
})
.catch((error) => console.error(error));
// Handle crawl button click
document.getElementById("crawl-btn").addEventListener("click", () => {
// validate input to have both URL and API token
// if selected extraction strategy is LLMExtractionStrategy, then API token is required
if (document.getElementById("extraction-strategy-select").value === "LLMExtractionStrategy") {
if (!document.getElementById("url-input").value || !document.getElementById("token-input").value) {
alert("Please enter both URL(s) and API token.");
return;
}
}
const selectedProviderModel = document.getElementById("provider-model-select").value;
const apiToken = document.getElementById("token-input").value;
const extractBlocks = document.getElementById("extract-blocks-checkbox").checked;
const bypassCache = document.getElementById("bypass-cache-checkbox").checked;
// Save the selected provider model and token to local storage
localStorage.setItem("provider_model", selectedProviderModel);
localStorage.setItem(selectedProviderModel, apiToken);
const urlsInput = document.getElementById("url-input").value;
const urls = urlsInput.split(",").map((url) => url.trim());
const data = {
urls: urls,
include_raw_html: true,
bypass_cache: bypassCache,
extract_blocks: extractBlocks,
word_count_threshold: parseInt(document.getElementById("threshold").value),
extraction_strategy: document.getElementById("extraction-strategy-select").value,
extraction_strategy_args: {
provider: selectedProviderModel,
api_token: apiToken,
instruction: document.getElementById("instruction").value,
semantic_filter: document.getElementById("semantic_filter").value,
},
chunking_strategy: document.getElementById("chunking-strategy-select").value,
chunking_strategy_args: {},
css_selector: document.getElementById("css-selector").value,
screenshot: document.getElementById("screenshot-checkbox").checked,
// instruction: document.getElementById("instruction").value,
// semantic_filter: document.getElementById("semantic_filter").value,
verbose: true,
};
// import requests
// data = {
// "urls": [
// "https://www.nbcnews.com/business"
// ],
// "word_count_threshold": 10,
// "extraction_strategy": "NoExtractionStrategy",
// }
// response = requests.post("https://crawl4ai.com/crawl", json=data) # OR local host if your run locally
// print(response.json())
// save api token to local storage
localStorage.setItem("api_token", document.getElementById("token-input").value);
document.getElementById("loading").classList.remove("hidden");
document.getElementById("result").style.visibility = "hidden";
document.getElementById("code_help").style.visibility = "hidden";
axios
.post("/crawl", data)
.then((response) => {
const result = response.data.results[0];
const parsedJson = JSON.parse(result.extracted_content);
document.getElementById("json-result").textContent = JSON.stringify(parsedJson, null, 2);
document.getElementById("cleaned-html-result").textContent = result.cleaned_html;
document.getElementById("markdown-result").textContent = result.markdown;
document.getElementById("media-result").textContent = JSON.stringify( result.media, null, 2);
if (result.screenshot){
const imgElement = document.createElement("img");
// Set the src attribute with the base64 data
imgElement.src = `data:image/png;base64,${result.screenshot}`;
document.getElementById("screenshot-result").innerHTML = "";
document.getElementById("screenshot-result").appendChild(imgElement);
}
// Update code examples dynamically
const extractionStrategy = data.extraction_strategy;
const isLLMExtraction = extractionStrategy === "LLMExtractionStrategy";
// REMOVE API TOKEN FROM CODE EXAMPLES
data.extraction_strategy_args.api_token = "your_api_token";
if (data.extraction_strategy === "NoExtractionStrategy") {
delete data.extraction_strategy_args;
delete data.extrac_blocks;
}
if (data.chunking_strategy === "RegexChunking") {
delete data.chunking_strategy_args;
}
delete data.verbose;
if (data.css_selector === "") {
delete data.css_selector;
}
if (!data.bypass_cache) {
delete data.bypass_cache;
}
if (!data.extract_blocks) {
delete data.extract_blocks;
}
if (!data.include_raw_html) {
delete data.include_raw_html;
}
document.getElementById(
"curl-code"
).textContent = `curl -X POST -H "Content-Type: application/json" -d '${JSON.stringify({
...data,
api_token: isLLMExtraction ? "your_api_token" : undefined,
}, null, 2)}' https://crawl4ai.com/crawl`;
document.getElementById("python-code").textContent = `import requests\n\ndata = ${JSON.stringify(
{ ...data, api_token: isLLMExtraction ? "your_api_token" : undefined },
null,
2
)}\n\nresponse = requests.post("https://crawl4ai.com/crawl", json=data) # OR local host if your run locally \nprint(response.json())`;
document.getElementById(
"nodejs-code"
).textContent = `const axios = require('axios');\n\nconst data = ${JSON.stringify(
{ ...data, api_token: isLLMExtraction ? "your_api_token" : undefined },
null,
2
)};\n\naxios.post("https://crawl4ai.com/crawl", data) // OR local host if your run locally \n .then(response => console.log(response.data))\n .catch(error => console.error(error));`;
document.getElementById(
"library-code"
).textContent = `from crawl4ai.web_crawler import WebCrawler\nfrom crawl4ai.extraction_strategy import *\nfrom crawl4ai.chunking_strategy import *\n\ncrawler = WebCrawler()\ncrawler.warmup()\n\nresult = crawler.run(\n url='${
urls[0]
}',\n word_count_threshold=${data.word_count_threshold},\n extraction_strategy=${
isLLMExtraction
? `${extractionStrategy}(provider="${data.provider_model}", api_token="${data.api_token}")`
: extractionStrategy + "()"
},\n chunking_strategy=${data.chunking_strategy}(),\n bypass_cache=${
data.bypass_cache
},\n css_selector="${data.css_selector}"\n)\nprint(result)`;
// Highlight code syntax
hljs.highlightAll();
// Select JSON tab by default
document.querySelector('.tab-btn[data-tab="json"]').click();
document.getElementById("loading").classList.add("hidden");
document.getElementById("result").style.visibility = "visible";
document.getElementById("code_help").style.visibility = "visible";
// increment the total count
document.getElementById("total-count").textContent =
parseInt(document.getElementById("total-count").textContent) + 1;
})
.catch((error) => {
console.error(error);
document.getElementById("loading").classList.add("hidden");
});
});
// Handle tab clicks
document.querySelectorAll(".tab-btn").forEach((btn) => {
btn.addEventListener("click", () => {
const tab = btn.dataset.tab;
document.querySelectorAll(".tab-btn").forEach((b) => b.classList.remove("bg-lime-700", "text-white"));
btn.classList.add("bg-lime-700", "text-white");
document.querySelectorAll(".tab-content.code pre").forEach((el) => el.classList.add("hidden"));
document.getElementById(`${tab}-result`).parentElement.classList.remove("hidden");
});
});
// Handle code tab clicks
document.querySelectorAll(".code-tab-btn").forEach((btn) => {
btn.addEventListener("click", () => {
const tab = btn.dataset.tab;
document.querySelectorAll(".code-tab-btn").forEach((b) => b.classList.remove("bg-lime-700", "text-white"));
btn.classList.add("bg-lime-700", "text-white");
document.querySelectorAll(".tab-content.result pre").forEach((el) => el.classList.add("hidden"));
document.getElementById(`${tab}-code`).parentElement.classList.remove("hidden");
});
});
// Handle copy to clipboard button clicks
async function copyToClipboard(text) {
if (navigator.clipboard && navigator.clipboard.writeText) {
return navigator.clipboard.writeText(text);
} else {
return fallbackCopyTextToClipboard(text);
}
}
function fallbackCopyTextToClipboard(text) {
return new Promise((resolve, reject) => {
const textArea = document.createElement("textarea");
textArea.value = text;
// Avoid scrolling to bottom
textArea.style.top = "0";
textArea.style.left = "0";
textArea.style.position = "fixed";
document.body.appendChild(textArea);
textArea.focus();
textArea.select();
try {
const successful = document.execCommand("copy");
if (successful) {
resolve();
} else {
reject();
}
} catch (err) {
reject(err);
}
document.body.removeChild(textArea);
});
}
document.querySelectorAll(".copy-btn").forEach((btn) => {
btn.addEventListener("click", () => {
const target = btn.dataset.target;
const code = document.getElementById(target).textContent;
//navigator.clipboard.writeText(code).then(() => {
copyToClipboard(code).then(() => {
btn.textContent = "Copied!";
setTimeout(() => {
btn.textContent = "Copy";
}, 2000);
});
});
});
document.addEventListener("DOMContentLoaded", async () => {
try {
const extractionResponse = await fetch("/strategies/extraction");
const extractionStrategies = await extractionResponse.json();
const chunkingResponse = await fetch("/strategies/chunking");
const chunkingStrategies = await chunkingResponse.json();
renderStrategies("extraction-strategies", extractionStrategies);
renderStrategies("chunking-strategies", chunkingStrategies);
} catch (error) {
console.error("Error fetching strategies:", error);
}
});
function renderStrategies(containerId, strategies) {
const container = document.getElementById(containerId);
container.innerHTML = ""; // Clear any existing content
strategies = JSON.parse(strategies);
Object.entries(strategies).forEach(([strategy, description]) => {
const strategyElement = document.createElement("div");
strategyElement.classList.add("bg-zinc-800", "p-4", "rounded", "shadow-md", "docs-item");
const strategyDescription = document.createElement("div");
strategyDescription.classList.add("text-gray-300", "prose", "prose-sm");
strategyDescription.innerHTML = marked.parse(description);
strategyElement.appendChild(strategyDescription);
container.appendChild(strategyElement);
});
}
document.querySelectorAll(".sidebar a").forEach((link) => {
link.addEventListener("click", function (event) {
event.preventDefault();
document.querySelectorAll(".content-section").forEach((section) => {
section.classList.remove("active");
});
const target = event.target.getAttribute("data-target");
document.getElementById(target).classList.add("active");
});
});
// Highlight code syntax
hljs.highlightAll();

View File

@@ -1,971 +0,0 @@
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="UTF-8" />
<meta name="viewport" content="width=device-width, initial-scale=1.0" />
<title>Crawl4AI</title>
<link rel="preconnect" href="https://fonts.googleapis.com" />
<link rel="preconnect" href="https://fonts.gstatic.com" crossorigin />
<link href="https://fonts.googleapis.com/css2?family=Inter:wght@100..900&display=swap" rel="stylesheet" />
<!-- <link href="https://cdn.jsdelivr.net/npm/tailwindcss@3.4.3/dist/tailwind.min.css" rel="stylesheet" /> -->
<script src="https://cdn.tailwindcss.com"></script>
<script src="https://cdn.jsdelivr.net/npm/axios/dist/axios.min.js"></script>
<link
rel="stylesheet"
href="https://cdnjs.cloudflare.com/ajax/libs/highlight.js/11.7.0/styles/monokai.min.css"
/>
<script src="https://cdn.jsdelivr.net/npm/marked/marked.min.js"></script>
<script src="https://cdnjs.cloudflare.com/ajax/libs/highlight.js/11.7.0/highlight.min.js"></script>
<style>
:root {
--ifm-font-size-base: 100%;
--ifm-line-height-base: 1.65;
--ifm-font-family-base: system-ui, -apple-system, Segoe UI, Roboto, Ubuntu, Cantarell, Noto Sans,
sans-serif, BlinkMacSystemFont, "Segoe UI", Helvetica, Arial, sans-serif, "Apple Color Emoji",
"Segoe UI Emoji", "Segoe UI Symbol";
}
html {
-webkit-font-smoothing: antialiased;
-webkit-text-size-adjust: 100%;
text-size-adjust: 100%;
font: var(--ifm-font-size-base) / var(--ifm-line-height-base) var(--ifm-font-family-base);
}
body {
background-color: #1a202c;
color: #fff;
}
.tab-content {
max-height: 400px;
overflow: auto;
}
pre {
white-space: pre-wrap;
font-size: 14px;
}
pre code {
width: 100%;
}
</style>
<style>
/* Custom styling for docs-item class and Markdown generated elements */
.docs-item {
background-color: #2d3748; /* bg-gray-800 */
padding: 1rem; /* p-4 */
border-radius: 0.375rem; /* rounded */
box-shadow: 0 4px 6px rgba(0, 0, 0, 0.1); /* shadow-md */
margin-bottom: 1rem; /* space between items */
}
.docs-item h3,
.docs-item h4 {
color: #ffffff; /* text-white */
font-size: 1.25rem; /* text-xl */
font-weight: 700; /* font-bold */
margin-bottom: 0.5rem; /* mb-2 */
}
.docs-item p {
color: #e2e8f0; /* text-gray-300 */
margin-bottom: 0.5rem; /* mb-2 */
}
.docs-item code {
background-color: #1a202c; /* bg-gray-900 */
color: #e2e8f0; /* text-gray-300 */
padding: 0.25rem 0.5rem; /* px-2 py-1 */
border-radius: 0.25rem; /* rounded */
}
.docs-item pre {
background-color: #1a202c; /* bg-gray-900 */
color: #e2e8f0; /* text-gray-300 */
padding: 0.5rem; /* p-2 */
border-radius: 0.375rem; /* rounded */
overflow: auto; /* overflow-auto */
margin-bottom: 0.5rem; /* mb-2 */
}
.docs-item div {
color: #e2e8f0; /* text-gray-300 */
font-size: 1rem; /* prose prose-sm */
line-height: 1.25rem; /* line-height for readability */
}
/* Adjustments to make prose class more suitable for dark mode */
.prose {
max-width: none; /* max-w-none */
}
.prose p,
.prose ul {
margin-bottom: 1rem; /* mb-4 */
}
.prose code {
/* background-color: #4a5568; */ /* bg-gray-700 */
color: #65a30d; /* text-white */
padding: 0.25rem 0.5rem; /* px-1 py-0.5 */
border-radius: 0.25rem; /* rounded */
display: inline-block; /* inline-block */
}
.prose pre {
background-color: #1a202c; /* bg-gray-900 */
color: #ffffff; /* text-white */
padding: 0.5rem; /* p-2 */
border-radius: 0.375rem; /* rounded */
}
.prose h3 {
color: #65a30d; /* text-white */
font-size: 1.25rem; /* text-xl */
font-weight: 700; /* font-bold */
margin-bottom: 0.5rem; /* mb-2 */
}
</style>
</head>
<body class="bg-black text-gray-200">
<header class="bg-zinc-950 text-white py-4 flex">
<div class="mx-auto px-4">
<h1 class="text-2xl font-bold">🔥🕷️ Crawl4AI: Web Data for your Thoughts</h1>
</div>
<div class="mx-auto px-4 flex font-bold text-xl gap-2">
<span>📊 Total Website Processed</span>
<span id="total-count" class="text-lime-400">2</span>
</div>
</header>
<section class="try-it py-8 px-16 pb-20">
<div class="container mx-auto px-4">
<h2 class="text-2xl font-bold mb-4">Try It Now</h2>
<div class="grid grid-cols-1 lg:grid-cols-3 gap-4">
<div class="space-y-4">
<div class="flex flex-col">
<label for="url-input" class="text-lime-500 font-bold text-xs">URL(s)</label>
<input
type="text"
id="url-input"
value="https://www.nbcnews.com/business"
class="border border-zinc-700 rounded px-4 py-0 bg-zinc-900 text-lime-500"
placeholder="Enter URL(s) separated by commas"
/>
</div>
<div class="flex flex-col">
<label for="threshold" class="text-lime-500 font-bold text-xs">Min Words Threshold</label>
<select
id="threshold"
class="border border-zinc-700 rounded px-4 py-0 bg-zinc-900 text-lime-500"
>
<option value="5">5</option>
<option value="10" selected>10</option>
<option value="15">15</option>
<option value="20">20</option>
<option value="25">25</option>
</select>
</div>
<div class="flex flex-col">
<label for="css-selector" class="text-lime-500 font-bold text-xs">CSS Selector</label>
<input
type="text"
id="css-selector"
class="border border-zinc-700 rounded px-4 py-0 bg-zinc-900 text-lime-500"
placeholder="Enter CSS Selector"
/>
</div>
<div class="flex flex-col">
<label for="extraction-strategy-select" class="text-lime-500 font-bold text-xs"
>Extraction Strategy</label
>
<select
id="extraction-strategy-select"
class="border border-zinc-700 rounded px-4 py-1 bg-zinc-900 text-lime-500"
>
<option value="CosineStrategy">CosineStrategy</option>
<option value="LLMExtractionStrategy">LLMExtractionStrategy</option>
<option value="NoExtractionStrategy">NoExtractionStrategy</option>
</select>
</div>
<div class="flex flex-col">
<label for="chunking-strategy-select" class="text-lime-500 font-bold text-xs"
>Chunking Strategy</label
>
<select
id="chunking-strategy-select"
class="border border-zinc-700 rounded px-4 py-1 bg-zinc-900 text-lime-500"
>
<option value="RegexChunking">RegexChunking</option>
<option value="NlpSentenceChunking">NlpSentenceChunking</option>
<option value="TopicSegmentationChunking">TopicSegmentationChunking</option>
<option value="FixedLengthWordChunking">FixedLengthWordChunking</option>
<option value="SlidingWindowChunking">SlidingWindowChunking</option>
</select>
</div>
<div class="flex flex-col">
<label for="provider-model-select" class="text-lime-500 font-bold text-xs"
>Provider Model</label
>
<select
id="provider-model-select"
class="border border-zinc-700 rounded px-4 py-0 bg-zinc-900 text-lime-500"
disabled
>
<option value="groq/llama3-70b-8192">groq/llama3-70b-8192</option>
<option value="groq/llama3-8b-8192">groq/llama3-8b-8192</option>
<option value="openai/gpt-4-turbo">gpt-4-turbo</option>
<option value="openai/gpt-3.5-turbo">gpt-3.5-turbo</option>
<option value="anthropic/claude-3-haiku-20240307">claude-3-haiku</option>
<option value="anthropic/claude-3-opus-20240229">claude-3-opus</option>
<option value="anthropic/claude-3-sonnet-20240229">claude-3-sonnet</option>
</select>
</div>
<div class="flex flex-col">
<label for="token-input" class="text-lime-500 font-bold text-xs">API Token</label>
<input
type="password"
id="token-input"
class="border border-zinc-700 rounded px-4 py-0 bg-zinc-900 text-lime-500"
placeholder="Enter Groq API token"
disabled
/>
</div>
<div class="flex gap-3">
<div class="flex items-center gap-2">
<input type="checkbox" id="bypass-cache-checkbox" />
<label for="bypass-cache-checkbox" class="text-lime-500 font-bold">Bypass Cache</label>
</div>
<div class="flex items-center gap-2">
<input type="checkbox" id="extract-blocks-checkbox" checked />
<label for="extract-blocks-checkbox" class="text-lime-500 font-bold"
>Extract Blocks</label
>
</div>
<button id="crawl-btn" class="bg-lime-600 text-black font-bold px-4 py-0 rounded">
Crawl
</button>
</div>
</div>
<div id="result" class=" ">
<div id="loading" class="hidden">
<p class="text-white">Loading... Please wait.</p>
</div>
<div class="tab-buttons flex gap-2">
<button
class="tab-btn px-4 py-1 text-sm bg-zinc-700 rounded-t text-lime-500"
data-tab="json"
>
JSON
</button>
<button
class="tab-btn px-4 py-1 text-sm bg-zinc-700 rounded-t text-lime-500"
data-tab="cleaned-html"
>
Cleaned HTML
</button>
<button
class="tab-btn px-4 py-1 text-sm bg-zinc-700 rounded-t text-lime-500"
data-tab="markdown"
>
Markdown
</button>
</div>
<div class="tab-content code bg-zinc-900 p-2 rounded h-full border border-zinc-700 text-sm">
<pre class="h-full flex"><code id="json-result" class="language-json"></code></pre>
<pre
class="hidden h-full flex"
><code id="cleaned-html-result" class="language-html"></code></pre>
<pre
class="hidden h-full flex"
><code id="markdown-result" class="language-markdown"></code></pre>
</div>
</div>
<div id="code_help" class=" ">
<div class="tab-buttons flex gap-2">
<button
class="code-tab-btn px-4 py-1 text-sm bg-zinc-700 rounded-t text-lime-500"
data-tab="curl"
>
cURL
</button>
<button
class="code-tab-btn px-4 py-1 text-sm bg-zinc-700 rounded-t text-lime-500"
data-tab="library"
>
Python Library
</button>
<button
class="code-tab-btn px-4 py-1 text-sm bg-zinc-700 rounded-t text-lime-500"
data-tab="python"
>
Python (Request)
</button>
<button
class="code-tab-btn px-4 py-1 text-sm bg-zinc-700 rounded-t text-lime-500"
data-tab="nodejs"
>
Node.js
</button>
</div>
<div class="tab-content result bg-zinc-900 p-2 rounded h-full border border-zinc-700 text-sm">
<pre class="h-full flex relative">
<code id="curl-code" class="language-bash"></code>
<button class="absolute top-2 right-2 bg-zinc-700 text-white px-2 py-1 rounded copy-btn" data-target="curl-code">Copy</button>
</pre>
<pre class="hidden h-full flex relative">
<code id="python-code" class="language-python"></code>
<button class="absolute top-2 right-2 bg-zinc-700 text-white px-2 py-1 rounded copy-btn" data-target="python-code">Copy</button>
</pre>
<pre class="hidden h-full flex relative">
<code id="nodejs-code" class="language-javascript"></code>
<button class="absolute top-2 right-2 bg-zinc-700 text-white px-2 py-1 rounded copy-btn" data-target="nodejs-code">Copy</button>
</pre>
<pre class="hidden h-full flex relative">
<code id="library-code" class="language-python"></code>
<button class="absolute top-2 right-2 bg-zinc-700 text-white px-2 py-1 rounded copy-btn" data-target="library-code">Copy</button>
</pre>
</div>
</div>
</div>
</div>
</section>
<section class="bg-zinc-900 text-zinc-300 p-6 px-20">
<div class="grid grid-cols-2 gap-4 p-4 bg-zinc-900 text-lime-500">
<!-- Step 1 -->
<div class="col-span-2 bg-yellow-500 p-2 rounded text-zinc-900">
🌟 <strong>Welcome to the Crawl4ai Quickstart Guide! Let's dive into some web crawling fun!</strong>
</div>
<div class="bg-zinc-800 p-2 rounded">
First Step: Create an instance of WebCrawler and call the <code>warmup()</code> function.
</div>
<div>
<pre><code class="language-python">crawler = WebCrawler()
crawler.warmup()</code></pre>
</div>
<!-- Step 2 -->
<div class="col-span-2 bg-yellow-500 p-2 rounded text-zinc-900">
🧠 <strong>Understanding 'bypass_cache' and 'include_raw_html' parameters:</strong>
</div>
<div class="bg-zinc-800 p-2 rounded">First crawl (caches the result):</div>
<div>
<pre><code class="language-python">result = crawler.run(url="https://www.nbcnews.com/business")</code></pre>
</div>
<div class="bg-zinc-800 p-2 rounded">Second crawl (Force to crawl again):</div>
<div>
<pre><code class="language-python">result = crawler.run(url="https://www.nbcnews.com/business", bypass_cache=True)</code></pre>
</div>
<div class="bg-zinc-800 p-2 rounded">Crawl result without raw HTML content:</div>
<div>
<pre><code class="language-python">result = crawler.run(url="https://www.nbcnews.com/business", include_raw_html=False)</code></pre>
</div>
<!-- Step 3 -->
<div class="col-span-2 bg-yellow-500 p-2 rounded text-zinc-900">
📄
<strong
>The 'include_raw_html' parameter, when set to True, includes the raw HTML content in the
response. By default, it is set to True.</strong
>
</div>
<div class="bg-zinc-800 p-2 rounded">Set <code>always_by_pass_cache</code> to True:</div>
<div>
<pre><code class="language-python">crawler.always_by_pass_cache = True</code></pre>
</div>
<!-- Step 4 -->
<div class="col-span-2 bg-yellow-500 p-2 rounded text-zinc-900">
🧩 <strong>Let's add a chunking strategy: RegexChunking!</strong>
</div>
<div class="bg-zinc-800 p-2 rounded">Using RegexChunking:</div>
<div>
<pre><code class="language-python">result = crawler.run(
url="https://www.nbcnews.com/business",
chunking_strategy=RegexChunking(patterns=["\n\n"])
)</code></pre>
</div>
<div class="bg-zinc-800 p-2 rounded">Using NlpSentenceChunking:</div>
<div>
<pre><code class="language-python">result = crawler.run(
url="https://www.nbcnews.com/business",
chunking_strategy=NlpSentenceChunking()
)</code></pre>
</div>
<!-- Step 5 -->
<div class="col-span-2 bg-yellow-500 p-2 rounded text-zinc-900">
🧠 <strong>Let's get smarter with an extraction strategy: CosineStrategy!</strong>
</div>
<div class="bg-zinc-800 p-2 rounded">Using CosineStrategy:</div>
<div>
<pre><code class="language-python">result = crawler.run(
url="https://www.nbcnews.com/business",
extraction_strategy=CosineStrategy(word_count_threshold=10, max_dist=0.2, linkage_method="ward", top_k=3)
)</code></pre>
</div>
<!-- Step 6 -->
<div class="col-span-2 bg-yellow-500 p-2 rounded text-zinc-900">
🤖 <strong>Time to bring in the big guns: LLMExtractionStrategy without instructions!</strong>
</div>
<div class="bg-zinc-800 p-2 rounded">Using LLMExtractionStrategy without instructions:</div>
<div>
<pre><code class="language-python">result = crawler.run(
url="https://www.nbcnews.com/business",
extraction_strategy=LLMExtractionStrategy(provider="openai/gpt-4o", api_token=os.getenv('OPENAI_API_KEY'))
)</code></pre>
</div>
<!-- Step 7 -->
<div class="col-span-2 bg-yellow-500 p-2 rounded text-zinc-900">
📜 <strong>Let's make it even more interesting: LLMExtractionStrategy with instructions!</strong>
</div>
<div class="bg-zinc-800 p-2 rounded">Using LLMExtractionStrategy with instructions:</div>
<div>
<pre><code class="language-python">result = crawler.run(
url="https://www.nbcnews.com/business",
extraction_strategy=LLMExtractionStrategy(
provider="openai/gpt-4o",
api_token=os.getenv('OPENAI_API_KEY'),
instruction="I am interested in only financial news"
)
)</code></pre>
</div>
<!-- Step 8 -->
<div class="col-span-2 bg-yellow-500 p-2 rounded text-zinc-900">
🎯 <strong>Targeted extraction: Let's use a CSS selector to extract only H2 tags!</strong>
</div>
<div class="bg-zinc-800 p-2 rounded">Using CSS selector to extract H2 tags:</div>
<div>
<pre><code class="language-python">result = crawler.run(
url="https://www.nbcnews.com/business",
css_selector="h2"
)</code></pre>
</div>
<!-- Step 9 -->
<div class="col-span-2 bg-yellow-500 p-2 rounded text-zinc-900">
🖱️ <strong>Let's get interactive: Passing JavaScript code to click 'Load More' button!</strong>
</div>
<div class="bg-zinc-800 p-2 rounded">Using JavaScript to click 'Load More' button:</div>
<div>
<pre><code class="language-python">js_code = """
const loadMoreButton = Array.from(document.querySelectorAll('button')).find(button => button.textContent.includes('Load More'));
loadMoreButton && loadMoreButton.click();
"""
crawler_strategy = LocalSeleniumCrawlerStrategy(js_code=js_code)
crawler = WebCrawler(crawler_strategy=crawler_strategy, always_by_pass_cache=True)
result = crawler.run(url="https://www.nbcnews.com/business")</code></pre>
</div>
<!-- Conclusion -->
<div class="col-span-2 bg-yellow-500 p-2 rounded text-zinc-900">
🎉
<strong
>Congratulations! You've made it through the Crawl4ai Quickstart Guide! Now go forth and crawl
the web like a pro! 🕸️</strong
>
</div>
</div>
</section>
<section class="bg-zinc-900 text-zinc-300 p-6 px-20">
<h1 class="text-3xl font-bold mb-4">Installation 💻</h1>
<p class="mb-4">
There are two ways to use Crawl4AI: as a library in your Python projects or as a standalone local
server.
</p>
<p class="mb-4">
You can also try Crawl4AI in a Google Colab
<a href="https://colab.research.google.com/drive/1wz8u30rvbq6Scodye9AGCw8Qg_Z8QGsk"
><img
src="https://colab.research.google.com/assets/colab-badge.svg"
alt="Open In Colab"
style="display: inline-block; width: 100px; height: 20px"
/></a>
</p>
<h2 class="text-2xl font-bold mb-2">Using Crawl4AI as a Library 📚</h2>
<p class="mb-4">To install Crawl4AI as a library, follow these steps:</p>
<ol class="list-decimal list-inside mb-4">
<li class="mb-2">
Install the package from GitHub:
<pre
class="bg-zinc-800 p-4 rounded mt-2 text-zinc-100"
><code>pip install git+https://github.com/unclecode/crawl4ai.git</code></pre>
</li>
<li class="mb-2">
Alternatively, you can clone the repository and install the package locally:
<pre
class="bg-zinc-800 p-4 rounded mt-2 text-zinc-100"
><code class = "language-python bash">virtualenv venv
source venv/bin/activate
git clone https://github.com/unclecode/crawl4ai.git
cd crawl4ai
pip install -e .
</code></pre>
</li>
<li>
Import the necessary modules in your Python script:
<pre
class="bg-zinc-800 p-4 rounded mt-2 text-zinc-100"
><code class = "language-python hljs">from crawl4ai.web_crawler import WebCrawler
from crawl4ai.chunking_strategy import *
from crawl4ai.extraction_strategy import *
import os
crawler = WebCrawler()
# Single page crawl
single_url = UrlModel(url='https://www.nbcnews.com/business', forced=False)
result = crawl4ai.fetch_page(
url='https://www.nbcnews.com/business',
word_count_threshold=5, # Minimum word count for a HTML tag to be considered as a worthy block
chunking_strategy= RegexChunking( patterns = ["\\n\\n"]), # Default is RegexChunking
extraction_strategy= CosineStrategy(word_count_threshold=10, max_dist=0.2, linkage_method='ward', top_k=3) # Default is CosineStrategy
# extraction_strategy= LLMExtractionStrategy(provider= "openai/gpt-4o", api_token = os.getenv('OPENAI_API_KEY')),
bypass_cache=False,
extract_blocks =True, # Whether to extract semantical blocks of text from the HTML
css_selector = "", # Eg: "div.article-body"
verbose=True,
include_raw_html=True, # Whether to include the raw HTML content in the response
)
print(result.model_dump())
</code></pre>
</li>
</ol>
<p class="mb-4">
For more information about how to run Crawl4AI as a local server, please refer to the
<a href="https://github.com/unclecode/crawl4ai" class="text-blue-400">GitHub repository</a>.
</p>
</section>
<section class="bg-zinc-900 text-zinc-300 p-6 px-20">
<h1 class="text-3xl font-bold mb-4">📖 Parameters</h1>
<div class="overflow-x-auto">
<table class="min-w-full bg-zinc-800 border border-zinc-700">
<thead>
<tr>
<th class="py-2 px-4 border-b border-zinc-700">Parameter</th>
<th class="py-2 px-4 border-b border-zinc-700">Description</th>
<th class="py-2 px-4 border-b border-zinc-700">Required</th>
<th class="py-2 px-4 border-b border-zinc-700">Default Value</th>
</tr>
</thead>
<tbody>
<tr>
<td class="py-2 px-4 border-b border-zinc-700">urls</td>
<td class="py-2 px-4 border-b border-zinc-700">
A list of URLs to crawl and extract data from.
</td>
<td class="py-2 px-4 border-b border-zinc-700">Yes</td>
<td class="py-2 px-4 border-b border-zinc-700">-</td>
</tr>
<tr>
<td class="py-2 px-4 border-b border-zinc-700">include_raw_html</td>
<td class="py-2 px-4 border-b border-zinc-700">
Whether to include the raw HTML content in the response.
</td>
<td class="py-2 px-4 border-b border-zinc-700">No</td>
<td class="py-2 px-4 border-b border-zinc-700">false</td>
</tr>
<tr>
<td class="py-2 px-4 border-b border-zinc-700">bypass_cache</td>
<td class="py-2 px-4 border-b border-zinc-700">
Whether to force a fresh crawl even if the URL has been previously crawled.
</td>
<td class="py-2 px-4 border-b border-zinc-700">No</td>
<td class="py-2 px-4 border-b border-zinc-700">false</td>
</tr>
<tr>
<td class="py-2 px-4 border-b border-zinc-700">extract_blocks</td>
<td class="py-2 px-4 border-b border-zinc-700">
Whether to extract semantical blocks of text from the HTML.
</td>
<td class="py-2 px-4 border-b border-zinc-700">No</td>
<td class="py-2 px-4 border-b border-zinc-700">true</td>
</tr>
<tr>
<td class="py-2 px-4 border-b border-zinc-700">word_count_threshold</td>
<td class="py-2 px-4 border-b border-zinc-700">
The minimum number of words a block must contain to be considered meaningful (minimum
value is 5).
</td>
<td class="py-2 px-4 border-b border-zinc-700">No</td>
<td class="py-2 px-4 border-b border-zinc-700">5</td>
</tr>
<tr>
<td class="py-2 px-4 border-b border-zinc-700">extraction_strategy</td>
<td class="py-2 px-4 border-b border-zinc-700">
The strategy to use for extracting content from the HTML (e.g., "CosineStrategy").
</td>
<td class="py-2 px-4 border-b border-zinc-700">No</td>
<td class="py-2 px-4 border-b border-zinc-700">CosineStrategy</td>
</tr>
<tr>
<td class="py-2 px-4 border-b border-zinc-700">chunking_strategy</td>
<td class="py-2 px-4 border-b border-zinc-700">
The strategy to use for chunking the text before processing (e.g., "RegexChunking").
</td>
<td class="py-2 px-4 border-b border-zinc-700">No</td>
<td class="py-2 px-4 border-b border-zinc-700">RegexChunking</td>
</tr>
<tr>
<td class="py-2 px-4 border-b border-zinc-700">css_selector</td>
<td class="py-2 px-4 border-b border-zinc-700">
The CSS selector to target specific parts of the HTML for extraction.
</td>
<td class="py-2 px-4 border-b border-zinc-700">No</td>
<td class="py-2 px-4 border-b border-zinc-700">None</td>
</tr>
<tr>
<td class="py-2 px-4">verbose</td>
<td class="py-2 px-4">Whether to enable verbose logging.</td>
<td class="py-2 px-4">No</td>
<td class="py-2 px-4">true</td>
</tr>
</tbody>
</table>
</div>
</section>
<section id="extraction" class="py-8 px-20">
<div class="overflow-x-auto mx-auto px-6">
<h2 class="text-2xl font-bold mb-4">Extraction Strategies</h2>
<div id="extraction-strategies" class="space-y-4"></div>
</div>
</section>
<section id="chunking" class="py-8 px-20">
<div class="overflow-x-auto mx-auto px-6">
<h2 class="text-2xl font-bold mb-4">Chunking Strategies</h2>
<div id="chunking-strategies" class="space-y-4"></div>
</div>
</section>
<section class="hero bg-zinc-900 py-8 px-20">
<div class="container mx-auto px-4">
<h2 class="text-3xl font-bold mb-4">🤔 Why building this?</h2>
<p class="text-lg mb-4">
In recent times, we've witnessed a surge of startups emerging, riding the AI hype wave and charging
for services that should rightfully be accessible to everyone. 🌍💸 One such example is scraping and
crawling web pages and transforming them into a format suitable for Large Language Models (LLMs).
🕸️🤖 We believe that building a business around this is not the right approach; instead, it should
definitely be open-source. 🆓🌟 So, if you possess the skills to build such tools and share our
philosophy, we invite you to join our "Robinhood" band and help set these products free for the
benefit of all. 🤝💪
</p>
</div>
</section>
<section class="installation py-8 px-20">
<div class="container mx-auto px-4">
<h2 class="text-2xl font-bold mb-4">⚙️ Installation</h2>
<p class="mb-4">
To install and run Crawl4AI as a library or a local server, please refer to the 📚
<a href="https://github.com/unclecode/crawl4ai" class="text-blue-400">GitHub repository</a>.
</p>
</div>
</section>
<footer class="bg-zinc-900 text-white py-4">
<div class="container mx-auto px-4">
<div class="flex justify-between items-center">
<p>© 2024 Crawl4AI. All rights reserved.</p>
<div class="social-links">
<a
href="https://github.com/unclecode/crawl4ai"
class="text-white hover:text-gray-300 mx-2"
target="_blank"
>😺 GitHub</a
>
<a
href="https://twitter.com/unclecode"
class="text-white hover:text-gray-300 mx-2"
target="_blank"
>🐦 Twitter</a
>
</div>
</div>
</div>
</footer>
<script>
// JavaScript to manage dynamic form changes and logic
document.getElementById("extraction-strategy-select").addEventListener("change", function () {
const strategy = this.value;
const providerModelSelect = document.getElementById("provider-model-select");
const tokenInput = document.getElementById("token-input");
if (strategy === "LLMExtractionStrategy") {
providerModelSelect.disabled = false;
tokenInput.disabled = false;
} else {
providerModelSelect.disabled = true;
tokenInput.disabled = true;
}
});
// Get the selected provider model and token from local storage
const storedProviderModel = localStorage.getItem("provider_model");
const storedToken = localStorage.getItem(storedProviderModel);
if (storedProviderModel) {
document.getElementById("provider-model-select").value = storedProviderModel;
}
if (storedToken) {
document.getElementById("token-input").value = storedToken;
}
// Handle provider model dropdown change
document.getElementById("provider-model-select").addEventListener("change", () => {
const selectedProviderModel = document.getElementById("provider-model-select").value;
const storedToken = localStorage.getItem(selectedProviderModel);
if (storedToken) {
document.getElementById("token-input").value = storedToken;
} else {
document.getElementById("token-input").value = "";
}
});
// Fetch total count from the database
axios
.get("/total-count")
.then((response) => {
document.getElementById("total-count").textContent = response.data.count;
})
.catch((error) => console.error(error));
// Handle crawl button click
document.getElementById("crawl-btn").addEventListener("click", () => {
// validate input to have both URL and API token
if (!document.getElementById("url-input").value || !document.getElementById("token-input").value) {
alert("Please enter both URL(s) and API token.");
return;
}
const selectedProviderModel = document.getElementById("provider-model-select").value;
const apiToken = document.getElementById("token-input").value;
const extractBlocks = document.getElementById("extract-blocks-checkbox").checked;
const bypassCache = document.getElementById("bypass-cache-checkbox").checked;
// Save the selected provider model and token to local storage
localStorage.setItem("provider_model", selectedProviderModel);
localStorage.setItem(selectedProviderModel, apiToken);
const urlsInput = document.getElementById("url-input").value;
const urls = urlsInput.split(",").map((url) => url.trim());
const data = {
urls: urls,
provider_model: selectedProviderModel,
api_token: apiToken,
include_raw_html: true,
bypass_cache: bypassCache,
extract_blocks: extractBlocks,
word_count_threshold: parseInt(document.getElementById("threshold").value),
extraction_strategy: document.getElementById("extraction-strategy-select").value,
chunking_strategy: document.getElementById("chunking-strategy-select").value,
css_selector: document.getElementById("css-selector").value,
verbose: true,
};
// save api token to local storage
localStorage.setItem("api_token", document.getElementById("token-input").value);
document.getElementById("loading").classList.remove("hidden");
//document.getElementById("result").classList.add("hidden");
//document.getElementById("code_help").classList.add("hidden");
axios
.post("/crawl", data)
.then((response) => {
const result = response.data.results[0];
const parsedJson = JSON.parse(result.extracted_content);
document.getElementById("json-result").textContent = JSON.stringify(parsedJson, null, 2);
document.getElementById("cleaned-html-result").textContent = result.cleaned_html;
document.getElementById("markdown-result").textContent = result.markdown;
// Update code examples dynamically
const extractionStrategy = data.extraction_strategy;
const isLLMExtraction = extractionStrategy === "LLMExtractionStrategy";
document.getElementById(
"curl-code"
).textContent = `curl -X POST -H "Content-Type: application/json" -d '${JSON.stringify({
...data,
api_token: isLLMExtraction ? "your_api_token" : undefined,
})}' http://crawl4ai.uccode.io/crawl`;
document.getElementById(
"python-code"
).textContent = `import requests\n\ndata = ${JSON.stringify(
{ ...data, api_token: isLLMExtraction ? "your_api_token" : undefined },
null,
2
)}\n\nresponse = requests.post("http://crawl4ai.uccode.io/crawl", json=data) # OR local host if your run locally \nprint(response.json())`;
document.getElementById(
"nodejs-code"
).textContent = `const axios = require('axios');\n\nconst data = ${JSON.stringify(
{ ...data, api_token: isLLMExtraction ? "your_api_token" : undefined },
null,
2
)};\n\naxios.post("http://crawl4ai.uccode.io/crawl", data) // OR local host if your run locally \n .then(response => console.log(response.data))\n .catch(error => console.error(error));`;
document.getElementById(
"library-code"
).textContent = `from crawl4ai.web_crawler import WebCrawler\nfrom crawl4ai.extraction_strategy import *\nfrom crawl4ai.chunking_strategy import *\n\ncrawler = WebCrawler()\ncrawler.warmup()\n\nresult = crawler.run(\n url='${
urls[0]
}',\n word_count_threshold=${data.word_count_threshold},\n extraction_strategy=${
isLLMExtraction
? `${extractionStrategy}(provider="${data.provider_model}", api_token="${data.api_token}")`
: extractionStrategy + "()"
},\n chunking_strategy=${data.chunking_strategy}(),\n bypass_cache=${
data.bypass_cache
},\n css_selector="${data.css_selector}"\n)\nprint(result)`;
// Highlight code syntax
hljs.highlightAll();
// Select JSON tab by default
document.querySelector('.tab-btn[data-tab="json"]').click();
document.getElementById("loading").classList.add("hidden");
document.getElementById("result").classList.remove("hidden");
document.getElementById("code_help").classList.remove("hidden");
// increment the total count
document.getElementById("total-count").textContent =
parseInt(document.getElementById("total-count").textContent) + 1;
})
.catch((error) => {
console.error(error);
document.getElementById("loading").classList.add("hidden");
});
});
// Handle tab clicks
document.querySelectorAll(".tab-btn").forEach((btn) => {
btn.addEventListener("click", () => {
const tab = btn.dataset.tab;
document
.querySelectorAll(".tab-btn")
.forEach((b) => b.classList.remove("bg-lime-700", "text-white"));
btn.classList.add("bg-lime-700", "text-white");
document.querySelectorAll(".tab-content.code pre").forEach((el) => el.classList.add("hidden"));
document.getElementById(`${tab}-result`).parentElement.classList.remove("hidden");
});
});
// Handle code tab clicks
document.querySelectorAll(".code-tab-btn").forEach((btn) => {
btn.addEventListener("click", () => {
const tab = btn.dataset.tab;
document
.querySelectorAll(".code-tab-btn")
.forEach((b) => b.classList.remove("bg-lime-700", "text-white"));
btn.classList.add("bg-lime-700", "text-white");
document.querySelectorAll(".tab-content.result pre").forEach((el) => el.classList.add("hidden"));
document.getElementById(`${tab}-code`).parentElement.classList.remove("hidden");
});
});
// Handle copy to clipboard button clicks
async function copyToClipboard(text) {
if (navigator.clipboard && navigator.clipboard.writeText) {
return navigator.clipboard.writeText(text);
} else {
return fallbackCopyTextToClipboard(text);
}
}
function fallbackCopyTextToClipboard(text) {
return new Promise((resolve, reject) => {
const textArea = document.createElement("textarea");
textArea.value = text;
// Avoid scrolling to bottom
textArea.style.top = "0";
textArea.style.left = "0";
textArea.style.position = "fixed";
document.body.appendChild(textArea);
textArea.focus();
textArea.select();
try {
const successful = document.execCommand("copy");
if (successful) {
resolve();
} else {
reject();
}
} catch (err) {
reject(err);
}
document.body.removeChild(textArea);
});
}
document.querySelectorAll(".copy-btn").forEach((btn) => {
btn.addEventListener("click", () => {
const target = btn.dataset.target;
const code = document.getElementById(target).textContent;
//navigator.clipboard.writeText(code).then(() => {
copyToClipboard(code).then(() => {
btn.textContent = "Copied!";
setTimeout(() => {
btn.textContent = "Copy";
}, 2000);
});
});
});
document.addEventListener("DOMContentLoaded", async () => {
try {
const extractionResponse = await fetch("/strategies/extraction");
const extractionStrategies = await extractionResponse.json();
const chunkingResponse = await fetch("/strategies/chunking");
const chunkingStrategies = await chunkingResponse.json();
renderStrategies("extraction-strategies", extractionStrategies);
renderStrategies("chunking-strategies", chunkingStrategies);
} catch (error) {
console.error("Error fetching strategies:", error);
}
});
function renderStrategies(containerId, strategies) {
const container = document.getElementById(containerId);
container.innerHTML = ""; // Clear any existing content
strategies = JSON.parse(strategies);
Object.entries(strategies).forEach(([strategy, description]) => {
const strategyElement = document.createElement("div");
strategyElement.classList.add("bg-zinc-800", "p-4", "rounded", "shadow-md", "docs-item");
const strategyDescription = document.createElement("div");
strategyDescription.classList.add("text-gray-300", "prose", "prose-sm");
strategyDescription.innerHTML = marked.parse(description);
strategyElement.appendChild(strategyDescription);
container.appendChild(strategyElement);
});
}
// Highlight code syntax
hljs.highlightAll();
</script>
</body>
</html>

View File

@@ -1,73 +0,0 @@
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="UTF-8" />
<meta name="viewport" content="width=device-width, initial-scale=1.0" />
<title>Crawl4AI</title>
<link rel="preconnect" href="https://fonts.googleapis.com" />
<link rel="preconnect" href="https://fonts.gstatic.com" crossorigin />
<link href="https://fonts.googleapis.com/css2?family=Inter:wght@100..900&display=swap" rel="stylesheet" />
<!-- <link href="https://cdn.jsdelivr.net/npm/tailwindcss@3.4.3/dist/tailwind.min.css" rel="stylesheet" /> -->
<script src="https://cdn.tailwindcss.com"></script>
<script src="https://cdn.jsdelivr.net/npm/axios/dist/axios.min.js"></script>
<link rel="stylesheet" href="/pages/app.css" />
<link
rel="stylesheet"
href="https://cdnjs.cloudflare.com/ajax/libs/highlight.js/11.7.0/styles/monokai.min.css"
/>
<script src="https://cdn.jsdelivr.net/npm/marked/marked.min.js"></script>
<script src="https://cdnjs.cloudflare.com/ajax/libs/highlight.js/11.7.0/highlight.min.js"></script>
</head>
<body class="bg-black text-gray-200">
<header class="bg-zinc-950 text-lime-500 py-4 flex">
<div class="mx-auto px-4">
<h1 class="text-2xl font-bold">🔥🕷️ Crawl4AI: Web Data for your Thoughts</h1>
</div>
<div class="mx-auto px-4 flex font-bold text-xl gap-2">
<span>📊 Total Website Processed</span>
<span id="total-count" class="text-lime-400">2</span>
</div>
</header>
{{ try_it | safe }}
<div class="mx-auto p-4 bg-zinc-950 text-lime-500 min-h-screen">
<div class="container mx-auto">
<div class="flex h-full px-20">
<div class="sidebar w-1/4 p-4">
<h2 class="text-lg font-bold mb-4">Outline</h2>
<ul>
<li class="mb-2"><a href="#" data-target="installation">Installation</a></li>
<li class="mb-2"><a href="#" data-target="how-to-guide">How to Guide</a></li>
<li class="mb-2"><a href="#" data-target="chunking-strategies">Chunking Strategies</a></li>
<li class="mb-2">
<a href="#" data-target="extraction-strategies">Extraction Strategies</a>
</li>
</ul>
</div>
<!-- Main Content -->
<div class="w-3/4 p-4">
{{installation | safe}} {{how_to_guide | safe}}
<section id="chunking-strategies" class="content-section">
<h1 class="text-2xl font-bold">Chunking Strategies</h1>
<p>Content for chunking strategies...</p>
</section>
<section id="extraction-strategies" class="content-section">
<h1 class="text-2xl font-bold">Extraction Strategies</h1>
<p>Content for extraction strategies...</p>
</section>
</div>
</div>
</div>
</div>
{{ footer | safe }}
<script script src="/pages/app.js"></script>
</body>
</html>

View File

@@ -1,425 +0,0 @@
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="UTF-8" />
<meta name="viewport" content="width=device-width, initial-scale=1.0" />
<title>Crawl4AI</title>
<link rel="preconnect" href="https://fonts.googleapis.com" />
<link rel="preconnect" href="https://fonts.gstatic.com" crossorigin />
<link href="https://fonts.googleapis.com/css2?family=Inter:wght@100..900&display=swap" rel="stylesheet" />
<link href="https://cdn.jsdelivr.net/npm/tailwindcss@2.2.19/dist/tailwind.min.css" rel="stylesheet" />
<script src="https://cdn.jsdelivr.net/npm/axios/dist/axios.min.js"></script>
<link
rel="stylesheet"
href="https://cdnjs.cloudflare.com/ajax/libs/highlight.js/11.7.0/styles/vs2015.min.css"
/>
<script src="https://cdnjs.cloudflare.com/ajax/libs/highlight.js/11.7.0/highlight.min.js"></script>
<style>
:root {
--ifm-font-size-base: 100%;
--ifm-line-height-base: 1.65;
--ifm-font-family-base: system-ui, -apple-system, Segoe UI, Roboto, Ubuntu, Cantarell, Noto Sans,
sans-serif, BlinkMacSystemFont, "Segoe UI", Helvetica, Arial, sans-serif, "Apple Color Emoji",
"Segoe UI Emoji", "Segoe UI Symbol";
}
html {
-webkit-font-smoothing: antialiased;
-webkit-text-size-adjust: 100%;
text-size-adjust: 100%;
font: var(--ifm-font-size-base) / var(--ifm-line-height-base) var(--ifm-font-family-base);
}
body {
background-color: #1a202c;
color: #fff;
}
.tab-content {
max-height: 400px;
overflow: auto;
}
pre {
white-space: pre-wrap;
font-size: 14px;
}
pre code {
width: 100%;
}
</style>
</head>
<body>
<header class="bg-gray-900 text-white py-4">
<div class="container mx-auto px-4">
<h1 class="text-2xl font-bold">🔥🕷️ Crawl4AI: Open-source LLM Friendly Web scraper</h1>
</div>
</header>
<section class="try-it py-8 pb-20">
<div class="container mx-auto px-4">
<h2 class="text-2xl font-bold mb-4">Try It Now</h2>
<div class="mb-4 flex w-full gap-2">
<input
type="text"
id="url-input"
value="https://kidocode.com"
class="border border-gray-600 rounded px-4 py-2 flex-grow bg-gray-800 text-white"
placeholder="Enter URL(s) separated by commas"
/>
<select
id="provider-model-select"
class="border border-gray-600 rounded px-4 py-2 bg-gray-800 text-white"
>
<!-- Add your option values here -->
<option value="groq/llama3-70b-8192">groq/llama3-70b-8192</option>
<option value="groq/llama3-8b-8192">groq/llama3-8b-8192</option>
<option value="openai/gpt-4-turbo">gpt-4-turbo</option>
<option value="openai/gpt-3.5-turbo">gpt-3.5-turbo</option>
<option value="anthropic/claude-3-haiku-20240307">claude-3-haiku</option>
<option value="anthropic/claude-3-opus-20240229">claude-3-opus</option>
<option value="anthropic/claude-3-sonnet-20240229">claude-3-sonnet</option>
</select>
<input
type="password"
id="token-input"
class="border border-gray-600 rounded px-4 py-2 flex-grow bg-gray-800 text-white"
placeholder="Enter Groq API token"
/>
<div class="flex items-center justify-center">
<input type="checkbox" id="extract-blocks-checkbox" class="mr-2" checked>
<label for="extract-blocks-checkbox" class="text-white">Extract Blocks</label>
</div>
<button id="crawl-btn" class="bg-blue-600 text-white px-4 py-2 rounded">Crawl</button>
</div>
<div class="grid grid-cols-1 md:grid-cols-2 gap-8">
<div id="loading" class="hidden mt-4">
<p>Loading...</p>
</div>
<div id="result" class="tab-container flex-1 h-full flex-col">
<div class="tab-buttons flex gap-2">
<button class="tab-btn px-4 py-2 bg-gray-700 rounded-t" data-tab="json">JSON</button>
<button class="tab-btn px-4 py-2 bg-gray-700 rounded-t" data-tab="cleaned-html">
Cleaned HTML
</button>
<button class="tab-btn px-4 py-2 bg-gray-700 rounded-t" data-tab="markdown">
Markdown
</button>
</div>
<div class="tab-content code bg-gray-800 p-2 rounded h-full flex-1 border border-gray-600">
<pre class="h-full flex"><code id="json-result" class="language-json "></code></pre>
<pre
class="hidden h-full flex"
><code id="cleaned-html-result" class="language-html "></code></pre>
<pre
class="hidden h-full flex"
><code id="markdown-result" class="language-markdown "></code></pre>
</div>
</div>
<div id="code_help" class="tab-container flex-1 h-full">
<div class="tab-buttons flex gap-2">
<button class="code-tab-btn px-4 py-2 bg-gray-700 rounded-t" data-tab="curl">cURL</button>
<button class="code-tab-btn px-4 py-2 bg-gray-700 rounded-t" data-tab="python">
Python
</button>
<button class="code-tab-btn px-4 py-2 bg-gray-700 rounded-t" data-tab="nodejs">
Node.js
</button>
</div>
<div class="tab-content result bg-gray-800 p-2 rounded h-full flex-1 border border-gray-600">
<pre class="h-full flex relative">
<code id="curl-code" class="language-bash"></code>
<button class="absolute top-2 right-2 bg-gray-700 text-white px-2 py-1 rounded copy-btn" data-target="curl-code">Copy</button>
</pre>
<pre class="hidden h-full flex relative">
<code id="python-code" class="language-python"></code>
<button class="absolute top-2 right-2 bg-gray-700 text-white px-2 py-1 rounded copy-btn" data-target="python-code">Copy</button>
</pre>
<pre class="hidden h-full flex relative">
<code id="nodejs-code" class="language-javascript"></code>
<button class="absolute top-2 right-2 bg-gray-700 text-white px-2 py-1 rounded copy-btn" data-target="nodejs-code">Copy</button>
</pre>
</div>
</div>
</div>
</div>
</section>
<section class="hero bg-gray-900 py-8">
<div class="container mx-auto px-4">
<h2 class="text-3xl font-bold mb-4">🤔 Why building this?</h2>
<p class="text-lg mb-4">
In recent times, we've seen numerous startups emerging, riding the AI hype wave and charging for
services that should rightfully be accessible to everyone. 🌍💸 One for example is to scrap and crawl
a web page, and transform it o a form suitable for LLM. We don't think one should build a business
out of this, but definilty should be opened source. So if you possess the skills to build such things
and you have such philosphy you should join our "Robinhood" band and help set
these products free. 🆓🤝
</p>
</div>
</section>
<section class="installation py-8">
<div class="container mx-auto px-4">
<h2 class="text-2xl font-bold mb-4">⚙️ Installation</h2>
<p class="mb-4">
To install and run Crawl4AI locally or on your own service, the best way is to use Docker. 🐳 Follow
these steps:
</p>
<ol class="list-decimal list-inside mb-4">
<li>
Clone the GitHub repository: 📥
<code>git clone https://github.com/unclecode/crawl4ai.git</code>
</li>
<li>Navigate to the project directory: 📂 <code>cd crawl4ai</code></li>
<li>
Build the Docker image: 🛠️ <code>docker build -t crawl4ai .</code> On Mac, follow: 🍎
<code>docker build --platform linux/amd64 -t crawl4ai .</code>
</li>
<li>Run the Docker container: ▶️ <code>docker run -p 8000:80 crawl4ai</code></li>
</ol>
<p>
For more detailed instructions and advanced configuration options, please refer to the 📚
<a href="https://github.com/unclecode/crawl4ai" class="text-blue-400">GitHub repository</a>.
</p>
</div>
</section>
<footer class="bg-gray-900 text-white py-4">
<div class="container mx-auto px-4">
<div class="flex justify-between items-center">
<p>© 2024 Crawl4AI. All rights reserved.</p>
<div class="social-links">
<a
href="https://github.com/unclecode/crawl4ai"
class="text-white hover:text-gray-300 mx-2"
target="_blank"
>😺 GitHub</a
>
<a
href="https://twitter.com/unclecode"
class="text-white hover:text-gray-300 mx-2"
target="_blank"
>🐦 Twitter</a
>
<a
href="https://discord.gg/your-invite-link"
class="text-white hover:text-gray-300 mx-2"
target="_blank"
>💬 Discord</a
>
</div>
</div>
</div>
</footer>
<script>
// Get the selected provider model and token from local storage
const storedProviderModel = localStorage.getItem("provider_model");
const storedToken = localStorage.getItem(storedProviderModel);
if (storedProviderModel) {
document.getElementById("provider-model-select").value = storedProviderModel;
}
if (storedToken) {
document.getElementById("token-input").value = storedToken;
}
// Handle provider model dropdown change
document.getElementById("provider-model-select").addEventListener("change", () => {
const selectedProviderModel = document.getElementById("provider-model-select").value;
const storedToken = localStorage.getItem(selectedProviderModel);
if (storedToken) {
document.getElementById("token-input").value = storedToken;
} else {
document.getElementById("token-input").value = "";
}
});
// Fetch total count from the database
axios
.get("/total-count")
.then((response) => {
document.getElementById("total-count").textContent = response.data.count;
})
.catch((error) => console.error(error));
// Handle crawl button click
document.getElementById("crawl-btn").addEventListener("click", () => {
// validate input to have both URL and API token
if (!document.getElementById("url-input").value || !document.getElementById("token-input").value) {
alert("Please enter both URL(s) and API token.");
return;
}
const selectedProviderModel = document.getElementById("provider-model-select").value;
const apiToken = document.getElementById("token-input").value;
const extractBlocks = document.getElementById("extract-blocks-checkbox").checked;
// Save the selected provider model and token to local storage
localStorage.setItem("provider_model", selectedProviderModel);
localStorage.setItem(selectedProviderModel, apiToken);
const urlsInput = document.getElementById("url-input").value;
const urls = urlsInput.split(",").map((url) => url.trim());
const data = {
urls: urls,
provider_model: selectedProviderModel,
api_token: apiToken,
include_raw_html: true,
forced: false,
extract_blocks: extractBlocks,
};
// save api token to local storage
localStorage.setItem("api_token", document.getElementById("token-input").value);
document.getElementById("loading").classList.remove("hidden");
document.getElementById("result").classList.add("hidden");
document.getElementById("code_help").classList.add("hidden");
axios
.post("/crawl", data)
.then((response) => {
const result = response.data.results[0];
const parsedJson = JSON.parse(result.extracted_content);
document.getElementById("json-result").textContent = JSON.stringify(parsedJson, null, 2);
document.getElementById("cleaned-html-result").textContent = result.cleaned_html;
document.getElementById("markdown-result").textContent = result.markdown;
// Update code examples dynamically
// Update code examples dynamically
document.getElementById(
"curl-code"
).textContent = `curl -X POST -H "Content-Type: application/json" -d '${JSON.stringify({
...data,
api_token: "your_api_token",
})}' http://localhost:8000/crawl`;
document.getElementById(
"python-code"
).textContent = `import requests\n\ndata = ${JSON.stringify(
{ ...data, api_token: "your_api_token" },
null,
2
)}\n\nresponse = requests.post("http://localhost:8000/crawl", json=data)\nprint(response.json())`;
document.getElementById(
"nodejs-code"
).textContent = `const axios = require('axios');\n\nconst data = ${JSON.stringify(
{ ...data, api_token: "your_api_token" },
null,
2
)};\n\naxios.post("http://localhost:8000/crawl", data)\n .then(response => console.log(response.data))\n .catch(error => console.error(error));`;
// Highlight code syntax
hljs.highlightAll();
// Select JSON tab by default
document.querySelector('.tab-btn[data-tab="json"]').click();
document.getElementById("loading").classList.add("hidden");
document.getElementById("result").classList.remove("hidden");
document.getElementById("code_help").classList.remove("hidden");
})
.catch((error) => {
console.error(error);
document.getElementById("loading").classList.add("hidden");
});
});
// Handle tab clicks
document.querySelectorAll(".tab-btn").forEach((btn) => {
btn.addEventListener("click", () => {
const tab = btn.dataset.tab;
document
.querySelectorAll(".tab-btn")
.forEach((b) => b.classList.remove("bg-blue-600", "text-white"));
btn.classList.add("bg-blue-600", "text-white");
document.querySelectorAll(".tab-content.code pre").forEach((el) => el.classList.add("hidden"));
document.getElementById(`${tab}-result`).parentElement.classList.remove("hidden");
});
});
// Handle code tab clicks
document.querySelectorAll(".code-tab-btn").forEach((btn) => {
btn.addEventListener("click", () => {
const tab = btn.dataset.tab;
document
.querySelectorAll(".code-tab-btn")
.forEach((b) => b.classList.remove("bg-blue-600", "text-white"));
btn.classList.add("bg-blue-600", "text-white");
document.querySelectorAll(".tab-content.result pre").forEach((el) => el.classList.add("hidden"));
document.getElementById(`${tab}-code`).parentElement.classList.remove("hidden");
});
});
// Handle copy to clipboard button clicks
document.querySelectorAll(".copy-btn").forEach((btn) => {
btn.addEventListener("click", () => {
const target = btn.dataset.target;
const code = document.getElementById(target).textContent;
navigator.clipboard.writeText(code).then(() => {
btn.textContent = "Copied!";
setTimeout(() => {
btn.textContent = "Copy";
}, 2000);
});
});
});
document.getElementById("crawl-btn").addEventListener("click", () => {
const urlsInput = document.getElementById("url-input").value;
const urls = urlsInput.split(",").map(url => url.trim());
const apiToken = document.getElementById("token-input").value;
const selectedProviderModel = document.getElementById("provider-model-select").value;
const extractBlocks = document.getElementById("extract-blocks-checkbox").checked;
const data = {
urls: urls,
provider_model: selectedProviderModel,
api_token: apiToken,
include_raw_html: true,
forced: false,
extract_blocks: extractBlocks
};
localStorage.setItem("api_token", apiToken);
document.getElementById("loading").classList.remove("hidden");
document.getElementById("result").classList.add("hidden");
document.getElementById("code_help").classList.add("hidden");
axios.post("/crawl", data)
.then(response => {
const taskId = response.data.task_id;
pollTaskStatus(taskId);
})
.catch(error => {
console.error('Error during fetch:', error);
document.getElementById("loading").classList.add("hidden");
});
});
function pollTaskStatus(taskId) {
axios.get(`/task/${taskId}`)
.then(response => {
const task = response.data;
if (task.status === 'done') {
displayResults(task.results[0]);
} else if (task.status === 'pending') {
setTimeout(() => pollTaskStatus(taskId), 2000); // Poll every 2 seconds
} else {
console.error('Task failed:', task.error);
document.getElementById("loading").classList.add("hidden");
}
})
.catch(error => {
console.error('Error polling task status:', error);
document.getElementById("loading").classList.add("hidden");
});
}
</script>
</body>
</html>

View File

@@ -1,36 +0,0 @@
<section class="hero bg-zinc-900 py-8 px-20 text-zinc-400">
<div class="container mx-auto px-4">
<h2 class="text-3xl font-bold mb-4">🤔 Why building this?</h2>
<p class="text-lg mb-4">
In recent times, we've witnessed a surge of startups emerging, riding the AI hype wave and charging
for services that should rightfully be accessible to everyone. 🌍💸 One such example is scraping and
crawling web pages and transforming them into a format suitable for Large Language Models (LLMs).
🕸️🤖 We believe that building a business around this is not the right approach; instead, it should
definitely be open-source. 🆓🌟 So, if you possess the skills to build such tools and share our
philosophy, we invite you to join our "Robinhood" band and help set these products free for the
benefit of all. 🤝💪
</p>
</div>
</section>
<footer class="bg-zinc-900 text-zinc-400 py-4">
<div class="container mx-auto px-4">
<div class="flex justify-between items-center">
<p>© 2024 Crawl4AI. All rights reserved.</p>
<div class="social-links">
<a
href="https://github.com/unclecode/crawl4ai"
class="text-zinc-400 hover:text-gray-300 mx-2"
target="_blank"
>😺 GitHub</a
>
<a
href="https://twitter.com/unclecode"
class="text-zinc-400 hover:text-gray-300 mx-2"
target="_blank"
>🐦 Twitter</a
>
</div>
</div>
</div>
</footer>

View File

@@ -1,174 +0,0 @@
<section id="how-to-guide" class="content-section">
<h1 class="text-2xl font-bold">How to Guide</h1>
<div class="flex flex-col gap-4 p-4 bg-zinc-900 text-lime-500">
<!-- Step 1 -->
<div class="col-span-2 bg-lime-800 p-2 rounded text-zinc-50">
🌟
<strong
>Welcome to the Crawl4ai Quickstart Guide! Let's dive into some web crawling
fun!</strong
>
</div>
<div class="">
First Step: Create an instance of WebCrawler and call the
<code>warmup()</code> function.
</div>
<div>
<pre><code class="language-python">crawler = WebCrawler()
crawler.warmup()</code></pre>
</div>
<!-- Step 2 -->
<div class="col-span-2 bg-lime-800 p-2 rounded text-zinc-50">
🧠 <strong>Understanding 'bypass_cache' and 'include_raw_html' parameters:</strong>
</div>
<div class="">First crawl (caches the result):</div>
<div>
<pre><code class="language-python">result = crawler.run(url="https://www.nbcnews.com/business")</code></pre>
</div>
<div class="">Second crawl (Force to crawl again):</div>
<div>
<pre><code class="language-python">result = crawler.run(url="https://www.nbcnews.com/business", bypass_cache=True)</code></pre>
<div class="bg-red-900 p-2 text-zinc-50">
⚠️ Don't forget to set <code>`bypass_cache`</code> to True if you want to try different strategies for the same URL. Otherwise, the cached result will be returned. You can also set <code>`always_by_pass_cache`</code> in constructor to True to always bypass the cache.
</div>
</div>
<div class="">Crawl result without raw HTML content:</div>
<div>
<pre><code class="language-python">result = crawler.run(url="https://www.nbcnews.com/business", include_raw_html=False)</code></pre>
</div>
<!-- Step 3 -->
<div class="col-span-2 bg-lime-800 p-2 rounded text-zinc-50">
📄
<strong
>The 'include_raw_html' parameter, when set to True, includes the raw HTML content
in the response. By default, it is set to True.</strong
>
</div>
<div class="">Set <code>always_by_pass_cache</code> to True:</div>
<div>
<pre><code class="language-python">crawler.always_by_pass_cache = True</code></pre>
</div>
<!-- Step 3.5 Screenshot -->
<div class="col-span-2 bg-lime-800 p-2 rounded text-zinc-50">
📸
<strong>Let's take a screenshot of the page!</strong>
</div>
<div>
<pre><code class="language-python">result = crawler.run(
url="https://www.nbcnews.com/business",
screenshot=True
)
with open("screenshot.png", "wb") as f:
f.write(base64.b64decode(result.screenshot))</code></pre>
</div>
<!-- Step 4 -->
<div class="col-span-2 bg-lime-800 p-2 rounded text-zinc-50">
🧩 <strong>Let's add a chunking strategy: RegexChunking!</strong>
</div>
<div class="">Using RegexChunking:</div>
<div>
<pre><code class="language-python">result = crawler.run(
url="https://www.nbcnews.com/business",
chunking_strategy=RegexChunking(patterns=["\n\n"])
)</code></pre>
</div>
<div class="">Using NlpSentenceChunking:</div>
<div>
<pre><code class="language-python">result = crawler.run(
url="https://www.nbcnews.com/business",
chunking_strategy=NlpSentenceChunking()
)</code></pre>
</div>
<!-- Step 5 -->
<div class="col-span-2 bg-lime-800 p-2 rounded text-zinc-50">
🧠 <strong>Let's get smarter with an extraction strategy: CosineStrategy!</strong>
</div>
<div class="">Using CosineStrategy:</div>
<div>
<pre><code class="language-python">result = crawler.run(
url="https://www.nbcnews.com/business",
extraction_strategy=CosineStrategy(word_count_threshold=10, max_dist=0.2, linkage_method="ward", top_k=3)
)</code></pre>
</div>
<!-- Step 6 -->
<div class="col-span-2 bg-lime-800 p-2 rounded text-zinc-50">
🤖
<strong
>Time to bring in the big guns: LLMExtractionStrategy without instructions!</strong
>
</div>
<div class="">Using LLMExtractionStrategy without instructions:</div>
<div>
<pre><code class="language-python">result = crawler.run(
url="https://www.nbcnews.com/business",
extraction_strategy=LLMExtractionStrategy(provider="openai/gpt-4o", api_token=os.getenv('OPENAI_API_KEY'))
)</code></pre>
</div>
<!-- Step 7 -->
<div class="col-span-2 bg-lime-800 p-2 rounded text-zinc-50">
📜
<strong
>Let's make it even more interesting: LLMExtractionStrategy with
instructions!</strong
>
</div>
<div class="">Using LLMExtractionStrategy with instructions:</div>
<div>
<pre><code class="language-python">result = crawler.run(
url="https://www.nbcnews.com/business",
extraction_strategy=LLMExtractionStrategy(
provider="openai/gpt-4o",
api_token=os.getenv('OPENAI_API_KEY'),
instruction="I am interested in only financial news"
)
)</code></pre>
</div>
<!-- Step 8 -->
<div class="col-span-2 bg-lime-800 p-2 rounded text-zinc-50">
🎯
<strong>Targeted extraction: Let's use a CSS selector to extract only H2 tags!</strong>
</div>
<div class="">Using CSS selector to extract H2 tags:</div>
<div>
<pre><code class="language-python">result = crawler.run(
url="https://www.nbcnews.com/business",
css_selector="h2"
)</code></pre>
</div>
<!-- Step 9 -->
<div class="col-span-2 bg-lime-800 p-2 rounded text-zinc-50">
🖱️
<strong
>Let's get interactive: Passing JavaScript code to click 'Load More' button!</strong
>
</div>
<div class="">Using JavaScript to click 'Load More' button:</div>
<div>
<pre><code class="language-python">js_code = ["""
const loadMoreButton = Array.from(document.querySelectorAll('button')).find(button => button.textContent.includes('Load More'));
loadMoreButton && loadMoreButton.click();
"""]
crawler = WebCrawler(verbos=crawler_strategy, always_by_pass_cache=True)
result = crawler.run(url="https://www.nbcnews.com/business", js = js_code)</code></pre>
<div class="">Remember that you can pass multiple JavaScript code snippets in the list. They all will be executed in the order they are passed.</div>
</div>
<!-- Conclusion -->
<div class="col-span-2 bg-lime-800 p-2 rounded text-zinc-50">
🎉
<strong
>Congratulations! You've made it through the Crawl4ai Quickstart Guide! Now go forth
and crawl the web like a pro! 🕸️</strong
>
</div>
</div>
</section>

View File

@@ -1,65 +0,0 @@
<section id="installation" class="content-section active">
<h1 class="text-2xl font-bold">Installation 💻</h1>
<p class="mb-4">
There are three ways to use Crawl4AI:
<ol class="list-decimal list-inside mb-4">
<li class="">
As a library
</li>
<li class="">
As a local server (Docker)
</li>
<li class="">
As a Google Colab notebook. <a href="https://colab.research.google.com/drive/1wz8u30rvbq6Scodye9AGCw8Qg_Z8QGsk"
><img
src="https://colab.research.google.com/assets/colab-badge.svg"
alt="Open In Colab"
style="display: inline-block; width: 100px; height: 20px"
/></a>
</li>
</p>
<p class="my-4">To install Crawl4AI as a library, follow these steps:</p>
<ol class="list-decimal list-inside mb-4">
<li class="mb-4">
Install the package from GitHub:
<pre
class="bg-zinc-800 p-4 rounded mt-2 text-zinc-100"
><code>virtualenv venv
source venv/bin/activate
pip install "crawl4ai[all] @ git+https://github.com/unclecode/crawl4ai.git"
</code></pre>
</li>
<li class="mb-4">
Run the following command to load the required models. This is optional, but it will boost the performance and speed of the crawler. You need to do this only once.
<pre
class="bg-zinc-800 p-4 rounded mt-2 text-zinc-100"
><code>crawl4ai-download-models</code></pre>
</li>
<li class="mb-4">
Alternatively, you can clone the repository and install the package locally:
<pre
class="bg-zinc-800 p-4 rounded mt-2 text-zinc-100"
><code class = "language-python bash">virtualenv venv
source venv/bin/activate
git clone https://github.com/unclecode/crawl4ai.git
cd crawl4ai
pip install -e .[all]
</code></pre>
</li>
<li class="">
Use docker to run the local server:
<pre
class="bg-zinc-800 p-4 rounded mt-2 text-zinc-100"
><code class = "language-python bash">docker build -t crawl4ai .
# docker build --platform linux/amd64 -t crawl4ai . For Mac users
docker run -d -p 8000:80 crawl4ai</code></pre>
</li>
</ol>
<p class="mb-4">
For more information about how to run Crawl4AI as a local server, please refer to the
<a href="https://github.com/unclecode/crawl4ai" class="text-blue-400">GitHub repository</a>.
</p>
</section>

View File

@@ -1,217 +0,0 @@
<section class="try-it py-8 px-16 pb-20 bg-zinc-900 overflow-hidden">
<div class="container mx-auto ">
<h2 class="text-2xl font-bold mb-4 text-lime-500">Try It Now</h2>
<div class="flex gap-4">
<div class="flex flex-col flex-1 gap-2">
<div class="flex flex-col">
<label for="url-input" class="text-lime-500 font-bold text-xs">URL(s)</label>
<input
type="text"
id="url-input"
value="https://www.nbcnews.com/business"
class="border border-zinc-700 rounded px-4 py-0 bg-zinc-900 text-zinc-300"
placeholder="Enter URL(s) separated by commas"
/>
</div>
<div class="flex gap-2">
<div class="flex flex-col">
<label for="threshold" class="text-lime-500 font-bold text-xs">Min Words Threshold</label>
<select
id="threshold"
class="border border-zinc-700 rounded px-4 py-1 bg-zinc-900 text-zinc-300"
>
<option value="1">1</option>
<option value="5">5</option>
<option value="10" selected>10</option>
<option value="15">15</option>
<option value="20">20</option>
<option value="25">25</option>
</select>
</div>
<div class="flex flex-col flex-1">
<label for="css-selector" class="text-lime-500 font-bold text-xs">CSS Selector</label>
<input
type="text"
id="css-selector"
class="border border-zinc-700 rounded px-4 py-0 bg-zinc-900 text-zinc-300 placeholder-lime-700"
placeholder="CSS Selector (e.g. .content, #main, article)"
/>
</div>
</div>
<div class="flex gap-2">
<div class="flex flex-col">
<label for="extraction-strategy-select" class="text-lime-500 font-bold text-xs"
>Extraction Strategy</label
>
<select
id="extraction-strategy-select"
class="border border-zinc-700 rounded px-4 py-1 bg-zinc-900 text-zinc-300"
>
<option value="NoExtractionStrategy" selected>NoExtractionStrategy</option>
<option value="CosineStrategy">CosineStrategy</option>
<option value="LLMExtractionStrategy">LLMExtractionStrategy</option>
</select>
</div>
<div class="flex flex-col">
<label for="chunking-strategy-select" class="text-lime-500 font-bold text-xs"
>Chunking Strategy</label
>
<select
id="chunking-strategy-select"
class="border border-zinc-700 rounded px-4 py-1 bg-zinc-900 text-zinc-300"
>
<option value="RegexChunking">RegexChunking</option>
<option value="NlpSentenceChunking">NlpSentenceChunking</option>
<option value="TopicSegmentationChunking">TopicSegmentationChunking</option>
<option value="FixedLengthWordChunking">FixedLengthWordChunking</option>
<option value="SlidingWindowChunking">SlidingWindowChunking</option>
</select>
</div>
</div>
<div id = "llm_settings" class="flex gap-2 hidden hidden">
<div class="flex flex-col">
<label for="provider-model-select" class="text-lime-500 font-bold text-xs"
>Provider Model</label
>
<select
id="provider-model-select"
class="border border-zinc-700 rounded px-4 py-1 bg-zinc-900 text-zinc-300"
>
<option value="groq/llama3-70b-8192">groq/llama3-70b-8192</option>
<option value="groq/llama3-8b-8192">groq/llama3-8b-8192</option>
<option value="groq/mixtral-8x7b-32768">groq/mixtral-8x7b-32768</option>
<option value="openai/gpt-4-turbo">gpt-4-turbo</option>
<option value="openai/gpt-3.5-turbo">gpt-3.5-turbo</option>
<option value="openai/gpt-4o">gpt-4o</option>
<option value="anthropic/claude-3-haiku-20240307">claude-3-haiku</option>
<option value="anthropic/claude-3-opus-20240229">claude-3-opus</option>
<option value="anthropic/claude-3-sonnet-20240229">claude-3-sonnet</option>
</select>
</div>
<div class="flex flex-col flex-1">
<label for="token-input" class="text-lime-500 font-bold text-xs">API Token</label>
<input
type="password"
id="token-input"
class="border border-zinc-700 rounded px-4 py-0 bg-zinc-900 text-zinc-300"
placeholder="Enter Groq API token"
/>
</div>
</div>
<div class="flex gap-2">
<!-- Add two textarea one for getting Keyword Filter and another one Instruction, make both grow whole with-->
<div id = "semantic_filter_div" class="flex flex-col flex-1 hidden">
<label for="keyword-filter" class="text-lime-500 font-bold text-xs">Keyword Filter</label>
<textarea
id="semantic_filter"
rows="3"
class="border border-zinc-700 rounded px-4 py-0 bg-zinc-900 text-zinc-300 placeholder-zinc-700"
placeholder="Enter keywords for CosineStrategy to narrow down the content."
></textarea>
</div>
<div id = "instruction_div" class="flex flex-col flex-1 hidden">
<label for="instruction" class="text-lime-500 font-bold text-xs">Instruction</label>
<textarea
id="instruction"
rows="3"
class="border border-zinc-700 rounded px-4 py-0 bg-zinc-900 text-zinc-300 placeholder-zinc-700"
placeholder="Enter instruction for the LLMEstrategy to instruct the model."
></textarea>
</div>
</div>
<div class="flex gap-3">
<div class="flex items-center gap-2">
<input type="checkbox" id="bypass-cache-checkbox" />
<label for="bypass-cache-checkbox" class="text-lime-500 font-bold">Bypass Cache</label>
</div>
<div class="flex items-center gap-2">
<input type="checkbox" id="screenshot-checkbox" checked />
<label for="screenshot-checkbox" class="text-lime-500 font-bold">Screenshot</label>
</div>
<div class="flex items-center gap-2 hidden">
<input type="checkbox" id="extract-blocks-checkbox" />
<label for="extract-blocks-checkbox" class="text-lime-500 font-bold">Extract Blocks</label>
</div>
<button id="crawl-btn" class="bg-lime-600 text-black font-bold px-4 py-0 rounded">Crawl</button>
</div>
</div>
<div id="loading" class="hidden">
<p class="text-white">Loading... Please wait.</p>
</div>
<div id="result" class="flex-1 overflow-x-auto">
<div class="tab-buttons flex gap-2">
<button class="tab-btn px-4 py-1 text-sm bg-zinc-700 rounded-t text-lime-500" data-tab="json">
JSON
</button>
<button
class="tab-btn px-4 py-1 text-sm bg-zinc-700 rounded-t text-lime-500"
data-tab="cleaned-html"
>
Cleaned HTML
</button>
<button class="tab-btn px-4 py-1 text-sm bg-zinc-700 rounded-t text-lime-500" data-tab="markdown">
Markdown
</button>
<button class="tab-btn px-4 py-1 text-sm bg-zinc-700 rounded-t text-lime-500" data-tab="media">
Medias
</button>
<button class="tab-btn px-4 py-1 text-sm bg-zinc-700 rounded-t text-lime-500" data-tab="screenshot">
Screenshot
</button>
</div>
<div class="tab-content code bg-zinc-900 p-2 rounded h-full border border-zinc-700 text-sm">
<pre class="h-full flex"><code id="json-result" class="language-json"></code></pre>
<pre class="hidden h-full flex"><code id="cleaned-html-result" class="language-html"></code></pre>
<pre class="hidden h-full flex"><code id="markdown-result" class="language-markdown"></code></pre>
<pre class="hidden h-full flex"><code id="media-result" class="language-json"></code></pre>
<pre class="hidden h-full flex"><code id="screenshot-result"></code></pre>
</div>
</div>
<div id="code_help" class="flex-1 overflow-x-auto">
<div class="tab-buttons flex gap-2">
<button class="code-tab-btn px-4 py-1 text-sm bg-zinc-700 rounded-t text-lime-500" data-tab="curl">
cURL
</button>
<button
class="code-tab-btn px-4 py-1 text-sm bg-zinc-700 rounded-t text-lime-500"
data-tab="library"
>
Python
</button>
<button
class="code-tab-btn px-4 py-1 text-sm bg-zinc-700 rounded-t text-lime-500"
data-tab="python"
>
REST API
</button>
<!-- <button
class="code-tab-btn px-4 py-1 text-sm bg-zinc-700 rounded-t text-lime-500"
data-tab="nodejs"
>
Node.js
</button> -->
</div>
<div class="tab-content result bg-zinc-900 p-2 rounded h-full border border-zinc-700 text-sm">
<pre class="h-full flex relative overflow-x-auto">
<code id="curl-code" class="language-bash"></code>
<button class="absolute top-2 right-2 bg-zinc-700 text-white px-2 py-1 rounded copy-btn" data-target="curl-code">Copy</button>
</pre>
<pre class="hidden h-full flex relative overflow-x-auto">
<code id="python-code" class="language-python"></code>
<button class="absolute top-2 right-2 bg-zinc-700 text-white px-2 py-1 rounded copy-btn" data-target="python-code">Copy</button>
</pre>
<pre class="hidden h-full flex relative overflow-x-auto">
<code id="nodejs-code" class="language-javascript"></code>
<button class="absolute top-2 right-2 bg-zinc-700 text-white px-2 py-1 rounded copy-btn" data-target="nodejs-code">Copy</button>
</pre>
<pre class="hidden h-full flex relative overflow-x-auto">
<code id="library-code" class="language-python"></code>
<button class="absolute top-2 right-2 bg-zinc-700 text-white px-2 py-1 rounded copy-btn" data-target="library-code">Copy</button>
</pre>
</div>
</div>
</div>
</div>
</section>

View File

@@ -1,434 +0,0 @@
<div class="w-3/4 p-4">
<section id="installation" class="content-section active">
<h1 class="text-2xl font-bold">Installation 💻</h1>
<p class="mb-4">There are three ways to use Crawl4AI:</p>
<ol class="list-decimal list-inside mb-4">
<li class="">As a library</li>
<li class="">As a local server (Docker)</li>
<li class="">
As a Google Colab notebook.
<a href="https://colab.research.google.com/drive/1wz8u30rvbq6Scodye9AGCw8Qg_Z8QGsk"
><img
src="https://colab.research.google.com/assets/colab-badge.svg"
alt="Open In Colab"
style="display: inline-block; width: 100px; height: 20px"
/></a>
</li>
<p></p>
<p class="my-4">To install Crawl4AI as a library, follow these steps:</p>
<ol class="list-decimal list-inside mb-4">
<li class="mb-4">
Install the package from GitHub:
<pre
class="bg-zinc-800 p-4 rounded mt-2 text-zinc-100"
><code class="hljs language-bash">pip install git+https://github.com/unclecode/crawl4ai.git</code></pre>
</li>
<li class="mb-4">
Alternatively, you can clone the repository and install the package locally:
<pre
class="bg-zinc-800 p-4 rounded mt-2 text-zinc-100"
><code class="language-python bash hljs">virtualenv venv
source venv/<span class="hljs-built_in">bin</span>/activate
git clone https://github.com/unclecode/crawl4ai.git
cd crawl4ai
pip install -e .
</code></pre>
</li>
<li class="">
Use docker to run the local server:
<pre
class="bg-zinc-800 p-4 rounded mt-2 text-zinc-100"
><code class="language-python bash hljs">docker build -t crawl4ai .
<span class="hljs-comment"># docker build --platform linux/amd64 -t crawl4ai . For Mac users</span>
docker run -d -p <span class="hljs-number">8000</span>:<span class="hljs-number">80</span> crawl4ai</code></pre>
</li>
</ol>
<p class="mb-4">
For more information about how to run Crawl4AI as a local server, please refer to the
<a href="https://github.com/unclecode/crawl4ai" class="text-blue-400">GitHub repository</a>.
</p>
</ol>
</section>
<section id="how-to-guide" class="content-section">
<h1 class="text-2xl font-bold">How to Guide</h1>
<div class="flex flex-col gap-4 p-4 bg-zinc-900 text-lime-500">
<!-- Step 1 -->
<div class="col-span-2 bg-lime-800 p-2 rounded text-zinc-50">
🌟
<strong>Welcome to the Crawl4ai Quickstart Guide! Let's dive into some web crawling fun!</strong>
</div>
<div class="">
First Step: Create an instance of WebCrawler and call the
<code>warmup()</code> function.
</div>
<div>
<pre><code class="language-python hljs">crawler = WebCrawler()
crawler.warmup()</code></pre>
</div>
<!-- Step 2 -->
<div class="col-span-2 bg-lime-800 p-2 rounded text-zinc-50">
🧠 <strong>Understanding 'bypass_cache' and 'include_raw_html' parameters:</strong>
</div>
<div class="">First crawl (caches the result):</div>
<div>
<pre><code class="language-python hljs">result = crawler.run(url=<span class="hljs-string">"https://www.nbcnews.com/business"</span>)</code></pre>
</div>
<div class="">Second crawl (Force to crawl again):</div>
<div>
<pre><code class="language-python hljs">result = crawler.run(url=<span class="hljs-string">"https://www.nbcnews.com/business"</span>, bypass_cache=<span class="hljs-literal">True</span>)</code></pre>
<div class="bg-red-900 p-2 text-zinc-50">
⚠️ Don't forget to set <code>`bypass_cache`</code> to True if you want to try different strategies
for the same URL. Otherwise, the cached result will be returned. You can also set
<code>`always_by_pass_cache`</code> in constructor to True to always bypass the cache.
</div>
</div>
<div class="">Crawl result without raw HTML content:</div>
<div>
<pre><code class="language-python hljs">result = crawler.run(url=<span class="hljs-string">"https://www.nbcnews.com/business"</span>, include_raw_html=<span class="hljs-literal">False</span>)</code></pre>
</div>
<!-- Step 3 -->
<div class="col-span-2 bg-lime-800 p-2 rounded text-zinc-50">
📄
<strong
>The 'include_raw_html' parameter, when set to True, includes the raw HTML content in the response.
By default, it is set to True.</strong
>
</div>
<div class="">Set <code>always_by_pass_cache</code> to True:</div>
<div>
<pre><code class="language-python hljs">crawler.always_by_pass_cache = <span class="hljs-literal">True</span></code></pre>
</div>
<!-- Step 4 -->
<div class="col-span-2 bg-lime-800 p-2 rounded text-zinc-50">
🧩 <strong>Let's add a chunking strategy: RegexChunking!</strong>
</div>
<div class="">Using RegexChunking:</div>
<div>
<pre><code class="language-python hljs">result = crawler.run(
url=<span class="hljs-string">"https://www.nbcnews.com/business"</span>,
chunking_strategy=RegexChunking(patterns=[<span class="hljs-string">"\n\n"</span>])
)</code></pre>
</div>
<div class="">Using NlpSentenceChunking:</div>
<div>
<pre><code class="language-python hljs">result = crawler.run(
url=<span class="hljs-string">"https://www.nbcnews.com/business"</span>,
chunking_strategy=NlpSentenceChunking()
)</code></pre>
</div>
<!-- Step 5 -->
<div class="col-span-2 bg-lime-800 p-2 rounded text-zinc-50">
🧠 <strong>Let's get smarter with an extraction strategy: CosineStrategy!</strong>
</div>
<div class="">Using CosineStrategy:</div>
<div>
<pre><code class="language-python hljs">result = crawler.run(
url=<span class="hljs-string">"https://www.nbcnews.com/business"</span>,
extraction_strategy=CosineStrategy(word_count_threshold=<span class="hljs-number">20</span>, max_dist=<span class="hljs-number">0.2</span>, linkage_method=<span class="hljs-string">"ward"</span>, top_k=<span class="hljs-number">3</span>)
)</code></pre>
</div>
<!-- Step 6 -->
<div class="col-span-2 bg-lime-800 p-2 rounded text-zinc-50">
🤖
<strong>Time to bring in the big guns: LLMExtractionStrategy without instructions!</strong>
</div>
<div class="">Using LLMExtractionStrategy without instructions:</div>
<div>
<pre><code class="language-python hljs">result = crawler.run(
url=<span class="hljs-string">"https://www.nbcnews.com/business"</span>,
extraction_strategy=LLMExtractionStrategy(provider=<span class="hljs-string">"openai/gpt-4o"</span>, api_token=os.getenv(<span class="hljs-string">'OPENAI_API_KEY'</span>))
)</code></pre>
</div>
<!-- Step 7 -->
<div class="col-span-2 bg-lime-800 p-2 rounded text-zinc-50">
📜
<strong>Let's make it even more interesting: LLMExtractionStrategy with instructions!</strong>
</div>
<div class="">Using LLMExtractionStrategy with instructions:</div>
<div>
<pre><code class="language-python hljs">result = crawler.run(
url=<span class="hljs-string">"https://www.nbcnews.com/business"</span>,
extraction_strategy=LLMExtractionStrategy(
provider=<span class="hljs-string">"openai/gpt-4o"</span>,
api_token=os.getenv(<span class="hljs-string">'OPENAI_API_KEY'</span>),
instruction=<span class="hljs-string">"I am interested in only financial news"</span>
)
)</code></pre>
</div>
<!-- Step 8 -->
<div class="col-span-2 bg-lime-800 p-2 rounded text-zinc-50">
🎯
<strong>Targeted extraction: Let's use a CSS selector to extract only H2 tags!</strong>
</div>
<div class="">Using CSS selector to extract H2 tags:</div>
<div>
<pre><code class="language-python hljs">result = crawler.run(
url=<span class="hljs-string">"https://www.nbcnews.com/business"</span>,
css_selector=<span class="hljs-string">"h2"</span>
)</code></pre>
</div>
<!-- Step 9 -->
<div class="col-span-2 bg-lime-800 p-2 rounded text-zinc-50">
🖱️
<strong>Let's get interactive: Passing JavaScript code to click 'Load More' button!</strong>
</div>
<div class="">Using JavaScript to click 'Load More' button:</div>
<div>
<pre><code class="language-python hljs">js_code = <span class="hljs-string">"""
const loadMoreButton = Array.from(document.querySelectorAll('button')).find(button =&gt; button.textContent.includes('Load More'));
loadMoreButton &amp;&amp; loadMoreButton.click();
"""</span>
crawler_strategy = LocalSeleniumCrawlerStrategy(js_code=js_code)
crawler = WebCrawler(crawler_strategy=crawler_strategy, always_by_pass_cache=<span class="hljs-literal">True</span>)
result = crawler.run(url=<span class="hljs-string">"https://www.nbcnews.com/business"</span>)</code></pre>
</div>
<!-- Conclusion -->
<div class="col-span-2 bg-lime-800 p-2 rounded text-zinc-50">
🎉
<strong
>Congratulations! You've made it through the Crawl4ai Quickstart Guide! Now go forth and crawl the
web like a pro! 🕸️</strong
>
</div>
</div>
</section>
<section id="chunking-strategies" class="content-section">
<div class="bg-zinc-800 p-4 rounded shadow-md docs-item">
<div class="text-gray-300 prose prose-sm">
<h3>RegexChunking</h3>
<p>
<code>RegexChunking</code> is a text chunking strategy that splits a given text into smaller parts
using regular expressions. This is useful for preparing large texts for processing by language
models, ensuring they are divided into manageable segments.
</p>
<h4>Constructor Parameters:</h4>
<ul>
<li>
<code>patterns</code> (list, optional): A list of regular expression patterns used to split the
text. Default is to split by double newlines (<code>['\n\n']</code>).
</li>
</ul>
<h4>Example usage:</h4>
<pre><code class="language-python">chunker = RegexChunking(patterns=[r'\n\n', r'\. '])
chunks = chunker.chunk("This is a sample text. It will be split into chunks.")
</code></pre>
</div>
</div>
<div class="bg-zinc-800 p-4 rounded shadow-md docs-item">
<div class="text-gray-300 prose prose-sm">
<h3>NlpSentenceChunking</h3>
<p>
<code>NlpSentenceChunking</code> uses a natural language processing model to chunk a given text into
sentences. This approach leverages SpaCy to accurately split text based on sentence boundaries.
</p>
<h4>Constructor Parameters:</h4>
<ul>
<li>
None.
</li>
</ul>
<h4>Example usage:</h4>
<pre><code class="language-python">chunker = NlpSentenceChunking()
chunks = chunker.chunk("This is a sample text. It will be split into sentences.")
</code></pre>
</div>
</div>
<div class="bg-zinc-800 p-4 rounded shadow-md docs-item">
<div class="text-gray-300 prose prose-sm">
<h3>TopicSegmentationChunking</h3>
<p>
<code>TopicSegmentationChunking</code> uses the TextTiling algorithm to segment a given text into
topic-based chunks. This method identifies thematic boundaries in the text.
</p>
<h4>Constructor Parameters:</h4>
<ul>
<li>
<code>num_keywords</code> (int, optional): The number of keywords to extract for each topic
segment. Default is <code>3</code>.
</li>
</ul>
<h4>Example usage:</h4>
<pre><code class="language-python">chunker = TopicSegmentationChunking(num_keywords=3)
chunks = chunker.chunk("This is a sample text. It will be split into topic-based segments.")
</code></pre>
</div>
</div>
<div class="bg-zinc-800 p-4 rounded shadow-md docs-item">
<div class="text-gray-300 prose prose-sm">
<h3>FixedLengthWordChunking</h3>
<p>
<code>FixedLengthWordChunking</code> splits a given text into chunks of fixed length, based on the
number of words.
</p>
<h4>Constructor Parameters:</h4>
<ul>
<li>
<code>chunk_size</code> (int, optional): The number of words in each chunk. Default is
<code>100</code>.
</li>
</ul>
<h4>Example usage:</h4>
<pre><code class="language-python">chunker = FixedLengthWordChunking(chunk_size=100)
chunks = chunker.chunk("This is a sample text. It will be split into fixed-length word chunks.")
</code></pre>
</div>
</div>
<div class="bg-zinc-800 p-4 rounded shadow-md docs-item">
<div class="text-gray-300 prose prose-sm">
<h3>SlidingWindowChunking</h3>
<p>
<code>SlidingWindowChunking</code> uses a sliding window approach to chunk a given text. Each chunk
has a fixed length, and the window slides by a specified step size.
</p>
<h4>Constructor Parameters:</h4>
<ul>
<li>
<code>window_size</code> (int, optional): The number of words in each chunk. Default is
<code>100</code>.
</li>
<li>
<code>step</code> (int, optional): The number of words to slide the window. Default is
<code>50</code>.
</li>
</ul>
<h4>Example usage:</h4>
<pre><code class="language-python">chunker = SlidingWindowChunking(window_size=100, step=50)
chunks = chunker.chunk("This is a sample text. It will be split using a sliding window approach.")
</code></pre>
</div>
</div>
</section>
<section id="extraction-strategies" class="content-section">
<div class="bg-zinc-800 p-4 rounded shadow-md docs-item">
<div class="text-gray-300 prose prose-sm">
<h3>NoExtractionStrategy</h3>
<p>
<code>NoExtractionStrategy</code> is a basic extraction strategy that returns the entire HTML
content without any modification. It is useful for cases where no specific extraction is required.
Only clean html, and amrkdown.
</p>
<h4>Constructor Parameters:</h4>
<p>None.</p>
<h4>Example usage:</h4>
<pre><code class="language-python">extractor = NoExtractionStrategy()
extracted_content = extractor.extract(url, html)
</code></pre>
</div>
</div>
<div class="bg-zinc-800 p-4 rounded shadow-md docs-item">
<div class="text-gray-300 prose prose-sm">
<h3>LLMExtractionStrategy</h3>
<p>
<code>LLMExtractionStrategy</code> uses a Language Model (LLM) to extract meaningful blocks or
chunks from the given HTML content. This strategy leverages an external provider for language model
completions.
</p>
<h4>Constructor Parameters:</h4>
<ul>
<li>
<code>provider</code> (str, optional): The provider to use for the language model completions.
Default is <code>DEFAULT_PROVIDER</code> (e.g., openai/gpt-4).
</li>
<li>
<code>api_token</code> (str, optional): The API token for the provider. If not provided, it will
try to load from the environment variable <code>OPENAI_API_KEY</code>.
</li>
<li>
<code>instruction</code> (str, optional): An instruction to guide the LLM on how to perform the
extraction. This allows users to specify the type of data they are interested in or set the tone
of the response. Default is <code>None</code>.
</li>
</ul>
<h4>Example usage:</h4>
<pre><code class="language-python">extractor = LLMExtractionStrategy(provider='openai', api_token='your_api_token', instruction='Extract only news about AI.')
extracted_content = extractor.extract(url, html)
</code></pre>
<p>
By providing clear instructions, users can tailor the extraction process to their specific needs,
enhancing the relevance and utility of the extracted content.
</p>
</div>
</div>
<div class="bg-zinc-800 p-4 rounded shadow-md docs-item">
<div class="text-gray-300 prose prose-sm">
<h3>CosineStrategy</h3>
<p>
<code>CosineStrategy</code> uses hierarchical clustering based on cosine similarity to extract
clusters of text from the given HTML content. This strategy is suitable for identifying related
content sections.
</p>
<h4>Constructor Parameters:</h4>
<ul>
<li>
<code>semantic_filter</code> (str, optional): A string containing keywords for filtering relevant
documents before clustering. If provided, documents are filtered based on their cosine
similarity to the keyword filter embedding. Default is <code>None</code>.
</li>
<li>
<code>word_count_threshold</code> (int, optional): Minimum number of words per cluster. Default
is <code>20</code>.
</li>
<li>
<code>max_dist</code> (float, optional): The maximum cophenetic distance on the dendrogram to
form clusters. Default is <code>0.2</code>.
</li>
<li>
<code>linkage_method</code> (str, optional): The linkage method for hierarchical clustering.
Default is <code>'ward'</code>.
</li>
<li>
<code>top_k</code> (int, optional): Number of top categories to extract. Default is
<code>3</code>.
</li>
<li>
<code>model_name</code> (str, optional): The model name for embedding generation. Default is
<code>'BAAI/bge-small-en-v1.5'</code>.
</li>
</ul>
<h4>Example usage:</h4>
<pre><code class="language-python">extractor = CosineStrategy(semantic_filter='artificial intelligence', word_count_threshold=10, max_dist=0.2, linkage_method='ward', top_k=3, model_name='BAAI/bge-small-en-v1.5')
extracted_content = extractor.extract(url, html)
</code></pre>
<h4>Cosine Similarity Filtering</h4>
<p>
When a <code>semantic_filter</code> is provided, the <code>CosineStrategy</code> applies an
embedding-based filtering process to select relevant documents before performing hierarchical
clustering.
</p>
</div>
</div>
<div class="bg-zinc-800 p-4 rounded shadow-md docs-item">
<div class="text-gray-300 prose prose-sm">
<h3>TopicExtractionStrategy</h3>
<p>
<code>TopicExtractionStrategy</code> uses the TextTiling algorithm to segment the HTML content into
topics and extracts keywords for each segment. This strategy is useful for identifying and
summarizing thematic content.
</p>
<h4>Constructor Parameters:</h4>
<ul>
<li>
<code>num_keywords</code> (int, optional): Number of keywords to represent each topic segment.
Default is <code>3</code>.
</li>
</ul>
<h4>Example usage:</h4>
<pre><code class="language-python">extractor = TopicExtractionStrategy(num_keywords=3)
extracted_content = extractor.extract(url, html)
</code></pre>
</div>
</div>
</section>
</div>

View File

@@ -1,5 +0,0 @@
-r requirements.txt
pytest
pytest-asyncio
selenium
setuptools

View File

@@ -1,11 +1,16 @@
aiosqlite~=0.20
html2text~=2024.2
lxml~=5.3
litellm~=1.48
litellm>=1.53.1
numpy>=1.26.0,<3
pillow~=10.4
playwright>=1.47,<1.48
playwright>=1.49.0
python-dotenv~=1.0
requests~=2.26
beautifulsoup4~=4.12
playwright_stealth~=1.0
tf-playwright-stealth>=1.1.0
xxhash~=3.4
rank-bm25~=0.2
aiofiles>=24.1.0
colorama~=0.4
snowballstemmer~=2.2
pydantic>=2.10

View File

@@ -1,60 +1,51 @@
from setuptools import setup, find_packages
from setuptools.command.install import install
import os
from pathlib import Path
import shutil
import subprocess
import sys
# Create the .crawl4ai folder in the user's home directory if it doesn't exist
# If the folder already exists, remove the cache folder
crawl4ai_folder = os.getenv("CRAWL4_AI_BASE_DIRECTORY", Path.home()) / ".crawl4ai"
base_dir = os.getenv("CRAWL4_AI_BASE_DIRECTORY")
crawl4ai_folder = Path(base_dir) if base_dir else Path.home()
crawl4ai_folder = crawl4ai_folder / ".crawl4ai"
cache_folder = crawl4ai_folder / "cache"
content_folders = [
"html_content",
"cleaned_html",
"markdown_content",
"extracted_content",
"screenshots",
]
# Clean up old cache if exists
if cache_folder.exists():
shutil.rmtree(cache_folder)
# Create new folder structure
crawl4ai_folder.mkdir(exist_ok=True)
cache_folder.mkdir(exist_ok=True)
for folder in content_folders:
(crawl4ai_folder / folder).mkdir(exist_ok=True)
# Read the requirements from requirements.txt
# Read requirements and version
__location__ = os.path.realpath(os.path.join(os.getcwd(), os.path.dirname(__file__)))
with open(os.path.join(__location__, "requirements.txt")) as f:
requirements = f.read().splitlines()
# Read version from __init__.py
with open("crawl4ai/_version.py") as f:
with open("crawl4ai/__version__.py") as f:
for line in f:
if line.startswith("__version__"):
version = line.split("=")[1].strip().strip('"')
break
# Define the requirements for different environments
# Define requirements
default_requirements = requirements
# torch_requirements = ["torch", "nltk", "spacy", "scikit-learn"]
# transformer_requirements = ["transformers", "tokenizers", "onnxruntime"]
torch_requirements = ["torch", "nltk", "scikit-learn"]
torch_requirements = ["torch", "nltk", "scikit-learn"]
transformer_requirements = ["transformers", "tokenizers"]
cosine_similarity_requirements = ["torch", "transformers", "nltk" ]
cosine_similarity_requirements = ["torch", "transformers", "nltk"]
sync_requirements = ["selenium"]
def install_playwright():
print("Installing Playwright browsers...")
try:
subprocess.check_call([sys.executable, "-m", "playwright", "install"])
print("Playwright installation completed successfully.")
except subprocess.CalledProcessError as e:
print(f"Error during Playwright installation: {e}")
print("Please run 'python -m playwright install' manually after the installation.")
except Exception as e:
print(f"Unexpected error during Playwright installation: {e}")
print("Please run 'python -m playwright install' manually after the installation.")
class PostInstallCommand(install):
def run(self):
install.run(self)
install_playwright()
setup(
name="Crawl4AI",
version=version,
@@ -66,17 +57,27 @@ setup(
author_email="unclecode@kidocode.com",
license="MIT",
packages=find_packages(),
install_requires=default_requirements + ["playwright"], # Add playwright to default requirements
package_data={
'crawl4ai': ['js_snippet/*.js'] # This matches the exact path structure
},
install_requires=default_requirements
+ ["playwright", "aiofiles"], # Added aiofiles
extras_require={
"torch": torch_requirements,
"transformer": transformer_requirements,
"cosine": cosine_similarity_requirements,
"sync": sync_requirements,
"all": default_requirements + torch_requirements + transformer_requirements + cosine_similarity_requirements + sync_requirements,
"all": default_requirements
+ torch_requirements
+ transformer_requirements
+ cosine_similarity_requirements
+ sync_requirements,
},
entry_points={
'console_scripts': [
'crawl4ai-download-models=crawl4ai.model_loader:main',
"console_scripts": [
"crawl4ai-download-models=crawl4ai.model_loader:main",
"crawl4ai-migrate=crawl4ai.migrations:main",
'crawl4ai-setup=crawl4ai.install:post_install',
],
},
classifiers=[
@@ -90,7 +91,4 @@ setup(
"Programming Language :: Python :: 3.10",
],
python_requires=">=3.7",
cmdclass={
'install': PostInstallCommand,
},
)
)

File diff suppressed because one or more lines are too long

View File

@@ -0,0 +1,153 @@
import os, sys
parent_dir = os.path.dirname(os.path.dirname(os.path.abspath(__file__)))
sys.path.append(parent_dir)
__location__ = os.path.realpath( os.path.join(os.getcwd(), os.path.dirname(__file__)))
import os, sys
import asyncio
from crawl4ai import AsyncWebCrawler, CacheMode
from crawl4ai.content_filter_strategy import PruningContentFilter
from crawl4ai.markdown_generation_strategy import DefaultMarkdownGenerator
# Assuming that the changes made allow different configurations
# for managed browser, persistent context, and so forth.
async def test_default_headless():
async with AsyncWebCrawler(
headless=True,
verbose=True,
user_agent_mode="random",
user_agent_generator_config={"device_type": "mobile", "os_type": "android"},
use_managed_browser=False,
use_persistent_context=False,
ignore_https_errors=True,
# Testing normal ephemeral context
) as crawler:
result = await crawler.arun(
url='https://www.kidocode.com/degrees/technology',
cache_mode=CacheMode.BYPASS,
markdown_generator=DefaultMarkdownGenerator(options={"ignore_links": True}),
)
print("[test_default_headless] success:", result.success)
print("HTML length:", len(result.html if result.html else ""))
async def test_managed_browser_persistent():
# Treating use_persistent_context=True as managed_browser scenario.
async with AsyncWebCrawler(
headless=False,
verbose=True,
user_agent_mode="random",
user_agent_generator_config={"device_type": "desktop", "os_type": "mac"},
use_managed_browser=True,
use_persistent_context=True, # now should behave same as managed browser
user_data_dir="./outpu/test_profile",
# This should store and reuse profile data across runs
) as crawler:
result = await crawler.arun(
url='https://www.google.com',
cache_mode=CacheMode.BYPASS,
markdown_generator=DefaultMarkdownGenerator(options={"ignore_links": True})
)
print("[test_managed_browser_persistent] success:", result.success)
print("HTML length:", len(result.html if result.html else ""))
async def test_session_reuse():
# Test creating a session, using it for multiple calls
session_id = "my_session"
async with AsyncWebCrawler(
headless=False,
verbose=True,
user_agent="Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36",
# Fixed user-agent for consistency
use_managed_browser=False,
use_persistent_context=False,
) as crawler:
# First call: create session
result1 = await crawler.arun(
url='https://www.example.com',
cache_mode=CacheMode.BYPASS,
session_id=session_id,
markdown_generator=DefaultMarkdownGenerator(options={"ignore_links": True})
)
print("[test_session_reuse first call] success:", result1.success)
# Second call: same session, possibly cookie retained
result2 = await crawler.arun(
url='https://www.example.com/about',
cache_mode=CacheMode.BYPASS,
session_id=session_id,
markdown_generator=DefaultMarkdownGenerator(options={"ignore_links": True})
)
print("[test_session_reuse second call] success:", result2.success)
async def test_magic_mode():
# Test magic mode with override_navigator and simulate_user
async with AsyncWebCrawler(
headless=False,
verbose=True,
user_agent_mode="random",
user_agent_generator_config={"device_type": "desktop", "os_type": "windows"},
use_managed_browser=False,
use_persistent_context=False,
magic=True,
override_navigator=True,
simulate_user=True,
) as crawler:
result = await crawler.arun(
url='https://www.kidocode.com/degrees/business',
cache_mode=CacheMode.BYPASS,
markdown_generator=DefaultMarkdownGenerator(options={"ignore_links": True})
)
print("[test_magic_mode] success:", result.success)
print("HTML length:", len(result.html if result.html else ""))
async def test_proxy_settings():
# Test with a proxy (if available) to ensure code runs with proxy
async with AsyncWebCrawler(
headless=True,
verbose=False,
user_agent="Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36",
proxy="http://127.0.0.1:8080", # Assuming local proxy server for test
use_managed_browser=False,
use_persistent_context=False,
) as crawler:
result = await crawler.arun(
url='https://httpbin.org/ip',
cache_mode=CacheMode.BYPASS,
markdown_generator=DefaultMarkdownGenerator(options={"ignore_links": True})
)
print("[test_proxy_settings] success:", result.success)
if result.success:
print("HTML preview:", result.html[:200] if result.html else "")
async def test_ignore_https_errors():
# Test ignore HTTPS errors with a self-signed or invalid cert domain
# This is just conceptual, the domain should be one that triggers SSL error.
# Using a hypothetical URL that fails SSL:
async with AsyncWebCrawler(
headless=True,
verbose=True,
user_agent="Mozilla/5.0",
ignore_https_errors=True,
use_managed_browser=False,
use_persistent_context=False,
) as crawler:
result = await crawler.arun(
url='https://self-signed.badssl.com/',
cache_mode=CacheMode.BYPASS,
markdown_generator=DefaultMarkdownGenerator(options={"ignore_links": True})
)
print("[test_ignore_https_errors] success:", result.success)
async def main():
print("Running tests...")
# await test_default_headless()
# await test_managed_browser_persistent()
# await test_session_reuse()
# await test_magic_mode()
# await test_proxy_settings()
await test_ignore_https_errors()
if __name__ == "__main__":
asyncio.run(main())

View File

@@ -0,0 +1,231 @@
import os, sys
parent_dir = os.path.dirname(os.path.dirname(os.path.abspath(__file__)))
sys.path.append(parent_dir)
__location__ = os.path.realpath(os.path.join(os.getcwd(), os.path.dirname(__file__)))
import asyncio
from crawl4ai import AsyncWebCrawler, CacheMode
from crawl4ai.async_configs import BrowserConfig, CrawlerRunConfig
from crawl4ai.content_filter_strategy import PruningContentFilter
from crawl4ai.extraction_strategy import JsonCssExtractionStrategy
from crawl4ai.chunking_strategy import RegexChunking
from crawl4ai.markdown_generation_strategy import DefaultMarkdownGenerator
# Category 1: Browser Configuration Tests
async def test_browser_config_object():
"""Test the new BrowserConfig object with various browser settings"""
browser_config = BrowserConfig(
browser_type="chromium",
headless=False,
viewport_width=1920,
viewport_height=1080,
use_managed_browser=True,
user_agent_mode="random",
user_agent_generator_config={"device_type": "desktop", "os_type": "windows"}
)
async with AsyncWebCrawler(config=browser_config, verbose=True) as crawler:
result = await crawler.arun('https://example.com', cache_mode=CacheMode.BYPASS)
assert result.success, "Browser config crawl failed"
assert len(result.html) > 0, "No HTML content retrieved"
async def test_browser_performance_config():
"""Test browser configurations focused on performance"""
browser_config = BrowserConfig(
text_only=True,
light_mode=True,
extra_args=['--disable-gpu', '--disable-software-rasterizer'],
ignore_https_errors=True,
java_script_enabled=False
)
async with AsyncWebCrawler(config=browser_config) as crawler:
result = await crawler.arun('https://example.com')
assert result.success, "Performance optimized crawl failed"
assert result.status_code == 200, "Unexpected status code"
# Category 2: Content Processing Tests
async def test_content_extraction_config():
"""Test content extraction with various strategies"""
crawler_config = CrawlerRunConfig(
word_count_threshold=300,
extraction_strategy=JsonCssExtractionStrategy(
schema={
"name": "article",
"baseSelector": "div",
"fields": [{
"name": "title",
"selector": "h1",
"type": "text"
}]
}
),
chunking_strategy=RegexChunking(),
content_filter=PruningContentFilter()
)
async with AsyncWebCrawler() as crawler:
result = await crawler.arun(
'https://example.com/article',
config=crawler_config
)
assert result.extracted_content is not None, "Content extraction failed"
assert 'title' in result.extracted_content, "Missing expected content field"
# Category 3: Cache and Session Management Tests
async def test_cache_and_session_management():
"""Test different cache modes and session handling"""
browser_config = BrowserConfig(use_persistent_context=True)
crawler_config = CrawlerRunConfig(
cache_mode=CacheMode.WRITE_ONLY,
process_iframes=True,
remove_overlay_elements=True
)
async with AsyncWebCrawler(config=browser_config) as crawler:
# First request - should write to cache
result1 = await crawler.arun(
'https://example.com',
config=crawler_config
)
# Second request - should use fresh fetch due to WRITE_ONLY mode
result2 = await crawler.arun(
'https://example.com',
config=crawler_config
)
assert result1.success and result2.success, "Cache mode crawl failed"
assert result1.html == result2.html, "Inconsistent results between requests"
# Category 4: Media Handling Tests
async def test_media_handling_config():
"""Test configurations related to media handling"""
# Get the base path for home directroy ~/.crawl4ai/downloads, make sure it exists
os.makedirs(os.path.expanduser("~/.crawl4ai/downloads"), exist_ok=True)
browser_config = BrowserConfig(
viewport_width=1920,
viewport_height=1080,
accept_downloads=True,
downloads_path= os.path.expanduser("~/.crawl4ai/downloads")
)
crawler_config = CrawlerRunConfig(
screenshot=True,
pdf=True,
adjust_viewport_to_content=True,
wait_for_images=True,
screenshot_height_threshold=20000
)
async with AsyncWebCrawler(config=browser_config) as crawler:
result = await crawler.arun(
'https://example.com',
config=crawler_config
)
assert result.screenshot is not None, "Screenshot capture failed"
assert result.pdf is not None, "PDF generation failed"
# Category 5: Anti-Bot and Site Interaction Tests
async def test_antibot_config():
"""Test configurations for handling anti-bot measures"""
crawler_config = CrawlerRunConfig(
simulate_user=True,
override_navigator=True,
magic=True,
wait_for="js:()=>document.querySelector('body')",
delay_before_return_html=1.0,
log_console=True,
cache_mode=CacheMode.BYPASS
)
async with AsyncWebCrawler() as crawler:
result = await crawler.arun(
'https://example.com',
config=crawler_config
)
assert result.success, "Anti-bot measure handling failed"
# Category 6: Parallel Processing Tests
async def test_parallel_processing():
"""Test parallel processing capabilities"""
crawler_config = CrawlerRunConfig(
mean_delay=0.5,
max_range=1.0,
semaphore_count=5
)
urls = [
'https://example.com/1',
'https://example.com/2',
'https://example.com/3'
]
async with AsyncWebCrawler() as crawler:
results = await crawler.arun_many(
urls,
config=crawler_config
)
assert len(results) == len(urls), "Not all URLs were processed"
assert all(r.success for r in results), "Some parallel requests failed"
# Category 7: Backwards Compatibility Tests
async def test_legacy_parameter_support():
"""Test that legacy parameters still work"""
async with AsyncWebCrawler(
headless=True,
browser_type="chromium",
viewport_width=1024,
viewport_height=768
) as crawler:
result = await crawler.arun(
'https://example.com',
screenshot=True,
word_count_threshold=200,
bypass_cache=True,
css_selector=".main-content"
)
assert result.success, "Legacy parameter support failed"
# Category 8: Mixed Configuration Tests
async def test_mixed_config_usage():
"""Test mixing new config objects with legacy parameters"""
browser_config = BrowserConfig(headless=True)
crawler_config = CrawlerRunConfig(screenshot=True)
async with AsyncWebCrawler(
config=browser_config,
verbose=True # legacy parameter
) as crawler:
result = await crawler.arun(
'https://example.com',
config=crawler_config,
cache_mode=CacheMode.BYPASS, # legacy parameter
css_selector="body" # legacy parameter
)
assert result.success, "Mixed configuration usage failed"
if __name__ == "__main__":
async def run_tests():
test_functions = [
test_browser_config_object,
# test_browser_performance_config,
# test_content_extraction_config,
# test_cache_and_session_management,
# test_media_handling_config,
# test_antibot_config,
# test_parallel_processing,
# test_legacy_parameter_support,
# test_mixed_config_usage
]
for test in test_functions:
print(f"\nRunning {test.__name__}...")
try:
await test()
print(f"{test.__name__} passed")
except AssertionError as e:
print(f"{test.__name__} failed: {str(e)}")
except Exception as e:
print(f"{test.__name__} error: {str(e)}")
asyncio.run(run_tests())

View File

@@ -0,0 +1,229 @@
import os
import sys
import asyncio
import shutil
from typing import List
import tempfile
import time
# Add the parent directory to the Python path
parent_dir = os.path.dirname(os.path.dirname(os.path.abspath(__file__)))
sys.path.append(parent_dir)
from crawl4ai.async_webcrawler import AsyncWebCrawler
class TestDownloads:
def __init__(self):
self.temp_dir = tempfile.mkdtemp(prefix="crawl4ai_test_")
self.download_dir = os.path.join(self.temp_dir, "downloads")
os.makedirs(self.download_dir, exist_ok=True)
self.results: List[str] = []
def cleanup(self):
shutil.rmtree(self.temp_dir)
def log_result(self, test_name: str, success: bool, message: str = ""):
result = f"{'' if success else ''} {test_name}: {message}"
self.results.append(result)
print(result)
async def test_basic_download(self):
"""Test basic file download functionality"""
try:
async with AsyncWebCrawler(
accept_downloads=True,
downloads_path=self.download_dir,
verbose=True
) as crawler:
# Python.org downloads page typically has stable download links
result = await crawler.arun(
url="https://www.python.org/downloads/",
js_code="""
// Click first download link
const downloadLink = document.querySelector('a[href$=".exe"]');
if (downloadLink) downloadLink.click();
"""
)
success = result.downloaded_files is not None and len(result.downloaded_files) > 0
self.log_result(
"Basic Download",
success,
f"Downloaded {len(result.downloaded_files or [])} files" if success else "No files downloaded"
)
except Exception as e:
self.log_result("Basic Download", False, str(e))
async def test_persistent_context_download(self):
"""Test downloads with persistent context"""
try:
user_data_dir = os.path.join(self.temp_dir, "user_data")
os.makedirs(user_data_dir, exist_ok=True)
async with AsyncWebCrawler(
accept_downloads=True,
downloads_path=self.download_dir,
use_persistent_context=True,
user_data_dir=user_data_dir,
verbose=True
) as crawler:
result = await crawler.arun(
url="https://www.python.org/downloads/",
js_code="""
const downloadLink = document.querySelector('a[href$=".exe"]');
if (downloadLink) downloadLink.click();
"""
)
success = result.downloaded_files is not None and len(result.downloaded_files) > 0
self.log_result(
"Persistent Context Download",
success,
f"Downloaded {len(result.downloaded_files or [])} files" if success else "No files downloaded"
)
except Exception as e:
self.log_result("Persistent Context Download", False, str(e))
async def test_multiple_downloads(self):
"""Test multiple simultaneous downloads"""
try:
async with AsyncWebCrawler(
accept_downloads=True,
downloads_path=self.download_dir,
verbose=True
) as crawler:
result = await crawler.arun(
url="https://www.python.org/downloads/",
js_code="""
// Click multiple download links
const downloadLinks = document.querySelectorAll('a[href$=".exe"]');
downloadLinks.forEach(link => link.click());
"""
)
success = result.downloaded_files is not None and len(result.downloaded_files) > 1
self.log_result(
"Multiple Downloads",
success,
f"Downloaded {len(result.downloaded_files or [])} files" if success else "Not enough files downloaded"
)
except Exception as e:
self.log_result("Multiple Downloads", False, str(e))
async def test_different_browsers(self):
"""Test downloads across different browser types"""
browsers = ["chromium", "firefox", "webkit"]
for browser_type in browsers:
try:
async with AsyncWebCrawler(
accept_downloads=True,
downloads_path=self.download_dir,
browser_type=browser_type,
verbose=True
) as crawler:
result = await crawler.arun(
url="https://www.python.org/downloads/",
js_code="""
const downloadLink = document.querySelector('a[href$=".exe"]');
if (downloadLink) downloadLink.click();
"""
)
success = result.downloaded_files is not None and len(result.downloaded_files) > 0
self.log_result(
f"{browser_type.title()} Download",
success,
f"Downloaded {len(result.downloaded_files or [])} files" if success else "No files downloaded"
)
except Exception as e:
self.log_result(f"{browser_type.title()} Download", False, str(e))
async def test_edge_cases(self):
"""Test various edge cases"""
# Test 1: Downloads without specifying download path
try:
async with AsyncWebCrawler(
accept_downloads=True,
verbose=True
) as crawler:
result = await crawler.arun(
url="https://www.python.org/downloads/",
js_code="document.querySelector('a[href$=\".exe\"]').click()"
)
self.log_result(
"Default Download Path",
True,
f"Downloaded to default path: {result.downloaded_files[0] if result.downloaded_files else 'None'}"
)
except Exception as e:
self.log_result("Default Download Path", False, str(e))
# Test 2: Downloads with invalid path
try:
async with AsyncWebCrawler(
accept_downloads=True,
downloads_path="/invalid/path/that/doesnt/exist",
verbose=True
) as crawler:
result = await crawler.arun(
url="https://www.python.org/downloads/",
js_code="document.querySelector('a[href$=\".exe\"]').click()"
)
self.log_result("Invalid Download Path", False, "Should have raised an error")
except Exception as e:
self.log_result("Invalid Download Path", True, "Correctly handled invalid path")
# Test 3: Download with accept_downloads=False
try:
async with AsyncWebCrawler(
accept_downloads=False,
verbose=True
) as crawler:
result = await crawler.arun(
url="https://www.python.org/downloads/",
js_code="document.querySelector('a[href$=\".exe\"]').click()"
)
success = result.downloaded_files is None
self.log_result(
"Disabled Downloads",
success,
"Correctly ignored downloads" if success else "Unexpectedly downloaded files"
)
except Exception as e:
self.log_result("Disabled Downloads", False, str(e))
async def run_all_tests(self):
"""Run all test cases"""
print("\n🧪 Running Download Tests...\n")
test_methods = [
self.test_basic_download,
self.test_persistent_context_download,
self.test_multiple_downloads,
self.test_different_browsers,
self.test_edge_cases
]
for test in test_methods:
print(f"\n📝 Running {test.__doc__}...")
await test()
await asyncio.sleep(2) # Brief pause between tests
print("\n📊 Test Results Summary:")
for result in self.results:
print(result)
successes = len([r for r in self.results if '' in r])
total = len(self.results)
print(f"\nTotal: {successes}/{total} tests passed")
self.cleanup()
async def main():
tester = TestDownloads()
await tester.run_all_tests()
if __name__ == "__main__":
asyncio.run(main())

View File

@@ -0,0 +1,175 @@
import os, sys
import pytest
from bs4 import BeautifulSoup
from typing import List
# Add the parent directory to the Python path
parent_dir = os.path.dirname(os.path.dirname(os.path.abspath(__file__)))
sys.path.append(parent_dir)
from crawl4ai.content_filter_strategy import BM25ContentFilter
@pytest.fixture
def basic_html():
return """
<html>
<head>
<title>Test Article</title>
<meta name="description" content="Test description">
<meta name="keywords" content="test, keywords">
</head>
<body>
<h1>Main Heading</h1>
<article>
<p>This is a long paragraph with more than fifty words. It continues with more text to ensure we meet the minimum word count threshold. We need to make sure this paragraph is substantial enough to be considered for extraction according to our filtering rules. This should be enough words now.</p>
<div class="navigation">Skip this nav content</div>
</article>
</body>
</html>
"""
@pytest.fixture
def wiki_html():
return """
<html>
<head>
<title>Wikipedia Article</title>
</head>
<body>
<h1>Article Title</h1>
<h2>Section 1</h2>
<p>Short but important section header description.</p>
<div class="content">
<p>Long paragraph with sufficient words to meet the minimum threshold. This paragraph continues with more text to ensure we have enough content for proper testing. We need to make sure this has enough words to pass our filters and be considered valid content for extraction purposes.</p>
</div>
</body>
</html>
"""
@pytest.fixture
def no_meta_html():
return """
<html>
<body>
<h1>Simple Page</h1>
<p>First paragraph that should be used as fallback for query when no meta tags exist. This text needs to be long enough to serve as a meaningful fallback for our content extraction process.</p>
</body>
</html>
"""
class TestBM25ContentFilter:
def test_basic_extraction(self, basic_html):
"""Test basic content extraction functionality"""
filter = BM25ContentFilter()
contents = filter.filter_content(basic_html)
assert contents, "Should extract content"
assert len(contents) >= 1, "Should extract at least one content block"
assert "long paragraph" in ' '.join(contents).lower()
assert "navigation" not in ' '.join(contents).lower()
def test_user_query_override(self, basic_html):
"""Test that user query overrides metadata extraction"""
user_query = "specific test query"
filter = BM25ContentFilter(user_query=user_query)
# Access internal state to verify query usage
soup = BeautifulSoup(basic_html, 'lxml')
extracted_query = filter.extract_page_query(soup.find('head'))
assert extracted_query == user_query
assert "Test description" not in extracted_query
def test_header_extraction(self, wiki_html):
"""Test that headers are properly extracted despite length"""
filter = BM25ContentFilter()
contents = filter.filter_content(wiki_html)
combined_content = ' '.join(contents).lower()
assert "section 1" in combined_content, "Should include section header"
assert "article title" in combined_content, "Should include main title"
def test_no_metadata_fallback(self, no_meta_html):
"""Test fallback behavior when no metadata is present"""
filter = BM25ContentFilter()
contents = filter.filter_content(no_meta_html)
assert contents, "Should extract content even without metadata"
assert "First paragraph" in ' '.join(contents), "Should use first paragraph content"
def test_empty_input(self):
"""Test handling of empty input"""
filter = BM25ContentFilter()
assert filter.filter_content("") == []
assert filter.filter_content(None) == []
def test_malformed_html(self):
"""Test handling of malformed HTML"""
malformed_html = "<p>Unclosed paragraph<div>Nested content</p></div>"
filter = BM25ContentFilter()
contents = filter.filter_content(malformed_html)
assert isinstance(contents, list), "Should return list even with malformed HTML"
def test_threshold_behavior(self, basic_html):
"""Test different BM25 threshold values"""
strict_filter = BM25ContentFilter(bm25_threshold=2.0)
lenient_filter = BM25ContentFilter(bm25_threshold=0.5)
strict_contents = strict_filter.filter_content(basic_html)
lenient_contents = lenient_filter.filter_content(basic_html)
assert len(strict_contents) <= len(lenient_contents), \
"Strict threshold should extract fewer elements"
def test_html_cleaning(self, basic_html):
"""Test HTML cleaning functionality"""
filter = BM25ContentFilter()
contents = filter.filter_content(basic_html)
cleaned_content = ' '.join(contents)
assert 'class=' not in cleaned_content, "Should remove class attributes"
assert 'style=' not in cleaned_content, "Should remove style attributes"
assert '<script' not in cleaned_content, "Should remove script tags"
def test_large_content(self):
"""Test handling of large content blocks"""
large_html = f"""
<html><body>
<article>{'<p>Test content. ' * 1000}</article>
</body></html>
"""
filter = BM25ContentFilter()
contents = filter.filter_content(large_html)
assert contents, "Should handle large content blocks"
@pytest.mark.parametrize("unwanted_tag", [
'script', 'style', 'nav', 'footer', 'header'
])
def test_excluded_tags(self, unwanted_tag):
"""Test that specific tags are properly excluded"""
html = f"""
<html><body>
<{unwanted_tag}>Should not appear</{unwanted_tag}>
<p>Should appear</p>
</body></html>
"""
filter = BM25ContentFilter()
contents = filter.filter_content(html)
combined_content = ' '.join(contents).lower()
assert "should not appear" not in combined_content
def test_performance(self, basic_html):
"""Test performance with timer"""
filter = BM25ContentFilter()
import time
start = time.perf_counter()
filter.filter_content(basic_html)
duration = time.perf_counter() - start
assert duration < 1.0, f"Processing took too long: {duration:.2f} seconds"
if __name__ == "__main__":
pytest.main([__file__])

View File

@@ -0,0 +1,159 @@
import os, sys
import pytest
from bs4 import BeautifulSoup
parent_dir = os.path.dirname(os.path.dirname(os.path.abspath(__file__)))
sys.path.append(parent_dir)
from crawl4ai.content_filter_strategy import PruningContentFilter
@pytest.fixture
def basic_html():
return """
<html>
<body>
<article>
<h1>Main Article</h1>
<p>This is a high-quality paragraph with substantial text content. It contains enough words to pass the threshold and has good text density without too many links. This kind of content should survive the pruning process.</p>
<div class="sidebar">Low quality sidebar content</div>
<div class="social-share">Share buttons</div>
</article>
</body>
</html>
"""
@pytest.fixture
def link_heavy_html():
return """
<html>
<body>
<div class="content">
<p>Good content paragraph that should remain.</p>
<div class="links">
<a href="#">Link 1</a>
<a href="#">Link 2</a>
<a href="#">Link 3</a>
<a href="#">Link 4</a>
</div>
</div>
</body>
</html>
"""
@pytest.fixture
def mixed_content_html():
return """
<html>
<body>
<article>
<h1>Article Title</h1>
<p class="summary">Short summary.</p>
<div class="content">
<p>Long high-quality paragraph with substantial content that should definitely survive the pruning process. This content has good text density and proper formatting which makes it valuable for retention.</p>
</div>
<div class="comments">
<p>Short comment 1</p>
<p>Short comment 2</p>
</div>
</article>
</body>
</html>
"""
class TestPruningContentFilter:
def test_basic_pruning(self, basic_html):
"""Test basic content pruning functionality"""
filter = PruningContentFilter(min_word_threshold=5)
contents = filter.filter_content(basic_html)
combined_content = ' '.join(contents).lower()
assert "high-quality paragraph" in combined_content
assert "sidebar content" not in combined_content
assert "share buttons" not in combined_content
def test_min_word_threshold(self, mixed_content_html):
"""Test minimum word threshold filtering"""
filter = PruningContentFilter(min_word_threshold=10)
contents = filter.filter_content(mixed_content_html)
combined_content = ' '.join(contents).lower()
assert "short summary" not in combined_content
assert "long high-quality paragraph" in combined_content
assert "short comment" not in combined_content
def test_threshold_types(self, basic_html):
"""Test fixed vs dynamic thresholds"""
fixed_filter = PruningContentFilter(threshold_type='fixed', threshold=0.48)
dynamic_filter = PruningContentFilter(threshold_type='dynamic', threshold=0.45)
fixed_contents = fixed_filter.filter_content(basic_html)
dynamic_contents = dynamic_filter.filter_content(basic_html)
assert len(fixed_contents) != len(dynamic_contents), \
"Fixed and dynamic thresholds should yield different results"
def test_link_density_impact(self, link_heavy_html):
"""Test handling of link-heavy content"""
filter = PruningContentFilter(threshold_type='dynamic')
contents = filter.filter_content(link_heavy_html)
combined_content = ' '.join(contents).lower()
assert "good content paragraph" in combined_content
assert len([c for c in contents if 'href' in c]) < 2, \
"Should prune link-heavy sections"
def test_tag_importance(self, mixed_content_html):
"""Test tag importance in scoring"""
filter = PruningContentFilter(threshold_type='dynamic')
contents = filter.filter_content(mixed_content_html)
has_article = any('article' in c.lower() for c in contents)
has_h1 = any('h1' in c.lower() for c in contents)
assert has_article or has_h1, "Should retain important tags"
def test_empty_input(self):
"""Test handling of empty input"""
filter = PruningContentFilter()
assert filter.filter_content("") == []
assert filter.filter_content(None) == []
def test_malformed_html(self):
"""Test handling of malformed HTML"""
malformed_html = "<div>Unclosed div<p>Nested<span>content</div>"
filter = PruningContentFilter()
contents = filter.filter_content(malformed_html)
assert isinstance(contents, list)
def test_performance(self, basic_html):
"""Test performance with timer"""
filter = PruningContentFilter()
import time
start = time.perf_counter()
filter.filter_content(basic_html)
duration = time.perf_counter() - start
# Extra strict on performance since you mentioned milliseconds matter
assert duration < 0.1, f"Processing took too long: {duration:.3f} seconds"
@pytest.mark.parametrize("threshold,expected_count", [
(0.3, 4), # Very lenient
(0.48, 2), # Default
(0.7, 1), # Very strict
])
def test_threshold_levels(self, mixed_content_html, threshold, expected_count):
"""Test different threshold levels"""
filter = PruningContentFilter(threshold_type='fixed', threshold=threshold)
contents = filter.filter_content(mixed_content_html)
assert len(contents) <= expected_count, \
f"Expected {expected_count} or fewer elements with threshold {threshold}"
def test_consistent_output(self, basic_html):
"""Test output consistency across multiple runs"""
filter = PruningContentFilter()
first_run = filter.filter_content(basic_html)
second_run = filter.filter_content(basic_html)
assert first_run == second_run, "Output should be consistent"
if __name__ == "__main__":
pytest.main([__file__])

View File

@@ -0,0 +1,162 @@
import asyncio
from bs4 import BeautifulSoup
from typing import Dict, Any
import os
import sys
import time
import csv
from tabulate import tabulate
from dataclasses import dataclass
from typing import List, Dict
parent_dir = os.path.dirname(os.path.dirname(os.path.dirname(os.path.abspath(__file__))))
sys.path.append(parent_dir)
__location__ = os.path.realpath(os.path.join(os.getcwd(), os.path.dirname(__file__)))
from crawl4ai.content_scraping_strategy import WebScrapingStrategy
from crawl4ai.content_scraping_strategy import WebScrapingStrategy as WebScrapingStrategyCurrent
# from crawl4ai.content_scrapping_strategy_current import WebScrapingStrategy as WebScrapingStrategyCurrent
@dataclass
class TestResult:
name: str
success: bool
images: int
internal_links: int
external_links: int
markdown_length: int
execution_time: float
class StrategyTester:
def __init__(self):
self.new_scraper = WebScrapingStrategy()
self.current_scraper = WebScrapingStrategyCurrent()
with open(__location__ + '/sample_wikipedia.html', 'r', encoding='utf-8') as f:
self.WIKI_HTML = f.read()
self.results = {'new': [], 'current': []}
def run_test(self, name: str, **kwargs) -> tuple[TestResult, TestResult]:
results = []
for scraper in [self.new_scraper, self.current_scraper]:
start_time = time.time()
result = scraper._get_content_of_website_optimized(
url="https://en.wikipedia.org/wiki/Test",
html=self.WIKI_HTML,
**kwargs
)
execution_time = time.time() - start_time
test_result = TestResult(
name=name,
success=result['success'],
images=len(result['media']['images']),
internal_links=len(result['links']['internal']),
external_links=len(result['links']['external']),
markdown_length=len(result['markdown']),
execution_time=execution_time
)
results.append(test_result)
return results[0], results[1] # new, current
def run_all_tests(self):
test_cases = [
("Basic Extraction", {}),
("Exclude Tags", {'excluded_tags': ['table', 'div.infobox', 'div.navbox']}),
("Word Threshold", {'word_count_threshold': 50}),
("CSS Selector", {'css_selector': 'div.mw-parser-output > p'}),
("Link Exclusions", {
'exclude_external_links': True,
'exclude_social_media_links': True,
'exclude_domains': ['facebook.com', 'twitter.com']
}),
("Media Handling", {
'exclude_external_images': True,
'image_description_min_word_threshold': 20
}),
("Text Only", {
'only_text': True,
'remove_forms': True
}),
("HTML Cleaning", {
'clean_html': True,
'keep_data_attributes': True
}),
("HTML2Text Options", {
'html2text': {
'skip_internal_links': True,
'single_line_break': True,
'mark_code': True,
'preserve_tags': ['pre', 'code']
}
})
]
all_results = []
for name, kwargs in test_cases:
try:
new_result, current_result = self.run_test(name, **kwargs)
all_results.append((name, new_result, current_result))
except Exception as e:
print(f"Error in {name}: {str(e)}")
self.save_results_to_csv(all_results)
self.print_comparison_table(all_results)
def save_results_to_csv(self, all_results: List[tuple]):
csv_file = os.path.join(__location__, 'strategy_comparison_results.csv')
with open(csv_file, 'w', newline='') as f:
writer = csv.writer(f)
writer.writerow(['Test Name', 'Strategy', 'Success', 'Images', 'Internal Links',
'External Links', 'Markdown Length', 'Execution Time'])
for name, new_result, current_result in all_results:
writer.writerow([name, 'New', new_result.success, new_result.images,
new_result.internal_links, new_result.external_links,
new_result.markdown_length, f"{new_result.execution_time:.3f}"])
writer.writerow([name, 'Current', current_result.success, current_result.images,
current_result.internal_links, current_result.external_links,
current_result.markdown_length, f"{current_result.execution_time:.3f}"])
def print_comparison_table(self, all_results: List[tuple]):
table_data = []
headers = ['Test Name', 'Strategy', 'Success', 'Images', 'Internal Links',
'External Links', 'Markdown Length', 'Time (s)']
for name, new_result, current_result in all_results:
# Check for differences
differences = []
if new_result.images != current_result.images: differences.append('images')
if new_result.internal_links != current_result.internal_links: differences.append('internal_links')
if new_result.external_links != current_result.external_links: differences.append('external_links')
if new_result.markdown_length != current_result.markdown_length: differences.append('markdown')
# Add row for new strategy
new_row = [
name, 'New', new_result.success, new_result.images,
new_result.internal_links, new_result.external_links,
new_result.markdown_length, f"{new_result.execution_time:.3f}"
]
table_data.append(new_row)
# Add row for current strategy
current_row = [
'', 'Current', current_result.success, current_result.images,
current_result.internal_links, current_result.external_links,
current_result.markdown_length, f"{current_result.execution_time:.3f}"
]
table_data.append(current_row)
# Add difference summary if any
if differences:
table_data.append(['', '⚠️ Differences', ', '.join(differences), '', '', '', '', ''])
# Add empty row for better readability
table_data.append([''] * len(headers))
print("\nStrategy Comparison Results:")
print(tabulate(table_data, headers=headers, tablefmt='grid'))
if __name__ == "__main__":
tester = StrategyTester()
tester.run_all_tests()

View File

@@ -0,0 +1,165 @@
# ## Issue #236
# - **Last Updated:** 2024-11-11 01:42:14
# - **Title:** [user data crawling opens two windows, unable to control correct user browser](https://github.com/unclecode/crawl4ai/issues/236)
# - **State:** open
import os, sys, time
parent_dir = os.path.dirname(os.path.dirname(os.path.abspath(__file__)))
sys.path.append(parent_dir)
__location__ = os.path.realpath( os.path.join(os.getcwd(), os.path.dirname(__file__)))
import asyncio
import os
import time
from typing import Dict, Any
from crawl4ai.markdown_generation_strategy import DefaultMarkdownGenerator
# Get current directory
__location__ = os.path.realpath(os.path.join(os.getcwd(), os.path.dirname(__file__)))
def print_test_result(name: str, result: Dict[str, Any], execution_time: float):
"""Helper function to print test results."""
print(f"\n{'='*20} {name} {'='*20}")
print(f"Execution time: {execution_time:.4f} seconds")
# Save markdown to files
for key, content in result.items():
if isinstance(content, str):
with open(__location__ + f"/output/{name.lower()}_{key}.md", "w") as f:
f.write(content)
# # Print first few lines of each markdown version
# for key, content in result.items():
# if isinstance(content, str):
# preview = '\n'.join(content.split('\n')[:3])
# print(f"\n{key} (first 3 lines):")
# print(preview)
# print(f"Total length: {len(content)} characters")
def test_basic_markdown_conversion():
"""Test basic markdown conversion with links."""
with open(__location__ + "/data/wikipedia.html", "r") as f:
cleaned_html = f.read()
generator = DefaultMarkdownGenerator()
start_time = time.perf_counter()
result = generator.generate_markdown(
cleaned_html=cleaned_html,
base_url="https://en.wikipedia.org"
)
execution_time = time.perf_counter() - start_time
print_test_result("Basic Markdown Conversion", {
'raw': result.raw_markdown,
'with_citations': result.markdown_with_citations,
'references': result.references_markdown
}, execution_time)
# Basic assertions
assert result.raw_markdown, "Raw markdown should not be empty"
assert result.markdown_with_citations, "Markdown with citations should not be empty"
assert result.references_markdown, "References should not be empty"
assert "" in result.markdown_with_citations, "Citations should use ⟨⟩ brackets"
assert "## References" in result.references_markdown, "Should contain references section"
def test_relative_links():
"""Test handling of relative links with base URL."""
markdown = """
Here's a [relative link](/wiki/Apple) and an [absolute link](https://example.com).
Also an [image](/images/test.png) and another [page](/wiki/Banana).
"""
generator = DefaultMarkdownGenerator()
result = generator.generate_markdown(
cleaned_html=markdown,
base_url="https://en.wikipedia.org"
)
assert "https://en.wikipedia.org/wiki/Apple" in result.references_markdown
assert "https://example.com" in result.references_markdown
assert "https://en.wikipedia.org/images/test.png" in result.references_markdown
def test_duplicate_links():
"""Test handling of duplicate links."""
markdown = """
Here's a [link](/test) and another [link](/test) and a [different link](/other).
"""
generator = DefaultMarkdownGenerator()
result = generator.generate_markdown(
cleaned_html=markdown,
base_url="https://example.com"
)
# Count citations in markdown
citations = result.markdown_with_citations.count("⟨1⟩")
assert citations == 2, "Same link should use same citation number"
def test_link_descriptions():
"""Test handling of link titles and descriptions."""
markdown = """
Here's a [link with title](/test "Test Title") and a [link with description](/other) to test.
"""
generator = DefaultMarkdownGenerator()
result = generator.generate_markdown(
cleaned_html=markdown,
base_url="https://example.com"
)
assert "Test Title" in result.references_markdown, "Link title should be in references"
assert "link with description" in result.references_markdown, "Link text should be in references"
def test_performance_large_document():
"""Test performance with large document."""
with open(__location__ + "/data/wikipedia.md", "r") as f:
markdown = f.read()
# Test with multiple iterations
iterations = 5
times = []
generator = DefaultMarkdownGenerator()
for i in range(iterations):
start_time = time.perf_counter()
result = generator.generate_markdown(
cleaned_html=markdown,
base_url="https://en.wikipedia.org"
)
end_time = time.perf_counter()
times.append(end_time - start_time)
avg_time = sum(times) / len(times)
print(f"\n{'='*20} Performance Test {'='*20}")
print(f"Average execution time over {iterations} iterations: {avg_time:.4f} seconds")
print(f"Min time: {min(times):.4f} seconds")
print(f"Max time: {max(times):.4f} seconds")
def test_image_links():
"""Test handling of image links."""
markdown = """
Here's an ![image](/image.png "Image Title") and another ![image](/other.jpg).
And a regular [link](/page).
"""
generator = DefaultMarkdownGenerator()
result = generator.generate_markdown(
cleaned_html=markdown,
base_url="https://example.com"
)
assert "![" in result.markdown_with_citations, "Image markdown syntax should be preserved"
assert "Image Title" in result.references_markdown, "Image title should be in references"
if __name__ == "__main__":
print("Running markdown generation strategy tests...")
test_basic_markdown_conversion()
test_relative_links()
test_duplicate_links()
test_link_descriptions()
test_performance_large_document()
test_image_links()

332
tests/docker_example.py Normal file
View File

@@ -0,0 +1,332 @@
import requests
import json
import time
import sys
import base64
import os
from typing import Dict, Any
class Crawl4AiTester:
def __init__(self, base_url: str = "http://localhost:11235", api_token: str = None):
self.base_url = base_url
self.api_token = api_token or os.getenv('CRAWL4AI_API_TOKEN') # Check environment variable as fallback
self.headers = {'Authorization': f'Bearer {self.api_token}'} if self.api_token else {}
def submit_and_wait(self, request_data: Dict[str, Any], timeout: int = 300) -> Dict[str, Any]:
# Submit crawl job
response = requests.post(f"{self.base_url}/crawl", json=request_data, headers=self.headers)
if response.status_code == 403:
raise Exception("API token is invalid or missing")
task_id = response.json()["task_id"]
print(f"Task ID: {task_id}")
# Poll for result
start_time = time.time()
while True:
if time.time() - start_time > timeout:
raise TimeoutError(f"Task {task_id} did not complete within {timeout} seconds")
result = requests.get(f"{self.base_url}/task/{task_id}", headers=self.headers)
status = result.json()
if status["status"] == "failed":
print("Task failed:", status.get("error"))
raise Exception(f"Task failed: {status.get('error')}")
if status["status"] == "completed":
return status
time.sleep(2)
def submit_sync(self, request_data: Dict[str, Any]) -> Dict[str, Any]:
response = requests.post(f"{self.base_url}/crawl_sync", json=request_data, headers=self.headers, timeout=60)
if response.status_code == 408:
raise TimeoutError("Task did not complete within server timeout")
response.raise_for_status()
return response.json()
def test_docker_deployment(version="basic"):
tester = Crawl4AiTester(
# base_url="http://localhost:11235" ,
base_url="https://crawl4ai-sby74.ondigitalocean.app",
api_token="test"
)
print(f"Testing Crawl4AI Docker {version} version")
# Health check with timeout and retry
max_retries = 5
for i in range(max_retries):
try:
health = requests.get(f"{tester.base_url}/health", timeout=10)
print("Health check:", health.json())
break
except requests.exceptions.RequestException as e:
if i == max_retries - 1:
print(f"Failed to connect after {max_retries} attempts")
sys.exit(1)
print(f"Waiting for service to start (attempt {i+1}/{max_retries})...")
time.sleep(5)
# Test cases based on version
test_basic_crawl(tester)
test_basic_crawl(tester)
test_basic_crawl_sync(tester)
# if version in ["full", "transformer"]:
# test_cosine_extraction(tester)
# test_js_execution(tester)
# test_css_selector(tester)
# test_structured_extraction(tester)
# test_llm_extraction(tester)
# test_llm_with_ollama(tester)
# test_screenshot(tester)
def test_basic_crawl(tester: Crawl4AiTester):
print("\n=== Testing Basic Crawl ===")
request = {
"urls": "https://www.nbcnews.com/business",
"priority": 10,
"session_id": "test"
}
result = tester.submit_and_wait(request)
print(f"Basic crawl result length: {len(result['result']['markdown'])}")
assert result["result"]["success"]
assert len(result["result"]["markdown"]) > 0
def test_basic_crawl_sync(tester: Crawl4AiTester):
print("\n=== Testing Basic Crawl (Sync) ===")
request = {
"urls": "https://www.nbcnews.com/business",
"priority": 10,
"session_id": "test"
}
result = tester.submit_sync(request)
print(f"Basic crawl result length: {len(result['result']['markdown'])}")
assert result['status'] == 'completed'
assert result['result']['success']
assert len(result['result']['markdown']) > 0
def test_js_execution(tester: Crawl4AiTester):
print("\n=== Testing JS Execution ===")
request = {
"urls": "https://www.nbcnews.com/business",
"priority": 8,
"js_code": [
"const loadMoreButton = Array.from(document.querySelectorAll('button')).find(button => button.textContent.includes('Load More')); loadMoreButton && loadMoreButton.click();"
],
"wait_for": "article.tease-card:nth-child(10)",
"crawler_params": {
"headless": True
}
}
result = tester.submit_and_wait(request)
print(f"JS execution result length: {len(result['result']['markdown'])}")
assert result["result"]["success"]
def test_css_selector(tester: Crawl4AiTester):
print("\n=== Testing CSS Selector ===")
request = {
"urls": "https://www.nbcnews.com/business",
"priority": 7,
"css_selector": ".wide-tease-item__description",
"crawler_params": {
"headless": True
},
"extra": {"word_count_threshold": 10}
}
result = tester.submit_and_wait(request)
print(f"CSS selector result length: {len(result['result']['markdown'])}")
assert result["result"]["success"]
def test_structured_extraction(tester: Crawl4AiTester):
print("\n=== Testing Structured Extraction ===")
schema = {
"name": "Coinbase Crypto Prices",
"baseSelector": ".cds-tableRow-t45thuk",
"fields": [
{
"name": "crypto",
"selector": "td:nth-child(1) h2",
"type": "text",
},
{
"name": "symbol",
"selector": "td:nth-child(1) p",
"type": "text",
},
{
"name": "price",
"selector": "td:nth-child(2)",
"type": "text",
}
],
}
request = {
"urls": "https://www.coinbase.com/explore",
"priority": 9,
"extraction_config": {
"type": "json_css",
"params": {
"schema": schema
}
}
}
result = tester.submit_and_wait(request)
extracted = json.loads(result["result"]["extracted_content"])
print(f"Extracted {len(extracted)} items")
print("Sample item:", json.dumps(extracted[0], indent=2))
assert result["result"]["success"]
assert len(extracted) > 0
def test_llm_extraction(tester: Crawl4AiTester):
print("\n=== Testing LLM Extraction ===")
schema = {
"type": "object",
"properties": {
"model_name": {
"type": "string",
"description": "Name of the OpenAI model."
},
"input_fee": {
"type": "string",
"description": "Fee for input token for the OpenAI model."
},
"output_fee": {
"type": "string",
"description": "Fee for output token for the OpenAI model."
}
},
"required": ["model_name", "input_fee", "output_fee"]
}
request = {
"urls": "https://openai.com/api/pricing",
"priority": 8,
"extraction_config": {
"type": "llm",
"params": {
"provider": "openai/gpt-4o-mini",
"api_token": os.getenv("OPENAI_API_KEY"),
"schema": schema,
"extraction_type": "schema",
"instruction": """From the crawled content, extract all mentioned model names along with their fees for input and output tokens."""
}
},
"crawler_params": {"word_count_threshold": 1}
}
try:
result = tester.submit_and_wait(request)
extracted = json.loads(result["result"]["extracted_content"])
print(f"Extracted {len(extracted)} model pricing entries")
print("Sample entry:", json.dumps(extracted[0], indent=2))
assert result["result"]["success"]
except Exception as e:
print(f"LLM extraction test failed (might be due to missing API key): {str(e)}")
def test_llm_with_ollama(tester: Crawl4AiTester):
print("\n=== Testing LLM with Ollama ===")
schema = {
"type": "object",
"properties": {
"article_title": {
"type": "string",
"description": "The main title of the news article"
},
"summary": {
"type": "string",
"description": "A brief summary of the article content"
},
"main_topics": {
"type": "array",
"items": {"type": "string"},
"description": "Main topics or themes discussed in the article"
}
}
}
request = {
"urls": "https://www.nbcnews.com/business",
"priority": 8,
"extraction_config": {
"type": "llm",
"params": {
"provider": "ollama/llama2",
"schema": schema,
"extraction_type": "schema",
"instruction": "Extract the main article information including title, summary, and main topics."
}
},
"extra": {"word_count_threshold": 1},
"crawler_params": {"verbose": True}
}
try:
result = tester.submit_and_wait(request)
extracted = json.loads(result["result"]["extracted_content"])
print("Extracted content:", json.dumps(extracted, indent=2))
assert result["result"]["success"]
except Exception as e:
print(f"Ollama extraction test failed: {str(e)}")
def test_cosine_extraction(tester: Crawl4AiTester):
print("\n=== Testing Cosine Extraction ===")
request = {
"urls": "https://www.nbcnews.com/business",
"priority": 8,
"extraction_config": {
"type": "cosine",
"params": {
"semantic_filter": "business finance economy",
"word_count_threshold": 10,
"max_dist": 0.2,
"top_k": 3
}
}
}
try:
result = tester.submit_and_wait(request)
extracted = json.loads(result["result"]["extracted_content"])
print(f"Extracted {len(extracted)} text clusters")
print("First cluster tags:", extracted[0]["tags"])
assert result["result"]["success"]
except Exception as e:
print(f"Cosine extraction test failed: {str(e)}")
def test_screenshot(tester: Crawl4AiTester):
print("\n=== Testing Screenshot ===")
request = {
"urls": "https://www.nbcnews.com/business",
"priority": 5,
"screenshot": True,
"crawler_params": {
"headless": True
}
}
result = tester.submit_and_wait(request)
print("Screenshot captured:", bool(result["result"]["screenshot"]))
if result["result"]["screenshot"]:
# Save screenshot
screenshot_data = base64.b64decode(result["result"]["screenshot"])
with open("test_screenshot.jpg", "wb") as f:
f.write(screenshot_data)
print("Screenshot saved as test_screenshot.jpg")
assert result["result"]["success"]
if __name__ == "__main__":
version = sys.argv[1] if len(sys.argv) > 1 else "basic"
# version = "full"
test_docker_deployment(version)