Commit Graph

99 Commits

Author SHA1 Message Date
Soham Kukreti
7a133e22cc feat: make LLM backoff configurable end-to-end
- extend LLMConfig with backoff delay/attempt/factor fields and thread them
  through LLMExtractionStrategy, LLMContentFilter, table extraction, and
  Docker API handlers
- expose the backoff parameter knobs on perform_completion_with_backoff/aperform_completion_with_backoff
  and document them in the md_v2 guides
2025-11-28 18:50:04 +05:30
ntohidi
a30548a98f This commit resolves issue #1055 where LLM extraction was blocking async
execution, causing URLs to be processed sequentially instead of in parallel.

  Changes:
  - Added aperform_completion_with_backoff() using litellm.acompletion for async LLM calls
  - Implemented arun() method in ExtractionStrategy base class with thread pool fallback
  - Created async arun() and aextract() methods in LLMExtractionStrategy using asyncio.gather
  - Updated AsyncWebCrawler.arun() to detect and use arun() when available
  - Added comprehensive test suite to verify parallel execution

  Impact:
  - LLM extraction now runs truly in parallel across multiple URLs
  - Significant performance improvement for multi-URL crawls with LLM strategies
  - Backward compatible - existing extraction strategies continue to work
  - No breaking changes to public API

  Technical details:
  - Uses litellm.acompletion for non-blocking LLM calls
  - Leverages asyncio.gather for concurrent chunk processing
  - Maintains backward compatibility via asyncio.to_thread fallback
  - Works seamlessly with MemoryAdaptiveDispatcher and other dispatchers
2025-11-06 11:22:45 +01:00
ntohidi
a3f057e19f feat: Add hooks utility for function-based hooks with Docker client integration. ref #1377
Add hooks_to_string() utility function that converts Python function objects
   to string representations for the Docker API, enabling developers to write hooks
   as regular Python functions instead of strings.

   Core Changes:
   - New hooks_to_string() utility in crawl4ai/utils.py using inspect.getsource()
   - Docker client now accepts both function objects and strings for hooks
   - Automatic detection and conversion in Crawl4aiDockerClient._prepare_request()
   - New hooks and hooks_timeout parameters in client.crawl() method

   Documentation:
   - Docker client examples with function-based hooks (docs/examples/docker_client_hooks_example.py)
   - Updated main Docker deployment guide with comprehensive hooks section
   - Added unit tests for hooks utility (tests/docker/test_hooks_utility.py)
2025-10-13 12:34:08 +08:00
ntohidi
487839640f fix: raise error on last attempt failure in perform_completion_with_backoff. ref #989 2025-09-02 16:49:01 +08:00
Nasrin
5e7fcb17e1 Merge pull request #1448 from unclecode/fix/https-reditrect
feat: add preserve_https_for_internal_links flag to maintain HTTPS during crawling
2025-09-01 16:11:25 +08:00
ntohidi
f566c5a376 feat: add preserve_https_for_internal_links flag to maintain HTTPS during crawling. Ref #1410
Added a new `preserve_https_for_internal_links` configuration flag that preserves the original HTTPS scheme for same-domain links even when the server redirects to HTTP.
2025-08-28 17:38:40 +08:00
ntohidi
40ab287c90 fix(utils): Improve URL normalization by avoiding quote/unquote to preserve '+' signs. ref #1332 2025-08-22 12:05:21 +08:00
UncleCode
2ab0bf27c2 refactor(utils): move memory utilities to utils and update imports 2025-08-17 19:14:55 +08:00
ntohidi
7a6ad547f0 Squashed commit of the following:
commit 2def6524cdacb69c72760bf55a41089257c0bb07
Author: ntohidi <nasrin@kidocode.com>
Date:   Mon Aug 4 18:59:10 2025 +0800

    refactor: consolidate WebScrapingStrategy to use LXML implementation only

    BREAKING CHANGE: None - full backward compatibility maintained

    This commit simplifies the content scraping architecture by removing the
    redundant BeautifulSoup-based WebScrapingStrategy implementation and making
    it an alias for LXMLWebScrapingStrategy.

    Changes:
    - Remove ~1000 lines of BeautifulSoup-based WebScrapingStrategy code
    - Make WebScrapingStrategy an alias for LXMLWebScrapingStrategy
    - Update LXMLWebScrapingStrategy to inherit directly from ContentScrapingStrategy
    - Add required methods (scrap, ascrap, process_element, _log) to LXMLWebScrapingStrategy
    - Maintain 100% backward compatibility - existing code continues to work

    Code changes:
    - crawl4ai/content_scraping_strategy.py: Remove WebScrapingStrategy class, add alias
    - crawl4ai/async_configs.py: Remove WebScrapingStrategy from imports
    - crawl4ai/__init__.py: Update imports to show alias relationship
    - crawl4ai/types.py: Update type definitions
    - crawl4ai/legacy/web_crawler.py: Update import to use alias
    - tests/async/test_content_scraper_strategy.py: Update to use LXMLWebScrapingStrategy
    - docs/examples/scraping_strategies_performance.py: Update to use single strategy

    Documentation updates:
    - docs/md_v2/core/content-selection.md: Update scraping modes section
    - docs/md_v2/migration/webscraping-strategy-migration.md: Add migration guide
    - CHANGELOG.md: Document the refactoring under [Unreleased]

    Benefits:
    - 10-20x faster HTML parsing for large documents
    - Reduced memory usage and simplified codebase
    - Consistent parsing behavior
    - No migration required for existing users

    All existing code using WebScrapingStrategy continues to work without
    modification, while benefiting from LXML's superior performance.
2025-08-04 19:02:01 +08:00
UncleCode
9546773a07 fix: Move sentence-transformers to optional dependencies
- Moved sentence-transformers from core to optional dependencies in pyproject.toml
- Removed sentence-transformers from requirements.txt
- Added proper ImportError handling with helpful installation message
- This prevents ~2.5GB of NVIDIA CUDA libraries from being installed by default
- Users who need embedding features can install with: pip install 'crawl4ai[transformer]'
2025-07-24 21:24:40 +08:00
ntohidi
36429a63de fix: Improve comments for article metadata extraction in extract_metadata functions. ref #1105 2025-07-08 12:54:33 +02:00
ntohidi
0f210f6e02 Merge branch '2025-MAY-2' into next-MAY 2025-07-08 11:46:13 +02:00
UncleCode
1a73fb60db feat(crawl4ai): Implement adaptive crawling feature
This commit introduces the adaptive crawling feature to the crawl4ai project. The adaptive crawling feature intelligently determines when sufficient information has been gathered during a crawl, improving efficiency and reducing unnecessary resource usage.

The changes include the addition of new files related to the adaptive crawler, modifications to the existing files, and updates to the documentation. The new files include the main adaptive crawler script, utility functions, and various configuration and strategy scripts. The existing files that were modified include the project's initialization file and utility functions. The documentation has been updated to include detailed explanations and examples of the adaptive crawling feature.

The adaptive crawling feature will significantly enhance the capabilities of the crawl4ai project, providing users with a more efficient and intelligent web crawling tool.

Significant modifications:
- Added adaptive_crawler.py and related scripts
- Modified __init__.py and utils.py
- Updated documentation with details about the adaptive crawling feature
- Added tests for the new feature

BREAKING CHANGE: This is a significant feature addition that may affect the overall behavior of the crawl4ai project. Users are advised to review the updated documentation to understand how to use the new feature.

Refs: #123, #456
2025-07-04 15:16:53 +08:00
UncleCode
5c9c305dbf feat: Add advanced link head extraction with three-layer scoring system (#1)
Squashed commit from feature/link-extractor branch implementing comprehensive link analysis:

- Extract HTML head content from discovered links with parallel processing
- Three-layer scoring: Intrinsic (URL quality), Contextual (BM25), and Total scores
- New LinkExtractionConfig class for type-safe configuration
- Pattern-based filtering for internal/external links
- Comprehensive documentation and examples
2025-06-27 20:06:04 +08:00
ntohidi
28125c1980 Merge branch 'next' into 2025-MAY-2 2025-06-02 20:26:40 +02:00
ntohidi
773ed7b281 Merge branch '2025-APR-1' into 2025-MAY-2 2025-06-02 20:25:58 +02:00
UncleCode
7d0b447e1c Update setup script to clarify virtual display setup message 2025-05-25 16:55:18 +08:00
UncleCode
33b0e222ca Add Colab utilities and rename setup function for clarity 2025-05-25 16:50:56 +08:00
UncleCode
1fc45ffac8 Fix temperature typo and enhance LinkedIn extraction with Colab support
- Fixed widespread typo: `temprature` → `temperature` across LLMConfig and related files
- Enhanced CSS/XPath selector guidance for more reliable LinkedIn data extraction
- Added Google Colab display server support for running Crawl4AI in notebook environments
- Improved browser debugging with verbose startup args logging
- Updated LinkedIn schemas and HTML snippets for better parsing accuracy

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-05-25 16:47:12 +08:00
Ahmed-Tawfik94
b4fc60a555 #1103 fix(url): enhance URL normalization to handle invalid schemes and trailing slashes 2025-05-19 13:51:16 +08:00
Ahmed-Tawfik94
137ac014fb #1105 :fix(metadata): optimize article metadata extraction using XPath for improved performance 2025-05-19 13:48:02 +08:00
Ahmed-Tawfik94
faa98eefbc #1105 got fixed (metadata now matches with meta property article:* 2025-05-19 11:35:13 +08:00
UncleCode
754ba731fa Fix chunk splitting utilities (#1122)
* Fix merge_chunks splitter usage and remove incorrect return

* 📝 Add docstrings to `codex/find-and-fix-a-bug` (#1123)

Docstrings generation was requested by @unclecode.

* https://github.com/unclecode/crawl4ai/pull/1122#issuecomment-2887985865

The following files were modified:

* `crawl4ai/utils.py`

Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>

---------

Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>
2025-05-17 15:06:53 +08:00
Aravind Karnam
f6e25e2a6b fix: check_robots_txt to support wildcard rules ref: #699 2025-05-07 17:53:30 +05:30
UncleCode
9b5ccac76e feat(extraction): add RegexExtractionStrategy for pattern-based extraction
Add new RegexExtractionStrategy for fast, zero-LLM extraction of common data types:
- Built-in patterns for emails, URLs, phones, dates, and more
- Support for custom regex patterns
- LLM-assisted pattern generation utility
- Optimized HTML preprocessing with fit_html field
- Enhanced network response body capture

Breaking changes: None
2025-05-02 21:15:24 +08:00
UncleCode
0e5d672763 Merge branch 'pr-971' into merge-pr971 2025-05-01 18:57:28 +08:00
wakaka6
b2f3cb0dfa WIP: logger migriate to rich 2025-04-11 00:44:43 +08:00
Aravind Karnam
7be5427283 Merge branch 'next' into 2025-MAR-ALPHA-1 2025-03-27 12:29:32 +05:30
UncleCode
4a20d7f7c2 feat(cli): add quick JSON extraction and global config management
Adds new features to improve user experience and configuration:
- Quick JSON extraction with -j flag for direct LLM-based structured data extraction
- Global configuration management with 'crwl config' commands
- Enhanced LLM extraction with better JSON handling and error management
- New user settings for default behaviors (LLM provider, browser settings, etc.)

Breaking changes: None
2025-03-25 20:30:25 +08:00
Aravind Karnam
471d110c5e fix: url normalisation ref: https://github.com/unclecode/crawl4ai/issues/841 2025-03-21 16:48:07 +05:30
Aravind Karnam
6740e87b4d fix: remove trailing slash when the path is empty. This is causing dupicate crawls 2025-03-21 13:41:31 +05:30
UncleCode
dc36997a08 feat(schema): improve HTML preprocessing for schema generation
Add new preprocess_html_for_schema utility function to better handle HTML cleaning
for schema generation. This replaces the previous optimize_html function in the
GoogleSearchCrawler and includes smarter attribute handling and pattern detection.

Other changes:
- Update default provider to gpt-4o
- Add DEFAULT_PROVIDER_API_KEY constant
- Make LLMConfig creation more flexible with create_llm_config helper
- Add new dependencies: zstandard and msgpack

This change improves schema generation reliability while reducing noise in the
processed HTML.
2025-03-12 22:40:46 +08:00
UncleCode
f78c46446b feat(deep-crawling): improve URL normalization and domain filtering
Enhance URL handling in deep crawling with:
- New URL normalization functions for consistent URL formats
- Improved domain filtering with subdomain support
- Added URLPatternFilter to public API
- Better URL deduplication in BFS strategy

These changes improve crawling accuracy and reduce duplicate visits.
2025-03-06 22:45:57 +08:00
UncleCode
b957ff2ecd refactor(crawler): improve HTML handling and cleanup codebase
- Add HTML attribute preservation in GoogleSearchCrawler
- Fix lxml import references in utils.py
- Remove unused ssl_certificate.json
- Clean up imports and code organization in hub.py
- Update test case formatting and remove unused image search test

BREAKING CHANGE: Removed ssl_certificate.json file which might affect existing certificate validations
2025-02-07 21:56:27 +08:00
UncleCode
a9415aaaf6 refactor(deep-crawling): reorganize deep crawling strategies and add new implementations
Split deep crawling code into separate strategy files for better organization and maintainability. Added new BFF (Best First) and DFS crawling strategies. Introduced base strategy class and common types.

BREAKING CHANGE: Deep crawling implementation has been split into multiple files. Import paths for deep crawling strategies have changed.
2025-02-05 22:50:39 +08:00
UncleCode
33a21d6a7a refactor(docker): improve server architecture and configuration
Complete overhaul of Docker deployment setup with improved architecture:
- Add Redis integration for task management
- Implement rate limiting and security middleware
- Add Prometheus metrics and health checks
- Improve error handling and logging
- Add support for streaming responses
- Implement proper configuration management
- Add platform-specific optimizations for ARM64/AMD64

BREAKING CHANGE: Docker deployment now requires Redis and new config.yml structure
2025-02-02 20:19:51 +08:00
UncleCode
f81712eb91 refactor(core): reorganize project structure and remove legacy code
Major reorganization of the project structure:
- Moved legacy synchronous crawler code to legacy folder
- Removed deprecated CLI and docs manager
- Consolidated version manager into utils.py
- Added CrawlerHub to __init__.py exports
- Fixed type hints in async_webcrawler.py
- Fixed minor bugs in chunking and crawler strategies

BREAKING CHANGE: Removed synchronous WebCrawler, CLI, and docs management functionality. Users should migrate to AsyncWebCrawler.
2025-01-30 19:35:06 +08:00
UncleCode
31938fb922 feat(crawler): enhance JavaScript execution and PDF processing
Add JavaScript execution result handling and improve PDF processing capabilities:
- Add js_execution_result to CrawlResult and AsyncCrawlResponse models
- Implement execution result capture in AsyncPlaywrightCrawlerStrategy
- Add batch processing for PDF pages with configurable batch size
- Enhance JsonElementExtractionStrategy with better schema generation
- Add HTML optimization utilities

BREAKING CHANGE: PDF processing now uses batch processing by default
2025-01-29 21:03:39 +08:00
UncleCode
d09c611d15 feat(robots): add robots.txt compliance support
Add support for checking and respecting robots.txt rules before crawling websites:
- Implement RobotsParser class with SQLite caching
- Add check_robots_txt parameter to CrawlerRunConfig
- Integrate robots.txt checking in AsyncWebCrawler
- Update documentation with robots.txt compliance examples
- Add tests for robot parser functionality

The cache uses WAL mode for better concurrency and has a default TTL of 7 days.
2025-01-21 17:54:13 +08:00
UncleCode
2d6b19e1a2 refactor(browser): improve browser path management
Implement more robust browser executable path handling using playwright's built-in browser management. This change:
- Adds async browser path resolution
- Implements path caching in the home folder
- Removes hardcoded browser paths
- Adds httpx dependency
- Removes obsolete test result files

This change makes the browser path resolution more reliable across different platforms and environments.
2025-01-17 22:14:37 +08:00
UncleCode
8ec12d7d68 Apply Ruff Corrections 2025-01-13 19:19:58 +08:00
UncleCode
f3ae5a657c feat(scraping): add LXML-based scraping mode for improved performance
Adds a new ScrapingMode enum to allow switching between BeautifulSoup and LXML parsing.
LXML mode offers 10-20x better performance for large HTML documents.

Key changes:
- Added ScrapingMode enum with BEAUTIFULSOUP and LXML options
- Implemented LXMLWebScrapingStrategy class
- Added LXML-based metadata extraction
- Updated documentation with scraping mode usage and performance considerations
- Added cssselect dependency

BREAKING CHANGE: None
2025-01-12 20:46:23 +08:00
UncleCode
72fbdac467 fix(extraction): JsonCss selector and crawler improvements
- Fix JsonCssExtractionStrategy._get_elements to return all matching elements instead of just one
- Add robust error handling to page_need_scroll with default fallback
- Improve JSON extraction strategies documentation
- Refactor content scraping strategy
- Update version to 0.4.247
2025-01-05 19:26:46 +08:00
UncleCode
fb33a24891 Commit Message:
- Added examples for Amazon product data extraction methods
  - Updated configuration options and enhance documentation
  - Minor refactoring for improved performance and readability
  - Cleaned up version control settings.
2024-12-29 20:05:18 +08:00
UncleCode
d5ed451299 Enhance crawler capabilities and documentation
- Add llm.txt generator
  - Added SSL certificate extraction in AsyncWebCrawler.
  - Introduced new content filters and chunking strategies for more robust data extraction.
  - Updated documentation.
2024-12-25 21:34:31 +08:00
UncleCode
0982c639ae Enhance AsyncWebCrawler and related configurations
- Introduced new configuration classes: BrowserConfig and CrawlerRunConfig.
  - Refactored AsyncWebCrawler to leverage the new configuration system for cleaner parameter management.
  - Updated AsyncPlaywrightCrawlerStrategy for better flexibility and reduced legacy parameters.
  - Improved error handling with detailed context extraction during exceptions.
  - Enhanced overall maintainability and usability of the web crawler.
2024-12-12 19:35:09 +08:00
UncleCode
5188b7a6a0 Add full-page screenshot and PDF export features
- Introduced a new approach for capturing full-page screenshots by exporting them as PDFs first, enhancing reliability and performance.
  - Added documentation for the feature in `docs/examples/full_page_screenshot_and_pdf_export.md`.
  - Refactored `perform_completion_with_backoff` in `crawl4ai/utils.py` to include necessary extra parameters.
  - Updated `quickstart_async.py` to utilize LLM extraction with refined arguments.
2024-12-10 20:59:31 +08:00
UncleCode
e130fd8db9 Implement new async crawler features and stability updates
- Introduced new async crawl strategy with session management.
  - Added BrowserManager for improved browser management.
  - Enhanced documentation, focusing on storage state and usage examples.
  - Improved error handling and logging for sessions.
  - Added JavaScript snippets for customizing navigator properties.
2024-12-10 17:55:29 +08:00
UncleCode
2d31915f0a Commit Message:
Enhance Async Crawler with storage state handling
  - Updated Async Crawler to support storage state management.
  - Added error handling for URL validation in Async Web Crawler.
  - Modified README logo and improved .gitignore entries.
  - Fixed issues in multiple files for better code robustness.
2024-12-09 20:04:59 +08:00
UncleCode
8c611dcb4b Refactored web scraping components
- Enhanced the web scraping strategy with new methods for optimized media handling.
  - Added new utility functions for better content processing.
  - Refined existing features for improved accuracy and efficiency in scraping tasks.
  - Introduced more robust filtering criteria for media elements.
2024-12-05 22:33:47 +08:00