* Fix: Use correct URL variable for raw HTML extraction (#1116)
- Prevents full HTML content from being passed as URL to extraction strategies
- Added unit tests to verify raw HTML and regular URL processing
Fix: Wrong URL variable used for extraction of raw html
* Fix#1181: Preserve whitespace in code blocks during HTML scraping
The remove_empty_elements_fast() method was removing whitespace-only
span elements inside <pre> and <code> tags, causing import statements
like "import torch" to become "importtorch". Now skips elements inside
code blocks where whitespace is significant.
* Refactor Pydantic model configuration to use ConfigDict for arbitrary types
* Fix EmbeddingStrategy: Uncomment response handling for the variations and clean up mock data. ref #1621
* Fix: permission issues with .cache/url_seeder and other runtime cache dirs. ref #1638
* fix: ensure BrowserConfig.to_dict serializes proxy_config
* feat: make LLM backoff configurable end-to-end
- extend LLMConfig with backoff delay/attempt/factor fields and thread them
through LLMExtractionStrategy, LLMContentFilter, table extraction, and
Docker API handlers
- expose the backoff parameter knobs on perform_completion_with_backoff/aperform_completion_with_backoff
and document them in the md_v2 guides
* reproduced AttributeError from #1642
* pass timeout parameter to docker client request
* added missing deep crawling objects to init
* generalized query in ContentRelevanceFilter to be a str or list
* import modules from enhanceable deserialization
* parameterized tests
* Fix: capture current page URL to reflect JavaScript navigation and add test for delayed redirects. ref #1268
* refactor: replace PyPDF2 with pypdf across the codebase. ref #1412
* Add browser_context_id and target_id parameters to BrowserConfig
Enable Crawl4AI to connect to pre-created CDP browser contexts, which is
essential for cloud browser services that pre-create isolated contexts.
Changes:
- Add browser_context_id and target_id parameters to BrowserConfig
- Update from_kwargs() and to_dict() methods
- Modify BrowserManager.start() to use existing context when provided
- Add _get_page_by_target_id() helper method
- Update get_page() to handle pre-existing targets
- Add test for browser_context_id functionality
This enables cloud services to:
1. Create isolated CDP contexts before Crawl4AI connects
2. Pass context/target IDs to BrowserConfig
3. Have Crawl4AI reuse existing contexts instead of creating new ones
* Add cdp_cleanup_on_close flag to prevent memory leaks in cloud/server scenarios
* Fix: add cdp_cleanup_on_close to from_kwargs
* Fix: find context by target_id for concurrent CDP connections
* Fix: use target_id to find correct page in get_page
* Fix: use CDP to find context by browserContextId for concurrent sessions
* Revert context matching attempts - Playwright cannot see CDP-created contexts
* Add create_isolated_context flag for concurrent CDP crawls
When True, forces creation of a new browser context instead of reusing
the default context. Essential for concurrent crawls on the same browser
to prevent navigation conflicts.
* Add context caching to create_isolated_context branch
Uses contexts_by_config cache (same as non-CDP mode) to reuse contexts
for multiple URLs with same config. Still creates new page per crawl
for navigation isolation. Benefits batch/deep crawls.
* Add init_scripts support to BrowserConfig for pre-page-load JS injection
This adds the ability to inject JavaScript that runs before any page loads,
useful for stealth evasions (canvas/audio fingerprinting, userAgentData).
- Add init_scripts parameter to BrowserConfig (list of JS strings)
- Apply init_scripts in setup_context() via context.add_init_script()
- Update from_kwargs() and to_dict() for serialization
* Fix CDP connection handling: support WS URLs and proper cleanup
Changes to browser_manager.py:
1. _verify_cdp_ready(): Support multiple URL formats
- WebSocket URLs (ws://, wss://): Skip HTTP verification, Playwright handles directly
- HTTP URLs with query params: Properly parse with urlparse to preserve query string
- Fixes issue where naive f"{cdp_url}/json/version" broke WS URLs and query params
2. close(): Proper cleanup when cdp_cleanup_on_close=True
- Close all sessions (pages)
- Close all contexts
- Call browser.close() to disconnect (doesn't terminate browser, just releases connection)
- Wait 1 second for CDP connection to fully release
- Stop Playwright instance to prevent memory leaks
This enables:
- Connecting to specific browsers via WS URL
- Reusing the same browser with multiple sequential connections
- No user wait needed between connections (internal 1s delay handles it)
Added tests/browser/test_cdp_cleanup_reuse.py with comprehensive tests.
* Update gitignore
* Some debugging for caching
* Add _generate_screenshot_from_html for raw: and file:// URLs
Implements the missing method that was being called but never defined.
Now raw: and file:// URLs can generate screenshots by:
1. Loading HTML into a browser page via page.set_content()
2. Taking screenshot using existing take_screenshot() method
3. Cleaning up the page afterward
This enables cached HTML to be rendered with screenshots in crawl4ai-cloud.
* Add PDF and MHTML support for raw: and file:// URLs
- Replace _generate_screenshot_from_html with _generate_media_from_html
- New method handles screenshot, PDF, and MHTML in one browser session
- Update raw: and file:// URL handlers to use new method
- Enables cached HTML to generate all media types
* Add crash recovery for deep crawl strategies
Add optional resume_state and on_state_change parameters to all deep
crawl strategies (BFS, DFS, Best-First) for cloud deployment crash
recovery.
Features:
- resume_state: Pass saved state to resume from checkpoint
- on_state_change: Async callback fired after each URL for real-time
state persistence to external storage (Redis, DB, etc.)
- export_state(): Get last captured state manually
- Zero overhead when features are disabled (None defaults)
State includes visited URLs, pending queue/stack, depths, and
pages_crawled count. All state is JSON-serializable.
* Fix: HTTP strategy raw: URL parsing truncates at # character
The AsyncHTTPCrawlerStrategy.crawl() method used urlparse() to extract
content from raw: URLs. This caused HTML with CSS color codes like #eee
to be truncated because # is treated as a URL fragment delimiter.
Before: raw:body{background:#eee} -> parsed.path = 'body{background:'
After: raw:body{background:#eee} -> raw_content = 'body{background:#eee'
Fix: Strip the raw: or raw:// prefix directly instead of using urlparse,
matching how the browser strategy handles it.
* Add base_url parameter to CrawlerRunConfig for raw HTML processing
When processing raw: HTML (e.g., from cache), the URL parameter is meaningless
for markdown link resolution. This adds a base_url parameter that can be set
explicitly to provide proper URL resolution context.
Changes:
- Add base_url parameter to CrawlerRunConfig.__init__
- Add base_url to CrawlerRunConfig.from_kwargs
- Update aprocess_html to use base_url for markdown generation
Usage:
config = CrawlerRunConfig(base_url='https://example.com')
result = await crawler.arun(url='raw:{html}', config=config)
* Add prefetch mode for two-phase deep crawling
- Add `prefetch` parameter to CrawlerRunConfig
- Add `quick_extract_links()` function for fast link extraction
- Add short-circuit in aprocess_html() for prefetch mode
- Add 42 tests (unit, integration, regression)
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
* Updates on proxy rotation and proxy configuration
* Add proxy support to HTTP crawler strategy
* Add browser pipeline support for raw:/file:// URLs
- Add process_in_browser parameter to CrawlerRunConfig
- Route raw:/file:// URLs through _crawl_web() when browser operations needed
- Use page.set_content() instead of goto() for local content
- Fix cookie handling for non-HTTP URLs in browser_manager
- Auto-detect browser requirements: js_code, wait_for, screenshot, etc.
- Maintain fast path for raw:/file:// without browser params
Fixes#310
* Add smart TTL cache for sitemap URL seeder
- Add cache_ttl_hours and validate_sitemap_lastmod params to SeedingConfig
- New JSON cache format with metadata (version, created_at, lastmod, url_count)
- Cache validation by TTL expiry and sitemap lastmod comparison
- Auto-migration from old .jsonl to new .json format
- Fixes bug where incomplete cache was used indefinitely
* Update URL seeder docs with smart TTL cache parameters
- Add cache_ttl_hours and validate_sitemap_lastmod to parameter table
- Document smart TTL cache validation with examples
- Add cache-related troubleshooting entries
- Update key features summary
* Add MEMORY.md to gitignore
* Docs: Add multi-sample schema generation section
Add documentation explaining how to pass multiple HTML samples
to generate_schema() for stable selectors that work across pages
with varying DOM structures.
Includes:
- Problem explanation (fragile nth-child selectors)
- Solution with code example
- Key points for multi-sample queries
- Comparison table of fragile vs stable selectors
* Fix critical RCE and LFI vulnerabilities in Docker API deployment
Security fixes for vulnerabilities reported by ProjectDiscovery:
1. Remote Code Execution via Hooks (CVE pending)
- Remove __import__ from allowed_builtins in hook_manager.py
- Prevents arbitrary module imports (os, subprocess, etc.)
- Hooks now disabled by default via CRAWL4AI_HOOKS_ENABLED env var
2. Local File Inclusion via file:// URLs (CVE pending)
- Add URL scheme validation to /execute_js, /screenshot, /pdf, /html
- Block file://, javascript:, data: and other dangerous schemes
- Only allow http://, https://, and raw: (where appropriate)
3. Security hardening
- Add CRAWL4AI_HOOKS_ENABLED=false as default (opt-in for hooks)
- Add security warning comments in config.yml
- Add validate_url_scheme() helper for consistent validation
Testing:
- Add unit tests (test_security_fixes.py) - 16 tests
- Add integration tests (run_security_tests.py) for live server
Affected endpoints:
- POST /crawl (hooks disabled by default)
- POST /crawl/stream (hooks disabled by default)
- POST /execute_js (URL validation added)
- POST /screenshot (URL validation added)
- POST /pdf (URL validation added)
- POST /html (URL validation added)
Breaking changes:
- Hooks require CRAWL4AI_HOOKS_ENABLED=true to function
- file:// URLs no longer work on API endpoints (use library directly)
* Enhance authentication flow by implementing JWT token retrieval and adding authorization headers to API requests
* Add release notes for v0.7.9, detailing breaking changes, security fixes, new features, bug fixes, and documentation updates
* Add release notes for v0.8.0, detailing breaking changes, security fixes, new features, bug fixes, and documentation updates
Documentation for v0.8.0 release:
- SECURITY.md: Security policy and vulnerability reporting guidelines
- RELEASE_NOTES_v0.8.0.md: Comprehensive release notes
- migration/v0.8.0-upgrade-guide.md: Step-by-step migration guide
- security/GHSA-DRAFT-RCE-LFI.md: GitHub security advisory drafts
- CHANGELOG.md: Updated with v0.8.0 changes
Breaking changes documented:
- Docker API hooks disabled by default (CRAWL4AI_HOOKS_ENABLED)
- file:// URLs blocked on Docker API endpoints
Security fixes credited to Neo by ProjectDiscovery
* Add examples for deep crawl crash recovery and prefetch mode in documentation
* Release v0.8.0: The v0.8.0 Update
- Updated version to 0.8.0
- Added comprehensive demo and release notes
- Updated all documentation
* Update security researcher acknowledgment with a hyperlink for Neo by ProjectDiscovery
* Add async agenerate_schema method for schema generation
- Extract prompt building to shared _build_schema_prompt() method
- Add agenerate_schema() async version using aperform_completion_with_backoff
- Refactor generate_schema() to use shared prompt builder
- Fixes Gemini/Vertex AI compatibility in async contexts (FastAPI)
* Fix: Enable litellm.drop_params for O-series/GPT-5 model compatibility
O-series (o1, o3) and GPT-5 models only support temperature=1.
Setting litellm.drop_params=True auto-drops unsupported parameters
instead of throwing UnsupportedParamsError.
Fixes temperature=0.01 error for these models in LLM extraction.
---------
Co-authored-by: rbushria <rbushri@gmail.com>
Co-authored-by: AHMET YILMAZ <tawfik@kidocode.com>
Co-authored-by: Soham Kukreti <kukretisoham@gmail.com>
Co-authored-by: Chris Murphy <chris.murphy@klaviyo.com>
Co-authored-by: unclecode <unclecode@kidocode.com>
Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
* Fix: Use correct URL variable for raw HTML extraction (#1116)
- Prevents full HTML content from being passed as URL to extraction strategies
- Added unit tests to verify raw HTML and regular URL processing
Fix: Wrong URL variable used for extraction of raw html
* Fix#1181: Preserve whitespace in code blocks during HTML scraping
The remove_empty_elements_fast() method was removing whitespace-only
span elements inside <pre> and <code> tags, causing import statements
like "import torch" to become "importtorch". Now skips elements inside
code blocks where whitespace is significant.
* Refactor Pydantic model configuration to use ConfigDict for arbitrary types
* Fix EmbeddingStrategy: Uncomment response handling for the variations and clean up mock data. ref #1621
* Fix: permission issues with .cache/url_seeder and other runtime cache dirs. ref #1638
* fix: ensure BrowserConfig.to_dict serializes proxy_config
* feat: make LLM backoff configurable end-to-end
- extend LLMConfig with backoff delay/attempt/factor fields and thread them
through LLMExtractionStrategy, LLMContentFilter, table extraction, and
Docker API handlers
- expose the backoff parameter knobs on perform_completion_with_backoff/aperform_completion_with_backoff
and document them in the md_v2 guides
* reproduced AttributeError from #1642
* pass timeout parameter to docker client request
* added missing deep crawling objects to init
* generalized query in ContentRelevanceFilter to be a str or list
* import modules from enhanceable deserialization
* parameterized tests
* Fix: capture current page URL to reflect JavaScript navigation and add test for delayed redirects. ref #1268
* refactor: replace PyPDF2 with pypdf across the codebase. ref #1412
* announcement: add application form for cloud API closed beta
* Release v0.7.8: Stability & Bug Fix Release
- Updated version to 0.7.8
- Introduced focused stability release addressing 11 community-reported bugs.
- Key fixes include Docker API improvements, LLM extraction enhancements, URL handling corrections, and dependency updates.
- Added detailed release notes for v0.7.8 in the blog and created a dedicated verification script to ensure all fixes are functioning as intended.
- Updated documentation to reflect recent changes and improvements.
* docs: add section for Crawl4AI Cloud API closed beta with application link
* fix: add disk cleanup step to Docker workflow
---------
Co-authored-by: rbushria <rbushri@gmail.com>
Co-authored-by: AHMET YILMAZ <tawfik@kidocode.com>
Co-authored-by: Soham Kukreti <kukretisoham@gmail.com>
Co-authored-by: Chris Murphy <chris.murphy@klaviyo.com>
Co-authored-by: Aravind Karnam <aravind.karanam@gmail.com>
execution, causing URLs to be processed sequentially instead of in parallel.
Changes:
- Added aperform_completion_with_backoff() using litellm.acompletion for async LLM calls
- Implemented arun() method in ExtractionStrategy base class with thread pool fallback
- Created async arun() and aextract() methods in LLMExtractionStrategy using asyncio.gather
- Updated AsyncWebCrawler.arun() to detect and use arun() when available
- Added comprehensive test suite to verify parallel execution
Impact:
- LLM extraction now runs truly in parallel across multiple URLs
- Significant performance improvement for multi-URL crawls with LLM strategies
- Backward compatible - existing extraction strategies continue to work
- No breaking changes to public API
Technical details:
- Uses litellm.acompletion for non-blocking LLM calls
- Leverages asyncio.gather for concurrent chunk processing
- Maintains backward compatibility via asyncio.to_thread fallback
- Works seamlessly with MemoryAdaptiveDispatcher and other dispatchers
- Fixed widespread typo: `temprature` → `temperature` across LLMConfig and related files
- Enhanced CSS/XPath selector guidance for more reliable LinkedIn data extraction
- Added Google Colab display server support for running Crawl4AI in notebook environments
- Improved browser debugging with verbose startup args logging
- Updated LinkedIn schemas and HTML snippets for better parsing accuracy
🤖 Generated with [Claude Code](https://claude.ai/code)
Co-Authored-By: Claude <noreply@anthropic.com>
This patch ensures consistent handling of `response.choices[0].message.content` by avoiding redefinition
of the `response` variable, which caused downstream exceptions during error handling.
Add new RegexExtractionStrategy for fast, zero-LLM extraction of common data types:
- Built-in patterns for emails, URLs, phones, dates, and more
- Support for custom regex patterns
- LLM-assisted pattern generation utility
- Optimized HTML preprocessing with fit_html field
- Enhanced network response body capture
Breaking changes: None
Enhance error handling and stability across multiple components:
- Add safety checks in async_configs.py for type and params existence
- Fix browser manager initialization and cleanup logic
- Add default LLM config fallback in extraction strategy
- Add comprehensive Docker deployment guide and server tests
BREAKING CHANGE: BrowserManager.start() now automatically closes existing instances
Adds new features to improve user experience and configuration:
- Quick JSON extraction with -j flag for direct LLM-based structured data extraction
- Global configuration management with 'crwl config' commands
- Enhanced LLM extraction with better JSON handling and error management
- New user settings for default behaviors (LLM provider, browser settings, etc.)
Breaking changes: None
Add new preprocess_html_for_schema utility function to better handle HTML cleaning
for schema generation. This replaces the previous optimize_html function in the
GoogleSearchCrawler and includes smarter attribute handling and pattern detection.
Other changes:
- Update default provider to gpt-4o
- Add DEFAULT_PROVIDER_API_KEY constant
- Make LLMConfig creation more flexible with create_llm_config helper
- Add new dependencies: zstandard and msgpack
This change improves schema generation reliability while reducing noise in the
processed HTML.
Add new features to enhance browser automation and HTML extraction:
- Add CDP browser launch capability with customizable ports and profiles
- Implement JsonLxmlExtractionStrategy for faster HTML parsing
- Add CLI command 'crwl cdp' for launching standalone CDP browsers
- Support connecting to external CDP browsers via URL
- Optimize selector caching and context-sensitive queries
BREAKING CHANGE: LLMConfig import path changed from crawl4ai.types to crawl4ai
Modify CrawlStats class to handle both datetime and float timestamp formats for start_time and end_time fields. This change improves compatibility with different time formats while maintaining existing functionality.
Other minor changes:
- Add datetime import in async_dispatcher
- Update JsonElementExtractionStrategy kwargs handling
No breaking changes.
Rename LlmConfig to LLMConfig across the codebase to follow consistent naming conventions.
Update all imports and usages to use the new name.
Update documentation and examples to reflect the change.
BREAKING CHANGE: LlmConfig has been renamed to LLMConfig. Users need to update their imports and usage.
* feature: Add LlmConfig to easily configure and pass LLM configs to different strategies
* pulled in next branch and resolved conflicts
* feat: Add gemini and deepseek providers. Make ignore_cache in llm content filter to true by default to avoid confusions
* Refactor: Update LlmConfig in LLMExtractionStrategy class and deprecate old params
* updated tests, docs and readme
Remove content filter related code and parameters as part of simplifying the crawler configuration. This includes:
- Removing ContentFilter import and related classes
- Removing content_filter parameter from CrawlerRunConfig
- Cleaning up LLMExtractionStrategy constructor parameters
BREAKING CHANGE: Removed content_filter parameter from CrawlerRunConfig. Users should migrate to using extraction strategies for content filtering.
Implements a full-featured CLI for Crawl4AI with the following capabilities:
- Basic and advanced web crawling
- Configuration management via YAML/JSON files
- Multiple extraction strategies (CSS, XPath, LLM)
- Content filtering and optimization
- Interactive Q&A capabilities
- Various output formats
- Comprehensive documentation and examples
Also includes:
- Home directory setup for configuration and cache
- Environment variable support for API tokens
- Test suite for CLI functionality
Complete overhaul of Docker deployment setup with improved architecture:
- Add Redis integration for task management
- Implement rate limiting and security middleware
- Add Prometheus metrics and health checks
- Improve error handling and logging
- Add support for streaming responses
- Implement proper configuration management
- Add platform-specific optimizations for ARM64/AMD64
BREAKING CHANGE: Docker deployment now requires Redis and new config.yml structure
Add JavaScript execution result handling and improve PDF processing capabilities:
- Add js_execution_result to CrawlResult and AsyncCrawlResponse models
- Implement execution result capture in AsyncPlaywrightCrawlerStrategy
- Add batch processing for PDF pages with configurable batch size
- Enhance JsonElementExtractionStrategy with better schema generation
- Add HTML optimization utilities
BREAKING CHANGE: PDF processing now uses batch processing by default
Add new PDF processing module with the following features:
- PDF text extraction and formatting to HTML/Markdown
- Image extraction with multiple format support (JPEG, PNG, TIFF)
- Link extraction from PDF documents
- Metadata extraction including title, author, dates
- Support for both local and remote PDF files
Also includes:
- New configuration options for HTML attribute handling
- Internal/external link filtering improvements
- Version bump to 0.4.300b4
Adds new static method generate_schema() to JsonElementExtractionStrategy classes
that can automatically generate extraction schemas using LLM (OpenAI or Ollama).
This provides a convenient way to bootstrap extraction schemas while maintaining
the performance benefits of selector-based extraction.
Key changes:
- Added generate_schema() static method to base extraction strategy
- Added support for both CSS and XPath schema generation
- Updated documentation with examples and best practices
- Added new prompt templates for schema generation
- Remove .pre-commit-config.yaml and duplicate mkdocs configuration files
- Add Optional type hint for proxy parameter in BrowserConfig
- Fix type annotation for results list in AsyncWebCrawler
- Move calculate_batch_size function import to model_loader
- Update prompt imports in extraction_strategy.py
No breaking changes.
Reorganize documentation into core/advanced/extraction sections for better navigation.
Update terminal theme styles and add rich library for better CLI output.
Remove redundant tutorial files and consolidate content into core sections.
Add personal story to index page for project context.
BREAKING CHANGE: Documentation structure has been significantly reorganized
Add explanatory comments to JsonCssExtractionStrategy._get_elements() method to clarify that it returns all matching elements using select() instead of select_one(). This helps developers understand the method's behavior and its difference from single element selection.
Removed trailing whitespace at end of file.
- Fix JsonCssExtractionStrategy._get_elements to return all matching elements instead of just one
- Add robust error handling to page_need_scroll with default fallback
- Improve JSON extraction strategies documentation
- Refactor content scraping strategy
- Update version to 0.4.247
- Added examples for Amazon product data extraction methods
- Updated configuration options and enhance documentation
- Minor refactoring for improved performance and readability
- Cleaned up version control settings.
- Add llm.txt generator
- Added SSL certificate extraction in AsyncWebCrawler.
- Introduced new content filters and chunking strategies for more robust data extraction.
- Updated documentation.
- Fix crawler text mode for improved performance; cover missing `srcset` and `data_srcset` attributes in image tags.
- Introduced Managed Browsers for enhanced crawling experience.
- Updated documentation for clearer navigation on configuration.
- Changed 'text_only' to 'text_mode' in configuration and methods.
- Improved performance and relevance in content filtering strategies.
- ReImplemented JsonXPathExtractionStrategy for enhanced JSON data extraction.
- Updated existing extraction strategies for better performance.
- Improved handling of response status codes during crawls.
Enhance Async Crawler with storage state handling
- Updated Async Crawler to support storage state management.
- Added error handling for URL validation in Async Web Crawler.
- Modified README logo and improved .gitignore entries.
- Fixed issues in multiple files for better code robustness.
• Add smart overlay removal system for handling popups and modals
• Improve screenshot functionality with configurable timing controls
• Implement URL normalization and enhanced link processing
• Add custom base directory support for cache storage
• Refine external content filtering and social media domain handling
This commit significantly improves the crawler's ability to handle modern
websites by automatically removing intrusive overlays and providing better
screenshot capabilities. URL handling is now more robust with proper
normalization and duplicate detection. The cache system is more flexible
with customizable base directory support.
Breaking changes: None
Issue numbers: None
- Add before_retrieve_html hook and delay_before_return_html option
- Implement flexible page_timeout for smart_wait function
- Support extra_args and custom headers in LLM extraction
- Allow arbitrary kwargs in AsyncWebCrawler initialization
- Improve perform_completion_with_backoff for custom API calls
- Update examples with new features and diverse LLM providers
This code change updates the `LLMExtractionStrategy` class to handle schema extraction when the schema is non-empty. Previously, the schema extraction was only triggered when the `extract_type` was set to "schema", regardless of whether a schema was provided. With this update, the schema extraction will only be performed if the `extract_type` is "schema" and a non-empty schema is provided. This ensures that the extraction strategy behaves correctly and avoids unnecessary schema extraction when not needed. Also "numpy" is removed from default installation mode.
Significant improvements in text processing and performance:
- 🚀 **Dependency reduction**: Removed dependency on spaCy model for text chunk labeling in cosine extraction strategy.
- 🤖 **Transformer upgrade**: Implemented text sequence classification using a transformer model for labeling text chunks.
- ⚡ **Performance enhancement**: Improved model loading speed due to removal of spaCy dependency.
- 🔧 **Future-proofing**: Laid groundwork for potential complete removal of spaCy dependency in future versions.
These changes address issue #68 and provide a foundation for faster, more efficient text processing in Crawl4AI.
A slew of exciting updates to improve the crawler's stability and robustness! 🎉
- 💻 **UTF encoding fix**: Resolved the Windows \"charmap\" error by adding UTF encoding.
- 🛡️ **Error handling**: Implemented MaxRetryError exception handling in LocalSeleniumCrawlerStrategy.
- 🧹 **Input sanitization**: Improved input sanitization and handled encoding issues in LLMExtractionStrategy.
- 🚮 **Database cleanup**: Removed existing database file and initialized a new one.
This commit updates the version number to v0.2.73 and makes corresponding changes in the README.md and Dockerfile.
Docker file install the default mode, this resolve many of installation issues.
Additionally, the installation instructions are updated to include support for different modes. Setup.py doesn't have anymore dependancy on Spacy.
The change log is also updated to reflect these changes.
Supporting websites need with-head browser.