- Change API_BASE to relative '/api' for production
- Move marketplace to /marketplace instead of /marketplace/frontend
- Update MkDocs navigation
- Fix logo path in marketplace index
- Implement marketplace frontend and admin dashboard
- Add FastAPI backend with environment-based configuration
- Use .env file for secrets management
- Include data generation scripts
- Add proper CORS configuration
- Remove hardcoded password from admin login
- Update gitignore for security
- Add comprehensive brand book with color system, typography, components
- Add page copy dropdown with markdown copy/view functionality
- Update mkdocs.yml with new assets and branding navigation
- Use terminal-style ASCII icons and condensed menu design
- Updated ProxyConfig.from_string to support multiple proxy formats, including URLs with credentials.
- Deprecated the 'proxy' parameter in BrowserConfig, replacing it with 'proxy_config' for better flexibility.
- Added warnings for deprecated usage and clarified behavior when both parameters are provided.
- Updated documentation and tests to reflect changes in proxy configuration handling.
Implement hierarchical configuration for LLM parameters with support for:
- Temperature control (0.0-2.0) to adjust response creativity
- Custom base_url for proxy servers and alternative endpoints
- 4-tier priority: request params > provider env > global env > defaults
Add helper functions in utils.py, update API schemas and handlers,
support environment variables (LLM_TEMPERATURE, OPENAI_TEMPERATURE, etc.),
and provide comprehensive documentation with examples.
- Replace BaseStrategy with CrawlStrategy in custom strategy examples (DomainSpecificStrategy, HybridStrategy)
- Remove “Custom Link Scoring” and “Caching Strategy” sections no longer aligned with current library
- Revise memory pruning example to use adaptive.get_relevant_content and index-based retention of top 500 docs
- Correct Quickstart note: default cache mode is CacheMode.BYPASS; instruct enabling with CacheMode.ENABLED
Previously, the system incorrectly used OPENAI_API_KEY for all LLM providers
due to a hardcoded api_key_env fallback in config.yml. This caused authentication
errors when using non-OpenAI providers like Gemini.
Changes:
- Remove api_key_env from config.yml to let litellm handle provider-specific env vars
- Simplify get_llm_api_key() to return None, allowing litellm to auto-detect keys
- Update validate_llm_provider() to trust litellm's built-in key detection
- Update documentation to reflect the new automatic key handling
The fix leverages litellm's existing capability to automatically find the correct
environment variable for each provider (OPENAI_API_KEY, GEMINI_API_TOKEN, etc.)
without manual configuration.
ref #1291
- Add comprehensive v0.7.4 release blog post with LLMTableExtraction feature highlight
- Update blog index to feature v0.7.4 as latest release
- Update README.md to showcase v0.7.4 features alongside v0.7.3
- Accurately describe dispatcher fix as bug fix rather than major enhancement
- Include practical code examples for new LLMTableExtraction capabilities
BREAKING CHANGE: Table extraction now uses Strategy Design Pattern
This epic commit introduces a game-changing approach to table extraction in Crawl4AI:
✨ NEW FEATURES:
- LLMTableExtraction: AI-powered extraction for complex HTML tables with rowspan/colspan
- Smart Chunking: Automatically splits massive tables into optimal chunks at row boundaries
- Parallel Processing: Processes multiple chunks simultaneously for blazing-fast extraction
- Intelligent Merging: Seamlessly combines chunk results into complete tables
- Header Preservation: Each chunk maintains context with original headers
- Auto-retry Logic: Built-in resilience with configurable retry attempts
🏗️ ARCHITECTURE:
- Strategy Design Pattern for pluggable table extraction strategies
- ThreadPoolExecutor for concurrent chunk processing
- Token-based chunking with configurable thresholds
- Handles tables without headers gracefully
⚡ PERFORMANCE:
- Process 1000+ row tables without timeout
- Parallel processing with up to 5 concurrent chunks
- Smart token estimation prevents LLM context overflow
- Optimized for providers like Groq for massive tables
🔧 CONFIGURATION:
- enable_chunking: Auto-handle large tables (default: True)
- chunk_token_threshold: When to split (default: 3000 tokens)
- min_rows_per_chunk: Meaningful chunk sizes (default: 10)
- max_parallel_chunks: Concurrent processing (default: 5)
📚 BACKWARD COMPATIBILITY:
- Existing code continues to work unchanged
- DefaultTableExtraction remains the default strategy
- Progressive enhancement approach
This is the future of web table extraction - handling everything from simple tables to massive, complex data grids with merged cells and nested structures. The chunking is completely transparent to users while providing unprecedented scalability.
- Wrap all AsyncUrlSeeder usage with async context managers
- Update URL seeding adventure example to use "sitemap+cc" source, focus on course posts, and add stream=True parameter to fix runtime error
- Update Table-to-DataFrame Extraction example in README.md
- Replace old method of accessing tables via result.media directly with result.tables in the documentation
- Remove tables section from links & media page.
- Add tables section to crawler result page.
- Support LLM_PROVIDER env var to override default provider (openai/gpt-4o-mini)
- Add optional 'provider' parameter to API endpoints for per-request overrides
- Implement provider validation to ensure API keys exist
- Update documentation and examples with new configuration options
Closes the need to hardcode providers in config.yml
commit 2def6524cdacb69c72760bf55a41089257c0bb07
Author: ntohidi <nasrin@kidocode.com>
Date: Mon Aug 4 18:59:10 2025 +0800
refactor: consolidate WebScrapingStrategy to use LXML implementation only
BREAKING CHANGE: None - full backward compatibility maintained
This commit simplifies the content scraping architecture by removing the
redundant BeautifulSoup-based WebScrapingStrategy implementation and making
it an alias for LXMLWebScrapingStrategy.
Changes:
- Remove ~1000 lines of BeautifulSoup-based WebScrapingStrategy code
- Make WebScrapingStrategy an alias for LXMLWebScrapingStrategy
- Update LXMLWebScrapingStrategy to inherit directly from ContentScrapingStrategy
- Add required methods (scrap, ascrap, process_element, _log) to LXMLWebScrapingStrategy
- Maintain 100% backward compatibility - existing code continues to work
Code changes:
- crawl4ai/content_scraping_strategy.py: Remove WebScrapingStrategy class, add alias
- crawl4ai/async_configs.py: Remove WebScrapingStrategy from imports
- crawl4ai/__init__.py: Update imports to show alias relationship
- crawl4ai/types.py: Update type definitions
- crawl4ai/legacy/web_crawler.py: Update import to use alias
- tests/async/test_content_scraper_strategy.py: Update to use LXMLWebScrapingStrategy
- docs/examples/scraping_strategies_performance.py: Update to use single strategy
Documentation updates:
- docs/md_v2/core/content-selection.md: Update scraping modes section
- docs/md_v2/migration/webscraping-strategy-migration.md: Add migration guide
- CHANGELOG.md: Document the refactoring under [Unreleased]
Benefits:
- 10-20x faster HTML parsing for large documents
- Reduced memory usage and simplified codebase
- Consistent parsing behavior
- No migration required for existing users
All existing code using WebScrapingStrategy continues to work without
modification, while benefiting from LXML's superior performance.
Fix critical issue where unmatched URLs incorrectly used the first config instead of failing safely. Also clarify that configs without url_matcher match ALL URLs by design, and improve memory usage monitoring.
Bug fixes:
- Change select_config() to return None when no config matches instead of using first config
- Add proper error handling in dispatchers when no config matches a URL
- Return failed CrawlResult with "No matching configuration found" error message
- Fix is_match() to return True when url_matcher is None (matches all URLs)
- Import and use get_true_memory_usage_percent() for more accurate memory monitoring
Behavior clarification:
- CrawlerRunConfig with url_matcher=None matches ALL URLs (not nothing)
- This is the intended behavior for default/fallback configurations
- Enables clean pattern: specific configs first, default config last
Documentation updates:
- Clarify that configs without url_matcher match everything
- Explain "No matching configuration found" error when no default config
- Add examples showing proper default config usage
- Update all relevant docs: multi-url-crawling.md, arun_many.md, parameters.md
- Simplify API config examples by removing extraction_strategy
Demo and test updates:
- Update demo_multi_config_clean.py with commented default config to show behavior
- Change example URL to w3schools.com to demonstrate no-match scenario
- Uncomment all test URLs in test_multi_config.py for comprehensive testing
Breaking changes: None - this restores the intended behavior
This ensures URLs only get processed with appropriate configs, preventing
issues like HTML pages being processed with PDF extraction strategies.
- Remove unused StealthConfig from browser_manager.py
- Update LinkPreviewConfig import path in __init__.py and examples
- Fix infinity handling in content_scraping_strategy.py (use 0 instead of float('inf'))
- Remove sanitize_json_data functions from API endpoints
- Add comprehensive C4A Script documentation to release notes
- Update v0.7.0 release notes with improved code examples
- Create v0.7.1 release notes focusing on cleanup and documentation improvements
- Update demo files with corrected import paths and examples
- Fix virtual scroll and adaptive crawling examples across documentation
🤖 Generated with Claude Code
Co-Authored-By: Claude <noreply@anthropic.com>
- Update mkdocs.yml site name to v0.7.x
- Add v0.7.0 to blog index as latest release
- Move v0.6.0 to Previous Releases section
- Copy release notes to proper location in docs/md_v2/blog/releases/
- Bump version to 0.7.0
- Add release notes and demo files
- Update README with v0.7.0 features
- Update Docker configurations for v0.7.0-r1
- Move v0.7.0 demo files to releases_review
- Fix BM25 scoring bug in URLSeeder
Major features:
- Adaptive Crawling with pattern learning
- Virtual Scroll support for infinite pages
- Link Preview with 3-layer scoring
- Async URL Seeder for massive discovery
- Performance optimizations