feat(release): prepare v0.4.3 beta release

Prepare the v0.4.3 beta release with major feature additions and improvements: - Add JsonXPathExtractionStrategy and LLMContentFilter to exports - Update version to 0.4.3b1 - Improve documentation for dispatchers and markdown generation - Update development status to Beta - Reorganize changelog format BREAKING CHANGE: Memory threshold in MemoryAdaptiveDispatcher increased to 90% and SemaphoreDispatcher parameter renamed to max_session_permit
2025-01-21 21:03:11 +08:00
parent d09c611d15
commit 16b8d4945b
12 changed files with 885 additions and 287 deletions
--- a/CHANGELOG.md
+++ b/CHANGELOG.md
@@ -1,19 +1,3 @@
-### [Added] 2025-01-21
- Added robots.txt compliance support with efficient SQLite-based caching
- New `check_robots_txt` parameter in CrawlerRunConfig to enable robots.txt checking
- Documentation updates for robots.txt compliance features and examples
- Automated robots.txt checking integrated into AsyncWebCrawler with 403 status codes for blocked URLs
-
-### [Added] 2025-01-20
- Added proxy configuration support to CrawlerRunConfig allowing dynamic proxy settings per crawl request
- Updated documentation with examples for using proxy configuration in crawl operations
-
-### [Added] 2025-01-20
- New LLM-powered schema generation utility for JsonElementExtractionStrategy
- Support for automatic CSS and XPath schema generation using OpenAI or Ollama
- Comprehensive documentation and examples for schema generation
- New prompt templates optimized for HTML schema analysis
-
 # Changelog

 All notable changes to Crawl4AI will be documented in this file.
@@ -21,6 +5,140 @@ All notable changes to Crawl4AI will be documented in this file.
 The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/),
 and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).

+Okay, here's a detailed changelog in Markdown format, generated from the provided git diff and commit history. I've focused on user-facing changes, fixes, and features, and grouped them as requested:
+
+## Version 0.4.3 (2025-01-21)
+
+This release introduces several powerful new features, including robots.txt compliance, dynamic proxy support, LLM-powered schema generation, and improved documentation.
+
+### Features
+
+-   **Robots.txt Compliance:**
+    -   Added robots.txt compliance support with efficient SQLite-based caching.
+    -   New `check_robots_txt` parameter in `CrawlerRunConfig` to enable robots.txt checking before crawling a URL.
+    -   Automated robots.txt checking is now integrated into `AsyncWebCrawler` with 403 status codes for blocked URLs.
+    
+-   **Proxy Configuration:**
+    -   Added proxy configuration support to `CrawlerRunConfig`, allowing dynamic proxy settings per crawl request.
+    -   Updated documentation with examples for using proxy configuration in crawl operations.
+
+-   **LLM-Powered Schema Generation:**
+    -   Introduced a new utility for automatic CSS and XPath schema generation using OpenAI or Ollama models.
+    -   Added comprehensive documentation and examples for schema generation.
+    -   New prompt templates optimized for HTML schema analysis.
+
+-   **URL Redirection Tracking:**
+    -   Added URL redirection tracking to capture the final URL after any redirects.
+    -   The final URL is now available in the `final_url` field of the `AsyncCrawlResponse` object.
+
+-   **Enhanced Streamlined Documentation:**
+    -   Refactored and improved the documentation structure for clarity and ease of use.
+    -   Added detailed explanations of new features and updated examples.
+
+-   **Improved Browser Context Management:**
+    -   Enhanced the management of browser contexts and added shared data support.
+    -   Introduced the `shared_data` parameter in `CrawlerRunConfig` to pass data between hooks.
+
+-   **Memory Dispatcher System:**
+    -   Migrated to a memory dispatcher system with enhanced monitoring capabilities.
+    -   Introduced `MemoryAdaptiveDispatcher` and `SemaphoreDispatcher` for improved resource management.
+    -   Added `RateLimiter` for rate limiting support.
+    -   New `CrawlerMonitor` for real-time monitoring of crawler operations.
+
+-   **Streaming Support:**
+    -   Added streaming support for processing crawled URLs as they are processed.
+    -   Enabled streaming mode with the `stream` parameter in `CrawlerRunConfig`.
+
+-   **Content Scraping Strategy:**
+    -   Introduced a new `LXMLWebScrapingStrategy` for faster content scraping.
+    -   Added support for selecting the scraping strategy via the `scraping_strategy` parameter in `CrawlerRunConfig`.
+
+### Bug Fixes
+
+-   **Browser Path Management:**
+    -   Improved browser path management for consistent behavior across different environments.
+
+-   **Memory Threshold:**
+    -   Adjusted the default memory threshold to improve resource utilization.
+
+-   **Pydantic Model Fields:**
+    -   Made several model fields optional with default values to improve flexibility.
+
+### Refactor
+
+-   **Documentation Structure:**
+    -   Reorganized documentation structure to improve navigation and readability.
+    -   Updated styles and added new sections for advanced features.
+
+-   **Scraping Mode:**
+    -   Replaced the `ScrapingMode` enum with a strategy pattern for more flexible content scraping.
+
+-   **Version Update:**
+    -   Updated the version to `0.4.248`.
+
+-   **Code Cleanup:**
+    -   Removed unused files and improved type hints.
+    -   Applied Ruff corrections for code quality.
+
+-   **Updated dependencies:**
+    -   Updated dependencies to their latest versions to ensure compatibility and security.
+
+-   **Ignored certain patterns and directories:**
+    -   Updated `.gitignore` and `.codeiumignore` to ignore additional patterns and directories, streamlining the development environment.
+
+-   **Simplified Personal Story in README:**
+    -   Streamlined the personal story and project vision in the `README.md` for clarity.
+
+-   **Removed Deprecated Files:**
+    -   Deleted several deprecated files and examples that are no longer relevant.
+
+---
+**Previous Releases:**
+
+### 0.4.24x (2024-12-31)
+-   **Enhanced SSL & Security**: New SSL certificate handling with custom paths and validation options for secure crawling.
+-   **Smart Content Filtering**: Advanced filtering system with regex support and efficient chunking strategies.
+-   **Improved JSON Extraction**: Support for complex JSONPath, JSON-CSS, and Microdata extraction.
+-   **New Field Types**: Added `computed`, `conditional`, `aggregate`, and `template` field types.
+-   **Performance Boost**: Optimized caching, parallel processing, and memory management.
+-   **Better Error Handling**: Enhanced debugging capabilities with detailed error tracking.
+-   **Security Features**: Improved input validation and safe expression evaluation.
+
+### 0.4.247 (2025-01-06)
+
+#### Added
+- **Windows Event Loop Configuration**: Introduced a utility function `configure_windows_event_loop` to resolve `NotImplementedError` for asyncio subprocesses on Windows. ([#utils.py](crawl4ai/utils.py), [#tutorials/async-webcrawler-basics.md](docs/md_v3/tutorials/async-webcrawler-basics.md))
+- **`page_need_scroll` Method**: Added a method to determine if a page requires scrolling before taking actions in `AsyncPlaywrightCrawlerStrategy`. ([#async_crawler_strategy.py](crawl4ai/async_crawler_strategy.py))
+
+#### Changed
+- **Version Bump**: Updated the version from `0.4.246` to `0.4.247`. ([#__version__.py](crawl4ai/__version__.py))
+- **Improved Scrolling Logic**: Enhanced scrolling methods in `AsyncPlaywrightCrawlerStrategy` by adding a `scroll_delay` parameter for better control. ([#async_crawler_strategy.py](crawl4ai/async_crawler_strategy.py))
+- **Markdown Generation Example**: Updated the `hello_world.py` example to reflect the latest API changes and better illustrate features. ([#examples/hello_world.py](docs/examples/hello_world.py))
+- **Documentation Update**: 
+  - Added Windows-specific instructions for handling asyncio event loops. ([#async-webcrawler-basics.md](docs/md_v3/tutorials/async-webcrawler-basics.md))
+
+#### Removed
+- **Legacy Markdown Generation Code**: Removed outdated and unused code for markdown generation in `content_scraping_strategy.py`. ([#content_scraping_strategy.py](crawl4ai/content_scraping_strategy.py))
+
+#### Fixed
+- **Page Closing to Prevent Memory Leaks**:
+  - **Description**: Added a `finally` block to ensure pages are closed when no `session_id` is provided.
+  - **Impact**: Prevents memory leaks caused by lingering pages after a crawl.
+  - **File**: [`async_crawler_strategy.py`](crawl4ai/async_crawler_strategy.py)
+  - **Code**:
+    ```python
+    finally:
+        # If no session_id is given we should close the page
+        if not config.session_id:
+            await page.close()
+    ```
+- **Multiple Element Selection**: Modified `_get_elements` in `JsonCssExtractionStrategy` to return all matching elements instead of just the first one, ensuring comprehensive extraction. ([#extraction_strategy.py](crawl4ai/extraction_strategy.py))
+- **Error Handling in Scrolling**: Added robust error handling to ensure scrolling proceeds safely even if a configuration is missing. ([#async_crawler_strategy.py](crawl4ai/async_crawler_strategy.py))
+
+#### Other
+- **Git Ignore Update**: Added `/plans` to `.gitignore` for better development environment consistency. ([#.gitignore](.gitignore))
+
+
 ## [0.4.24] - 2024-12-31

 ### Added
--- a/crawl4ai/init.py
+++ b/crawl4ai/init.py
@@ -12,10 +12,11 @@ from .extraction_strategy import (
    LLMExtractionStrategy,
    CosineStrategy,
    JsonCssExtractionStrategy,
+    JsonXPathExtractionStrategy
 )
 from .chunking_strategy import ChunkingStrategy, RegexChunking
 from .markdown_generation_strategy import DefaultMarkdownGenerator
-from .content_filter_strategy import PruningContentFilter, BM25ContentFilter
+from .content_filter_strategy import PruningContentFilter, BM25ContentFilter, LLMContentFilter
 from .models import CrawlResult, MarkdownGenerationResult
 from .async_dispatcher import (
    MemoryAdaptiveDispatcher,
@@ -39,11 +40,13 @@ __all__ = [
    "LLMExtractionStrategy",
    "CosineStrategy",
    "JsonCssExtractionStrategy",
+    "JsonXPathExtractionStrategy",
    "ChunkingStrategy",
    "RegexChunking",
    "DefaultMarkdownGenerator",
    "PruningContentFilter",
    "BM25ContentFilter",
+    "LLMContentFilter",
    "BaseDispatcher",
    "MemoryAdaptiveDispatcher",
    "SemaphoreDispatcher",
--- a/crawl4ai/version.py
+++ b/crawl4ai/version.py
@@ -1,2 +1,2 @@
 # crawl4ai/_version.py
-__version__ = "0.4.248"
+__version__ = "0.4.3b1"
--- a/docs/examples/dispatcher_example.py
+++ b/docs/examples/dispatcher_example.py
@@ -12,6 +12,7 @@ from crawl4ai import (
    CrawlerMonitor,
    DisplayMode,
    CacheMode,
+    LXMLWebScrapingStrategy,
 )


@@ -113,7 +114,7 @@ def create_performance_table(results):
 async def main():
    urls = [f"https://example.com/page{i}" for i in range(1, 20)]
    browser_config = BrowserConfig(headless=True, verbose=False)
-    run_config = CrawlerRunConfig(cache_mode=CacheMode.BYPASS)
+    run_config = CrawlerRunConfig(cache_mode=CacheMode.BYPASS, scraping_strategy=LXMLWebScrapingStrategy())

    results = {
        "Memory Adaptive": await memory_adaptive(urls, browser_config, run_config),
--- a/docs/examples/llm_markdown_generator.py
+++ b/docs/examples/llm_markdown_generator.py
@@ -0,0 +1,87 @@
+import os
+import asyncio
+from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig, CacheMode
+from crawl4ai.content_filter_strategy import LLMContentFilter
+
+async def test_llm_filter():
+    # Create an HTML source that needs intelligent filtering
+    url = "https://docs.python.org/3/tutorial/classes.html"
+    
+    browser_config = BrowserConfig(
+        headless=True,
+        verbose=True
+    )
+    
+    # run_config = CrawlerRunConfig(cache_mode=CacheMode.BYPASS)
+    run_config = CrawlerRunConfig(cache_mode=CacheMode.ENABLED)
+    
+    async with AsyncWebCrawler(config=browser_config) as crawler:
+        # First get the raw HTML
+        result = await crawler.arun(url, config=run_config)
+        html = result.cleaned_html
+
+        # Initialize LLM filter with focused instruction
+        filter = LLMContentFilter(
+            provider="openai/gpt-4o",
+            api_token=os.getenv('OPENAI_API_KEY'),
+            instruction="""
+            Focus on extracting the core educational content about Python classes.
+            Include:
+            - Key concepts and their explanations
+            - Important code examples
+            - Essential technical details
+            Exclude:
+            - Navigation elements
+            - Sidebars
+            - Footer content
+            - Version information
+            - Any non-essential UI elements
+            
+            Format the output as clean markdown with proper code blocks and headers.
+            """,
+            verbose=True
+        )
+        
+        filter = LLMContentFilter(
+            provider="openai/gpt-4o",
+            api_token=os.getenv('OPENAI_API_KEY'),
+            chunk_token_threshold=2 ** 12 * 2, # 2048 * 2
+            instruction="""
+            Extract the main educational content while preserving its original wording and substance completely. Your task is to:
+
+            1. Maintain the exact language and terminology used in the main content
+            2. Keep all technical explanations, examples, and educational content intact
+            3. Preserve the original flow and structure of the core content
+            4. Remove only clearly irrelevant elements like:
+            - Navigation menus
+            - Advertisement sections
+            - Cookie notices
+            - Footers with site information
+            - Sidebars with external links
+            - Any UI elements that don't contribute to learning
+
+            The goal is to create a clean markdown version that reads exactly like the original article, 
+            keeping all valuable content but free from distracting elements. Imagine you're creating 
+            a perfect reading experience where nothing valuable is lost, but all noise is removed.
+            """,
+            verbose=True
+        )        
+
+        # Apply filtering
+        filtered_content = filter.filter_content(html, ignore_cache = True)
+        
+        # Show results
+        print("\nFiltered Content Length:", len(filtered_content))
+        print("\nFirst 500 chars of filtered content:")
+        if filtered_content:
+            print(filtered_content[0][:500])
+        
+        # Save on disc the markdown version
+        with open("filtered_content.md", "w", encoding="utf-8") as f:
+            f.write("\n".join(filtered_content))
+        
+        # Show token usage
+        filter.show_usage()
+
+if __name__ == "__main__":
+    asyncio.run(test_llm_filter())
--- a/docs/examples/scraping_strategies_performance.py
+++ b/docs/examples/scraping_strategies_performance.py
@@ -0,0 +1,135 @@
+import time, re
+from crawl4ai.content_scraping_strategy import WebScrapingStrategy,  LXMLWebScrapingStrategy
+import time
+import functools
+from collections import defaultdict
+
+class TimingStats:
+    def __init__(self):
+        self.stats = defaultdict(lambda: defaultdict(lambda: {"calls": 0, "total_time": 0}))
+        
+    def add(self, strategy_name, func_name, elapsed):
+        self.stats[strategy_name][func_name]["calls"] += 1
+        self.stats[strategy_name][func_name]["total_time"] += elapsed
+        
+    def report(self):
+        for strategy_name, funcs in self.stats.items():
+            print(f"\n{strategy_name} Timing Breakdown:")
+            print("-" * 60)
+            print(f"{'Function':<30} {'Calls':<10} {'Total(s)':<10} {'Avg(ms)':<10}")
+            print("-" * 60)
+            
+            for func, data in sorted(funcs.items(), key=lambda x: x[1]["total_time"], reverse=True):
+                avg_ms = (data["total_time"] / data["calls"]) * 1000
+                print(f"{func:<30} {data['calls']:<10} {data['total_time']:<10.3f} {avg_ms:<10.2f}")
+
+timing_stats = TimingStats()
+
+# Modify timing decorator
+def timing_decorator(strategy_name):
+    def decorator(func):
+        @functools.wraps(func)
+        def wrapper(*args, **kwargs):
+            start = time.time()
+            result = func(*args, **kwargs)
+            elapsed = time.time() - start
+            timing_stats.add(strategy_name, func.__name__, elapsed)
+            return result
+        return wrapper
+    return decorator
+
+# Modified decorator application
+def apply_decorators(cls, method_name, strategy_name):
+    try:
+        original_method = getattr(cls, method_name)
+        decorated_method = timing_decorator(strategy_name)(original_method)
+        setattr(cls, method_name, decorated_method)
+    except AttributeError:
+        print(f"Method {method_name} not found in class {cls.__name__}.")
+
+# Apply to key methods
+methods_to_profile = [
+    '_scrap',
+    # 'process_element', 
+    '_process_element', 
+    'process_image',
+]
+
+
+# Apply decorators to both strategies
+for strategy, name in [(WebScrapingStrategy, "Original"), (LXMLWebScrapingStrategy, "LXML")]:
+    for method in methods_to_profile:
+        apply_decorators(strategy, method, name)
+
+
+def generate_large_html(n_elements=1000):
+    html = ['<!DOCTYPE html><html><head></head><body>']
+    for i in range(n_elements):
+        html.append(f'''
+            <div class="article">
+                <h2>Heading {i}</h2>
+                <div>
+                    <div>
+                        <p>This is paragraph {i} with some content and a <a href="http://example.com/{i}">link</a></p>
+                    </div>
+                </div>
+                <img src="image{i}.jpg" alt="Image {i}">
+                <ul>
+                    <li>List item {i}.1</li>
+                    <li>List item {i}.2</li>
+                </ul>
+            </div>
+        ''')
+    html.append('</body></html>')
+    return ''.join(html)
+
+def test_scraping():
+    # Initialize both scrapers
+    original_scraper = WebScrapingStrategy()
+    selected_scraper = LXMLWebScrapingStrategy()
+    
+    # Generate test HTML
+    print("Generating HTML...")
+    html = generate_large_html(5000)
+    print(f"HTML Size: {len(html)/1024:.2f} KB")
+    
+    # Time the scraping
+    print("\nStarting scrape...")
+    start_time = time.time()
+    
+    kwargs = {
+        "url": "http://example.com",
+        "html": html,
+        "word_count_threshold": 5,
+        "keep_data_attributes": True
+    }
+    
+    t1 = time.perf_counter()
+    result_selected = selected_scraper.scrap(**kwargs)
+    t2 = time.perf_counter()
+    
+    result_original = original_scraper.scrap(**kwargs)
+    t3 = time.perf_counter()
+    
+    elapsed = t3 - start_time
+    print(f"\nScraping completed in {elapsed:.2f} seconds")
+    
+    timing_stats.report()
+    
+    # Print stats of LXML output
+    print("\nLXML Output:")
+    print(f"\nExtracted links: {len(result_selected['links']['internal']) + len(result_selected['links']['external'])}")
+    print(f"Extracted images: {len(result_selected['media']['images'])}")
+    print(f"Clean HTML size: {len(result_selected['cleaned_html'])/1024:.2f} KB")
+    print(f"Scraping time: {t2 - t1:.2f} seconds")
+
+    # Print stats of original output
+    print("\nOriginal Output:")
+    print(f"\nExtracted links: {len(result_original['links']['internal']) + len(result_original['links']['external'])}")
+    print(f"Extracted images: {len(result_original['media']['images'])}")
+    print(f"Clean HTML size: {len(result_original['cleaned_html'])/1024:.2f} KB")
+    print(f"Scraping time: {t3 - t1:.2f} seconds")
+        
+        
+if __name__ == "__main__":
+    test_scraping()
--- a/docs/examples/v0_4_3_features_demo.py
+++ b/docs/examples/v0_4_3_features_demo.py
@@ -0,0 +1,252 @@
+"""
+Crawl4ai v0.4.3 Features Demo
+============================
+
+This example demonstrates the major new features introduced in Crawl4ai v0.4.3.
+Each section showcases a specific feature with practical examples and explanations.
+"""
+
+import asyncio
+import os
+from crawl4ai import *
+
+
+async def demo_memory_dispatcher():
+    """
+    1. Memory Dispatcher System Demo
+    ===============================
+    Shows how to use the new memory dispatcher with monitoring
+    """
+    print("\n=== 1. Memory Dispatcher System Demo ===")
+
+    # Configure crawler
+    browser_config = BrowserConfig(headless=True, verbose=True)
+    crawler_config = CrawlerRunConfig(
+        cache_mode=CacheMode.BYPASS, markdown_generator=DefaultMarkdownGenerator()
+    )
+
+    # Test URLs
+    urls = ["http://example.com", "http://example.org", "http://example.net"] * 3
+
+    async with AsyncWebCrawler(config=browser_config) as crawler:
+        # Initialize dispatcher with monitoring
+        monitor = CrawlerMonitor(
+            max_visible_rows=10,
+            display_mode=DisplayMode.DETAILED,  # Can be DETAILED or AGGREGATED
+        )
+
+        dispatcher = MemoryAdaptiveDispatcher(
+            memory_threshold_percent=80.0,  # Memory usage threshold
+            check_interval=0.5,  # How often to check memory
+            max_session_permit=5,  # Max concurrent crawls
+            monitor=monitor,  # Pass the monitor
+        )
+
+        # Run with memory monitoring
+        print("Starting batch crawl with memory monitoring...")
+        results = await dispatcher.run_urls(
+            urls=urls,
+            crawler=crawler,
+            config=crawler_config,
+        )
+        print(f"Completed {len(results)} URLs")
+
+
+async def demo_streaming_support():
+    """
+    2. Streaming Support Demo
+    ======================
+    Shows how to process URLs as they complete using streaming
+    """
+    print("\n=== 2. Streaming Support Demo ===")
+
+    browser_config = BrowserConfig(headless=True, verbose=True)
+    crawler_config = CrawlerRunConfig(cache_mode=CacheMode.BYPASS, stream=True)
+
+    # Test URLs
+    urls = ["http://example.com", "http://example.org", "http://example.net"] * 2
+
+    async with AsyncWebCrawler(config=browser_config) as crawler:
+        # Initialize dispatcher for streaming
+        dispatcher = MemoryAdaptiveDispatcher(max_session_permit=3, check_interval=0.5)
+
+        print("Starting streaming crawl...")
+        async for result in dispatcher.run_urls_stream(
+            urls=urls, crawler=crawler, config=crawler_config
+        ):
+            # Process each result as it arrives
+            print(
+                f"Received result for {result.url} - Success: {result.result.success}"
+            )
+            if result.result.success:
+                print(f"Content length: {len(result.result.markdown)}")
+
+
+async def demo_content_scraping():
+    """
+    3. Content Scraping Strategy Demo
+    ==============================
+    Demonstrates the new LXMLWebScrapingStrategy for faster content scraping.
+    """
+    print("\n=== 3. Content Scraping Strategy Demo ===")
+
+    crawler = AsyncWebCrawler()
+    url = "https://example.com/article"
+
+    # Configure with the new LXML strategy
+    config = CrawlerRunConfig(scraping_strategy=LXMLWebScrapingStrategy(), verbose=True)
+
+    print("Scraping content with LXML strategy...")
+    async with crawler:
+        result = await crawler.arun(url, config=config)
+        if result.success:
+            print("Successfully scraped content using LXML strategy")
+
+
+async def demo_llm_markdown():
+    """
+    4. LLM-Powered Markdown Generation Demo
+    ===================================
+    Shows how to use the new LLM-powered content filtering and markdown generation.
+    """
+    print("\n=== 4. LLM-Powered Markdown Generation Demo ===")
+
+    crawler = AsyncWebCrawler()
+    url = "https://docs.python.org/3/tutorial/classes.html"
+
+    content_filter = LLMContentFilter(
+        provider="openai/gpt-4o",
+        api_token=os.getenv("OPENAI_API_KEY"),
+        instruction="""
+        Focus on extracting the core educational content about Python classes.
+        Include:
+        - Key concepts and their explanations
+        - Important code examples
+        - Essential technical details
+        Exclude:
+        - Navigation elements
+        - Sidebars
+        - Footer content
+        - Version information
+        - Any non-essential UI elements
+        
+        Format the output as clean markdown with proper code blocks and headers.
+        """,
+        verbose=True,
+    )
+
+    # Configure LLM-powered markdown generation
+    config = CrawlerRunConfig(
+        markdown_generator=DefaultMarkdownGenerator(
+            content_filter=content_filter
+        ), 
+        cache_mode = CacheMode.BYPASS,
+        verbose=True
+    )
+
+    print("Generating focused markdown with LLM...")
+    async with crawler:
+        result = await crawler.arun(url, config=config)
+        if result.success and result.markdown_v2:
+            print("Successfully generated LLM-filtered markdown")
+            print("First 500 chars of filtered content:")
+            print(result.markdown_v2.fit_markdown[:500])
+            print("Successfully generated LLM-filtered markdown")
+
+
+async def demo_robots_compliance():
+    """
+    5. Robots.txt Compliance Demo
+    ==========================
+    Demonstrates the new robots.txt compliance feature with SQLite caching.
+    """
+    print("\n=== 5. Robots.txt Compliance Demo ===")
+
+    crawler = AsyncWebCrawler()
+    urls = ["https://example.com", "https://facebook.com", "https://twitter.com"]
+
+    # Enable robots.txt checking
+    config = CrawlerRunConfig(check_robots_txt=True, verbose=True)
+
+    print("Crawling with robots.txt compliance...")
+    async with crawler:
+        results = await crawler.arun_many(urls, config=config)
+        for result in results:
+            if result.status_code == 403:
+                print(f"Access blocked by robots.txt: {result.url}")
+            elif result.success:
+                print(f"Successfully crawled: {result.url}")
+
+
+
+async def demo_llm_schema_generation():
+    """
+    7. LLM-Powered Schema Generation Demo
+    =================================
+    Demonstrates automatic CSS and XPath schema generation using LLM models.
+    """
+    print("\n=== 7. LLM-Powered Schema Generation Demo ===")
+
+    # Example HTML content for a job listing
+    html_content = """
+    <div class="job-listing">
+        <h1 class="job-title">Senior Software Engineer</h1>
+        <div class="job-details">
+            <span class="location">San Francisco, CA</span>
+            <span class="salary">$150,000 - $200,000</span>
+            <div class="requirements">
+                <h2>Requirements</h2>
+                <ul>
+                    <li>5+ years Python experience</li>
+                    <li>Strong background in web crawling</li>
+                </ul>
+            </div>
+        </div>
+    </div>
+    """
+
+    print("Generating CSS selectors schema...")
+    # Generate CSS selectors with a specific query
+    css_schema = JsonCssExtractionStrategy.generate_schema(
+        html_content,
+        schema_type="CSS",
+        query="Extract job title, location, and salary information",
+        provider="openai/gpt-4o",  # or use other providers like "ollama"
+    )
+    print("\nGenerated CSS Schema:")
+    print(css_schema)
+
+    # Example of using the generated schema with crawler
+    crawler = AsyncWebCrawler()
+    url = "https://example.com/job-listing"
+
+    # Create an extraction strategy with the generated schema
+    extraction_strategy = JsonCssExtractionStrategy(schema=css_schema)
+
+    config = CrawlerRunConfig(extraction_strategy=extraction_strategy, verbose=True)
+
+    print("\nTesting generated schema with crawler...")
+    async with crawler:
+        result = await crawler.arun(url, config=config)
+        if result.success:
+            print(json.dumps(result.extracted_content, indent=2) if result.extracted_content else None)
+            print("Successfully used generated schema for crawling")
+
+
+async def main():
+    """Run all feature demonstrations."""
+    demo_memory_dispatcher(),
+    print("\n" + "=" * 50 + "\n")
+    demo_streaming_support(),
+    print("\n" + "=" * 50 + "\n")
+    demo_content_scraping(),
+    print("\n" + "=" * 50 + "\n")
+    demo_llm_schema_generation(),
+    print("\n" + "=" * 50 + "\n")
+    demo_llm_markdown(),
+    print("\n" + "=" * 50 + "\n")
+    demo_robots_compliance(),
+    print("\n" + "=" * 50 + "\n")
+
+if __name__ == "__main__":
+    asyncio.run(main())
--- a/docs/md_v2/advanced/multi-url-crawling
+++ b/docs/md_v2/advanced/multi-url-crawling
@@ -1,264 +0,0 @@
-# Optimized Multi-URL Crawling
-
-> **Note**: We’re developing a new **executor module** that uses a sophisticated algorithm to dynamically manage multi-URL crawling, optimizing for speed and memory usage. The approaches in this document remain fully valid, but keep an eye on **Crawl4AI**’s upcoming releases for this powerful feature! Follow [@unclecode](https://twitter.com/unclecode) on X and check the changelogs to stay updated.
-
-
-Crawl4AI’s **AsyncWebCrawler** can handle multiple URLs in a single run, which can greatly reduce overhead and speed up crawling. This guide shows how to:
-
-1. **Sequentially** crawl a list of URLs using the **same** session, avoiding repeated browser creation.  
-2. **Parallel**-crawl subsets of URLs in batches, again reusing the same browser.  
-
-When the entire process finishes, you close the browser once—**minimizing** memory and resource usage.
-
---
-
-## 1. Why Avoid Simple Loops per URL?
-
-If you naively do:
-
-```python
-for url in urls:
-    async with AsyncWebCrawler() as crawler:
-        result = await crawler.arun(url)
-```
-
-You end up:
-
-1. Spinning up a **new** browser for each URL  
-2. Closing it immediately after the single crawl  
-3. Potentially using a lot of CPU/memory for short-living browsers  
-4. Missing out on session reusability if you have login or ongoing states
-
-**Better** approaches ensure you **create** the browser once, then crawl multiple URLs with minimal overhead.
-
---
-
-## 2. Sequential Crawling with Session Reuse
-
-### 2.1 Overview
-
-1. **One** `AsyncWebCrawler` instance for **all** URLs.  
-2. **One** session (via `session_id`) so we can preserve local storage or cookies across URLs if needed.  
-3. The crawler is only closed at the **end**.
-
-**This** is the simplest pattern if your workload is moderate (dozens to a few hundred URLs).
-
-### 2.2 Example Code
-
-```python
-import asyncio
-from typing import List
-from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig
-from crawl4ai.markdown_generation_strategy import DefaultMarkdownGenerator
-
-async def crawl_sequential(urls: List[str]):
-    print("\n=== Sequential Crawling with Session Reuse ===")
-
-    browser_config = BrowserConfig(
-        headless=True,
-        # For better performance in Docker or low-memory environments:
-        extra_args=["--disable-gpu", "--disable-dev-shm-usage", "--no-sandbox"],
-    )
-
-    crawl_config = CrawlerRunConfig(
-        markdown_generator=DefaultMarkdownGenerator()
-    )
-
-    # Create the crawler (opens the browser)
-    crawler = AsyncWebCrawler(config=browser_config)
-    await crawler.start()
-
-    try:
-        session_id = "session1"  # Reuse the same session across all URLs
-        for url in urls:
-            result = await crawler.arun(
-                url=url,
-                config=crawl_config,
-                session_id=session_id
-            )
-            if result.success:
-                print(f"Successfully crawled: {url}")
-                # E.g. check markdown length
-                print(f"Markdown length: {len(result.markdown_v2.raw_markdown)}")
-            else:
-                print(f"Failed: {url} - Error: {result.error_message}")
-    finally:
-        # After all URLs are done, close the crawler (and the browser)
-        await crawler.close()
-
-async def main():
-    urls = [
-        "https://example.com/page1",
-        "https://example.com/page2",
-        "https://example.com/page3"
-    ]
-    await crawl_sequential(urls)
-
-if __name__ == "__main__":
-    asyncio.run(main())
-```
-
-**Why It’s Good**:
-
- **One** browser launch.  
- Minimal memory usage.  
- If the site requires login, you can log in once in `session_id` context and preserve auth across all URLs.
-
---
-
-## 3. Parallel Crawling with Browser Reuse
-
-### 3.1 Overview
-
-To speed up crawling further, you can crawl multiple URLs in **parallel** (batches or a concurrency limit). The crawler still uses **one** browser, but spawns different sessions (or the same, depending on your logic) for each task.
-
-### 3.2 Example Code
-
-For this example make sure to install the [psutil](https://pypi.org/project/psutil/) package.
-
-```bash
-pip install psutil
-```
-
-Then you can run the following code:
-
-```python
-import os
-import sys
-import psutil
-import asyncio
-
-__location__ = os.path.dirname(os.path.abspath(__file__))
-__output__ = os.path.join(__location__, "output")
-
-# Append parent directory to system path
-parent_dir = os.path.dirname(os.path.dirname(os.path.abspath(__file__)))
-sys.path.append(parent_dir)
-
-from typing import List
-from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig, CacheMode
-
-async def crawl_parallel(urls: List[str], max_concurrent: int = 3):
-    print("\n=== Parallel Crawling with Browser Reuse + Memory Check ===")
-
-    # We'll keep track of peak memory usage across all tasks
-    peak_memory = 0
-    process = psutil.Process(os.getpid())
-
-    def log_memory(prefix: str = ""):
-        nonlocal peak_memory
-        current_mem = process.memory_info().rss  # in bytes
-        if current_mem > peak_memory:
-            peak_memory = current_mem
-        print(f"{prefix} Current Memory: {current_mem // (1024 * 1024)} MB, Peak: {peak_memory // (1024 * 1024)} MB")
-
-    # Minimal browser config
-    browser_config = BrowserConfig(
-        headless=True,
-        verbose=False,   # corrected from 'verbos=False'
-        extra_args=["--disable-gpu", "--disable-dev-shm-usage", "--no-sandbox"],
-    )
-    crawl_config = CrawlerRunConfig(cache_mode=CacheMode.BYPASS)
-
-    # Create the crawler instance
-    crawler = AsyncWebCrawler(config=browser_config)
-    await crawler.start()
-
-    try:
-        # We'll chunk the URLs in batches of 'max_concurrent'
-        success_count = 0
-        fail_count = 0
-        for i in range(0, len(urls), max_concurrent):
-            batch = urls[i : i + max_concurrent]
-            tasks = []
-
-            for j, url in enumerate(batch):
-                # Unique session_id per concurrent sub-task
-                session_id = f"parallel_session_{i + j}"
-                task = crawler.arun(url=url, config=crawl_config, session_id=session_id)
-                tasks.append(task)
-
-            # Check memory usage prior to launching tasks
-            log_memory(prefix=f"Before batch {i//max_concurrent + 1}: ")
-
-            # Gather results
-            results = await asyncio.gather(*tasks, return_exceptions=True)
-
-            # Check memory usage after tasks complete
-            log_memory(prefix=f"After batch {i//max_concurrent + 1}: ")
-
-            # Evaluate results
-            for url, result in zip(batch, results):
-                if isinstance(result, Exception):
-                    print(f"Error crawling {url}: {result}")
-                    fail_count += 1
-                elif result.success:
-                    success_count += 1
-                else:
-                    fail_count += 1
-
-        print(f"\nSummary:")
-        print(f"  - Successfully crawled: {success_count}")
-        print(f"  - Failed: {fail_count}")
-
-    finally:
-        print("\nClosing crawler...")
-        await crawler.close()
-        # Final memory log
-        log_memory(prefix="Final: ")
-        print(f"\nPeak memory usage (MB): {peak_memory // (1024 * 1024)}")
-
-async def main():
-    urls = [
-        "https://example.com/page1",
-        "https://example.com/page2",
-        "https://example.com/page3",
-        "https://example.com/page4"
-    ]
-    await crawl_parallel(urls, max_concurrent=2)
-
-if __name__ == "__main__":
-    asyncio.run(main())
-
-```
-
-**Notes**:
-
- We **reuse** the same `AsyncWebCrawler` instance for all parallel tasks, launching **one** browser.  
- Each parallel sub-task might get its own `session_id` so they don’t share cookies/localStorage (unless that’s desired).  
- We limit concurrency to `max_concurrent=2` or 3 to avoid saturating CPU/memory.
-
---
-
-## 4. Performance Tips
-
-1. **Extra Browser Args**  
-   - `--disable-gpu`, `--no-sandbox` can help in Docker or restricted environments.  
-   - `--disable-dev-shm-usage` avoids using `/dev/shm` which can be small on some systems.
-
-2. **Session Reuse**  
-   - If your site requires a login or you want to maintain local data across URLs, share the **same** `session_id`.  
-   - If you want isolation (each URL fresh), create unique sessions.
-
-3. **Batching**  
-   - If you have **many** URLs (like thousands), you can do parallel crawling in chunks (like `max_concurrent=5`).  
-   - Use `arun_many()` for a built-in approach if you prefer, but the example above is often more flexible.
-
-4. **Cache**  
-   - If your pages share many resources or you’re re-crawling the same domain repeatedly, consider setting `cache_mode=CacheMode.ENABLED` in `CrawlerRunConfig`.  
-   - If you need fresh data each time, keep `cache_mode=CacheMode.BYPASS`.
-
-5. **Hooks**  
-   - You can set up global hooks for each crawler (like to block images) or per-run if you want.  
-   - Keep them consistent if you’re reusing sessions.
-
---
-
-## 5. Summary
-
- **One** `AsyncWebCrawler` + multiple calls to `.arun()` is far more efficient than launching a new crawler per URL.  
- **Sequential** approach with a shared session is simple and memory-friendly for moderate sets of URLs.  
- **Parallel** approach can speed up large crawls by concurrency, but keep concurrency balanced to avoid overhead.  
- Close the crawler once at the end, ensuring the browser is only opened/closed once.
-
-For even more advanced memory optimizations or dynamic concurrency patterns, see future sections on hooking or distributed crawling. The patterns above suffice for the majority of multi-URL scenarios—**giving you speed, simplicity, and minimal resource usage**. Enjoy your optimized crawling!
--- a/docs/md_v2/advanced/multi-url-crawling.md
+++ b/docs/md_v2/advanced/multi-url-crawling.md
@@ -58,7 +58,7 @@ Automatically manages concurrency based on system memory usage:

 ```python
 dispatcher = MemoryAdaptiveDispatcher(
-    memory_threshold_percent=70.0,  # Pause if memory exceeds this
+    memory_threshold_percent=90.0,  # Pause if memory exceeds this
    check_interval=1.0,             # How often to check memory
    max_session_permit=10,          # Maximum concurrent tasks
    rate_limiter=RateLimiter(       # Optional rate limiting
@@ -79,7 +79,7 @@ Provides simple concurrency control with a fixed limit:

 ```python
 dispatcher = SemaphoreDispatcher(
-    semaphore_count=5,             # Fixed concurrent tasks
+    max_session_permit=5,             # Fixed concurrent tasks
    rate_limiter=RateLimiter(      # Optional rate limiting
        base_delay=(0.5, 1.0),
        max_delay=10.0
--- a/docs/md_v2/blog/releases/v0.4.3b1.md
+++ b/docs/md_v2/blog/releases/v0.4.3b1.md
@@ -0,0 +1,266 @@
+# Crawl4AI 0.4.3b1 is Here: Faster, Smarter, and Ready for Real-World Crawling!
+
+Hey, Crawl4AI enthusiasts! We're thrilled to announce the release of **Crawl4AI 0.4.3b1**, packed with powerful new features and enhancements that take web crawling to a whole new level of efficiency and intelligence. This release is all about giving you more control, better performance, and deeper insights into your crawled data.
+
+Let's dive into what's new!
+
+## 🚀 Major Feature Highlights
+
+### 1. LLM-Powered Schema Generation: Zero to Structured Data in Seconds!
+
+Tired of manually crafting CSS or XPath selectors? We've got you covered! Crawl4AI now features a revolutionary **schema generator** that uses the power of Large Language Models (LLMs) to automatically create extraction schemas for you.
+
+**How it Works:**
+
+1. **Provide HTML**: Feed in a sample HTML snippet that contains the type of data you want to extract (e.g., product listings, article sections).
+2. **Describe Your Needs (Optional)**: You can provide a natural language query like "extract all product names and prices" to guide the schema creation.
+3. **Choose Your LLM**: Use either **OpenAI** (GPT-4o recommended) for top-tier accuracy or **Ollama** for a local, open-source option.
+4. **Get Your Schema**: The tool outputs a ready-to-use JSON schema that works seamlessly with `JsonCssExtractionStrategy` or `JsonXPathExtractionStrategy`.
+
+**Why You'll Love It:**
+
+-   **No More Tedious Selector Writing**: Let the LLM analyze the HTML and create the selectors for you!
+-   **One-Time Cost**: Schema generation uses LLM, but once you have your schema, subsequent extractions are fast and LLM-free.
+-   **Handles Complex Structures**: The LLM can understand nested elements, lists, and variations in layout—far beyond what simple CSS selectors can achieve.
+-   **Learn by Example**: The generated schemas are a fantastic way to learn best practices for writing your own schemas.
+
+**Example:**
+
+```python
+from crawl4ai.extraction_strategy import JsonCssExtractionStrategy
+
+# Sample HTML snippet (imagine this is part of a product listing page)
+html = """
+<div class="product">
+    <h2 class="name">Awesome Gadget</h2>
+    <span class="price">$99.99</span>
+</div>
+"""
+
+# Generate schema using OpenAI
+schema = JsonCssExtractionStrategy.generate_schema(
+    html,
+    llm_provider="openai/gpt-4o",
+    api_token="YOUR_API_TOKEN"
+)
+
+# Or use Ollama for a local, open-source option
+# schema = JsonCssExtractionStrategy.generate_schema(
+#     html,
+#     llm_provider="ollama/llama3"
+# )
+
+print(json.dumps(schema, indent=2))
+```
+
+**Output (Schema):**
+
+```json
+{
+  "name": null,
+  "baseSelector": "div.product",
+  "fields": [
+    {
+      "name": "name",
+      "selector": "h2.name",
+      "type": "text"
+    },
+    {
+      "name": "price",
+      "selector": "span.price",
+      "type": "text"
+    }
+  ]
+}
+```
+
+You can now **save** this schema and use it for all your extractions on pages with the same structure. No more LLM costs, just **fast, reliable** data extraction!
+
+### 2. Robots.txt Compliance: Crawl Responsibly
+
+Crawl4AI now respects website rules! With the new `check_robots_txt=True` option in `CrawlerRunConfig`, the crawler automatically fetches, parses, and obeys each site's `robots.txt` file.
+
+**Key Features**:
+
+-   **Efficient Caching**: Stores parsed `robots.txt` files locally for 7 days to avoid re-fetching.
+-   **Automatic Integration**: Works seamlessly with both `arun()` and `arun_many()`.
+-   **Clear Status Codes**: Returns a 403 status code if a URL is disallowed.
+-   **Customizable**: Adjust the cache directory and TTL if needed.
+
+**Example**:
+
+```python
+from crawl4ai import AsyncWebCrawler, CrawlerRunConfig, CacheMode
+
+async def main():
+    config = CrawlerRunConfig(
+        cache_mode=CacheMode.ENABLED,
+        check_robots_txt=True
+    )
+
+    async with AsyncWebCrawler() as crawler:
+        result = await crawler.arun("https://example.com/private-page", config=config)
+        if result.status_code == 403:
+            print("Access denied by robots.txt")
+
+if __name__ == "__main__":
+    asyncio.run(main())
+```
+
+### 3. Proxy Support in `CrawlerRunConfig`
+
+Need more control over your proxy settings? Now you can configure proxies directly within `CrawlerRunConfig` for each crawl:
+
+```python
+from crawl4ai import AsyncWebCrawler, CrawlerRunConfig
+
+async def main():
+    config = CrawlerRunConfig(
+        proxy_config={
+            "server": "http://your-proxy.com:8080",
+            "username": "your_username",  # Optional
+            "password": "your_password"  # Optional
+        }
+    )
+
+    async with AsyncWebCrawler() as crawler:
+        result = await crawler.arun("https://example.com", config=config)
+```
+
+This allows for dynamic proxy assignment per URL or even per request.
+
+### 4. LLM-Powered Markdown Filtering (Beta)
+
+We're introducing an experimental **`LLMContentFilter`**! This filter, when used with the `DefaultMarkdownGenerator`, can produce highly focused markdown output by using an LLM to analyze content relevance.
+
+**How it Works:**
+
+1. You provide an **instruction** (e.g., "extract only the key technical details").
+2. The LLM analyzes each section of the page based on your instruction.
+3. Only the most relevant content is included in the final `fit_markdown`.
+
+**Example**:
+
+```python
+from crawl4ai import AsyncWebCrawler, CrawlerRunConfig
+from crawl4ai.content_filter_strategy import LLMContentFilter
+from crawl4ai.markdown_generation_strategy import DefaultMarkdownGenerator
+
+async def main():
+    llm_filter = LLMContentFilter(
+        provider="openai/gpt-4o",
+        api_token="YOUR_API_TOKEN",  # Or use "ollama/llama3" with no token
+        instruction="Extract the core educational content about Python classes."
+    )
+
+    config = CrawlerRunConfig(
+        markdown_generator=DefaultMarkdownGenerator(content_filter=llm_filter)
+    )
+
+    async with AsyncWebCrawler() as crawler:
+        result = await crawler.arun(
+            "https://docs.python.org/3/tutorial/classes.html",
+            config=config
+        )
+        print(result.markdown_v2.fit_markdown)
+
+if __name__ == "__main__":
+    asyncio.run(main())
+```
+
+**Note**: This is a beta feature. We're actively working on improving its accuracy and performance.
+
+### 5. Streamlined `arun_many()` with Dispatchers
+
+We've simplified concurrent crawling! `arun_many()` now intelligently handles multiple URLs, either returning a **list** of results or an **async generator** for streaming.
+
+**Basic Usage (Batch)**:
+
+```python
+results = await crawler.arun_many(
+    urls=["https://site1.com", "https://site2.com"],
+    config=CrawlerRunConfig()
+)
+
+for res in results:
+    print(res.url, "crawled successfully:", res.success)
+```
+
+**Streaming Mode**:
+
+```python
+async for result in await crawler.arun_many(
+    urls=["https://site1.com", "https://site2.com"],
+    config=CrawlerRunConfig(stream=True)
+):
+    print("Just finished:", result.url)
+    # Process each result immediately
+```
+
+**Advanced:** You can now customize how `arun_many` handles concurrency by passing a **dispatcher**. See [Advanced Multi-URL Crawling](../advanced/multi-url-crawling.md) for details.
+
+### 6. Enhanced Browser Context Management
+
+We've improved how Crawl4AI manages browser contexts for better resource utilization and session handling.
+
+-   **`shared_data` in `CrawlerRunConfig`**: Pass data between hooks using the `shared_data` dictionary.
+-   **Context Reuse**: The crawler now intelligently reuses browser contexts based on configuration, reducing overhead.
+
+### 7. Faster Scraping with `LXMLWebScrapingStrategy`
+
+Introducing a new, optional **`LXMLWebScrapingStrategy`** that can be **10-20x faster** than the default BeautifulSoup approach for large, complex pages.
+
+**How to Use**:
+
+```python
+from crawl4ai import LXMLWebScrapingStrategy
+
+config = CrawlerRunConfig(
+    scraping_strategy=LXMLWebScrapingStrategy()  # Add this line
+)
+```
+
+**When to Use**:
+- If profiling shows a bottleneck in `WebScrapingStrategy`.
+- For very large HTML documents where parsing speed matters.
+
+**Caveats**:
+- It might not handle malformed HTML as gracefully as BeautifulSoup.
+- We're still gathering data, so report any issues!
+
+---
+
+## Try the Feature Demo Script!
+
+We've prepared a Python script demonstrating these new features. You can find it at:
+
+[**`features_demo.py`**](https://github.com/unclecode/crawl4ai/blob/main/docs/examples/0_4_3b1_feature_demo.py)
+
+**To run the demo:**
+
+1. Make sure you have Crawl4AI installed (`pip install crawl4ai`).
+2. Copy the `features_demo.py` script to your local environment.
+3. Set your OpenAI API key as an environment variable (if using OpenAI models):
+    ```bash
+    export OPENAI_API_KEY="your_api_key"
+    ```
+4. Run the script:
+    ```bash
+    python features_demo.py
+    ```
+
+The script will execute various crawl scenarios, showcasing the new features and printing results to your console.
+
+## Conclusion
+
+Crawl4AI version 0.4.3b1 is a major step forward in flexibility, performance, and ease of use. With automatic schema generation, robots.txt handling, advanced content filtering, and streamlined multi-URL crawling, you can build powerful, efficient, and responsible web scrapers.
+
+We encourage you to try out these new capabilities, explore the updated documentation, and share your feedback! Your input is invaluable as we continue to improve Crawl4AI.
+
+**Stay Connected:**
+
+-   **Star** us on [GitHub](https://github.com/unclecode/crawl4ai) to show your support!
+-   **Follow** [@unclecode](https://twitter.com/unclecode) on Twitter for updates and tips.
+-   **Join** our community on Discord (link coming soon) to discuss your projects and get help.
+
+Happy crawling!
--- a/docs/md_v2/core/markdown-generation.md
+++ b/docs/md_v2/core/markdown-generation.md
@@ -181,7 +181,7 @@ from crawl4ai.content_filter_strategy import LLMContentFilter
 async def main():
    # Initialize LLM filter with specific instruction
    filter = LLMContentFilter(
-        provider="openai/gpt-4",  # or your preferred provider
+        provider="openai/gpt-4o",  # or your preferred provider
        api_token="your-api-token",  # or use environment variable
        instruction="""
        Focus on extracting the core educational content.
--- a/pyproject.toml
+++ b/pyproject.toml
@@ -39,7 +39,7 @@ dependencies = [
    "httpx==0.27.2",
 ]
 classifiers = [
-    "Development Status :: 3 - Alpha",
+    "Development Status :: 4 - Beta",
    "Intended Audience :: Developers",
    "License :: OSI Approved :: Apache Software License",
    "Programming Language :: Python :: 3",