feat(extraction): add RegexExtractionStrategy for pattern-based extraction

Add new RegexExtractionStrategy for fast, zero-LLM extraction of common data types: - Built-in patterns for emails, URLs, phones, dates, and more - Support for custom regex patterns - LLM-assisted pattern generation utility - Optimized HTML preprocessing with fit_html field - Enhanced network response body capture Breaking changes: None
2025-05-02 21:15:24 +08:00
parent 94e9959fe0
commit 9b5ccac76e
13 changed files with 984 additions and 124 deletions
--- a/docs/md_v2/api/crawl-result.md
+++ b/docs/md_v2/api/crawl-result.md
@@ -10,6 +10,7 @@ class CrawlResult(BaseModel):
    html: str
    success: bool
    cleaned_html: Optional[str] = None
+    fit_html: Optional[str] = None  # Preprocessed HTML optimized for extraction
    media: Dict[str, List[Dict]] = {}
    links: Dict[str, List[Dict]] = {}
    downloaded_files: Optional[List[str]] = None
@@ -50,7 +51,7 @@ if not result.success:
 ```

 ### 1.3 **`status_code`** *(Optional[int])*  
-**What**: The page’s HTTP status code (e.g., 200, 404).  
+**What**: The page's HTTP status code (e.g., 200, 404).  
 **Usage**:
 ```python
 if result.status_code == 404:
@@ -82,7 +83,7 @@ if result.response_headers:
 ```

 ### 1.7 **`ssl_certificate`** *(Optional[SSLCertificate])*  
-**What**: If `fetch_ssl_certificate=True` in your CrawlerRunConfig, **`result.ssl_certificate`** contains a  [**`SSLCertificate`**](../advanced/ssl-certificate.md) object describing the site’s certificate. You can export the cert in multiple formats (PEM/DER/JSON) or access its properties like `issuer`, 
+**What**: If `fetch_ssl_certificate=True` in your CrawlerRunConfig, **`result.ssl_certificate`** contains a  [**`SSLCertificate`**](../advanced/ssl-certificate.md) object describing the site's certificate. You can export the cert in multiple formats (PEM/DER/JSON) or access its properties like `issuer`, 
 `subject`, `valid_from`, `valid_until`, etc. 
 **Usage**:
 ```python
@@ -109,14 +110,6 @@ print(len(result.html))
 print(result.cleaned_html[:500])  # Show a snippet
 ```

-### 2.3 **`fit_html`** *(Optional[str])*  
-**What**: If a **content filter** or heuristic (e.g., Pruning/BM25) modifies the HTML, the “fit” or post-filter version.  
-**When**: This is **only** present if your `markdown_generator` or `content_filter` produces it.  
-**Usage**:
-```python
-if result.markdown.fit_html:
-    print("High-value HTML content:", result.markdown.fit_html[:300])
-```

 ---

@@ -135,7 +128,7 @@ Crawl4AI can convert HTML→Markdown, optionally including:
 - **`raw_markdown`** *(str)*: The full HTML→Markdown conversion.  
 - **`markdown_with_citations`** *(str)*: Same markdown, but with link references as academic-style citations.  
 - **`references_markdown`** *(str)*: The reference list or footnotes at the end.  
- **`fit_markdown`** *(Optional[str])*: If content filtering (Pruning/BM25) was applied, the filtered “fit” text.  
+- **`fit_markdown`** *(Optional[str])*: If content filtering (Pruning/BM25) was applied, the filtered "fit" text.  
 - **`fit_html`** *(Optional[str])*: The HTML that led to `fit_markdown`.

 **Usage**:
@@ -157,7 +150,7 @@ print(result.markdown.raw_markdown[:200])
 print(result.markdown.fit_markdown)
 print(result.markdown.fit_html)
 ```
-**Important**: “Fit” content (in `fit_markdown`/`fit_html`) exists in result.markdown, only if you used a **filter** (like **PruningContentFilter** or **BM25ContentFilter**) within a `MarkdownGenerationStrategy`.
+**Important**: "Fit" content (in `fit_markdown`/`fit_html`) exists in result.markdown, only if you used a **filter** (like **PruningContentFilter** or **BM25ContentFilter**) within a `MarkdownGenerationStrategy`.

 ---

@@ -169,7 +162,7 @@ print(result.markdown.fit_html)

 - `src` *(str)*: Media URL  
 - `alt` or `title` *(str)*: Descriptive text  
- `score` *(float)*: Relevance score if the crawler’s heuristic found it “important”  
+- `score` *(float)*: Relevance score if the crawler's heuristic found it "important"  
 - `desc` or `description` *(Optional[str])*: Additional context extracted from surrounding text  

 **Usage**:
@@ -263,7 +256,7 @@ A `DispatchResult` object providing additional concurrency and resource usage in

 - **`task_id`**: A unique identifier for the parallel task.
 - **`memory_usage`** (float): The memory (in MB) used at the time of completion.
- **`peak_memory`** (float): The peak memory usage (in MB) recorded during the task’s execution.
+- **`peak_memory`** (float): The peak memory usage (in MB) recorded during the task's execution.
 - **`start_time`** / **`end_time`** (datetime): Time range for this crawling task.
 - **`error_message`** (str): Any dispatcher- or concurrency-related error encountered.

@@ -358,7 +351,7 @@ async def handle_result(result: CrawlResult):
    # HTML
    print("Original HTML size:", len(result.html))
    print("Cleaned HTML size:", len(result.cleaned_html or ""))
-
+    
    # Markdown output
    if result.markdown:
        print("Raw Markdown:", result.markdown.raw_markdown[:300])
--- a/docs/md_v2/api/strategies.md
+++ b/docs/md_v2/api/strategies.md
@@ -36,6 +36,45 @@ LLMExtractionStrategy(
 )
 ```

+### RegexExtractionStrategy
+
+Used for fast pattern-based extraction of common entities using regular expressions.
+
+```python
+RegexExtractionStrategy(
+    # Pattern Configuration
+    pattern: IntFlag = RegexExtractionStrategy.Nothing,  # Bit flags of built-in patterns to use
+    custom: Optional[Dict[str, str]] = None,           # Custom pattern dictionary {label: regex}
+    
+    # Input Format
+    input_format: str = "fit_html",                    # "html", "markdown", "text" or "fit_html"
+)
+
+# Built-in Patterns as Bit Flags
+RegexExtractionStrategy.Email           # Email addresses
+RegexExtractionStrategy.PhoneIntl       # International phone numbers 
+RegexExtractionStrategy.PhoneUS         # US-format phone numbers
+RegexExtractionStrategy.Url             # HTTP/HTTPS URLs
+RegexExtractionStrategy.IPv4            # IPv4 addresses
+RegexExtractionStrategy.IPv6            # IPv6 addresses
+RegexExtractionStrategy.Uuid            # UUIDs
+RegexExtractionStrategy.Currency        # Currency values (USD, EUR, etc)
+RegexExtractionStrategy.Percentage      # Percentage values
+RegexExtractionStrategy.Number          # Numeric values
+RegexExtractionStrategy.DateIso         # ISO format dates
+RegexExtractionStrategy.DateUS          # US format dates
+RegexExtractionStrategy.Time24h         # 24-hour format times
+RegexExtractionStrategy.PostalUS        # US postal codes
+RegexExtractionStrategy.PostalUK        # UK postal codes
+RegexExtractionStrategy.HexColor        # HTML hex color codes
+RegexExtractionStrategy.TwitterHandle   # Twitter handles
+RegexExtractionStrategy.Hashtag         # Hashtags
+RegexExtractionStrategy.MacAddr         # MAC addresses
+RegexExtractionStrategy.Iban            # International bank account numbers
+RegexExtractionStrategy.CreditCard      # Credit card numbers
+RegexExtractionStrategy.All             # All available patterns
+```
+
 ### CosineStrategy

 Used for content similarity-based extraction and clustering.
@@ -156,6 +195,55 @@ result = await crawler.arun(
 data = json.loads(result.extracted_content)
 ```

+### Regex Extraction
+
+```python
+import json
+from crawl4ai import AsyncWebCrawler, CrawlerRunConfig, RegexExtractionStrategy
+
+# Method 1: Use built-in patterns
+strategy = RegexExtractionStrategy(
+    pattern = RegexExtractionStrategy.Email | RegexExtractionStrategy.Url
+)
+
+# Method 2: Use custom patterns
+price_pattern = {"usd_price": r"\$\s?\d{1,3}(?:,\d{3})*(?:\.\d{2})?"}
+strategy = RegexExtractionStrategy(custom=price_pattern)
+
+# Method 3: Generate pattern with LLM assistance (one-time)
+from crawl4ai import LLMConfig
+
+async with AsyncWebCrawler() as crawler:
+    # Get sample HTML first
+    sample_result = await crawler.arun("https://example.com/products")
+    html = sample_result.fit_html
+    
+    # Generate regex pattern once
+    pattern = RegexExtractionStrategy.generate_pattern(
+        label="price",
+        html=html,
+        query="Product prices in USD format",
+        llm_config=LLMConfig(provider="openai/gpt-4o-mini")
+    )
+    
+    # Save pattern for reuse
+    import json
+    with open("price_pattern.json", "w") as f:
+        json.dump(pattern, f)
+    
+    # Use pattern for extraction (no LLM calls)
+    strategy = RegexExtractionStrategy(custom=pattern)
+    result = await crawler.arun(
+        url="https://example.com/products",
+        config=CrawlerRunConfig(extraction_strategy=strategy)
+    )
+    
+    # Process results
+    data = json.loads(result.extracted_content)
+    for item in data:
+        print(f"{item['label']}: {item['value']}")
+```
+
 ### CSS Extraction

 ```python
@@ -220,12 +308,28 @@ result = await crawler.arun(

 ## Best Practices

-1. **Choose the Right Strategy**
-   - Use `LLMExtractionStrategy` for complex, unstructured content
-   - Use `JsonCssExtractionStrategy` for well-structured HTML
+1. **Choose the Right Strategy**
+   - Use `RegexExtractionStrategy` for common data types like emails, phones, URLs, dates
+   - Use `JsonCssExtractionStrategy` for well-structured HTML with consistent patterns
+   - Use `LLMExtractionStrategy` for complex, unstructured content requiring reasoning
   - Use `CosineStrategy` for content similarity and clustering

-2. **Optimize Chunking**
+2. **Strategy Selection Guide**
+   ```
+   Is the target data a common type (email/phone/date/URL)? 
+   → RegexExtractionStrategy
+   
+   Does the page have consistent HTML structure?
+   → JsonCssExtractionStrategy or JsonXPathExtractionStrategy
+   
+   Is the data semantically complex or unstructured?
+   → LLMExtractionStrategy
+   
+   Need to find content similar to a specific topic?
+   → CosineStrategy
+   ```
+
+3. **Optimize Chunking**
   ```python
   # For long documents
   strategy = LLMExtractionStrategy(
@@ -234,7 +338,26 @@ result = await crawler.arun(
   )
   ```

-3. **Handle Errors**
+4. **Combine Strategies for Best Performance**
+   ```python
+   # First pass: Extract structure with CSS
+   css_strategy = JsonCssExtractionStrategy(product_schema)
+   css_result = await crawler.arun(url, config=CrawlerRunConfig(extraction_strategy=css_strategy))
+   product_data = json.loads(css_result.extracted_content)
+   
+   # Second pass: Extract specific fields with regex
+   descriptions = [product["description"] for product in product_data]
+   regex_strategy = RegexExtractionStrategy(
+       pattern=RegexExtractionStrategy.Email | RegexExtractionStrategy.PhoneUS,
+       custom={"dimension": r"\d+x\d+x\d+ (?:cm|in)"}
+   )
+   
+   # Process descriptions with regex
+   for text in descriptions:
+       matches = regex_strategy.extract("", text)  # Direct extraction
+   ```
+
+5. **Handle Errors**
   ```python
   try:
       result = await crawler.arun(
@@ -247,11 +370,31 @@ result = await crawler.arun(
       print(f"Extraction failed: {e}")
   ```

-4. **Monitor Performance**
+6. **Monitor Performance**
   ```python
   strategy = CosineStrategy(
       verbose=True,  # Enable logging
       word_count_threshold=20,  # Filter short content
       top_k=5  # Limit results
   )
+   ```
+
+7. **Cache Generated Patterns**
+   ```python
+   # For RegexExtractionStrategy pattern generation
+   import json
+   from pathlib import Path
+   
+   cache_dir = Path("./pattern_cache")
+   cache_dir.mkdir(exist_ok=True)
+   pattern_file = cache_dir / "product_pattern.json"
+   
+   if pattern_file.exists():
+       with open(pattern_file) as f:
+           pattern = json.load(f)
+   else:
+       # Generate once with LLM
+       pattern = RegexExtractionStrategy.generate_pattern(...)
+       with open(pattern_file, "w") as f:
+           json.dump(pattern, f)
   ```