Files

UncleCode d5ed451299 Enhance crawler capabilities and documentation

- Add llm.txt generator
  - Added SSL certificate extraction in AsyncWebCrawler.
  - Introduced new content filters and chunking strategies for more robust data extraction.
  - Updated documentation.

2024-12-25 21:34:31 +08:00

15 KiB

Raw Blame History

Content Selection in Crawl4AI

Crawl4AI offers flexible and powerful methods to precisely select and filter content from webpages. Whether you’re extracting articles, filtering unwanted elements, or using LLMs for structured data extraction, this guide will walk you through the essentials and advanced techniques.

Table of Contents:

Content Selection in Crawl4AI

Introduction & Quick Start

When crawling websites, you often need to isolate specific parts of a page—such as main article text, product listings, or metadata. Crawl4AI’s content selection features help you fine-tune your crawls to grab exactly what you need, while filtering out unnecessary elements.

Quick Start Example: Here’s a minimal example that extracts the main article content from a page:

from crawl4ai.async_configs import CrawlerRunConfig, AsyncWebCrawler

async def quick_start():
    config = CrawlerRunConfig(css_selector=".main-article")
    async with AsyncWebCrawler() as crawler:
        result = await crawler.arun(url="https://crawl4ai.com", config=config)
        print(result.extracted_content)

This snippet sets a simple CSS selector to focus on the main article area of a webpage. You can build from here, adding more advanced strategies as needed.

CSS Selectors

What are they?
CSS selectors let you target specific parts of a webpage’s HTML. If you can identify a unique CSS selector (such as .main-article, article h1, or .product-listing > li), you can precisely control what parts of the page are extracted.

How to find selectors:

Open the page in your browser.
Use browser dev tools (e.g., Chrome DevTools: right-click → "Inspect") to locate the elements you want.
Copy the CSS selector for that element.

Example:

from crawl4ai.async_configs import CrawlerRunConfig, AsyncWebCrawler

async def extract_heading_and_content(url):
    config = CrawlerRunConfig(css_selector="article h1, article .content")
    async with AsyncWebCrawler() as crawler:
        result = await crawler.arun(url=url, config=config)
        return result.extracted_content

Tip: If your extracted content is empty, verify that your CSS selectors match existing elements on the page. Using overly generic selectors can also lead to too much content being extracted.

Video and Audio Content

The library extracts video and audio elements with their metadata:

from crawl4ai.async_configs import CrawlerRunConfig

config = CrawlerRunConfig()
result = await crawler.arun(url="https://example.com", config=config)

# Process videos
for video in result.media["videos"]:
    print(f"Video source: {video['src']}")
    print(f"Type: {video['type']}")
    print(f"Duration: {video.get('duration')}")
    print(f"Thumbnail: {video.get('poster')}")

# Process audio
for audio in result.media["audios"]:
    print(f"Audio source: {audio['src']}")
    print(f"Type: {audio['type']}")
    print(f"Duration: {audio.get('duration')}")

Link Analysis

Crawl4AI provides sophisticated link analysis capabilities, helping you understand the relationship between pages and identify important navigation patterns.

Link Classification

The library automatically categorizes links into:

Internal links (same domain)
External links (different domains)
Social media links
Navigation links
Content links

from crawl4ai.async_configs import CrawlerRunConfig

config = CrawlerRunConfig()
result = await crawler.arun(url="https://example.com", config=config)

# Analyze internal links
for link in result.links["internal"]:
    print(f"Internal: {link['href']}")
    print(f"Link text: {link['text']}")
    print(f"Context: {link['context']}")  # Surrounding text
    print(f"Type: {link['type']}")  # nav, content, etc.

# Analyze external links
for link in result.links["external"]:
    print(f"External: {link['href']}")
    print(f"Domain: {link['domain']}")
    print(f"Type: {link['type']}")

Smart Link Filtering

Control which links are included in the results with CrawlerRunConfig:

config = CrawlerRunConfig(
    exclude_external_links=True,          # Remove external links
    exclude_social_media_links=True,      # Remove social media links
    exclude_social_media_domains=[        # Custom social media domains
        "facebook.com", "twitter.com", "instagram.com"
    ],
    exclude_domains=["ads.example.com"]   # Exclude specific domains
)
result = await crawler.arun(url="https://example.com", config=config)

Metadata Extraction

Crawl4AI automatically extracts and processes page metadata, providing valuable information about the content:

from crawl4ai.async_configs import CrawlerRunConfig

config = CrawlerRunConfig()
result = await crawler.arun(url="https://example.com", config=config)

metadata = result.metadata
print(f"Title: {metadata['title']}")
print(f"Description: {metadata['description']}")
print(f"Keywords: {metadata['keywords']}")
print(f"Author: {metadata['author']}")
print(f"Published Date: {metadata['published_date']}")
print(f"Modified Date: {metadata['modified_date']}")
print(f"Language: {metadata['language']}")

Content Filtering

Crawl4AI provides content filtering parameters to exclude unwanted elements and ensure that you only get meaningful data. For instance, you can remove navigation bars, ads, or other non-essential parts of the page.

Key Parameters:

word_count_threshold: Minimum word count per extracted block. Helps skip short or irrelevant snippets.
excluded_tags: List of HTML tags to omit (e.g., ['form', 'header', 'footer', 'nav']).
exclude_external_links: Strips out links pointing to external domains.
exclude_social_media_links: Removes common social media links or widgets.
exclude_external_images: Filters out images hosted on external domains.

Example:

from crawl4ai.async_configs import CrawlerRunConfig, AsyncWebCrawler

async def filtered_extraction(url):
    config = CrawlerRunConfig(
        word_count_threshold=10,
        excluded_tags=['form', 'header', 'footer', 'nav'],
        exclude_external_links=True,
        exclude_social_media_links=True,
        exclude_external_images=True
    )

    async with AsyncWebCrawler() as crawler:
        result = await crawler.arun(url=url, config=config)
        return result.extracted_content

Best Practice: Start with a minimal set of exclusions and increase them as needed. If you notice no content is extracted, try lowering word_count_threshold or removing certain excluded tags.

Handling Iframe Content

If a page embeds content in iframes (such as videos, maps, or third-party widgets), you may need to enable iframe processing. This ensures that Crawl4AI loads and extracts content displayed inside iframes.

How to enable:

Set process_iframes=True in your CrawlerRunConfig to process iframe content.
Use remove_overlay_elements=True to discard popups or modals that might block iframe content.

Example:

from crawl4ai.async_configs import CrawlerRunConfig, AsyncWebCrawler

async def extract_iframe_content(url):
    config = CrawlerRunConfig(
        process_iframes=True,
        remove_overlay_elements=True
    )

    async with AsyncWebCrawler() as crawler:
        result = await crawler.arun(url=url, config=config)
        return result.extracted_content

Troubleshooting:

If iframe content doesn’t load, ensure the iframe’s origin is allowed and that you have no network-related issues. Check the logs or consider using a browser-based strategy that supports multi-domain requests.

Structured Content Selection Using LLMs

For more complex extraction tasks (e.g., summarizing content, extracting structured data like titles and key points), you can integrate LLMs. LLM-based extraction strategies let you define a schema and provide instructions to an LLM so it returns structured, JSON-formatted results.

When to use LLM-based strategies:

Extracting complex structures not easily captured by simple CSS selectors.
Summarizing or transforming data.
Handling varied, unpredictable page layouts.

Example with an LLMExtractionStrategy:

from crawl4ai.extraction_strategy import LLMExtractionStrategy
from crawl4ai.async_configs import CrawlerRunConfig, AsyncWebCrawler
from pydantic import BaseModel
from typing import List
import json

class ArticleContent(BaseModel):
    title: str
    main_points: List[str]
    conclusion: str

async def extract_article_with_llm(url):
    strategy = LLMExtractionStrategy(
        provider="ollama/nemotron",
        schema=ArticleContent.schema(),
        instruction="Extract the main article title, key points, and conclusion"
    )

    config = CrawlerRunConfig(extraction_strategy=strategy)

    async with AsyncWebCrawler() as crawler:
        result = await crawler.arun(url=url, config=config)
        article = json.loads(result.extracted_content)
        return article

Tips for LLM-based extraction:

Refine your prompt in instruction to guide the LLM towards the desired structure.
If results are incomplete or incorrect, consider adjusting the schema or adding more context to the instruction.
Check for errors and handle edge cases where the LLM might not find certain fields.

Pattern-Based Selection

When dealing with repetitive, structured patterns (like a list of articles or products), you can use JsonCssExtractionStrategy to define a JSON schema that maps selectors to specific fields.

Use Cases:

News article listings, product grids, directory entries.
Extract multiple items that follow a similar structure on the same page.

Example JSON Schema Extraction:

from crawl4ai.extraction_strategy import JsonCssExtractionStrategy
from crawl4ai.async_configs import CrawlerRunConfig, AsyncWebCrawler
import json

schema = {
    "name": "News Articles",
    "baseSelector": "article.news-item",
    "fields": [
        {"name": "headline", "selector": "h2", "type": "text"},
        {"name": "summary", "selector": ".summary", "type": "text"},
        {"name": "category", "selector": ".category", "type": "text"},
        {
            "name": "metadata",
            "type": "nested",
            "fields": [
                {"name": "author", "selector": ".author", "type": "text"},
                {"name": "date", "selector": ".date", "type": "text"}
            ]
        }
    ]
}

async def extract_news_items(url):
    strategy = JsonCssExtractionStrategy(schema)
    config = CrawlerRunConfig(extraction_strategy=strategy)

    async with AsyncWebCrawler() as crawler:
        result = await crawler.arun(url=url, config=config)
        articles = json.loads(result.extracted_content)
        return articles

Maintenance Tip: If the site’s structure changes, update your schema accordingly. Test small changes to ensure the extracted structure still matches your expectations.

Comprehensive Example: Combining Techniques

Below is a more involved example that demonstrates combining multiple strategies and filtering parameters. Here, we extract structured article content from an article.main section, exclude unnecessary elements, and enforce a word count threshold.

from crawl4ai.async_configs import CrawlerRunConfig, AsyncWebCrawler, BrowserConfig
from crawl4ai.extraction_strategy import JsonCssExtractionStrategy
import json

async def extract_article_content(url: str):
    # Schema for structured extraction
    article_schema = {
        "name": "Article",
        "baseSelector": "article.main",
        "fields": [
            {"name": "title", "selector": "h1", "type": "text"},
            {"name": "content", "selector": ".content", "type": "text"}
        ]
    }

    config = CrawlerRunConfig(
        extraction_strategy=JsonCssExtractionStrategy(article_schema),
        word_count_threshold=10,
        excluded_tags=['nav', 'footer'],
        exclude_external_links=True
    )

    async with AsyncWebCrawler() as crawler:
        result = await crawler.arun(url=url, config=config)
        extracted = json.loads(result.extracted_content)
        return extracted

Expanding This Example:

Add pagination logic to handle multi-page extractions.
Introduce LLM-based extraction for a summary of the article’s main points.
Adjust filtering parameters to refine what content is included or excluded.

Troubleshooting & Best Practices

Common Issues & Fixes:

Empty extraction result:
- Verify CSS selectors and filtering parameters.
- Lower or remove word_count_threshold to see if overly strict criteria are filtering everything out.
- Check network requests or iframe settings if content is loaded dynamically.
Unintended content included:
- Add more tags to excluded_tags, or refine your CSS selectors.
- Use exclude_external_links and other filters to clean up results.
LLM extraction errors:
- Ensure the schema matches the expected JSON structure.
- Refine the instruction prompt to guide the LLM more clearly.
- Validate LLM provider configuration and error logs.

Performance Tips:

Start with simpler strategies (basic CSS selectors) before moving to advanced LLM-based extraction.
Use caching or asynchronous crawling to handle large numbers of pages efficiently.
Consider running headless browser extractions in Docker for consistent, reproducible environments.

Additional Resources

GitHub Source Files:
- Async Web Crawler Implementation
- Async Crawler Strategy Implementation
Advanced Topics:
- Dockerized deployments for reproducible scraping environments.
- Integration with caching or proxy services for large-scale crawls.
- Expanding LLM strategies to perform complex transformations or summarizations.

Use these links and approaches as a starting point to refine your crawling strategies. With Crawl4AI’s flexible configuration and powerful selection methods, you’ll be able to extract exactly the content you need—no more, no less.

15 KiB Raw Blame History Unescape Escape

Content Selection in Crawl4AI

Introduction & Quick Start

CSS Selectors

Video and Audio Content

Link Analysis

Link Classification

Smart Link Filtering

Metadata Extraction

Content Filtering

Handling Iframe Content

Structured Content Selection Using LLMs

Pattern-Based Selection

Comprehensive Example: Combining Techniques

Troubleshooting & Best Practices

Additional Resources

15 KiB

Raw Blame History