# Content Selection Crawl4AI provides multiple ways to **select**, **filter**, and **refine** the content from your crawls. Whether you need to target a specific CSS region, exclude entire tags, filter out external links, or remove certain domains and images, **`CrawlerRunConfig`** offers a wide range of parameters. Below, we show how to configure these parameters and combine them for precise control. --- ## 1. CSS-Based Selection There are two ways to select content from a page: using `css_selector` or the more flexible `target_elements`. ### 1.1 Using `css_selector` A straightforward way to **limit** your crawl results to a certain region of the page is **`css_selector`** in **`CrawlerRunConfig`**: ```python import asyncio from crawl4ai import AsyncWebCrawler, CrawlerRunConfig async def main(): config = CrawlerRunConfig( # e.g., first 30 items from Hacker News css_selector=".athing:nth-child(-n+30)" ) async with AsyncWebCrawler() as crawler: result = await crawler.arun( url="https://news.ycombinator.com/newest", config=config ) print("Partial HTML length:", len(result.cleaned_html)) if __name__ == "__main__": asyncio.run(main()) ``` **Result**: Only elements matching that selector remain in `result.cleaned_html`. ### 1.2 Using `target_elements` The `target_elements` parameter provides more flexibility by allowing you to target **multiple elements** for content extraction while preserving the entire page context for other features: ```python import asyncio from crawl4ai import AsyncWebCrawler, CrawlerRunConfig async def main(): config = CrawlerRunConfig( # Target article body and sidebar, but not other content target_elements=["article.main-content", "aside.sidebar"] ) async with AsyncWebCrawler() as crawler: result = await crawler.arun( url="https://example.com/blog-post", config=config ) print("Markdown focused on target elements") print("Links from entire page still available:", len(result.links.get("internal", []))) if __name__ == "__main__": asyncio.run(main()) ``` **Key difference**: With `target_elements`, the markdown generation and structural data extraction focus on those elements, but other page elements (like links, images, and tables) are still extracted from the entire page. This gives you fine-grained control over what appears in your markdown content while preserving full page context for link analysis and media collection. --- ## 2. Content Filtering & Exclusions ### 2.1 Basic Overview ```python config = CrawlerRunConfig( # Content thresholds word_count_threshold=10, # Minimum words per block # Tag exclusions excluded_tags=['form', 'header', 'footer', 'nav'], # Link filtering exclude_external_links=True, exclude_social_media_links=True, # Block entire domains exclude_domains=["adtrackers.com", "spammynews.org"], exclude_social_media_domains=["facebook.com", "twitter.com"], # Media filtering exclude_external_images=True ) ``` **Explanation**: - **`word_count_threshold`**: Ignores text blocks under X words. Helps skip trivial blocks like short nav or disclaimers. - **`excluded_tags`**: Removes entire tags (`