crawl4ai/docs/llm.txt/7_extraction_strategies.q.md

llm_extraction: LLM Extraction Strategy uses language models to process web content into structured JSON | language model extraction, schema extraction, LLM parsing | LLMExtractionStrategy(provider="openai", api_token="token")
schema_based_extraction: Extract data using predefined JSON schemas to structure LLM output | schema extraction, structured output | schema=OpenAIModelFee.model_json_schema()
chunking_config: Configure content chunking with token threshold and overlap rate | content chunks, token limits | chunk_token_threshold=1000, overlap_rate=0.1
provider_config: Specify LLM provider and API credentials for extraction | model provider, API setup | provider="openai", api_token="your_token"
cosine_strategy: Use similarity-based clustering to extract relevant content sections | content clustering, semantic similarity | CosineStrategy(semantic_filter="product reviews")
clustering_params: Configure clustering behavior with similarity thresholds and methods | similarity settings, cluster config | sim_threshold=0.3, linkage_method='ward'
content_filtering: Filter extracted content based on word count and relevance | content filters, extraction rules | word_count_threshold=10, top_k=3
xpath_extraction: Extract data using XPath selectors for stable structural parsing | xpath selectors, HTML parsing | JsonXPathExtractionStrategy(schema)
css_extraction: Extract data using CSS selectors for simple HTML parsing | css selectors, HTML parsing | JsonCssExtractionStrategy(schema)
schema_generation: Generate extraction schemas automatically using one-time LLM assistance | schema creation, automation | generate_schema(html, query)
dynamic_content: Handle dynamic webpage content with JavaScript execution and waiting | async content, js execution | js_code="window.scrollTo(0, document.body.scrollHeight)"
extraction_best_practices: Use XPath for stability, avoid unnecessary LLM calls, test selectors | optimization, reliability | baseSelector="//table/tbody/tr"