- Add llm.txt generator - Added SSL certificate extraction in AsyncWebCrawler. - Introduced new content filters and chunking strategies for more robust data extraction. - Updated documentation.
10 lines
1.8 KiB
Markdown
10 lines
1.8 KiB
Markdown
url_prefix_handling: Crawl4AI supports different URL prefixes for various input types | input handling, url format, crawling types | url="https://example.com" or "file://path" or "raw:html"
|
|
web_crawling: Crawl live web pages using http:// or https:// prefixes with AsyncWebCrawler | web scraping, url crawling, web content | AsyncWebCrawler().arun(url="https://example.com")
|
|
local_file_crawling: Access local HTML files using file:// prefix for crawling | local html, file crawling, file access | AsyncWebCrawler().arun(url="file:///path/to/file.html")
|
|
raw_html_crawling: Process raw HTML content directly using raw: prefix | html string, raw content, direct html | AsyncWebCrawler().arun(url="raw:<html>content</html>")
|
|
crawler_config: Configure crawling behavior using CrawlerRunConfig object | crawler settings, configuration, bypass cache | CrawlerRunConfig(bypass_cache=True)
|
|
async_context: AsyncWebCrawler should be used within async context manager | async with, context management, async programming | async with AsyncWebCrawler() as crawler
|
|
crawl_result: Crawler returns result object containing success status, markdown and error messages | response handling, crawl output, result parsing | result.success, result.markdown, result.error_message
|
|
html_to_markdown: Crawler automatically converts HTML content to markdown format | format conversion, markdown generation, content processing | result.markdown
|
|
error_handling: Check crawl success status and handle error messages appropriately | error checking, failure handling, status verification | if result.success: ... else: print(result.error_message)
|
|
content_verification: Compare markdown length between different crawling methods for consistency | content validation, length comparison, consistency check | assert web_crawl_length == local_crawl_length |