Automatically detect when crawls are blocked by anti-bot systems (Akamai, Cloudflare, PerimeterX, DataDome, Imperva, etc.) and escalate through configurable retry and fallback strategies. New features on CrawlerRunConfig: - max_retries: retry rounds when blocking is detected - fallback_proxy_configs: list of fallback proxies tried each round - fallback_fetch_function: async last-resort function returning raw HTML New field on ProxyConfig: - is_fallback: skip proxy on first attempt, activate only when blocked Escalation chain per round: main proxy → fallback proxies in order. After all rounds: fallback_fetch_function as last resort. Detection uses tiered heuristics — structural HTML markers (high confidence) trigger on any page, generic patterns only on short error pages to avoid false positives.
124 lines
4.5 KiB
YAML
124 lines
4.5 KiB
YAML
site_name: Crawl4AI Documentation (v0.8.x)
|
|
site_description: 🚀🤖 Crawl4AI, Open-source LLM-Friendly Web Crawler & Scraper
|
|
site_url: https://docs.crawl4ai.com
|
|
repo_url: https://github.com/unclecode/crawl4ai
|
|
repo_name: unclecode/crawl4ai
|
|
docs_dir: docs/md_v2
|
|
|
|
nav:
|
|
- Home: 'index.md'
|
|
- "📚 Complete SDK Reference": "complete-sdk-reference.md"
|
|
- "Ask AI": "core/ask-ai.md"
|
|
- "Quick Start": "core/quickstart.md"
|
|
- "Code Examples": "core/examples.md"
|
|
- Apps:
|
|
- "Demo Apps": "apps/index.md"
|
|
- "C4A-Script Editor": "apps/c4a-script/index.html"
|
|
- "LLM Context Builder": "apps/llmtxt/index.html"
|
|
- "Marketplace": "marketplace/index.html"
|
|
- "Marketplace Admin": "marketplace/admin/index.html"
|
|
- Setup & Installation:
|
|
- "Installation": "core/installation.md"
|
|
- "Self-Hosting Guide": "core/self-hosting.md"
|
|
- "Blog & Changelog":
|
|
- "Blog Home": "blog/index.md"
|
|
- "Changelog": "https://github.com/unclecode/crawl4ai/blob/main/CHANGELOG.md"
|
|
- Core:
|
|
- "Command Line Interface": "core/cli.md"
|
|
- "Simple Crawling": "core/simple-crawling.md"
|
|
- "Deep Crawling": "core/deep-crawling.md"
|
|
- "Adaptive Crawling": "core/adaptive-crawling.md"
|
|
- "URL Seeding": "core/url-seeding.md"
|
|
- "C4A-Script": "core/c4a-script.md"
|
|
- "Crawler Result": "core/crawler-result.md"
|
|
- "Browser, Crawler & LLM Config": "core/browser-crawler-config.md"
|
|
- "Markdown Generation": "core/markdown-generation.md"
|
|
- "Fit Markdown": "core/fit-markdown.md"
|
|
- "Page Interaction": "core/page-interaction.md"
|
|
- "Content Selection": "core/content-selection.md"
|
|
- "Cache Modes": "core/cache-modes.md"
|
|
- "Local Files & Raw HTML": "core/local-files.md"
|
|
- "Link & Media": "core/link-media.md"
|
|
- Advanced:
|
|
- "Overview": "advanced/advanced-features.md"
|
|
- "Adaptive Strategies": "advanced/adaptive-strategies.md"
|
|
- "Virtual Scroll": "advanced/virtual-scroll.md"
|
|
- "File Downloading": "advanced/file-downloading.md"
|
|
- "Lazy Loading": "advanced/lazy-loading.md"
|
|
- "Hooks & Auth": "advanced/hooks-auth.md"
|
|
- "Proxy & Security": "advanced/proxy-security.md"
|
|
- "Anti-Bot & Fallback": "advanced/anti-bot-and-fallback.md"
|
|
- "Undetected Browser": "advanced/undetected-browser.md"
|
|
- "Session Management": "advanced/session-management.md"
|
|
- "Multi-URL Crawling": "advanced/multi-url-crawling.md"
|
|
- "Crawl Dispatcher": "advanced/crawl-dispatcher.md"
|
|
- "Identity Based Crawling": "advanced/identity-based-crawling.md"
|
|
- "SSL Certificate": "advanced/ssl-certificate.md"
|
|
- "Network & Console Capture": "advanced/network-console-capture.md"
|
|
- "PDF Parsing": "advanced/pdf-parsing.md"
|
|
- Extraction:
|
|
- "LLM-Free Strategies": "extraction/no-llm-strategies.md"
|
|
- "LLM Strategies": "extraction/llm-strategies.md"
|
|
- "Clustering Strategies": "extraction/clustring-strategies.md"
|
|
- "Chunking": "extraction/chunking.md"
|
|
- API Reference:
|
|
- "AsyncWebCrawler": "api/async-webcrawler.md"
|
|
- "arun()": "api/arun.md"
|
|
- "arun_many()": "api/arun_many.md"
|
|
- "Browser, Crawler & LLM Config": "api/parameters.md"
|
|
- "CrawlResult": "api/crawl-result.md"
|
|
- "Strategies": "api/strategies.md"
|
|
- "C4A-Script Reference": "api/c4a-script-reference.md"
|
|
- "Brand Book": "branding/index.md"
|
|
- Community:
|
|
- "Contributing Guide": "CONTRIBUTING.md"
|
|
- "Code of Conduct": "https://github.com/unclecode/crawl4ai/blob/main/CODE_OF_CONDUCT.md"
|
|
- "Contributors": "https://github.com/unclecode/crawl4ai/blob/main/CONTRIBUTORS.md"
|
|
|
|
theme:
|
|
name: 'terminal'
|
|
palette: 'dark'
|
|
favicon: favicon.ico
|
|
custom_dir: docs/md_v2/overrides
|
|
color_mode: 'dark'
|
|
icon:
|
|
repo: fontawesome/brands/github
|
|
|
|
plugins:
|
|
- search
|
|
|
|
markdown_extensions:
|
|
- pymdownx.highlight:
|
|
anchor_linenums: true
|
|
- pymdownx.inlinehilite
|
|
- pymdownx.snippets
|
|
- pymdownx.superfences
|
|
- admonition
|
|
- pymdownx.details
|
|
- attr_list
|
|
- tables
|
|
|
|
extra:
|
|
version: !ENV [CRAWL4AI_VERSION, 'development']
|
|
|
|
extra_css:
|
|
- assets/layout.css
|
|
- assets/styles.css
|
|
- assets/highlight.css
|
|
- assets/dmvendor.css
|
|
- assets/feedback-overrides.css
|
|
- assets/page_actions.css
|
|
|
|
extra_javascript:
|
|
- https://www.googletagmanager.com/gtag/js?id=G-58W0K2ZQ25
|
|
- assets/gtag.js
|
|
- assets/highlight.min.js
|
|
- assets/highlight_init.js
|
|
- https://buttons.github.io/buttons.js
|
|
- assets/toc.js
|
|
- assets/github_stats.js
|
|
- assets/selection_ask_ai.js
|
|
- assets/copy_code.js
|
|
- assets/floating_ask_ai_button.js
|
|
- assets/mobile_menu.js
|
|
- assets/page_actions.js?v=20251006 |