- Add llm.txt generator - Added SSL certificate extraction in AsyncWebCrawler. - Introduced new content filters and chunking strategies for more robust data extraction. - Updated documentation.
2.3 KiB
crawler_hooks: AsyncWebCrawler supports customizable hooks for modifying crawler behavior | hooks, async functions, crawler customization | crawler.crawler_strategy.set_hook() browser_creation_hook: on_browser_created hook executes when browser is initialized for authentication and setup | browser setup, login, authentication | async def on_browser_created(browser: Browser, **kwargs) page_context_hook: on_page_context_created hook handles routing and initial page setup | page context, routing, resource blocking | async def on_page_context_created(context: BrowserContext, page: Page, **kwargs) navigation_pre_hook: before_goto hook allows adding custom headers before URL navigation | headers, pre-navigation, request modification | async def before_goto(page: Page, context: BrowserContext, **kwargs) navigation_post_hook: after_goto hook executes after URL navigation for verification | post-navigation, URL logging | async def after_goto(page: Page, context: BrowserContext, **kwargs) js_execution_hook: on_execution_started hook runs after custom JavaScript execution | JavaScript, script execution | async def on_execution_started(page: Page, context: BrowserContext, **kwargs) html_processing_hook: before_return_html hook processes HTML content before returning | HTML content, preprocessing | async def before_return_html(page: Page, context: BrowserContext, html: str, **kwargs) browser_configuration: BrowserConfig allows setting headless mode and viewport dimensions | browser settings, viewport | BrowserConfig(headless=True, viewport_width=1920, viewport_height=1080) crawler_configuration: CrawlerRunConfig defines JavaScript execution and wait conditions | crawler settings, JS code, wait conditions | CrawlerRunConfig(js_code="window.scrollTo(0)", wait_for="footer") resource_management: Route handlers can block or modify specific resource types | resource blocking, request handling | if route.request.resource_type == "image": await route.abort() authentication_flow: Browser authentication handled through login form interaction and cookie setting | login process, cookies | await page.fill("input[name='username']", "testuser") hook_registration: Hooks are registered using the crawler strategy's set_hook method | hook setup, strategy | crawler.crawler_strategy.set_hook("hook_name", hook_function)