Implements comprehensive hooks functionality allowing users to provide custom Python functions as strings that execute at specific points in the crawling pipeline. Key Features: - Support for all 8 crawl4ai hook points: • on_browser_created: Initialize browser settings • on_page_context_created: Configure page context • before_goto: Pre-navigation setup • after_goto: Post-navigation processing • on_user_agent_updated: User agent modification handling • on_execution_started: Crawl execution initialization • before_retrieve_html: Pre-extraction processing • before_return_html: Final HTML processing Implementation Details: - Created UserHookManager for validation, compilation, and safe execution - Added IsolatedHookWrapper for error isolation and timeout protection - AST-based validation ensures code structure correctness - Sandboxed execution with restricted builtins for security - Configurable timeout (1-120 seconds) prevents infinite loops - Comprehensive error handling ensures hooks don't crash main process - Execution tracking with detailed statistics and logging API Changes: - Added HookConfig schema with code and timeout fields - Extended CrawlRequest with optional hooks parameter - Added /hooks/info endpoint for hook discovery - Updated /crawl and /crawl/stream endpoints to support hooks Safety Features: - Malformed hooks return clear validation errors - Hook errors are isolated and reported without stopping crawl - Execution statistics track success/failure/timeout rates - All hook results are JSON-serializable Testing: - Comprehensive test suite covering all 8 hooks - Error handling and timeout scenarios validated - Authentication, performance, and content extraction examples - 100% success rate in production testing Documentation: - Added extensive hooks section to docker-deployment.md - Security warnings about user-provided code risks - Real-world examples using httpbin.org, GitHub, BBC - Best practices and troubleshooting guide ref #1377
19 KiB
19 KiB