5.1 KiB
5.1 KiB
Features
Current Features
- Async-first architecture for high-performance web crawling
- Built-in anti-bot detection bypass ("magic mode")
- Multiple browser engine support (Chromium, Firefox, WebKit)
- Smart session management with automatic cleanup
- Automatic content cleaning and relevance scoring
- Built-in markdown generation with formatting preservation
- Intelligent image scoring and filtering
- Automatic popup and overlay removal
- Smart wait conditions (CSS/JavaScript based)
- Multi-provider LLM integration (OpenAI, HuggingFace, Ollama)
- Schema-based structured data extraction
- Automated iframe content processing
- Intelligent link categorization (internal/external)
- Multiple chunking strategies for large content
- Real-time HTML cleaning and sanitization
- Automatic screenshot capabilities
- Social media link filtering
- Semantic similarity-based content clustering
- Human behavior simulation for anti-bot bypass
- Proxy support with authentication
- Automatic resource cleanup
- Custom CSS selector-based extraction
- Automatic content relevance scoring ("fit" content)
- Recursive website crawling capabilities
- Flexible hook system for customization
- Built-in caching system
- Domain-based content filtering
- Dynamic content handling with JavaScript execution
- Automatic media content extraction and classification
- Metadata extraction and processing
- Customizable HTML to Markdown conversion
- Token-aware content chunking for LLM processing
- Automatic response header and status code handling
- Browser fingerprint customization
- Multiple extraction strategies (LLM, CSS, Cosine, XPATH)
- Automatic error image generation for failed screenshots
- Smart content overlap handling for large texts
- Built-in rate limiting for batch processing
- Automatic cookie handling
- Browser Console logging and debugging capabilities
Feature Techs
• Browser Management
- Asynchronous browser control
- Multi-browser support (Chromium, Firefox, WebKit)
- Headless mode support
- Browser cleanup and resource management
- Custom browser arguments and configuration
- Context management with
__aenter__and__aexit__
• Session Handling
- Session management with TTL (Time To Live)
- Session reuse capabilities
- Session cleanup for expired sessions
- Session-based context preservation
• Stealth Features
- Playwright stealth configuration
- Navigator properties override
- WebDriver detection evasion
- Chrome app simulation
- Plugin simulation
- Language preferences simulation
- Hardware concurrency simulation
- Media codecs simulation
• Network Features
- Proxy support with authentication
- Custom headers management
- Cookie handling
- Response header capture
- Status code tracking
- Network idle detection
• Page Interaction
- Smart wait functionality for multiple conditions
- CSS selector-based waiting
- JavaScript condition waiting
- Custom JavaScript execution
- User interaction simulation (mouse/keyboard)
- Page scrolling
- Timeout management
- Load state monitoring
• Content Processing
- HTML content extraction
- Iframe processing and content extraction
- Delayed content retrieval
- Content caching
- Cache file management
- HTML cleaning and processing
• Image Handling
- Screenshot capabilities (full page)
- Base64 encoding of screenshots
- Image dimension updating
- Image filtering (size/visibility)
- Error image generation
- Natural width/height preservation
• Overlay Management
- Popup removal
- Cookie notice removal
- Newsletter dialog removal
- Modal removal
- Fixed position element removal
- Z-index based overlay detection
- Visibility checking
• Hook System
- Browser creation hooks
- User agent update hooks
- Execution start hooks
- Navigation hooks (before/after goto)
- HTML retrieval hooks
- HTML return hooks
• Error Handling
- Browser error catching
- Network error handling
- Timeout handling
- Screenshot error recovery
- Invalid selector handling
- General exception management
• Performance Features
- Concurrent URL processing
- Semaphore-based rate limiting
- Async gathering of results
- Resource cleanup
- Memory management
• Debug Features
- Console logging
- Page error logging
- Verbose mode
- Error message generation
- Warning system
• Security Features
- Certificate error handling
- Sandbox configuration
- GPU handling
- CSP (Content Security Policy) compliant waiting
• Configuration
- User agent customization
- Viewport configuration
- Timeout configuration
- Browser type selection
- Proxy configuration
- Header configuration
• Data Models
- Pydantic model for responses
- Type hints throughout code
- Structured response format
- Optional response fields
• File System Integration
- Cache directory management
- File path handling
- Cache metadata storage
- File read/write operations
• Metadata Handling
- Response headers capture
- Status code tracking
- Cache metadata
- Session tracking
- Timestamp management