Implement initial MVP for Docker-based browser management in Crawl4ai, enabling
remote browser execution in containerized environments.
Key Changes:
- Add browser_farm module with Docker support components:
* BrowserFarmService: Manages browser endpoints
* DockerBrowser: Handles Docker browser communication
* Basic health check implementation
* Dockerfile with optimized Chrome/Playwright setup:
- Based on python:3.10-slim for minimal size
- Includes all required system dependencies
- Auto-installs crawl4ai and sets up Playwright
- Configures Chrome with remote debugging
- Uses socat for port forwarding (9223)
- Update core components:
* Rename use_managed_browser to use_remote_browser for clarity
* Modify BrowserManager to support Docker mode
* Add Docker configuration in BrowserConfig
* Update context handling for remote browsers
- Add example:
* hello_world_docker.py demonstrating Docker browser usage
Technical Details:
- Docker container exposes port 9223 (mapped to host:9333)
- Uses CDP (Chrome DevTools Protocol) for remote connection
- Maintains compatibility with existing managed browser features
- Simplified endpoint management for MVP phase
- Optimized Docker setup:
* Minimal dependencies installation
* Proper Chrome flags for containerized environment
* Headless mode with GPU disabled
* Security considerations (no-sandbox mode)
Testing:
- Extensive Docker configuration testing and optimization
- Verified with hello_world_docker.py example
- Confirmed remote browser connection and crawling functionality
- Tested basic health checks
This is the first step towards a scalable browser farm solution, setting up
the foundation for future enhancements like resource monitoring, multiple
browser instances, and container lifecycle management.
42 KiB
42 KiB
crawl4ai/models.py
| Type | Name | Signature | Docstring |
|---|---|---|---|
| MODULE | models.py | `` | |
| CLASS | TokenUsage | class TokenUsage: |
|
| CLASS | UrlModel | class UrlModel: |
|
| CLASS | MarkdownGenerationResult | class MarkdownGenerationResult: |
|
| CLASS | CrawlResult | class CrawlResult: |
|
| CLASS | AsyncCrawlResponse | class AsyncCrawlResponse: |
crawl4ai/async_configs.py
| Type | Name | Signature | Docstring |
|---|---|---|---|
| MODULE | async_configs.py | `` | |
| CLASS | BrowserConfig | class BrowserConfig: |
Configuration class for setting up a browser instance and its context in AsyncPlaywrightCrawlerStrat... (truncated) |
| METHOD | BrowserConfig.init | def __init__(self, browser_type='chromium', headless=True, use_remote_browser=False, use_persistent_context=False, user_data_dir=None, chrome_channel='chrome', proxy=None, proxy_config=None, viewport_width=1080, viewport_height=600, accept_downloads=False, downloads_path=None, storage_state=None, ignore_https_errors=True, java_script_enabled=True, sleep_on_close=False, verbose=True, cookies=None, headers=None, user_agent='Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:109.0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/116.0.5845.187 Safari/604.1 Edg/117.0.2045.47', user_agent_mode=None, user_agent_generator_config=None, text_mode=False, light_mode=False, extra_args=None, debugging_port=9222): |
|
| METHOD | BrowserConfig.from_kwargs | def from_kwargs(kwargs): |
|
| CLASS | CrawlerRunConfig | class CrawlerRunConfig: |
Configuration class for controlling how the crawler runs each crawl operation. This includes paramet... (truncated) |
| METHOD | CrawlerRunConfig.init | def __init__(self, word_count_threshold=MIN_WORD_THRESHOLD, extraction_strategy=None, chunking_strategy=None, markdown_generator=None, content_filter=None, only_text=False, css_selector=None, excluded_tags=None, excluded_selector=None, keep_data_attributes=False, remove_forms=False, prettiify=False, parser_type='lxml', fetch_ssl_certificate=False, cache_mode=None, session_id=None, bypass_cache=False, disable_cache=False, no_cache_read=False, no_cache_write=False, wait_until='domcontentloaded', page_timeout=PAGE_TIMEOUT, wait_for=None, wait_for_images=True, delay_before_return_html=0.1, mean_delay=0.1, max_range=0.3, semaphore_count=5, js_code=None, js_only=False, ignore_body_visibility=True, scan_full_page=False, scroll_delay=0.2, process_iframes=False, remove_overlay_elements=False, simulate_user=False, override_navigator=False, magic=False, adjust_viewport_to_content=False, screenshot=False, screenshot_wait_for=None, screenshot_height_threshold=SCREENSHOT_HEIGHT_TRESHOLD, pdf=False, image_description_min_word_threshold=IMAGE_DESCRIPTION_MIN_WORD_THRESHOLD, image_score_threshold=IMAGE_SCORE_THRESHOLD, exclude_external_images=False, exclude_social_media_domains=None, exclude_external_links=False, exclude_social_media_links=False, exclude_domains=None, verbose=True, log_console=False, url=None): |
|
| METHOD | CrawlerRunConfig.from_kwargs | def from_kwargs(kwargs): |
|
| METHOD | CrawlerRunConfig.to_dict | def to_dict(self): |
crawl4ai/async_webcrawler.py
| Type | Name | Signature | Docstring |
|---|---|---|---|
| MODULE | async_webcrawler.py | `` | |
| CLASS | AsyncWebCrawler | class AsyncWebCrawler: |
Asynchronous web crawler with flexible caching capabilities. There are two ways to use the crawler:... (truncated) |
| METHOD | AsyncWebCrawler.init | def __init__(self, crawler_strategy=None, config=None, always_bypass_cache=False, always_by_pass_cache=None, base_directory=str(os.getenv('CRAWL4_AI_BASE_DIRECTORY', Path.home())), thread_safe=False, **kwargs): |
Initialize the AsyncWebCrawler. Args: crawler_strategy: Strategy for crawling web pages. If Non... (truncated) |
| METHOD | AsyncWebCrawler.start | async def start(self): |
Start the crawler explicitly without using context manager. This is equivalent to using 'async with'... (truncated) |
| METHOD | AsyncWebCrawler.close | async def close(self): |
Close the crawler explicitly without using context manager. This should be called when you're done w... (truncated) |
| METHOD | AsyncWebCrawler.aenter | async def __aenter__(self): |
|
| METHOD | AsyncWebCrawler.aexit | async def __aexit__(self, exc_type, exc_val, exc_tb): |
|
| METHOD | AsyncWebCrawler.awarmup | async def awarmup(self): |
Initialize the crawler with warm-up sequence. This method: 1. Logs initialization info 2. Sets up b... (truncated) |
| METHOD | AsyncWebCrawler.nullcontext | async def nullcontext(self): |
异步空上下文管理器 |
| METHOD | AsyncWebCrawler.arun | async def arun(self, url, config=None, word_count_threshold=MIN_WORD_THRESHOLD, extraction_strategy=None, chunking_strategy=RegexChunking(), content_filter=None, cache_mode=None, bypass_cache=False, disable_cache=False, no_cache_read=False, no_cache_write=False, css_selector=None, screenshot=False, pdf=False, user_agent=None, verbose=True, **kwargs): |
Runs the crawler for a single source: URL (web, local file, or raw HTML). Migration Guide: Old way ... (truncated) |
| METHOD | AsyncWebCrawler.aprocess_html | async def aprocess_html(self, url, html, extracted_content, config, screenshot, pdf_data, verbose, **kwargs): |
Process HTML content using the provided configuration. Args: url: The URL being processed h... (truncated) |
| METHOD | AsyncWebCrawler.arun_many | async def arun_many(self, urls, config=None, word_count_threshold=MIN_WORD_THRESHOLD, extraction_strategy=None, chunking_strategy=RegexChunking(), content_filter=None, cache_mode=None, bypass_cache=False, css_selector=None, screenshot=False, pdf=False, user_agent=None, verbose=True, **kwargs): |
Runs the crawler for multiple URLs concurrently. Migration Guide: Old way (deprecated): results... (truncated) |
| METHOD | AsyncWebCrawler.aclear_cache | async def aclear_cache(self): |
Clear the cache database. |
| METHOD | AsyncWebCrawler.aflush_cache | async def aflush_cache(self): |
Flush the cache database. |
| METHOD | AsyncWebCrawler.aget_cache_size | async def aget_cache_size(self): |
Get the total number of cached items. |
crawl4ai/async_crawler_strategy.py
| Type | Name | Signature | Docstring |
|---|---|---|---|
| MODULE | async_crawler_strategy.py | `` | |
| CLASS | RemoteConnector | class RemoteConnector: |
Manages the browser process and context. This class allows to connect to the browser using CDP proto... (truncated) |
| METHOD | RemoteConnector.init | def __init__(self, browser_type='chromium', user_data_dir=None, headless=False, logger=None, host='localhost', debugging_port=9222): |
Initialize the RemoteConnector instance. Args: browser_type (str): The type of browser to launch... (truncated) |
| METHOD | RemoteConnector.start | async def start(self): |
Starts the browser process and returns the CDP endpoint URL. If user_data_dir is not provided, creat... (truncated) |
| METHOD | RemoteConnector._monitor_browser_process | async def _monitor_browser_process(self): |
Monitor the browser process for unexpected termination. How it works: 1. Read stdout and stderr fro... (truncated) |
| METHOD | RemoteConnector._get_browser_path | def _get_browser_path(self): |
Returns the browser executable path based on OS and browser type |
| METHOD | RemoteConnector._get_browser_args | def _get_browser_args(self): |
Returns browser-specific command line arguments |
| METHOD | RemoteConnector.cleanup | async def cleanup(self): |
Cleanup browser process and temporary directory |
| CLASS | BrowserManager | class BrowserManager: |
Manages the browser instance and context. Attributes: config (BrowserConfig): Configuration ob... (truncated) |
| METHOD | BrowserManager.init | def __init__(self, browser_config, logger=None): |
Initialize the BrowserManager with a browser configuration. Args: browser_config (BrowserConfig... (truncated) |
| METHOD | BrowserManager.start | async def start(self): |
Start the browser instance and set up the default context. How it works: 1. Check if Playwright is ... (truncated) |
| METHOD | BrowserManager._build_browser_args | def _build_browser_args(self): |
Build browser launch arguments from config. |
| METHOD | BrowserManager.setup_context | async def setup_context(self, context, crawlerRunConfig, is_default=False): |
Set up a browser context with the configured options. How it works: 1. Set extra HTTP headers if pr... (truncated) |
| METHOD | BrowserManager.create_browser_context | async def create_browser_context(self): |
Creates and returns a new browser context with configured settings. Applies text-only mode settings ... (truncated) |
| METHOD | BrowserManager.get_page | async def get_page(self, crawlerRunConfig): |
Get a page for the given session ID, creating a new one if needed. Args: crawlerRunConfig (Craw... (truncated) |
| METHOD | BrowserManager.kill_session | async def kill_session(self, session_id): |
Kill a browser session and clean up resources. Args: session_id (str): The session ID to kill... (truncated) |
| METHOD | BrowserManager._cleanup_expired_sessions | def _cleanup_expired_sessions(self): |
Clean up expired sessions based on TTL. |
| METHOD | BrowserManager.close | async def close(self): |
Close all browser resources and clean up. |
| CLASS | AsyncCrawlerStrategy | class AsyncCrawlerStrategy: |
Abstract base class for crawler strategies. Subclasses must implement the crawl method. |
| METHOD | AsyncCrawlerStrategy.crawl | async def crawl(self, url, **kwargs): |
|
| CLASS | AsyncPlaywrightCrawlerStrategy | class AsyncPlaywrightCrawlerStrategy: |
Crawler strategy using Playwright. Attributes: browser_config (BrowserConfig): Configuration ob... (truncated) |
| METHOD | AsyncPlaywrightCrawlerStrategy.init | def __init__(self, browser_config=None, logger=None, **kwargs): |
Initialize the AsyncPlaywrightCrawlerStrategy with a browser configuration. Args: browser_confi... (truncated) |
| METHOD | AsyncPlaywrightCrawlerStrategy.aenter | async def __aenter__(self): |
|
| METHOD | AsyncPlaywrightCrawlerStrategy.aexit | async def __aexit__(self, exc_type, exc_val, exc_tb): |
|
| METHOD | AsyncPlaywrightCrawlerStrategy.start | async def start(self): |
Start the browser and initialize the browser manager. |
| METHOD | AsyncPlaywrightCrawlerStrategy.close | async def close(self): |
Close the browser and clean up resources. |
| METHOD | AsyncPlaywrightCrawlerStrategy.kill_session | async def kill_session(self, session_id): |
Kill a browser session and clean up resources. Args: session_id (str): The ID of the session to... (truncated) |
| METHOD | AsyncPlaywrightCrawlerStrategy.set_hook | def set_hook(self, hook_type, hook): |
Set a hook function for a specific hook type. Following are list of hook types: - on_browser_created... (truncated) |
| METHOD | AsyncPlaywrightCrawlerStrategy.execute_hook | async def execute_hook(self, hook_type, *args, **kwargs): |
Execute a hook function for a specific hook type. Args: hook_type (str): The type of the hook. ... (truncated) |
| METHOD | AsyncPlaywrightCrawlerStrategy.update_user_agent | def update_user_agent(self, user_agent): |
Update the user agent for the browser. Args: user_agent (str): The new user agent string. ... (truncated) |
| METHOD | AsyncPlaywrightCrawlerStrategy.set_custom_headers | def set_custom_headers(self, headers): |
Set custom headers for the browser. Args: headers (Dict[str, str]): A dictionary of headers to... (truncated) |
| METHOD | AsyncPlaywrightCrawlerStrategy.smart_wait | async def smart_wait(self, page, wait_for, timeout=30000): |
Wait for a condition in a smart way. This functions works as below: 1. If wait_for starts with 'js:... (truncated) |
| METHOD | AsyncPlaywrightCrawlerStrategy.csp_compliant_wait | async def csp_compliant_wait(self, page, user_wait_function, timeout=30000): |
Wait for a condition in a CSP-compliant way. Args: page: Playwright page object user_wait_f... (truncated) |
| METHOD | AsyncPlaywrightCrawlerStrategy.process_iframes | async def process_iframes(self, page): |
Process iframes on a page. This function will extract the content of each iframe and replace it with... (truncated) |
| METHOD | AsyncPlaywrightCrawlerStrategy.create_session | async def create_session(self, **kwargs): |
Creates a new browser session and returns its ID. A browse session is a unique openned page can be r... (truncated) |
| METHOD | AsyncPlaywrightCrawlerStrategy.crawl | async def crawl(self, url, config, **kwargs): |
Crawls a given URL or processes raw HTML/local file content based on the URL prefix. Args: url ... (truncated) |
| METHOD | AsyncPlaywrightCrawlerStrategy._crawl_web | async def _crawl_web(self, url, config): |
Internal method to crawl web URLs with the specified configuration. Args: url (str): The web UR... (truncated) |
| METHOD | AsyncPlaywrightCrawlerStrategy._handle_full_page_scan | async def _handle_full_page_scan(self, page, scroll_delay): |
Helper method to handle full page scanning. How it works: 1. Get the viewport height. 2. Scroll to... (truncated) |
| METHOD | AsyncPlaywrightCrawlerStrategy._handle_download | async def _handle_download(self, download): |
Handle file downloads. How it works: 1. Get the suggested filename. 2. Get the download path. 3. Lo... (truncated) |
| METHOD | AsyncPlaywrightCrawlerStrategy.remove_overlay_elements | async def remove_overlay_elements(self, page): |
Removes popup overlays, modals, cookie notices, and other intrusive elements from the page. Args: ... (truncated) |
| METHOD | AsyncPlaywrightCrawlerStrategy.export_pdf | async def export_pdf(self, page): |
Exports the current page as a PDF. Args: page (Page): The Playwright page object Returns: ... (truncated) |
| METHOD | AsyncPlaywrightCrawlerStrategy.take_screenshot | async def take_screenshot(self, page, **kwargs): |
Take a screenshot of the current page. Args: page (Page): The Playwright page object kwargs... (truncated) |
| METHOD | AsyncPlaywrightCrawlerStrategy.take_screenshot_from_pdf | async def take_screenshot_from_pdf(self, pdf_data): |
Convert the first page of the PDF to a screenshot. Requires pdf2image and poppler. Args: ... (truncated) |
| METHOD | AsyncPlaywrightCrawlerStrategy.take_screenshot_scroller | async def take_screenshot_scroller(self, page, **kwargs): |
Attempt to set a large viewport and take a full-page screenshot. If still too large, segment the pag... (truncated) |
| METHOD | AsyncPlaywrightCrawlerStrategy.take_screenshot_naive | async def take_screenshot_naive(self, page): |
Takes a screenshot of the current page. Args: page (Page): The Playwright page instance Return... (truncated) |
| METHOD | AsyncPlaywrightCrawlerStrategy.export_storage_state | async def export_storage_state(self, path=None): |
Exports the current storage state (cookies, localStorage, sessionStorage) to a JSON file at the spec... (truncated) |
| METHOD | AsyncPlaywrightCrawlerStrategy.robust_execute_user_script | async def robust_execute_user_script(self, page, js_code): |
Executes user-provided JavaScript code with proper error handling and context, supporting both synch... (truncated) |
| METHOD | AsyncPlaywrightCrawlerStrategy.execute_user_script | async def execute_user_script(self, page, js_code): |
Executes user-provided JavaScript code with proper error handling and context. Args: page: Play... (truncated) |
| METHOD | AsyncPlaywrightCrawlerStrategy.check_visibility | async def check_visibility(self, page): |
Checks if an element is visible on the page. Args: page: Playwright page object Returns: ... (truncated) |
| METHOD | AsyncPlaywrightCrawlerStrategy.safe_scroll | async def safe_scroll(self, page, x, y): |
Safely scroll the page with rendering time. Args: page: Playwright page object x: Horizonta... (truncated) |
| METHOD | AsyncPlaywrightCrawlerStrategy.csp_scroll_to | async def csp_scroll_to(self, page, x, y): |
Performs a CSP-compliant scroll operation and returns the result status. Args: page: Playwright... (truncated) |
| METHOD | AsyncPlaywrightCrawlerStrategy.get_page_dimensions | async def get_page_dimensions(self, page): |
Get the dimensions of the page. Args: page: Playwright page object Returns: Dict conta... (truncated) |
crawl4ai/content_scraping_strategy.py
| Type | Name | Signature | Docstring |
|---|---|---|---|
| MODULE | content_scraping_strategy.py | `` | |
| FUNCTION | parse_dimension | def parse_dimension(dimension): |
|
| FUNCTION | fetch_image_file_size | def fetch_image_file_size(img, base_url): |
|
| CLASS | ContentScrapingStrategy | class ContentScrapingStrategy: |
|
| METHOD | ContentScrapingStrategy.scrap | def scrap(self, url, html, **kwargs): |
|
| METHOD | ContentScrapingStrategy.ascrap | async def ascrap(self, url, html, **kwargs): |
|
| CLASS | WebScrapingStrategy | class WebScrapingStrategy: |
Class for web content scraping. Perhaps the most important class. How it works: 1. Extract content... (truncated) |
| METHOD | WebScrapingStrategy.init | def __init__(self, logger=None): |
|
| METHOD | WebScrapingStrategy._log | def _log(self, level, message, tag='SCRAPE', **kwargs): |
Helper method to safely use logger. |
| METHOD | WebScrapingStrategy.scrap | def scrap(self, url, html, **kwargs): |
Main entry point for content scraping. Args: url (str): The URL of the page to scrape. ht... (truncated) |
| METHOD | WebScrapingStrategy.ascrap | async def ascrap(self, url, html, **kwargs): |
Main entry point for asynchronous content scraping. Args: url (str): The URL of the page to scr... (truncated) |
| METHOD | WebScrapingStrategy._generate_markdown_content | def _generate_markdown_content(self, cleaned_html, html, url, success, **kwargs): |
Generate markdown content from cleaned HTML. Args: cleaned_html (str): The cleaned HTML content... (truncated) |
| METHOD | WebScrapingStrategy.flatten_nested_elements | def flatten_nested_elements(self, node): |
Flatten nested elements in a HTML tree. Args: node (Tag): The root node of the HTML tree. Retu... (truncated) |
| METHOD | WebScrapingStrategy.find_closest_parent_with_useful_text | def find_closest_parent_with_useful_text(self, tag, **kwargs): |
Find the closest parent with useful text. Args: tag (Tag): The starting tag to search from. ... (truncated) |
| METHOD | WebScrapingStrategy.remove_unwanted_attributes | def remove_unwanted_attributes(self, element, important_attrs, keep_data_attributes=False): |
Remove unwanted attributes from an HTML element. Args: element (Tag): The HTML element to r... (truncated) |
| METHOD | WebScrapingStrategy.process_image | def process_image(self, img, url, index, total_images, **kwargs): |
Process an image element. How it works: 1. Check if the image has valid display and inside undesire... (truncated) |
| METHOD | WebScrapingStrategy.process_element | def process_element(self, url, element, **kwargs): |
Process an HTML element. How it works: 1. Check if the element is an image, video, or audio. 2. Ext... (truncated) |
| METHOD | WebScrapingStrategy._process_element | def _process_element(self, url, element, media, internal_links_dict, external_links_dict, **kwargs): |
Process an HTML element. |
| METHOD | WebScrapingStrategy._scrap | def _scrap(self, url, html, word_count_threshold=MIN_WORD_THRESHOLD, css_selector=None, **kwargs): |
Extract content from HTML using BeautifulSoup. Args: url (str): The URL of the page to scrape. ... (truncated) |
crawl4ai/markdown_generation_strategy.py
| Type | Name | Signature | Docstring |
|---|---|---|---|
| MODULE | markdown_generation_strategy.py | `` | |
| FUNCTION | fast_urljoin | def fast_urljoin(base, url): |
Fast URL joining for common cases. |
| CLASS | MarkdownGenerationStrategy | class MarkdownGenerationStrategy: |
Abstract base class for markdown generation strategies. |
| METHOD | MarkdownGenerationStrategy.init | def __init__(self, content_filter=None, options=None): |
|
| METHOD | MarkdownGenerationStrategy.generate_markdown | def generate_markdown(self, cleaned_html, base_url='', html2text_options=None, content_filter=None, citations=True, **kwargs): |
Generate markdown from cleaned HTML. |
| CLASS | DefaultMarkdownGenerator | class DefaultMarkdownGenerator: |
Default implementation of markdown generation strategy. How it works: 1. Generate raw markdown from... (truncated) |
| METHOD | DefaultMarkdownGenerator.init | def __init__(self, content_filter=None, options=None): |
|
| METHOD | DefaultMarkdownGenerator.convert_links_to_citations | def convert_links_to_citations(self, markdown, base_url=''): |
Convert links in markdown to citations. How it works: 1. Find all links in the markdown. 2. Convert... (truncated) |
| METHOD | DefaultMarkdownGenerator.generate_markdown | def generate_markdown(self, cleaned_html, base_url='', html2text_options=None, options=None, content_filter=None, citations=True, **kwargs): |
Generate markdown with citations from cleaned HTML. How it works: 1. Generate raw markdown from cle... (truncated) |
crawl4ai/content_filter_strategy.py
| Type | Name | Signature | Docstring |
|---|---|---|---|
| MODULE | content_filter_strategy.py | `` | |
| CLASS | RelevantContentFilter | class RelevantContentFilter: |
Abstract base class for content filtering strategies |
| METHOD | RelevantContentFilter.init | def __init__(self, user_query=None): |
|
| METHOD | RelevantContentFilter.filter_content | def filter_content(self, html): |
Abstract method to be implemented by specific filtering strategies |
| METHOD | RelevantContentFilter.extract_page_query | def extract_page_query(self, soup, body): |
Common method to extract page metadata with fallbacks |
| METHOD | RelevantContentFilter.extract_text_chunks | def extract_text_chunks(self, body, min_word_threshold=None): |
Extracts text chunks from a BeautifulSoup body element while preserving order. Returns list of tuple... (truncated) |
| METHOD | RelevantContentFilter._deprecated_extract_text_chunks | def _deprecated_extract_text_chunks(self, soup): |
Common method for extracting text chunks |
| METHOD | RelevantContentFilter.is_excluded | def is_excluded(self, tag): |
Common method for exclusion logic |
| METHOD | RelevantContentFilter.clean_element | def clean_element(self, tag): |
Common method for cleaning HTML elements with minimal overhead |
| CLASS | BM25ContentFilter | class BM25ContentFilter: |
Content filtering using BM25 algorithm with priority tag handling. How it works: 1. Extracts page m... (truncated) |
| METHOD | BM25ContentFilter.init | def __init__(self, user_query=None, bm25_threshold=1.0, language='english'): |
Initializes the BM25ContentFilter class, if not provided, falls back to page metadata. Note: If no ... (truncated) |
| METHOD | BM25ContentFilter.filter_content | def filter_content(self, html, min_word_threshold=None): |
Implements content filtering using BM25 algorithm with priority tag handling. Note: This method... (truncated) |
| CLASS | PruningContentFilter | class PruningContentFilter: |
Content filtering using pruning algorithm with dynamic threshold. How it works: 1. Extracts page me... (truncated) |
| METHOD | PruningContentFilter.init | def __init__(self, user_query=None, min_word_threshold=None, threshold_type='fixed', threshold=0.48): |
Initializes the PruningContentFilter class, if not provided, falls back to page metadata. Note: If ... (truncated) |
| METHOD | PruningContentFilter.filter_content | def filter_content(self, html, min_word_threshold=None): |
Implements content filtering using pruning algorithm with dynamic threshold. Note: This method impl... (truncated) |
| METHOD | PruningContentFilter._remove_comments | def _remove_comments(self, soup): |
Removes HTML comments |
| METHOD | PruningContentFilter._remove_unwanted_tags | def _remove_unwanted_tags(self, soup): |
Removes unwanted tags |
| METHOD | PruningContentFilter._prune_tree | def _prune_tree(self, node): |
Prunes the tree starting from the given node. Args: node (Tag): The node from which the pruning... (truncated) |
| METHOD | PruningContentFilter._compute_composite_score | def _compute_composite_score(self, metrics, text_len, tag_len, link_text_len): |
Computes the composite score |
| METHOD | PruningContentFilter._compute_class_id_weight | def _compute_class_id_weight(self, node): |
Computes the class ID weight |
crawl4ai/extraction_strategy.py
| Type | Name | Signature | Docstring |
|---|---|---|---|
| MODULE | extraction_strategy.py | `` | |
| CLASS | ExtractionStrategy | class ExtractionStrategy: |
Abstract base class for all extraction strategies. |
| METHOD | ExtractionStrategy.init | def __init__(self, input_format='markdown', **kwargs): |
Initialize the extraction strategy. Args: input_format: Content format to use for extraction. ... (truncated) |
| METHOD | ExtractionStrategy.extract | def extract(self, url, html, *q, **kwargs): |
Extract meaningful blocks or chunks from the given HTML. :param url: The URL of the webpage. :param... (truncated) |
| METHOD | ExtractionStrategy.run | def run(self, url, sections, *q, **kwargs): |
Process sections of text in parallel by default. :param url: The URL of the webpage. :param section... (truncated) |
| CLASS | NoExtractionStrategy | class NoExtractionStrategy: |
A strategy that does not extract any meaningful content from the HTML. It simply returns the entire ... (truncated) |
| METHOD | NoExtractionStrategy.extract | def extract(self, url, html, *q, **kwargs): |
Extract meaningful blocks or chunks from the given HTML. |
| METHOD | NoExtractionStrategy.run | def run(self, url, sections, *q, **kwargs): |
|
| CLASS | LLMExtractionStrategy | class LLMExtractionStrategy: |
A strategy that uses an LLM to extract meaningful content from the HTML. Attributes: provider: ... (truncated) |
| METHOD | LLMExtractionStrategy.init | def __init__(self, provider=DEFAULT_PROVIDER, api_token=None, instruction=None, schema=None, extraction_type='block', **kwargs): |
Initialize the strategy with clustering parameters. Args: provider: The provider to use for ext... (truncated) |
| METHOD | LLMExtractionStrategy.extract | def extract(self, url, ix, html): |
Extract meaningful blocks or chunks from the given HTML using an LLM. How it works: 1. Construct a ... (truncated) |
| METHOD | LLMExtractionStrategy._merge | def _merge(self, documents, chunk_token_threshold, overlap): |
Merge documents into sections based on chunk_token_threshold and overlap. |
| METHOD | LLMExtractionStrategy.run | def run(self, url, sections): |
Process sections sequentially with a delay for rate limiting issues, specifically for LLMExtractionS... (truncated) |
| METHOD | LLMExtractionStrategy.show_usage | def show_usage(self): |
Print a detailed token usage report showing total and per-request usage. |
| CLASS | CosineStrategy | class CosineStrategy: |
Extract meaningful blocks or chunks from the given HTML using cosine similarity. How it works: 1. P... (truncated) |
| METHOD | CosineStrategy.init | def __init__(self, semantic_filter=None, word_count_threshold=10, max_dist=0.2, linkage_method='ward', top_k=3, model_name='sentence-transformers/all-MiniLM-L6-v2', sim_threshold=0.3, **kwargs): |
Initialize the strategy with clustering parameters. Args: semantic_filter (str): A keyword filt... (truncated) |
| METHOD | CosineStrategy.filter_documents_embeddings | def filter_documents_embeddings(self, documents, semantic_filter, at_least_k=20): |
Filter and sort documents based on the cosine similarity of their embeddings with the semantic_filte... (truncated) |
| METHOD | CosineStrategy.get_embeddings | def get_embeddings(self, sentences, batch_size=None, bypass_buffer=False): |
Get BERT embeddings for a list of sentences. Args: sentences (List[str]): A list of text chunks... (truncated) |
| METHOD | CosineStrategy.hierarchical_clustering | def hierarchical_clustering(self, sentences, embeddings=None): |
Perform hierarchical clustering on sentences and return cluster labels. Args: sentences (List[s... (truncated) |
| METHOD | CosineStrategy.filter_clusters_by_word_count | def filter_clusters_by_word_count(self, clusters): |
Filter clusters to remove those with a word count below the threshold. Args: clusters (Dict[int... (truncated) |
| METHOD | CosineStrategy.extract | def extract(self, url, html, *q, **kwargs): |
Extract clusters from HTML content using hierarchical clustering. Args: url (str): The URL of t... (truncated) |
| METHOD | CosineStrategy.run | def run(self, url, sections, *q, **kwargs): |
Process sections using hierarchical clustering. Args: url (str): The URL of the webpage. se... (truncated) |
| CLASS | JsonElementExtractionStrategy | class JsonElementExtractionStrategy: |
Abstract base class for extracting structured JSON from HTML content. How it works: 1. ... (truncated) |
| METHOD | JsonElementExtractionStrategy.init | def __init__(self, schema, **kwargs): |
Initialize the JSON element extraction strategy with a schema. Args: schema (Dict[str, Any]): T... (truncated) |
| METHOD | JsonElementExtractionStrategy.extract | def extract(self, url, html_content, *q, **kwargs): |
Extract structured data from HTML content. How it works: 1. Parses the HTML content using the `_par... (truncated) |
| METHOD | JsonElementExtractionStrategy._parse_html | def _parse_html(self, html_content): |
Parse HTML content into appropriate format |
| METHOD | JsonElementExtractionStrategy._get_base_elements | def _get_base_elements(self, parsed_html, selector): |
Get all base elements using the selector |
| METHOD | JsonElementExtractionStrategy._get_elements | def _get_elements(self, element, selector): |
Get child elements using the selector |
| METHOD | JsonElementExtractionStrategy._extract_field | def _extract_field(self, element, field): |
|
| METHOD | JsonElementExtractionStrategy._extract_single_field | def _extract_single_field(self, element, field): |
Extract a single field based on its type. How it works: 1. Selects the target element using the fie... (truncated) |
| METHOD | JsonElementExtractionStrategy._extract_list_item | def _extract_list_item(self, element, fields): |
|
| METHOD | JsonElementExtractionStrategy._extract_item | def _extract_item(self, element, fields): |
Extracts fields from a given element. How it works: 1. Iterates through the fields defined in the s... (truncated) |
| METHOD | JsonElementExtractionStrategy._apply_transform | def _apply_transform(self, value, transform): |
Apply a transformation to a value. How it works: 1. Checks the transformation type (e.g., `lowercas... (truncated) |
| METHOD | JsonElementExtractionStrategy._compute_field | def _compute_field(self, item, field): |
|
| METHOD | JsonElementExtractionStrategy.run | def run(self, url, sections, *q, **kwargs): |
Run the extraction strategy on a combined HTML content. How it works: 1. Combines multiple HTML sec... (truncated) |
| METHOD | JsonElementExtractionStrategy._get_element_text | def _get_element_text(self, element): |
Get text content from element |
| METHOD | JsonElementExtractionStrategy._get_element_html | def _get_element_html(self, element): |
Get HTML content from element |
| METHOD | JsonElementExtractionStrategy._get_element_attribute | def _get_element_attribute(self, element, attribute): |
Get attribute value from element |
| CLASS | JsonCssExtractionStrategy | class JsonCssExtractionStrategy: |
Concrete implementation of JsonElementExtractionStrategy using CSS selectors. How it works: 1. Pa... (truncated) |
| METHOD | JsonCssExtractionStrategy.init | def __init__(self, schema, **kwargs): |
|
| METHOD | JsonCssExtractionStrategy._parse_html | def _parse_html(self, html_content): |
|
| METHOD | JsonCssExtractionStrategy._get_base_elements | def _get_base_elements(self, parsed_html, selector): |
|
| METHOD | JsonCssExtractionStrategy._get_elements | def _get_elements(self, element, selector): |
|
| METHOD | JsonCssExtractionStrategy._get_element_text | def _get_element_text(self, element): |
|
| METHOD | JsonCssExtractionStrategy._get_element_html | def _get_element_html(self, element): |
|
| METHOD | JsonCssExtractionStrategy._get_element_attribute | def _get_element_attribute(self, element, attribute): |
|
| CLASS | JsonXPathExtractionStrategy | class JsonXPathExtractionStrategy: |
Concrete implementation of JsonElementExtractionStrategy using XPath selectors. How it works: 1. ... (truncated) |
| METHOD | JsonXPathExtractionStrategy.init | def __init__(self, schema, **kwargs): |
|
| METHOD | JsonXPathExtractionStrategy._parse_html | def _parse_html(self, html_content): |
|
| METHOD | JsonXPathExtractionStrategy._get_base_elements | def _get_base_elements(self, parsed_html, selector): |
|
| METHOD | JsonXPathExtractionStrategy._css_to_xpath | def _css_to_xpath(self, css_selector): |
Convert CSS selector to XPath if needed |
| METHOD | JsonXPathExtractionStrategy._basic_css_to_xpath | def _basic_css_to_xpath(self, css_selector): |
Basic CSS to XPath conversion for common cases |
| METHOD | JsonXPathExtractionStrategy._get_elements | def _get_elements(self, element, selector): |
|
| METHOD | JsonXPathExtractionStrategy._get_element_text | def _get_element_text(self, element): |
|
| METHOD | JsonXPathExtractionStrategy._get_element_html | def _get_element_html(self, element): |
|
| METHOD | JsonXPathExtractionStrategy._get_element_attribute | def _get_element_attribute(self, element, attribute): |
crawl4ai/chunking_strategy.py
| Type | Name | Signature | Docstring |
|---|---|---|---|
| MODULE | chunking_strategy.py | `` | |
| CLASS | ChunkingStrategy | class ChunkingStrategy: |
Abstract base class for chunking strategies. |
| METHOD | ChunkingStrategy.chunk | def chunk(self, text): |
Abstract method to chunk the given text. Args: text (str): The text to chunk. Returns: lis... (truncated) |
| CLASS | IdentityChunking | class IdentityChunking: |
Chunking strategy that returns the input text as a single chunk. |
| METHOD | IdentityChunking.chunk | def chunk(self, text): |
|
| CLASS | RegexChunking | class RegexChunking: |
Chunking strategy that splits text based on regular expression patterns. |
| METHOD | RegexChunking.init | def __init__(self, patterns=None, **kwargs): |
Initialize the RegexChunking object. Args: patterns (list): A list of regular expression patter... (truncated) |
| METHOD | RegexChunking.chunk | def chunk(self, text): |
|
| CLASS | NlpSentenceChunking | class NlpSentenceChunking: |
Chunking strategy that splits text into sentences using NLTK's sentence tokenizer. |
| METHOD | NlpSentenceChunking.init | def __init__(self, **kwargs): |
Initialize the NlpSentenceChunking object. |
| METHOD | NlpSentenceChunking.chunk | def chunk(self, text): |
|
| CLASS | TopicSegmentationChunking | class TopicSegmentationChunking: |
Chunking strategy that segments text into topics using NLTK's TextTilingTokenizer. How it works: 1.... (truncated) |
| METHOD | TopicSegmentationChunking.init | def __init__(self, num_keywords=3, **kwargs): |
Initialize the TopicSegmentationChunking object. Args: num_keywords (int): The number of keywor... (truncated) |
| METHOD | TopicSegmentationChunking.chunk | def chunk(self, text): |
|
| METHOD | TopicSegmentationChunking.extract_keywords | def extract_keywords(self, text): |
|
| METHOD | TopicSegmentationChunking.chunk_with_topics | def chunk_with_topics(self, text): |
|
| CLASS | FixedLengthWordChunking | class FixedLengthWordChunking: |
Chunking strategy that splits text into fixed-length word chunks. How it works: 1. Split the text i... (truncated) |
| METHOD | FixedLengthWordChunking.init | def __init__(self, chunk_size=100, **kwargs): |
Initialize the fixed-length word chunking strategy with the given chunk size. Args: chunk_size ... (truncated) |
| METHOD | FixedLengthWordChunking.chunk | def chunk(self, text): |
|
| CLASS | SlidingWindowChunking | class SlidingWindowChunking: |
Chunking strategy that splits text into overlapping word chunks. How it works: 1. Split the text in... (truncated) |
| METHOD | SlidingWindowChunking.init | def __init__(self, window_size=100, step=50, **kwargs): |
Initialize the sliding window chunking strategy with the given window size and step size. Args: ... (truncated) |
| METHOD | SlidingWindowChunking.chunk | def chunk(self, text): |
|
| CLASS | OverlappingWindowChunking | class OverlappingWindowChunking: |
Chunking strategy that splits text into overlapping word chunks. How it works: 1. Split the text in... (truncated) |
| METHOD | OverlappingWindowChunking.init | def __init__(self, window_size=1000, overlap=100, **kwargs): |
Initialize the overlapping window chunking strategy with the given window size and overlap size. Ar... (truncated) |
| METHOD | OverlappingWindowChunking.chunk | def chunk(self, text): |
crawl4ai/user_agent_generator.py
| Type | Name | Signature | Docstring |
|---|---|---|---|
| MODULE | user_agent_generator.py | `` | |
| CLASS | UserAgentGenerator | class UserAgentGenerator: |
Generate random user agents with specified constraints. Attributes: desktop_platforms (dict): A... (truncated) |
| METHOD | UserAgentGenerator.init | def __init__(self): |
|
| METHOD | UserAgentGenerator.get_browser_stack | def get_browser_stack(self, num_browsers=1): |
Get a valid combination of browser versions. How it works: 1. Check if the number of browsers is su... (truncated) |
| METHOD | UserAgentGenerator.generate | def generate(self, device_type=None, os_type=None, device_brand=None, browser_type=None, num_browsers=3): |
Generate a random user agent with specified constraints. Args: device_type: 'desktop' or 'mobil... (truncated) |
| METHOD | UserAgentGenerator.generate_with_client_hints | def generate_with_client_hints(self, **kwargs): |
Generate both user agent and matching client hints |
| METHOD | UserAgentGenerator.get_random_platform | def get_random_platform(self, device_type, os_type, device_brand): |
Helper method to get random platform based on constraints |
| METHOD | UserAgentGenerator.parse_user_agent | def parse_user_agent(self, user_agent): |
Parse a user agent string to extract browser and version information |
| METHOD | UserAgentGenerator.generate_client_hints | def generate_client_hints(self, user_agent): |
Generate Sec-CH-UA header value based on user agent string |
crawl4ai/ssl_certificate.py
| Type | Name | Signature | Docstring |
|---|---|---|---|
| MODULE | ssl_certificate.py | `` | SSL Certificate class for handling certificate operations. |
| CLASS | SSLCertificate | class SSLCertificate: |
A class representing an SSL certificate with methods to export in various formats. Attributes: ... (truncated) |
| METHOD | SSLCertificate.init | def __init__(self, cert_info): |
|
| METHOD | SSLCertificate.from_url | def from_url(url, timeout=10): |
Create SSLCertificate instance from a URL. Args: url (str): URL of the website. timeout (in... (truncated) |
| METHOD | SSLCertificate._decode_cert_data | def _decode_cert_data(data): |
Helper method to decode bytes in certificate data. |
| METHOD | SSLCertificate.to_json | def to_json(self, filepath=None): |
Export certificate as JSON. Args: filepath (Optional[str]): Path to save the JSON file (default... (truncated) |
| METHOD | SSLCertificate.to_pem | def to_pem(self, filepath=None): |
Export certificate as PEM. Args: filepath (Optional[str]): Path to save the PEM file (default: ... (truncated) |
| METHOD | SSLCertificate.to_der | def to_der(self, filepath=None): |
Export certificate as DER. Args: filepath (Optional[str]): Path to save the DER file (default: ... (truncated) |
| METHOD | SSLCertificate.issuer | def issuer(self): |
Get certificate issuer information. |
| METHOD | SSLCertificate.subject | def subject(self): |
Get certificate subject information. |
| METHOD | SSLCertificate.valid_from | def valid_from(self): |
Get certificate validity start date. |
| METHOD | SSLCertificate.valid_until | def valid_until(self): |
Get certificate validity end date. |
| METHOD | SSLCertificate.fingerprint | def fingerprint(self): |
Get certificate fingerprint. |