Files
crawl4ai/plans/docstring.md
UncleCode 7aaaaae461 feat(browser-farm): Add Docker browser support for remote crawling
Implement initial MVP for Docker-based browser management in Crawl4ai, enabling
remote browser execution in containerized environments.

Key Changes:
- Add browser_farm module with Docker support components:
  * BrowserFarmService: Manages browser endpoints
  * DockerBrowser: Handles Docker browser communication
  * Basic health check implementation
  * Dockerfile with optimized Chrome/Playwright setup:
    - Based on python:3.10-slim for minimal size
    - Includes all required system dependencies
    - Auto-installs crawl4ai and sets up Playwright
    - Configures Chrome with remote debugging
    - Uses socat for port forwarding (9223)

- Update core components:
  * Rename use_managed_browser to use_remote_browser for clarity
  * Modify BrowserManager to support Docker mode
  * Add Docker configuration in BrowserConfig
  * Update context handling for remote browsers

- Add example:
  * hello_world_docker.py demonstrating Docker browser usage

Technical Details:
- Docker container exposes port 9223 (mapped to host:9333)
- Uses CDP (Chrome DevTools Protocol) for remote connection
- Maintains compatibility with existing managed browser features
- Simplified endpoint management for MVP phase
- Optimized Docker setup:
  * Minimal dependencies installation
  * Proper Chrome flags for containerized environment
  * Headless mode with GPU disabled
  * Security considerations (no-sandbox mode)

Testing:
- Extensive Docker configuration testing and optimization
- Verified with hello_world_docker.py example
- Confirmed remote browser connection and crawling functionality
- Tested basic health checks

This is the first step towards a scalable browser farm solution, setting up
the foundation for future enhancements like resource monitoring, multiple
browser instances, and container lifecycle management.
2025-01-02 18:41:36 +08:00

42 KiB

crawl4ai/models.py

Type Name Signature Docstring
MODULE models.py ``
CLASS TokenUsage class TokenUsage:
CLASS UrlModel class UrlModel:
CLASS MarkdownGenerationResult class MarkdownGenerationResult:
CLASS CrawlResult class CrawlResult:
CLASS AsyncCrawlResponse class AsyncCrawlResponse:

crawl4ai/async_configs.py

Type Name Signature Docstring
MODULE async_configs.py ``
CLASS BrowserConfig class BrowserConfig: Configuration class for setting up a browser instance and its context in AsyncPlaywrightCrawlerStrat... (truncated)
METHOD BrowserConfig.init def __init__(self, browser_type='chromium', headless=True, use_remote_browser=False, use_persistent_context=False, user_data_dir=None, chrome_channel='chrome', proxy=None, proxy_config=None, viewport_width=1080, viewport_height=600, accept_downloads=False, downloads_path=None, storage_state=None, ignore_https_errors=True, java_script_enabled=True, sleep_on_close=False, verbose=True, cookies=None, headers=None, user_agent='Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:109.0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/116.0.5845.187 Safari/604.1 Edg/117.0.2045.47', user_agent_mode=None, user_agent_generator_config=None, text_mode=False, light_mode=False, extra_args=None, debugging_port=9222):
METHOD BrowserConfig.from_kwargs def from_kwargs(kwargs):
CLASS CrawlerRunConfig class CrawlerRunConfig: Configuration class for controlling how the crawler runs each crawl operation. This includes paramet... (truncated)
METHOD CrawlerRunConfig.init def __init__(self, word_count_threshold=MIN_WORD_THRESHOLD, extraction_strategy=None, chunking_strategy=None, markdown_generator=None, content_filter=None, only_text=False, css_selector=None, excluded_tags=None, excluded_selector=None, keep_data_attributes=False, remove_forms=False, prettiify=False, parser_type='lxml', fetch_ssl_certificate=False, cache_mode=None, session_id=None, bypass_cache=False, disable_cache=False, no_cache_read=False, no_cache_write=False, wait_until='domcontentloaded', page_timeout=PAGE_TIMEOUT, wait_for=None, wait_for_images=True, delay_before_return_html=0.1, mean_delay=0.1, max_range=0.3, semaphore_count=5, js_code=None, js_only=False, ignore_body_visibility=True, scan_full_page=False, scroll_delay=0.2, process_iframes=False, remove_overlay_elements=False, simulate_user=False, override_navigator=False, magic=False, adjust_viewport_to_content=False, screenshot=False, screenshot_wait_for=None, screenshot_height_threshold=SCREENSHOT_HEIGHT_TRESHOLD, pdf=False, image_description_min_word_threshold=IMAGE_DESCRIPTION_MIN_WORD_THRESHOLD, image_score_threshold=IMAGE_SCORE_THRESHOLD, exclude_external_images=False, exclude_social_media_domains=None, exclude_external_links=False, exclude_social_media_links=False, exclude_domains=None, verbose=True, log_console=False, url=None):
METHOD CrawlerRunConfig.from_kwargs def from_kwargs(kwargs):
METHOD CrawlerRunConfig.to_dict def to_dict(self):

crawl4ai/async_webcrawler.py

Type Name Signature Docstring
MODULE async_webcrawler.py ``
CLASS AsyncWebCrawler class AsyncWebCrawler: Asynchronous web crawler with flexible caching capabilities. There are two ways to use the crawler:... (truncated)
METHOD AsyncWebCrawler.init def __init__(self, crawler_strategy=None, config=None, always_bypass_cache=False, always_by_pass_cache=None, base_directory=str(os.getenv('CRAWL4_AI_BASE_DIRECTORY', Path.home())), thread_safe=False, **kwargs): Initialize the AsyncWebCrawler. Args: crawler_strategy: Strategy for crawling web pages. If Non... (truncated)
METHOD AsyncWebCrawler.start async def start(self): Start the crawler explicitly without using context manager. This is equivalent to using 'async with'... (truncated)
METHOD AsyncWebCrawler.close async def close(self): Close the crawler explicitly without using context manager. This should be called when you're done w... (truncated)
METHOD AsyncWebCrawler.aenter async def __aenter__(self):
METHOD AsyncWebCrawler.aexit async def __aexit__(self, exc_type, exc_val, exc_tb):
METHOD AsyncWebCrawler.awarmup async def awarmup(self): Initialize the crawler with warm-up sequence. This method: 1. Logs initialization info 2. Sets up b... (truncated)
METHOD AsyncWebCrawler.nullcontext async def nullcontext(self): 异步空上下文管理器
METHOD AsyncWebCrawler.arun async def arun(self, url, config=None, word_count_threshold=MIN_WORD_THRESHOLD, extraction_strategy=None, chunking_strategy=RegexChunking(), content_filter=None, cache_mode=None, bypass_cache=False, disable_cache=False, no_cache_read=False, no_cache_write=False, css_selector=None, screenshot=False, pdf=False, user_agent=None, verbose=True, **kwargs): Runs the crawler for a single source: URL (web, local file, or raw HTML). Migration Guide: Old way ... (truncated)
METHOD AsyncWebCrawler.aprocess_html async def aprocess_html(self, url, html, extracted_content, config, screenshot, pdf_data, verbose, **kwargs): Process HTML content using the provided configuration. Args: url: The URL being processed h... (truncated)
METHOD AsyncWebCrawler.arun_many async def arun_many(self, urls, config=None, word_count_threshold=MIN_WORD_THRESHOLD, extraction_strategy=None, chunking_strategy=RegexChunking(), content_filter=None, cache_mode=None, bypass_cache=False, css_selector=None, screenshot=False, pdf=False, user_agent=None, verbose=True, **kwargs): Runs the crawler for multiple URLs concurrently. Migration Guide: Old way (deprecated): results... (truncated)
METHOD AsyncWebCrawler.aclear_cache async def aclear_cache(self): Clear the cache database.
METHOD AsyncWebCrawler.aflush_cache async def aflush_cache(self): Flush the cache database.
METHOD AsyncWebCrawler.aget_cache_size async def aget_cache_size(self): Get the total number of cached items.

crawl4ai/async_crawler_strategy.py

Type Name Signature Docstring
MODULE async_crawler_strategy.py ``
CLASS RemoteConnector class RemoteConnector: Manages the browser process and context. This class allows to connect to the browser using CDP proto... (truncated)
METHOD RemoteConnector.init def __init__(self, browser_type='chromium', user_data_dir=None, headless=False, logger=None, host='localhost', debugging_port=9222): Initialize the RemoteConnector instance. Args: browser_type (str): The type of browser to launch... (truncated)
METHOD RemoteConnector.start async def start(self): Starts the browser process and returns the CDP endpoint URL. If user_data_dir is not provided, creat... (truncated)
METHOD RemoteConnector._monitor_browser_process async def _monitor_browser_process(self): Monitor the browser process for unexpected termination. How it works: 1. Read stdout and stderr fro... (truncated)
METHOD RemoteConnector._get_browser_path def _get_browser_path(self): Returns the browser executable path based on OS and browser type
METHOD RemoteConnector._get_browser_args def _get_browser_args(self): Returns browser-specific command line arguments
METHOD RemoteConnector.cleanup async def cleanup(self): Cleanup browser process and temporary directory
CLASS BrowserManager class BrowserManager: Manages the browser instance and context. Attributes: config (BrowserConfig): Configuration ob... (truncated)
METHOD BrowserManager.init def __init__(self, browser_config, logger=None): Initialize the BrowserManager with a browser configuration. Args: browser_config (BrowserConfig... (truncated)
METHOD BrowserManager.start async def start(self): Start the browser instance and set up the default context. How it works: 1. Check if Playwright is ... (truncated)
METHOD BrowserManager._build_browser_args def _build_browser_args(self): Build browser launch arguments from config.
METHOD BrowserManager.setup_context async def setup_context(self, context, crawlerRunConfig, is_default=False): Set up a browser context with the configured options. How it works: 1. Set extra HTTP headers if pr... (truncated)
METHOD BrowserManager.create_browser_context async def create_browser_context(self): Creates and returns a new browser context with configured settings. Applies text-only mode settings ... (truncated)
METHOD BrowserManager.get_page async def get_page(self, crawlerRunConfig): Get a page for the given session ID, creating a new one if needed. Args: crawlerRunConfig (Craw... (truncated)
METHOD BrowserManager.kill_session async def kill_session(self, session_id): Kill a browser session and clean up resources. Args: session_id (str): The session ID to kill... (truncated)
METHOD BrowserManager._cleanup_expired_sessions def _cleanup_expired_sessions(self): Clean up expired sessions based on TTL.
METHOD BrowserManager.close async def close(self): Close all browser resources and clean up.
CLASS AsyncCrawlerStrategy class AsyncCrawlerStrategy: Abstract base class for crawler strategies. Subclasses must implement the crawl method.
METHOD AsyncCrawlerStrategy.crawl async def crawl(self, url, **kwargs):
CLASS AsyncPlaywrightCrawlerStrategy class AsyncPlaywrightCrawlerStrategy: Crawler strategy using Playwright. Attributes: browser_config (BrowserConfig): Configuration ob... (truncated)
METHOD AsyncPlaywrightCrawlerStrategy.init def __init__(self, browser_config=None, logger=None, **kwargs): Initialize the AsyncPlaywrightCrawlerStrategy with a browser configuration. Args: browser_confi... (truncated)
METHOD AsyncPlaywrightCrawlerStrategy.aenter async def __aenter__(self):
METHOD AsyncPlaywrightCrawlerStrategy.aexit async def __aexit__(self, exc_type, exc_val, exc_tb):
METHOD AsyncPlaywrightCrawlerStrategy.start async def start(self): Start the browser and initialize the browser manager.
METHOD AsyncPlaywrightCrawlerStrategy.close async def close(self): Close the browser and clean up resources.
METHOD AsyncPlaywrightCrawlerStrategy.kill_session async def kill_session(self, session_id): Kill a browser session and clean up resources. Args: session_id (str): The ID of the session to... (truncated)
METHOD AsyncPlaywrightCrawlerStrategy.set_hook def set_hook(self, hook_type, hook): Set a hook function for a specific hook type. Following are list of hook types: - on_browser_created... (truncated)
METHOD AsyncPlaywrightCrawlerStrategy.execute_hook async def execute_hook(self, hook_type, *args, **kwargs): Execute a hook function for a specific hook type. Args: hook_type (str): The type of the hook. ... (truncated)
METHOD AsyncPlaywrightCrawlerStrategy.update_user_agent def update_user_agent(self, user_agent): Update the user agent for the browser. Args: user_agent (str): The new user agent string. ... (truncated)
METHOD AsyncPlaywrightCrawlerStrategy.set_custom_headers def set_custom_headers(self, headers): Set custom headers for the browser. Args: headers (Dict[str, str]): A dictionary of headers to... (truncated)
METHOD AsyncPlaywrightCrawlerStrategy.smart_wait async def smart_wait(self, page, wait_for, timeout=30000): Wait for a condition in a smart way. This functions works as below: 1. If wait_for starts with 'js:... (truncated)
METHOD AsyncPlaywrightCrawlerStrategy.csp_compliant_wait async def csp_compliant_wait(self, page, user_wait_function, timeout=30000): Wait for a condition in a CSP-compliant way. Args: page: Playwright page object user_wait_f... (truncated)
METHOD AsyncPlaywrightCrawlerStrategy.process_iframes async def process_iframes(self, page): Process iframes on a page. This function will extract the content of each iframe and replace it with... (truncated)
METHOD AsyncPlaywrightCrawlerStrategy.create_session async def create_session(self, **kwargs): Creates a new browser session and returns its ID. A browse session is a unique openned page can be r... (truncated)
METHOD AsyncPlaywrightCrawlerStrategy.crawl async def crawl(self, url, config, **kwargs): Crawls a given URL or processes raw HTML/local file content based on the URL prefix. Args: url ... (truncated)
METHOD AsyncPlaywrightCrawlerStrategy._crawl_web async def _crawl_web(self, url, config): Internal method to crawl web URLs with the specified configuration. Args: url (str): The web UR... (truncated)
METHOD AsyncPlaywrightCrawlerStrategy._handle_full_page_scan async def _handle_full_page_scan(self, page, scroll_delay): Helper method to handle full page scanning. How it works: 1. Get the viewport height. 2. Scroll to... (truncated)
METHOD AsyncPlaywrightCrawlerStrategy._handle_download async def _handle_download(self, download): Handle file downloads. How it works: 1. Get the suggested filename. 2. Get the download path. 3. Lo... (truncated)
METHOD AsyncPlaywrightCrawlerStrategy.remove_overlay_elements async def remove_overlay_elements(self, page): Removes popup overlays, modals, cookie notices, and other intrusive elements from the page. Args: ... (truncated)
METHOD AsyncPlaywrightCrawlerStrategy.export_pdf async def export_pdf(self, page): Exports the current page as a PDF. Args: page (Page): The Playwright page object Returns: ... (truncated)
METHOD AsyncPlaywrightCrawlerStrategy.take_screenshot async def take_screenshot(self, page, **kwargs): Take a screenshot of the current page. Args: page (Page): The Playwright page object kwargs... (truncated)
METHOD AsyncPlaywrightCrawlerStrategy.take_screenshot_from_pdf async def take_screenshot_from_pdf(self, pdf_data): Convert the first page of the PDF to a screenshot. Requires pdf2image and poppler. Args: ... (truncated)
METHOD AsyncPlaywrightCrawlerStrategy.take_screenshot_scroller async def take_screenshot_scroller(self, page, **kwargs): Attempt to set a large viewport and take a full-page screenshot. If still too large, segment the pag... (truncated)
METHOD AsyncPlaywrightCrawlerStrategy.take_screenshot_naive async def take_screenshot_naive(self, page): Takes a screenshot of the current page. Args: page (Page): The Playwright page instance Return... (truncated)
METHOD AsyncPlaywrightCrawlerStrategy.export_storage_state async def export_storage_state(self, path=None): Exports the current storage state (cookies, localStorage, sessionStorage) to a JSON file at the spec... (truncated)
METHOD AsyncPlaywrightCrawlerStrategy.robust_execute_user_script async def robust_execute_user_script(self, page, js_code): Executes user-provided JavaScript code with proper error handling and context, supporting both synch... (truncated)
METHOD AsyncPlaywrightCrawlerStrategy.execute_user_script async def execute_user_script(self, page, js_code): Executes user-provided JavaScript code with proper error handling and context. Args: page: Play... (truncated)
METHOD AsyncPlaywrightCrawlerStrategy.check_visibility async def check_visibility(self, page): Checks if an element is visible on the page. Args: page: Playwright page object Returns: ... (truncated)
METHOD AsyncPlaywrightCrawlerStrategy.safe_scroll async def safe_scroll(self, page, x, y): Safely scroll the page with rendering time. Args: page: Playwright page object x: Horizonta... (truncated)
METHOD AsyncPlaywrightCrawlerStrategy.csp_scroll_to async def csp_scroll_to(self, page, x, y): Performs a CSP-compliant scroll operation and returns the result status. Args: page: Playwright... (truncated)
METHOD AsyncPlaywrightCrawlerStrategy.get_page_dimensions async def get_page_dimensions(self, page): Get the dimensions of the page. Args: page: Playwright page object Returns: Dict conta... (truncated)

crawl4ai/content_scraping_strategy.py

Type Name Signature Docstring
MODULE content_scraping_strategy.py ``
FUNCTION parse_dimension def parse_dimension(dimension):
FUNCTION fetch_image_file_size def fetch_image_file_size(img, base_url):
CLASS ContentScrapingStrategy class ContentScrapingStrategy:
METHOD ContentScrapingStrategy.scrap def scrap(self, url, html, **kwargs):
METHOD ContentScrapingStrategy.ascrap async def ascrap(self, url, html, **kwargs):
CLASS WebScrapingStrategy class WebScrapingStrategy: Class for web content scraping. Perhaps the most important class. How it works: 1. Extract content... (truncated)
METHOD WebScrapingStrategy.init def __init__(self, logger=None):
METHOD WebScrapingStrategy._log def _log(self, level, message, tag='SCRAPE', **kwargs): Helper method to safely use logger.
METHOD WebScrapingStrategy.scrap def scrap(self, url, html, **kwargs): Main entry point for content scraping. Args: url (str): The URL of the page to scrape. ht... (truncated)
METHOD WebScrapingStrategy.ascrap async def ascrap(self, url, html, **kwargs): Main entry point for asynchronous content scraping. Args: url (str): The URL of the page to scr... (truncated)
METHOD WebScrapingStrategy._generate_markdown_content def _generate_markdown_content(self, cleaned_html, html, url, success, **kwargs): Generate markdown content from cleaned HTML. Args: cleaned_html (str): The cleaned HTML content... (truncated)
METHOD WebScrapingStrategy.flatten_nested_elements def flatten_nested_elements(self, node): Flatten nested elements in a HTML tree. Args: node (Tag): The root node of the HTML tree. Retu... (truncated)
METHOD WebScrapingStrategy.find_closest_parent_with_useful_text def find_closest_parent_with_useful_text(self, tag, **kwargs): Find the closest parent with useful text. Args: tag (Tag): The starting tag to search from. ... (truncated)
METHOD WebScrapingStrategy.remove_unwanted_attributes def remove_unwanted_attributes(self, element, important_attrs, keep_data_attributes=False): Remove unwanted attributes from an HTML element. Args: element (Tag): The HTML element to r... (truncated)
METHOD WebScrapingStrategy.process_image def process_image(self, img, url, index, total_images, **kwargs): Process an image element. How it works: 1. Check if the image has valid display and inside undesire... (truncated)
METHOD WebScrapingStrategy.process_element def process_element(self, url, element, **kwargs): Process an HTML element. How it works: 1. Check if the element is an image, video, or audio. 2. Ext... (truncated)
METHOD WebScrapingStrategy._process_element def _process_element(self, url, element, media, internal_links_dict, external_links_dict, **kwargs): Process an HTML element.
METHOD WebScrapingStrategy._scrap def _scrap(self, url, html, word_count_threshold=MIN_WORD_THRESHOLD, css_selector=None, **kwargs): Extract content from HTML using BeautifulSoup. Args: url (str): The URL of the page to scrape. ... (truncated)

crawl4ai/markdown_generation_strategy.py

Type Name Signature Docstring
MODULE markdown_generation_strategy.py ``
FUNCTION fast_urljoin def fast_urljoin(base, url): Fast URL joining for common cases.
CLASS MarkdownGenerationStrategy class MarkdownGenerationStrategy: Abstract base class for markdown generation strategies.
METHOD MarkdownGenerationStrategy.init def __init__(self, content_filter=None, options=None):
METHOD MarkdownGenerationStrategy.generate_markdown def generate_markdown(self, cleaned_html, base_url='', html2text_options=None, content_filter=None, citations=True, **kwargs): Generate markdown from cleaned HTML.
CLASS DefaultMarkdownGenerator class DefaultMarkdownGenerator: Default implementation of markdown generation strategy. How it works: 1. Generate raw markdown from... (truncated)
METHOD DefaultMarkdownGenerator.init def __init__(self, content_filter=None, options=None):
METHOD DefaultMarkdownGenerator.convert_links_to_citations def convert_links_to_citations(self, markdown, base_url=''): Convert links in markdown to citations. How it works: 1. Find all links in the markdown. 2. Convert... (truncated)
METHOD DefaultMarkdownGenerator.generate_markdown def generate_markdown(self, cleaned_html, base_url='', html2text_options=None, options=None, content_filter=None, citations=True, **kwargs): Generate markdown with citations from cleaned HTML. How it works: 1. Generate raw markdown from cle... (truncated)

crawl4ai/content_filter_strategy.py

Type Name Signature Docstring
MODULE content_filter_strategy.py ``
CLASS RelevantContentFilter class RelevantContentFilter: Abstract base class for content filtering strategies
METHOD RelevantContentFilter.init def __init__(self, user_query=None):
METHOD RelevantContentFilter.filter_content def filter_content(self, html): Abstract method to be implemented by specific filtering strategies
METHOD RelevantContentFilter.extract_page_query def extract_page_query(self, soup, body): Common method to extract page metadata with fallbacks
METHOD RelevantContentFilter.extract_text_chunks def extract_text_chunks(self, body, min_word_threshold=None): Extracts text chunks from a BeautifulSoup body element while preserving order. Returns list of tuple... (truncated)
METHOD RelevantContentFilter._deprecated_extract_text_chunks def _deprecated_extract_text_chunks(self, soup): Common method for extracting text chunks
METHOD RelevantContentFilter.is_excluded def is_excluded(self, tag): Common method for exclusion logic
METHOD RelevantContentFilter.clean_element def clean_element(self, tag): Common method for cleaning HTML elements with minimal overhead
CLASS BM25ContentFilter class BM25ContentFilter: Content filtering using BM25 algorithm with priority tag handling. How it works: 1. Extracts page m... (truncated)
METHOD BM25ContentFilter.init def __init__(self, user_query=None, bm25_threshold=1.0, language='english'): Initializes the BM25ContentFilter class, if not provided, falls back to page metadata. Note: If no ... (truncated)
METHOD BM25ContentFilter.filter_content def filter_content(self, html, min_word_threshold=None): Implements content filtering using BM25 algorithm with priority tag handling. Note: This method... (truncated)
CLASS PruningContentFilter class PruningContentFilter: Content filtering using pruning algorithm with dynamic threshold. How it works: 1. Extracts page me... (truncated)
METHOD PruningContentFilter.init def __init__(self, user_query=None, min_word_threshold=None, threshold_type='fixed', threshold=0.48): Initializes the PruningContentFilter class, if not provided, falls back to page metadata. Note: If ... (truncated)
METHOD PruningContentFilter.filter_content def filter_content(self, html, min_word_threshold=None): Implements content filtering using pruning algorithm with dynamic threshold. Note: This method impl... (truncated)
METHOD PruningContentFilter._remove_comments def _remove_comments(self, soup): Removes HTML comments
METHOD PruningContentFilter._remove_unwanted_tags def _remove_unwanted_tags(self, soup): Removes unwanted tags
METHOD PruningContentFilter._prune_tree def _prune_tree(self, node): Prunes the tree starting from the given node. Args: node (Tag): The node from which the pruning... (truncated)
METHOD PruningContentFilter._compute_composite_score def _compute_composite_score(self, metrics, text_len, tag_len, link_text_len): Computes the composite score
METHOD PruningContentFilter._compute_class_id_weight def _compute_class_id_weight(self, node): Computes the class ID weight

crawl4ai/extraction_strategy.py

Type Name Signature Docstring
MODULE extraction_strategy.py ``
CLASS ExtractionStrategy class ExtractionStrategy: Abstract base class for all extraction strategies.
METHOD ExtractionStrategy.init def __init__(self, input_format='markdown', **kwargs): Initialize the extraction strategy. Args: input_format: Content format to use for extraction. ... (truncated)
METHOD ExtractionStrategy.extract def extract(self, url, html, *q, **kwargs): Extract meaningful blocks or chunks from the given HTML. :param url: The URL of the webpage. :param... (truncated)
METHOD ExtractionStrategy.run def run(self, url, sections, *q, **kwargs): Process sections of text in parallel by default. :param url: The URL of the webpage. :param section... (truncated)
CLASS NoExtractionStrategy class NoExtractionStrategy: A strategy that does not extract any meaningful content from the HTML. It simply returns the entire ... (truncated)
METHOD NoExtractionStrategy.extract def extract(self, url, html, *q, **kwargs): Extract meaningful blocks or chunks from the given HTML.
METHOD NoExtractionStrategy.run def run(self, url, sections, *q, **kwargs):
CLASS LLMExtractionStrategy class LLMExtractionStrategy: A strategy that uses an LLM to extract meaningful content from the HTML. Attributes: provider: ... (truncated)
METHOD LLMExtractionStrategy.init def __init__(self, provider=DEFAULT_PROVIDER, api_token=None, instruction=None, schema=None, extraction_type='block', **kwargs): Initialize the strategy with clustering parameters. Args: provider: The provider to use for ext... (truncated)
METHOD LLMExtractionStrategy.extract def extract(self, url, ix, html): Extract meaningful blocks or chunks from the given HTML using an LLM. How it works: 1. Construct a ... (truncated)
METHOD LLMExtractionStrategy._merge def _merge(self, documents, chunk_token_threshold, overlap): Merge documents into sections based on chunk_token_threshold and overlap.
METHOD LLMExtractionStrategy.run def run(self, url, sections): Process sections sequentially with a delay for rate limiting issues, specifically for LLMExtractionS... (truncated)
METHOD LLMExtractionStrategy.show_usage def show_usage(self): Print a detailed token usage report showing total and per-request usage.
CLASS CosineStrategy class CosineStrategy: Extract meaningful blocks or chunks from the given HTML using cosine similarity. How it works: 1. P... (truncated)
METHOD CosineStrategy.init def __init__(self, semantic_filter=None, word_count_threshold=10, max_dist=0.2, linkage_method='ward', top_k=3, model_name='sentence-transformers/all-MiniLM-L6-v2', sim_threshold=0.3, **kwargs): Initialize the strategy with clustering parameters. Args: semantic_filter (str): A keyword filt... (truncated)
METHOD CosineStrategy.filter_documents_embeddings def filter_documents_embeddings(self, documents, semantic_filter, at_least_k=20): Filter and sort documents based on the cosine similarity of their embeddings with the semantic_filte... (truncated)
METHOD CosineStrategy.get_embeddings def get_embeddings(self, sentences, batch_size=None, bypass_buffer=False): Get BERT embeddings for a list of sentences. Args: sentences (List[str]): A list of text chunks... (truncated)
METHOD CosineStrategy.hierarchical_clustering def hierarchical_clustering(self, sentences, embeddings=None): Perform hierarchical clustering on sentences and return cluster labels. Args: sentences (List[s... (truncated)
METHOD CosineStrategy.filter_clusters_by_word_count def filter_clusters_by_word_count(self, clusters): Filter clusters to remove those with a word count below the threshold. Args: clusters (Dict[int... (truncated)
METHOD CosineStrategy.extract def extract(self, url, html, *q, **kwargs): Extract clusters from HTML content using hierarchical clustering. Args: url (str): The URL of t... (truncated)
METHOD CosineStrategy.run def run(self, url, sections, *q, **kwargs): Process sections using hierarchical clustering. Args: url (str): The URL of the webpage. se... (truncated)
CLASS JsonElementExtractionStrategy class JsonElementExtractionStrategy: Abstract base class for extracting structured JSON from HTML content. How it works: 1. ... (truncated)
METHOD JsonElementExtractionStrategy.init def __init__(self, schema, **kwargs): Initialize the JSON element extraction strategy with a schema. Args: schema (Dict[str, Any]): T... (truncated)
METHOD JsonElementExtractionStrategy.extract def extract(self, url, html_content, *q, **kwargs): Extract structured data from HTML content. How it works: 1. Parses the HTML content using the `_par... (truncated)
METHOD JsonElementExtractionStrategy._parse_html def _parse_html(self, html_content): Parse HTML content into appropriate format
METHOD JsonElementExtractionStrategy._get_base_elements def _get_base_elements(self, parsed_html, selector): Get all base elements using the selector
METHOD JsonElementExtractionStrategy._get_elements def _get_elements(self, element, selector): Get child elements using the selector
METHOD JsonElementExtractionStrategy._extract_field def _extract_field(self, element, field):
METHOD JsonElementExtractionStrategy._extract_single_field def _extract_single_field(self, element, field): Extract a single field based on its type. How it works: 1. Selects the target element using the fie... (truncated)
METHOD JsonElementExtractionStrategy._extract_list_item def _extract_list_item(self, element, fields):
METHOD JsonElementExtractionStrategy._extract_item def _extract_item(self, element, fields): Extracts fields from a given element. How it works: 1. Iterates through the fields defined in the s... (truncated)
METHOD JsonElementExtractionStrategy._apply_transform def _apply_transform(self, value, transform): Apply a transformation to a value. How it works: 1. Checks the transformation type (e.g., `lowercas... (truncated)
METHOD JsonElementExtractionStrategy._compute_field def _compute_field(self, item, field):
METHOD JsonElementExtractionStrategy.run def run(self, url, sections, *q, **kwargs): Run the extraction strategy on a combined HTML content. How it works: 1. Combines multiple HTML sec... (truncated)
METHOD JsonElementExtractionStrategy._get_element_text def _get_element_text(self, element): Get text content from element
METHOD JsonElementExtractionStrategy._get_element_html def _get_element_html(self, element): Get HTML content from element
METHOD JsonElementExtractionStrategy._get_element_attribute def _get_element_attribute(self, element, attribute): Get attribute value from element
CLASS JsonCssExtractionStrategy class JsonCssExtractionStrategy: Concrete implementation of JsonElementExtractionStrategy using CSS selectors. How it works: 1. Pa... (truncated)
METHOD JsonCssExtractionStrategy.init def __init__(self, schema, **kwargs):
METHOD JsonCssExtractionStrategy._parse_html def _parse_html(self, html_content):
METHOD JsonCssExtractionStrategy._get_base_elements def _get_base_elements(self, parsed_html, selector):
METHOD JsonCssExtractionStrategy._get_elements def _get_elements(self, element, selector):
METHOD JsonCssExtractionStrategy._get_element_text def _get_element_text(self, element):
METHOD JsonCssExtractionStrategy._get_element_html def _get_element_html(self, element):
METHOD JsonCssExtractionStrategy._get_element_attribute def _get_element_attribute(self, element, attribute):
CLASS JsonXPathExtractionStrategy class JsonXPathExtractionStrategy: Concrete implementation of JsonElementExtractionStrategy using XPath selectors. How it works: 1. ... (truncated)
METHOD JsonXPathExtractionStrategy.init def __init__(self, schema, **kwargs):
METHOD JsonXPathExtractionStrategy._parse_html def _parse_html(self, html_content):
METHOD JsonXPathExtractionStrategy._get_base_elements def _get_base_elements(self, parsed_html, selector):
METHOD JsonXPathExtractionStrategy._css_to_xpath def _css_to_xpath(self, css_selector): Convert CSS selector to XPath if needed
METHOD JsonXPathExtractionStrategy._basic_css_to_xpath def _basic_css_to_xpath(self, css_selector): Basic CSS to XPath conversion for common cases
METHOD JsonXPathExtractionStrategy._get_elements def _get_elements(self, element, selector):
METHOD JsonXPathExtractionStrategy._get_element_text def _get_element_text(self, element):
METHOD JsonXPathExtractionStrategy._get_element_html def _get_element_html(self, element):
METHOD JsonXPathExtractionStrategy._get_element_attribute def _get_element_attribute(self, element, attribute):

crawl4ai/chunking_strategy.py

Type Name Signature Docstring
MODULE chunking_strategy.py ``
CLASS ChunkingStrategy class ChunkingStrategy: Abstract base class for chunking strategies.
METHOD ChunkingStrategy.chunk def chunk(self, text): Abstract method to chunk the given text. Args: text (str): The text to chunk. Returns: lis... (truncated)
CLASS IdentityChunking class IdentityChunking: Chunking strategy that returns the input text as a single chunk.
METHOD IdentityChunking.chunk def chunk(self, text):
CLASS RegexChunking class RegexChunking: Chunking strategy that splits text based on regular expression patterns.
METHOD RegexChunking.init def __init__(self, patterns=None, **kwargs): Initialize the RegexChunking object. Args: patterns (list): A list of regular expression patter... (truncated)
METHOD RegexChunking.chunk def chunk(self, text):
CLASS NlpSentenceChunking class NlpSentenceChunking: Chunking strategy that splits text into sentences using NLTK's sentence tokenizer.
METHOD NlpSentenceChunking.init def __init__(self, **kwargs): Initialize the NlpSentenceChunking object.
METHOD NlpSentenceChunking.chunk def chunk(self, text):
CLASS TopicSegmentationChunking class TopicSegmentationChunking: Chunking strategy that segments text into topics using NLTK's TextTilingTokenizer. How it works: 1.... (truncated)
METHOD TopicSegmentationChunking.init def __init__(self, num_keywords=3, **kwargs): Initialize the TopicSegmentationChunking object. Args: num_keywords (int): The number of keywor... (truncated)
METHOD TopicSegmentationChunking.chunk def chunk(self, text):
METHOD TopicSegmentationChunking.extract_keywords def extract_keywords(self, text):
METHOD TopicSegmentationChunking.chunk_with_topics def chunk_with_topics(self, text):
CLASS FixedLengthWordChunking class FixedLengthWordChunking: Chunking strategy that splits text into fixed-length word chunks. How it works: 1. Split the text i... (truncated)
METHOD FixedLengthWordChunking.init def __init__(self, chunk_size=100, **kwargs): Initialize the fixed-length word chunking strategy with the given chunk size. Args: chunk_size ... (truncated)
METHOD FixedLengthWordChunking.chunk def chunk(self, text):
CLASS SlidingWindowChunking class SlidingWindowChunking: Chunking strategy that splits text into overlapping word chunks. How it works: 1. Split the text in... (truncated)
METHOD SlidingWindowChunking.init def __init__(self, window_size=100, step=50, **kwargs): Initialize the sliding window chunking strategy with the given window size and step size. Args: ... (truncated)
METHOD SlidingWindowChunking.chunk def chunk(self, text):
CLASS OverlappingWindowChunking class OverlappingWindowChunking: Chunking strategy that splits text into overlapping word chunks. How it works: 1. Split the text in... (truncated)
METHOD OverlappingWindowChunking.init def __init__(self, window_size=1000, overlap=100, **kwargs): Initialize the overlapping window chunking strategy with the given window size and overlap size. Ar... (truncated)
METHOD OverlappingWindowChunking.chunk def chunk(self, text):

crawl4ai/user_agent_generator.py

Type Name Signature Docstring
MODULE user_agent_generator.py ``
CLASS UserAgentGenerator class UserAgentGenerator: Generate random user agents with specified constraints. Attributes: desktop_platforms (dict): A... (truncated)
METHOD UserAgentGenerator.init def __init__(self):
METHOD UserAgentGenerator.get_browser_stack def get_browser_stack(self, num_browsers=1): Get a valid combination of browser versions. How it works: 1. Check if the number of browsers is su... (truncated)
METHOD UserAgentGenerator.generate def generate(self, device_type=None, os_type=None, device_brand=None, browser_type=None, num_browsers=3): Generate a random user agent with specified constraints. Args: device_type: 'desktop' or 'mobil... (truncated)
METHOD UserAgentGenerator.generate_with_client_hints def generate_with_client_hints(self, **kwargs): Generate both user agent and matching client hints
METHOD UserAgentGenerator.get_random_platform def get_random_platform(self, device_type, os_type, device_brand): Helper method to get random platform based on constraints
METHOD UserAgentGenerator.parse_user_agent def parse_user_agent(self, user_agent): Parse a user agent string to extract browser and version information
METHOD UserAgentGenerator.generate_client_hints def generate_client_hints(self, user_agent): Generate Sec-CH-UA header value based on user agent string

crawl4ai/ssl_certificate.py

Type Name Signature Docstring
MODULE ssl_certificate.py `` SSL Certificate class for handling certificate operations.
CLASS SSLCertificate class SSLCertificate: A class representing an SSL certificate with methods to export in various formats. Attributes: ... (truncated)
METHOD SSLCertificate.init def __init__(self, cert_info):
METHOD SSLCertificate.from_url def from_url(url, timeout=10): Create SSLCertificate instance from a URL. Args: url (str): URL of the website. timeout (in... (truncated)
METHOD SSLCertificate._decode_cert_data def _decode_cert_data(data): Helper method to decode bytes in certificate data.
METHOD SSLCertificate.to_json def to_json(self, filepath=None): Export certificate as JSON. Args: filepath (Optional[str]): Path to save the JSON file (default... (truncated)
METHOD SSLCertificate.to_pem def to_pem(self, filepath=None): Export certificate as PEM. Args: filepath (Optional[str]): Path to save the PEM file (default: ... (truncated)
METHOD SSLCertificate.to_der def to_der(self, filepath=None): Export certificate as DER. Args: filepath (Optional[str]): Path to save the DER file (default: ... (truncated)
METHOD SSLCertificate.issuer def issuer(self): Get certificate issuer information.
METHOD SSLCertificate.subject def subject(self): Get certificate subject information.
METHOD SSLCertificate.valid_from def valid_from(self): Get certificate validity start date.
METHOD SSLCertificate.valid_until def valid_until(self): Get certificate validity end date.
METHOD SSLCertificate.fingerprint def fingerprint(self): Get certificate fingerprint.