crawl4ai/plans/docstring.md at b11a91e1ddfcfb04d765aa6e0adc0263c4a41c83

Files

UncleCode 7aaaaae461 feat(browser-farm): Add Docker browser support for remote crawling

Implement initial MVP for Docker-based browser management in Crawl4ai, enabling
remote browser execution in containerized environments.

Key Changes:
- Add browser_farm module with Docker support components:
  * BrowserFarmService: Manages browser endpoints
  * DockerBrowser: Handles Docker browser communication
  * Basic health check implementation
  * Dockerfile with optimized Chrome/Playwright setup:
    - Based on python:3.10-slim for minimal size
    - Includes all required system dependencies
    - Auto-installs crawl4ai and sets up Playwright
    - Configures Chrome with remote debugging
    - Uses socat for port forwarding (9223)

- Update core components:
  * Rename use_managed_browser to use_remote_browser for clarity
  * Modify BrowserManager to support Docker mode
  * Add Docker configuration in BrowserConfig
  * Update context handling for remote browsers

- Add example:
  * hello_world_docker.py demonstrating Docker browser usage

Technical Details:
- Docker container exposes port 9223 (mapped to host:9333)
- Uses CDP (Chrome DevTools Protocol) for remote connection
- Maintains compatibility with existing managed browser features
- Simplified endpoint management for MVP phase
- Optimized Docker setup:
  * Minimal dependencies installation
  * Proper Chrome flags for containerized environment
  * Headless mode with GPU disabled
  * Security considerations (no-sandbox mode)

Testing:
- Extensive Docker configuration testing and optimization
- Verified with hello_world_docker.py example
- Confirmed remote browser connection and crawling functionality
- Tested basic health checks

This is the first step towards a scalable browser farm solution, setting up
the foundation for future enhancements like resource monitoring, multiple
browser instances, and container lifecycle management.

2025-01-02 18:41:36 +08:00

42 KiB

Raw Blame History

`crawl4ai/models.py`

Type	Name	Signature
MODULE	models.py	``
CLASS	TokenUsage	`class TokenUsage:`
CLASS	UrlModel	`class UrlModel:`
CLASS	MarkdownGenerationResult	`class MarkdownGenerationResult:`
CLASS	CrawlResult	`class CrawlResult:`
CLASS	AsyncCrawlResponse	`class AsyncCrawlResponse:`

`crawl4ai/async_configs.py`

Type	Name	Signature	Docstring
MODULE	async_configs.py	``
CLASS	BrowserConfig	`class BrowserConfig:`	Configuration class for setting up a browser instance and its context in AsyncPlaywrightCrawlerStrat... (truncated)
METHOD	BrowserConfig.init	def __init__(self, browser_type='chromium', headless=True, use_remote_browser=False, use_persistent_context=False, user_data_dir=None, chrome_channel='chrome', proxy=None, proxy_config=None, viewport_width=1080, viewport_height=600, accept_downloads=False, downloads_path=None, storage_state=None, ignore_https_errors=True, java_script_enabled=True, sleep_on_close=False, verbose=True, cookies=None, headers=None, user_agent='Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:109.0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/116.0.5845.187 Safari/604.1 Edg/117.0.2045.47', user_agent_mode=None, user_agent_generator_config=None, text_mode=False, light_mode=False, extra_args=None, debugging_port=9222):
METHOD	BrowserConfig.from_kwargs	`def from_kwargs(kwargs):`
CLASS	CrawlerRunConfig	`class CrawlerRunConfig:`	Configuration class for controlling how the crawler runs each crawl operation. This includes paramet... (truncated)
METHOD	CrawlerRunConfig.init	def __init__(self, word_count_threshold=MIN_WORD_THRESHOLD, extraction_strategy=None, chunking_strategy=None, markdown_generator=None, content_filter=None, only_text=False, css_selector=None, excluded_tags=None, excluded_selector=None, keep_data_attributes=False, remove_forms=False, prettiify=False, parser_type='lxml', fetch_ssl_certificate=False, cache_mode=None, session_id=None, bypass_cache=False, disable_cache=False, no_cache_read=False, no_cache_write=False, wait_until='domcontentloaded', page_timeout=PAGE_TIMEOUT, wait_for=None, wait_for_images=True, delay_before_return_html=0.1, mean_delay=0.1, max_range=0.3, semaphore_count=5, js_code=None, js_only=False, ignore_body_visibility=True, scan_full_page=False, scroll_delay=0.2, process_iframes=False, remove_overlay_elements=False, simulate_user=False, override_navigator=False, magic=False, adjust_viewport_to_content=False, screenshot=False, screenshot_wait_for=None, screenshot_height_threshold=SCREENSHOT_HEIGHT_TRESHOLD, pdf=False, image_description_min_word_threshold=IMAGE_DESCRIPTION_MIN_WORD_THRESHOLD, image_score_threshold=IMAGE_SCORE_THRESHOLD, exclude_external_images=False, exclude_social_media_domains=None, exclude_external_links=False, exclude_social_media_links=False, exclude_domains=None, verbose=True, log_console=False, url=None):
METHOD	CrawlerRunConfig.from_kwargs	`def from_kwargs(kwargs):`
METHOD	CrawlerRunConfig.to_dict	`def to_dict(self):`

`crawl4ai/async_webcrawler.py`

Type	Name	Signature	Docstring
MODULE	async_webcrawler.py	``
CLASS	AsyncWebCrawler	`class AsyncWebCrawler:`	Asynchronous web crawler with flexible caching capabilities. There are two ways to use the crawler:... (truncated)
METHOD	AsyncWebCrawler.init	`def __init__(self, crawler_strategy=None, config=None, always_bypass_cache=False, always_by_pass_cache=None, base_directory=str(os.getenv('CRAWL4_AI_BASE_DIRECTORY', Path.home())), thread_safe=False, **kwargs):`	Initialize the AsyncWebCrawler. Args: crawler_strategy: Strategy for crawling web pages. If Non... (truncated)
METHOD	AsyncWebCrawler.start	`async def start(self):`	Start the crawler explicitly without using context manager. This is equivalent to using 'async with'... (truncated)
METHOD	AsyncWebCrawler.close	`async def close(self):`	Close the crawler explicitly without using context manager. This should be called when you're done w... (truncated)
METHOD	AsyncWebCrawler.aenter	`async def __aenter__(self):`
METHOD	AsyncWebCrawler.aexit	`async def __aexit__(self, exc_type, exc_val, exc_tb):`
METHOD	AsyncWebCrawler.awarmup	`async def awarmup(self):`	Initialize the crawler with warm-up sequence. This method: 1. Logs initialization info 2. Sets up b... (truncated)
METHOD	AsyncWebCrawler.nullcontext	`async def nullcontext(self):`	异步空上下文管理器
METHOD	AsyncWebCrawler.arun	`async def arun(self, url, config=None, word_count_threshold=MIN_WORD_THRESHOLD, extraction_strategy=None, chunking_strategy=RegexChunking(), content_filter=None, cache_mode=None, bypass_cache=False, disable_cache=False, no_cache_read=False, no_cache_write=False, css_selector=None, screenshot=False, pdf=False, user_agent=None, verbose=True, **kwargs):`	Runs the crawler for a single source: URL (web, local file, or raw HTML). Migration Guide: Old way ... (truncated)
METHOD	AsyncWebCrawler.aprocess_html	`async def aprocess_html(self, url, html, extracted_content, config, screenshot, pdf_data, verbose, **kwargs):`	Process HTML content using the provided configuration. Args: url: The URL being processed h... (truncated)
METHOD	AsyncWebCrawler.arun_many	`async def arun_many(self, urls, config=None, word_count_threshold=MIN_WORD_THRESHOLD, extraction_strategy=None, chunking_strategy=RegexChunking(), content_filter=None, cache_mode=None, bypass_cache=False, css_selector=None, screenshot=False, pdf=False, user_agent=None, verbose=True, **kwargs):`	Runs the crawler for multiple URLs concurrently. Migration Guide: Old way (deprecated): results... (truncated)
METHOD	AsyncWebCrawler.aclear_cache	`async def aclear_cache(self):`	Clear the cache database.
METHOD	AsyncWebCrawler.aflush_cache	`async def aflush_cache(self):`	Flush the cache database.
METHOD	AsyncWebCrawler.aget_cache_size	`async def aget_cache_size(self):`	Get the total number of cached items.

`crawl4ai/async_crawler_strategy.py`

Type	Name	Signature	Docstring
MODULE	async_crawler_strategy.py	``
CLASS	RemoteConnector	`class RemoteConnector:`	Manages the browser process and context. This class allows to connect to the browser using CDP proto... (truncated)
METHOD	RemoteConnector.init	`def __init__(self, browser_type='chromium', user_data_dir=None, headless=False, logger=None, host='localhost', debugging_port=9222):`	Initialize the RemoteConnector instance. Args: browser_type (str): The type of browser to launch... (truncated)
METHOD	RemoteConnector.start	`async def start(self):`	Starts the browser process and returns the CDP endpoint URL. If user_data_dir is not provided, creat... (truncated)
METHOD	RemoteConnector._monitor_browser_process	`async def _monitor_browser_process(self):`	Monitor the browser process for unexpected termination. How it works: 1. Read stdout and stderr fro... (truncated)
METHOD	RemoteConnector._get_browser_path	`def _get_browser_path(self):`	Returns the browser executable path based on OS and browser type
METHOD	RemoteConnector._get_browser_args	`def _get_browser_args(self):`	Returns browser-specific command line arguments
METHOD	RemoteConnector.cleanup	`async def cleanup(self):`	Cleanup browser process and temporary directory
CLASS	BrowserManager	`class BrowserManager:`	Manages the browser instance and context. Attributes: config (BrowserConfig): Configuration ob... (truncated)
METHOD	BrowserManager.init	`def __init__(self, browser_config, logger=None):`	Initialize the BrowserManager with a browser configuration. Args: browser_config (BrowserConfig... (truncated)
METHOD	BrowserManager.start	`async def start(self):`	Start the browser instance and set up the default context. How it works: 1. Check if Playwright is ... (truncated)
METHOD	BrowserManager._build_browser_args	`def _build_browser_args(self):`	Build browser launch arguments from config.
METHOD	BrowserManager.setup_context	`async def setup_context(self, context, crawlerRunConfig, is_default=False):`	Set up a browser context with the configured options. How it works: 1. Set extra HTTP headers if pr... (truncated)
METHOD	BrowserManager.create_browser_context	`async def create_browser_context(self):`	Creates and returns a new browser context with configured settings. Applies text-only mode settings ... (truncated)
METHOD	BrowserManager.get_page	`async def get_page(self, crawlerRunConfig):`	Get a page for the given session ID, creating a new one if needed. Args: crawlerRunConfig (Craw... (truncated)
METHOD	BrowserManager.kill_session	`async def kill_session(self, session_id):`	Kill a browser session and clean up resources. Args: session_id (str): The session ID to kill... (truncated)
METHOD	BrowserManager._cleanup_expired_sessions	`def _cleanup_expired_sessions(self):`	Clean up expired sessions based on TTL.
METHOD	BrowserManager.close	`async def close(self):`	Close all browser resources and clean up.
CLASS	AsyncCrawlerStrategy	`class AsyncCrawlerStrategy:`	Abstract base class for crawler strategies. Subclasses must implement the crawl method.
METHOD	AsyncCrawlerStrategy.crawl	`async def crawl(self, url, **kwargs):`
CLASS	AsyncPlaywrightCrawlerStrategy	`class AsyncPlaywrightCrawlerStrategy:`	Crawler strategy using Playwright. Attributes: browser_config (BrowserConfig): Configuration ob... (truncated)
METHOD	AsyncPlaywrightCrawlerStrategy.init	`def __init__(self, browser_config=None, logger=None, **kwargs):`	Initialize the AsyncPlaywrightCrawlerStrategy with a browser configuration. Args: browser_confi... (truncated)
METHOD	AsyncPlaywrightCrawlerStrategy.aenter	`async def __aenter__(self):`
METHOD	AsyncPlaywrightCrawlerStrategy.aexit	`async def __aexit__(self, exc_type, exc_val, exc_tb):`
METHOD	AsyncPlaywrightCrawlerStrategy.start	`async def start(self):`	Start the browser and initialize the browser manager.
METHOD	AsyncPlaywrightCrawlerStrategy.close	`async def close(self):`	Close the browser and clean up resources.
METHOD	AsyncPlaywrightCrawlerStrategy.kill_session	`async def kill_session(self, session_id):`	Kill a browser session and clean up resources. Args: session_id (str): The ID of the session to... (truncated)
METHOD	AsyncPlaywrightCrawlerStrategy.set_hook	`def set_hook(self, hook_type, hook):`	Set a hook function for a specific hook type. Following are list of hook types: - on_browser_created... (truncated)
METHOD	AsyncPlaywrightCrawlerStrategy.execute_hook	`async def execute_hook(self, hook_type, args, *kwargs):`	Execute a hook function for a specific hook type. Args: hook_type (str): The type of the hook. ... (truncated)
METHOD	AsyncPlaywrightCrawlerStrategy.update_user_agent	`def update_user_agent(self, user_agent):`	Update the user agent for the browser. Args: user_agent (str): The new user agent string. ... (truncated)
METHOD	AsyncPlaywrightCrawlerStrategy.set_custom_headers	`def set_custom_headers(self, headers):`	Set custom headers for the browser. Args: headers (Dict[str, str]): A dictionary of headers to... (truncated)
METHOD	AsyncPlaywrightCrawlerStrategy.smart_wait	`async def smart_wait(self, page, wait_for, timeout=30000):`	Wait for a condition in a smart way. This functions works as below: 1. If wait_for starts with 'js:... (truncated)
METHOD	AsyncPlaywrightCrawlerStrategy.csp_compliant_wait	`async def csp_compliant_wait(self, page, user_wait_function, timeout=30000):`	Wait for a condition in a CSP-compliant way. Args: page: Playwright page object user_wait_f... (truncated)
METHOD	AsyncPlaywrightCrawlerStrategy.process_iframes	`async def process_iframes(self, page):`	Process iframes on a page. This function will extract the content of each iframe and replace it with... (truncated)
METHOD	AsyncPlaywrightCrawlerStrategy.create_session	`async def create_session(self, **kwargs):`	Creates a new browser session and returns its ID. A browse session is a unique openned page can be r... (truncated)
METHOD	AsyncPlaywrightCrawlerStrategy.crawl	`async def crawl(self, url, config, **kwargs):`	Crawls a given URL or processes raw HTML/local file content based on the URL prefix. Args: url ... (truncated)
METHOD	AsyncPlaywrightCrawlerStrategy._crawl_web	`async def _crawl_web(self, url, config):`	Internal method to crawl web URLs with the specified configuration. Args: url (str): The web UR... (truncated)
METHOD	AsyncPlaywrightCrawlerStrategy._handle_full_page_scan	`async def _handle_full_page_scan(self, page, scroll_delay):`	Helper method to handle full page scanning. How it works: 1. Get the viewport height. 2. Scroll to... (truncated)
METHOD	AsyncPlaywrightCrawlerStrategy._handle_download	`async def _handle_download(self, download):`	Handle file downloads. How it works: 1. Get the suggested filename. 2. Get the download path. 3. Lo... (truncated)
METHOD	AsyncPlaywrightCrawlerStrategy.remove_overlay_elements	`async def remove_overlay_elements(self, page):`	Removes popup overlays, modals, cookie notices, and other intrusive elements from the page. Args: ... (truncated)
METHOD	AsyncPlaywrightCrawlerStrategy.export_pdf	`async def export_pdf(self, page):`	Exports the current page as a PDF. Args: page (Page): The Playwright page object Returns: ... (truncated)
METHOD	AsyncPlaywrightCrawlerStrategy.take_screenshot	`async def take_screenshot(self, page, **kwargs):`	Take a screenshot of the current page. Args: page (Page): The Playwright page object kwargs... (truncated)
METHOD	AsyncPlaywrightCrawlerStrategy.take_screenshot_from_pdf	`async def take_screenshot_from_pdf(self, pdf_data):`	Convert the first page of the PDF to a screenshot. Requires pdf2image and poppler. Args: ... (truncated)
METHOD	AsyncPlaywrightCrawlerStrategy.take_screenshot_scroller	`async def take_screenshot_scroller(self, page, **kwargs):`	Attempt to set a large viewport and take a full-page screenshot. If still too large, segment the pag... (truncated)
METHOD	AsyncPlaywrightCrawlerStrategy.take_screenshot_naive	`async def take_screenshot_naive(self, page):`	Takes a screenshot of the current page. Args: page (Page): The Playwright page instance Return... (truncated)
METHOD	AsyncPlaywrightCrawlerStrategy.export_storage_state	`async def export_storage_state(self, path=None):`	Exports the current storage state (cookies, localStorage, sessionStorage) to a JSON file at the spec... (truncated)
METHOD	AsyncPlaywrightCrawlerStrategy.robust_execute_user_script	`async def robust_execute_user_script(self, page, js_code):`	Executes user-provided JavaScript code with proper error handling and context, supporting both synch... (truncated)
METHOD	AsyncPlaywrightCrawlerStrategy.execute_user_script	`async def execute_user_script(self, page, js_code):`	Executes user-provided JavaScript code with proper error handling and context. Args: page: Play... (truncated)
METHOD	AsyncPlaywrightCrawlerStrategy.check_visibility	`async def check_visibility(self, page):`	Checks if an element is visible on the page. Args: page: Playwright page object Returns: ... (truncated)
METHOD	AsyncPlaywrightCrawlerStrategy.safe_scroll	`async def safe_scroll(self, page, x, y):`	Safely scroll the page with rendering time. Args: page: Playwright page object x: Horizonta... (truncated)
METHOD	AsyncPlaywrightCrawlerStrategy.csp_scroll_to	`async def csp_scroll_to(self, page, x, y):`	Performs a CSP-compliant scroll operation and returns the result status. Args: page: Playwright... (truncated)
METHOD	AsyncPlaywrightCrawlerStrategy.get_page_dimensions	`async def get_page_dimensions(self, page):`	Get the dimensions of the page. Args: page: Playwright page object Returns: Dict conta... (truncated)

`crawl4ai/content_scraping_strategy.py`

Type	Name	Signature	Docstring
MODULE	content_scraping_strategy.py	``
FUNCTION	parse_dimension	`def parse_dimension(dimension):`
FUNCTION	fetch_image_file_size	`def fetch_image_file_size(img, base_url):`
CLASS	ContentScrapingStrategy	`class ContentScrapingStrategy:`
METHOD	ContentScrapingStrategy.scrap	`def scrap(self, url, html, **kwargs):`
METHOD	ContentScrapingStrategy.ascrap	`async def ascrap(self, url, html, **kwargs):`
CLASS	WebScrapingStrategy	`class WebScrapingStrategy:`	Class for web content scraping. Perhaps the most important class. How it works: 1. Extract content... (truncated)
METHOD	WebScrapingStrategy.init	`def __init__(self, logger=None):`
METHOD	WebScrapingStrategy._log	`def _log(self, level, message, tag='SCRAPE', **kwargs):`	Helper method to safely use logger.
METHOD	WebScrapingStrategy.scrap	`def scrap(self, url, html, **kwargs):`	Main entry point for content scraping. Args: url (str): The URL of the page to scrape. ht... (truncated)
METHOD	WebScrapingStrategy.ascrap	`async def ascrap(self, url, html, **kwargs):`	Main entry point for asynchronous content scraping. Args: url (str): The URL of the page to scr... (truncated)
METHOD	WebScrapingStrategy._generate_markdown_content	`def _generate_markdown_content(self, cleaned_html, html, url, success, **kwargs):`	Generate markdown content from cleaned HTML. Args: cleaned_html (str): The cleaned HTML content... (truncated)
METHOD	WebScrapingStrategy.flatten_nested_elements	`def flatten_nested_elements(self, node):`	Flatten nested elements in a HTML tree. Args: node (Tag): The root node of the HTML tree. Retu... (truncated)
METHOD	WebScrapingStrategy.find_closest_parent_with_useful_text	`def find_closest_parent_with_useful_text(self, tag, **kwargs):`	Find the closest parent with useful text. Args: tag (Tag): The starting tag to search from. ... (truncated)
METHOD	WebScrapingStrategy.remove_unwanted_attributes	`def remove_unwanted_attributes(self, element, important_attrs, keep_data_attributes=False):`	Remove unwanted attributes from an HTML element. Args: element (Tag): The HTML element to r... (truncated)
METHOD	WebScrapingStrategy.process_image	`def process_image(self, img, url, index, total_images, **kwargs):`	Process an image element. How it works: 1. Check if the image has valid display and inside undesire... (truncated)
METHOD	WebScrapingStrategy.process_element	`def process_element(self, url, element, **kwargs):`	Process an HTML element. How it works: 1. Check if the element is an image, video, or audio. 2. Ext... (truncated)
METHOD	WebScrapingStrategy._process_element	`def _process_element(self, url, element, media, internal_links_dict, external_links_dict, **kwargs):`	Process an HTML element.
METHOD	WebScrapingStrategy._scrap	`def _scrap(self, url, html, word_count_threshold=MIN_WORD_THRESHOLD, css_selector=None, **kwargs):`	Extract content from HTML using BeautifulSoup. Args: url (str): The URL of the page to scrape. ... (truncated)

`crawl4ai/markdown_generation_strategy.py`

Type	Name	Signature	Docstring
MODULE	markdown_generation_strategy.py	``
FUNCTION	fast_urljoin	`def fast_urljoin(base, url):`	Fast URL joining for common cases.
CLASS	MarkdownGenerationStrategy	`class MarkdownGenerationStrategy:`	Abstract base class for markdown generation strategies.
METHOD	MarkdownGenerationStrategy.init	`def __init__(self, content_filter=None, options=None):`
METHOD	MarkdownGenerationStrategy.generate_markdown	`def generate_markdown(self, cleaned_html, base_url='', html2text_options=None, content_filter=None, citations=True, **kwargs):`	Generate markdown from cleaned HTML.
CLASS	DefaultMarkdownGenerator	`class DefaultMarkdownGenerator:`	Default implementation of markdown generation strategy. How it works: 1. Generate raw markdown from... (truncated)
METHOD	DefaultMarkdownGenerator.init	`def __init__(self, content_filter=None, options=None):`
METHOD	DefaultMarkdownGenerator.convert_links_to_citations	`def convert_links_to_citations(self, markdown, base_url=''):`	Convert links in markdown to citations. How it works: 1. Find all links in the markdown. 2. Convert... (truncated)
METHOD	DefaultMarkdownGenerator.generate_markdown	`def generate_markdown(self, cleaned_html, base_url='', html2text_options=None, options=None, content_filter=None, citations=True, **kwargs):`	Generate markdown with citations from cleaned HTML. How it works: 1. Generate raw markdown from cle... (truncated)

`crawl4ai/content_filter_strategy.py`

Type	Name	Signature	Docstring
MODULE	content_filter_strategy.py	``
CLASS	RelevantContentFilter	`class RelevantContentFilter:`	Abstract base class for content filtering strategies
METHOD	RelevantContentFilter.init	`def __init__(self, user_query=None):`
METHOD	RelevantContentFilter.filter_content	`def filter_content(self, html):`	Abstract method to be implemented by specific filtering strategies
METHOD	RelevantContentFilter.extract_page_query	`def extract_page_query(self, soup, body):`	Common method to extract page metadata with fallbacks
METHOD	RelevantContentFilter.extract_text_chunks	`def extract_text_chunks(self, body, min_word_threshold=None):`	Extracts text chunks from a BeautifulSoup body element while preserving order. Returns list of tuple... (truncated)
METHOD	RelevantContentFilter._deprecated_extract_text_chunks	`def _deprecated_extract_text_chunks(self, soup):`	Common method for extracting text chunks
METHOD	RelevantContentFilter.is_excluded	`def is_excluded(self, tag):`	Common method for exclusion logic
METHOD	RelevantContentFilter.clean_element	`def clean_element(self, tag):`	Common method for cleaning HTML elements with minimal overhead
CLASS	BM25ContentFilter	`class BM25ContentFilter:`	Content filtering using BM25 algorithm with priority tag handling. How it works: 1. Extracts page m... (truncated)
METHOD	BM25ContentFilter.init	`def __init__(self, user_query=None, bm25_threshold=1.0, language='english'):`	Initializes the BM25ContentFilter class, if not provided, falls back to page metadata. Note: If no ... (truncated)
METHOD	BM25ContentFilter.filter_content	`def filter_content(self, html, min_word_threshold=None):`	Implements content filtering using BM25 algorithm with priority tag handling. Note: This method... (truncated)
CLASS	PruningContentFilter	`class PruningContentFilter:`	Content filtering using pruning algorithm with dynamic threshold. How it works: 1. Extracts page me... (truncated)
METHOD	PruningContentFilter.init	`def __init__(self, user_query=None, min_word_threshold=None, threshold_type='fixed', threshold=0.48):`	Initializes the PruningContentFilter class, if not provided, falls back to page metadata. Note: If ... (truncated)
METHOD	PruningContentFilter.filter_content	`def filter_content(self, html, min_word_threshold=None):`	Implements content filtering using pruning algorithm with dynamic threshold. Note: This method impl... (truncated)
METHOD	PruningContentFilter._remove_comments	`def _remove_comments(self, soup):`	Removes HTML comments
METHOD	PruningContentFilter._remove_unwanted_tags	`def _remove_unwanted_tags(self, soup):`	Removes unwanted tags
METHOD	PruningContentFilter._prune_tree	`def _prune_tree(self, node):`	Prunes the tree starting from the given node. Args: node (Tag): The node from which the pruning... (truncated)
METHOD	PruningContentFilter._compute_composite_score	`def _compute_composite_score(self, metrics, text_len, tag_len, link_text_len):`	Computes the composite score
METHOD	PruningContentFilter._compute_class_id_weight	`def _compute_class_id_weight(self, node):`	Computes the class ID weight

`crawl4ai/extraction_strategy.py`

Type	Name	Signature	Docstring
MODULE	extraction_strategy.py	``
CLASS	ExtractionStrategy	`class ExtractionStrategy:`	Abstract base class for all extraction strategies.
METHOD	ExtractionStrategy.init	`def __init__(self, input_format='markdown', **kwargs):`	Initialize the extraction strategy. Args: input_format: Content format to use for extraction. ... (truncated)
METHOD	ExtractionStrategy.extract	`def extract(self, url, html, q, *kwargs):`	Extract meaningful blocks or chunks from the given HTML. :param url: The URL of the webpage. :param... (truncated)
METHOD	ExtractionStrategy.run	`def run(self, url, sections, q, *kwargs):`	Process sections of text in parallel by default. :param url: The URL of the webpage. :param section... (truncated)
CLASS	NoExtractionStrategy	`class NoExtractionStrategy:`	A strategy that does not extract any meaningful content from the HTML. It simply returns the entire ... (truncated)
METHOD	NoExtractionStrategy.extract	`def extract(self, url, html, q, *kwargs):`	Extract meaningful blocks or chunks from the given HTML.
METHOD	NoExtractionStrategy.run	`def run(self, url, sections, q, *kwargs):`
CLASS	LLMExtractionStrategy	`class LLMExtractionStrategy:`	A strategy that uses an LLM to extract meaningful content from the HTML. Attributes: provider: ... (truncated)
METHOD	LLMExtractionStrategy.init	`def __init__(self, provider=DEFAULT_PROVIDER, api_token=None, instruction=None, schema=None, extraction_type='block', **kwargs):`	Initialize the strategy with clustering parameters. Args: provider: The provider to use for ext... (truncated)
METHOD	LLMExtractionStrategy.extract	`def extract(self, url, ix, html):`	Extract meaningful blocks or chunks from the given HTML using an LLM. How it works: 1. Construct a ... (truncated)
METHOD	LLMExtractionStrategy._merge	`def _merge(self, documents, chunk_token_threshold, overlap):`	Merge documents into sections based on chunk_token_threshold and overlap.
METHOD	LLMExtractionStrategy.run	`def run(self, url, sections):`	Process sections sequentially with a delay for rate limiting issues, specifically for LLMExtractionS... (truncated)
METHOD	LLMExtractionStrategy.show_usage	`def show_usage(self):`	Print a detailed token usage report showing total and per-request usage.
CLASS	CosineStrategy	`class CosineStrategy:`	Extract meaningful blocks or chunks from the given HTML using cosine similarity. How it works: 1. P... (truncated)
METHOD	CosineStrategy.init	`def __init__(self, semantic_filter=None, word_count_threshold=10, max_dist=0.2, linkage_method='ward', top_k=3, model_name='sentence-transformers/all-MiniLM-L6-v2', sim_threshold=0.3, **kwargs):`	Initialize the strategy with clustering parameters. Args: semantic_filter (str): A keyword filt... (truncated)
METHOD	CosineStrategy.filter_documents_embeddings	`def filter_documents_embeddings(self, documents, semantic_filter, at_least_k=20):`	Filter and sort documents based on the cosine similarity of their embeddings with the semantic_filte... (truncated)
METHOD	CosineStrategy.get_embeddings	`def get_embeddings(self, sentences, batch_size=None, bypass_buffer=False):`	Get BERT embeddings for a list of sentences. Args: sentences (List[str]): A list of text chunks... (truncated)
METHOD	CosineStrategy.hierarchical_clustering	`def hierarchical_clustering(self, sentences, embeddings=None):`	Perform hierarchical clustering on sentences and return cluster labels. Args: sentences (List[s... (truncated)
METHOD	CosineStrategy.filter_clusters_by_word_count	`def filter_clusters_by_word_count(self, clusters):`	Filter clusters to remove those with a word count below the threshold. Args: clusters (Dict[int... (truncated)
METHOD	CosineStrategy.extract	`def extract(self, url, html, q, *kwargs):`	Extract clusters from HTML content using hierarchical clustering. Args: url (str): The URL of t... (truncated)
METHOD	CosineStrategy.run	`def run(self, url, sections, q, *kwargs):`	Process sections using hierarchical clustering. Args: url (str): The URL of the webpage. se... (truncated)
CLASS	JsonElementExtractionStrategy	`class JsonElementExtractionStrategy:`	Abstract base class for extracting structured JSON from HTML content. How it works: 1. ... (truncated)
METHOD	JsonElementExtractionStrategy.init	`def __init__(self, schema, **kwargs):`	Initialize the JSON element extraction strategy with a schema. Args: schema (Dict[str, Any]): T... (truncated)
METHOD	JsonElementExtractionStrategy.extract	`def extract(self, url, html_content, q, *kwargs):`	Extract structured data from HTML content. How it works: 1. Parses the HTML content using the `_par... (truncated)
METHOD	JsonElementExtractionStrategy._parse_html	`def _parse_html(self, html_content):`	Parse HTML content into appropriate format
METHOD	JsonElementExtractionStrategy._get_base_elements	`def _get_base_elements(self, parsed_html, selector):`	Get all base elements using the selector
METHOD	JsonElementExtractionStrategy._get_elements	`def _get_elements(self, element, selector):`	Get child elements using the selector
METHOD	JsonElementExtractionStrategy._extract_field	`def _extract_field(self, element, field):`
METHOD	JsonElementExtractionStrategy._extract_single_field	`def _extract_single_field(self, element, field):`	Extract a single field based on its type. How it works: 1. Selects the target element using the fie... (truncated)
METHOD	JsonElementExtractionStrategy._extract_list_item	`def _extract_list_item(self, element, fields):`
METHOD	JsonElementExtractionStrategy._extract_item	`def _extract_item(self, element, fields):`	Extracts fields from a given element. How it works: 1. Iterates through the fields defined in the s... (truncated)
METHOD	JsonElementExtractionStrategy._apply_transform	`def _apply_transform(self, value, transform):`	Apply a transformation to a value. How it works: 1. Checks the transformation type (e.g., `lowercas... (truncated)
METHOD	JsonElementExtractionStrategy._compute_field	`def _compute_field(self, item, field):`
METHOD	JsonElementExtractionStrategy.run	`def run(self, url, sections, q, *kwargs):`	Run the extraction strategy on a combined HTML content. How it works: 1. Combines multiple HTML sec... (truncated)
METHOD	JsonElementExtractionStrategy._get_element_text	`def _get_element_text(self, element):`	Get text content from element
METHOD	JsonElementExtractionStrategy._get_element_html	`def _get_element_html(self, element):`	Get HTML content from element
METHOD	JsonElementExtractionStrategy._get_element_attribute	`def _get_element_attribute(self, element, attribute):`	Get attribute value from element
CLASS	JsonCssExtractionStrategy	`class JsonCssExtractionStrategy:`	Concrete implementation of `JsonElementExtractionStrategy` using CSS selectors. How it works: 1. Pa... (truncated)
METHOD	JsonCssExtractionStrategy.init	`def __init__(self, schema, **kwargs):`
METHOD	JsonCssExtractionStrategy._parse_html	`def _parse_html(self, html_content):`
METHOD	JsonCssExtractionStrategy._get_base_elements	`def _get_base_elements(self, parsed_html, selector):`
METHOD	JsonCssExtractionStrategy._get_elements	`def _get_elements(self, element, selector):`
METHOD	JsonCssExtractionStrategy._get_element_text	`def _get_element_text(self, element):`
METHOD	JsonCssExtractionStrategy._get_element_html	`def _get_element_html(self, element):`
METHOD	JsonCssExtractionStrategy._get_element_attribute	`def _get_element_attribute(self, element, attribute):`
CLASS	JsonXPathExtractionStrategy	`class JsonXPathExtractionStrategy:`	Concrete implementation of `JsonElementExtractionStrategy` using XPath selectors. How it works: 1. ... (truncated)
METHOD	JsonXPathExtractionStrategy.init	`def __init__(self, schema, **kwargs):`
METHOD	JsonXPathExtractionStrategy._parse_html	`def _parse_html(self, html_content):`
METHOD	JsonXPathExtractionStrategy._get_base_elements	`def _get_base_elements(self, parsed_html, selector):`
METHOD	JsonXPathExtractionStrategy._css_to_xpath	`def _css_to_xpath(self, css_selector):`	Convert CSS selector to XPath if needed
METHOD	JsonXPathExtractionStrategy._basic_css_to_xpath	`def _basic_css_to_xpath(self, css_selector):`	Basic CSS to XPath conversion for common cases
METHOD	JsonXPathExtractionStrategy._get_elements	`def _get_elements(self, element, selector):`
METHOD	JsonXPathExtractionStrategy._get_element_text	`def _get_element_text(self, element):`
METHOD	JsonXPathExtractionStrategy._get_element_html	`def _get_element_html(self, element):`
METHOD	JsonXPathExtractionStrategy._get_element_attribute	`def _get_element_attribute(self, element, attribute):`

`crawl4ai/chunking_strategy.py`

Type	Name	Signature	Docstring
MODULE	chunking_strategy.py	``
CLASS	ChunkingStrategy	`class ChunkingStrategy:`	Abstract base class for chunking strategies.
METHOD	ChunkingStrategy.chunk	`def chunk(self, text):`	Abstract method to chunk the given text. Args: text (str): The text to chunk. Returns: lis... (truncated)
CLASS	IdentityChunking	`class IdentityChunking:`	Chunking strategy that returns the input text as a single chunk.
METHOD	IdentityChunking.chunk	`def chunk(self, text):`
CLASS	RegexChunking	`class RegexChunking:`	Chunking strategy that splits text based on regular expression patterns.
METHOD	RegexChunking.init	`def __init__(self, patterns=None, **kwargs):`	Initialize the RegexChunking object. Args: patterns (list): A list of regular expression patter... (truncated)
METHOD	RegexChunking.chunk	`def chunk(self, text):`
CLASS	NlpSentenceChunking	`class NlpSentenceChunking:`	Chunking strategy that splits text into sentences using NLTK's sentence tokenizer.
METHOD	NlpSentenceChunking.init	`def __init__(self, **kwargs):`	Initialize the NlpSentenceChunking object.
METHOD	NlpSentenceChunking.chunk	`def chunk(self, text):`
CLASS	TopicSegmentationChunking	`class TopicSegmentationChunking:`	Chunking strategy that segments text into topics using NLTK's TextTilingTokenizer. How it works: 1.... (truncated)
METHOD	TopicSegmentationChunking.init	`def __init__(self, num_keywords=3, **kwargs):`	Initialize the TopicSegmentationChunking object. Args: num_keywords (int): The number of keywor... (truncated)
METHOD	TopicSegmentationChunking.chunk	`def chunk(self, text):`
METHOD	TopicSegmentationChunking.extract_keywords	`def extract_keywords(self, text):`
METHOD	TopicSegmentationChunking.chunk_with_topics	`def chunk_with_topics(self, text):`
CLASS	FixedLengthWordChunking	`class FixedLengthWordChunking:`	Chunking strategy that splits text into fixed-length word chunks. How it works: 1. Split the text i... (truncated)
METHOD	FixedLengthWordChunking.init	`def __init__(self, chunk_size=100, **kwargs):`	Initialize the fixed-length word chunking strategy with the given chunk size. Args: chunk_size ... (truncated)
METHOD	FixedLengthWordChunking.chunk	`def chunk(self, text):`
CLASS	SlidingWindowChunking	`class SlidingWindowChunking:`	Chunking strategy that splits text into overlapping word chunks. How it works: 1. Split the text in... (truncated)
METHOD	SlidingWindowChunking.init	`def __init__(self, window_size=100, step=50, **kwargs):`	Initialize the sliding window chunking strategy with the given window size and step size. Args: ... (truncated)
METHOD	SlidingWindowChunking.chunk	`def chunk(self, text):`
CLASS	OverlappingWindowChunking	`class OverlappingWindowChunking:`	Chunking strategy that splits text into overlapping word chunks. How it works: 1. Split the text in... (truncated)
METHOD	OverlappingWindowChunking.init	`def __init__(self, window_size=1000, overlap=100, **kwargs):`	Initialize the overlapping window chunking strategy with the given window size and overlap size. Ar... (truncated)
METHOD	OverlappingWindowChunking.chunk	`def chunk(self, text):`

`crawl4ai/user_agent_generator.py`

Type	Name	Signature	Docstring
MODULE	user_agent_generator.py	``
CLASS	UserAgentGenerator	`class UserAgentGenerator:`	Generate random user agents with specified constraints. Attributes: desktop_platforms (dict): A... (truncated)
METHOD	UserAgentGenerator.init	`def __init__(self):`
METHOD	UserAgentGenerator.get_browser_stack	`def get_browser_stack(self, num_browsers=1):`	Get a valid combination of browser versions. How it works: 1. Check if the number of browsers is su... (truncated)
METHOD	UserAgentGenerator.generate	`def generate(self, device_type=None, os_type=None, device_brand=None, browser_type=None, num_browsers=3):`	Generate a random user agent with specified constraints. Args: device_type: 'desktop' or 'mobil... (truncated)
METHOD	UserAgentGenerator.generate_with_client_hints	`def generate_with_client_hints(self, **kwargs):`	Generate both user agent and matching client hints
METHOD	UserAgentGenerator.get_random_platform	`def get_random_platform(self, device_type, os_type, device_brand):`	Helper method to get random platform based on constraints
METHOD	UserAgentGenerator.parse_user_agent	`def parse_user_agent(self, user_agent):`	Parse a user agent string to extract browser and version information
METHOD	UserAgentGenerator.generate_client_hints	`def generate_client_hints(self, user_agent):`	Generate Sec-CH-UA header value based on user agent string

`crawl4ai/ssl_certificate.py`

Type	Name	Signature	Docstring
MODULE	ssl_certificate.py	``	SSL Certificate class for handling certificate operations.
CLASS	SSLCertificate	`class SSLCertificate:`	A class representing an SSL certificate with methods to export in various formats. Attributes: ... (truncated)
METHOD	SSLCertificate.init	`def __init__(self, cert_info):`
METHOD	SSLCertificate.from_url	`def from_url(url, timeout=10):`	Create SSLCertificate instance from a URL. Args: url (str): URL of the website. timeout (in... (truncated)
METHOD	SSLCertificate._decode_cert_data	`def _decode_cert_data(data):`	Helper method to decode bytes in certificate data.
METHOD	SSLCertificate.to_json	`def to_json(self, filepath=None):`	Export certificate as JSON. Args: filepath (Optional[str]): Path to save the JSON file (default... (truncated)
METHOD	SSLCertificate.to_pem	`def to_pem(self, filepath=None):`	Export certificate as PEM. Args: filepath (Optional[str]): Path to save the PEM file (default: ... (truncated)
METHOD	SSLCertificate.to_der	`def to_der(self, filepath=None):`	Export certificate as DER. Args: filepath (Optional[str]): Path to save the DER file (default: ... (truncated)
METHOD	SSLCertificate.issuer	`def issuer(self):`	Get certificate issuer information.
METHOD	SSLCertificate.subject	`def subject(self):`	Get certificate subject information.
METHOD	SSLCertificate.valid_from	`def valid_from(self):`	Get certificate validity start date.
METHOD	SSLCertificate.valid_until	`def valid_until(self):`	Get certificate validity end date.
METHOD	SSLCertificate.fingerprint	`def fingerprint(self):`	Get certificate fingerprint.

42 KiB Raw Blame History

crawl4ai/models.py

crawl4ai/async_configs.py

crawl4ai/async_webcrawler.py

crawl4ai/async_crawler_strategy.py

crawl4ai/content_scraping_strategy.py

crawl4ai/markdown_generation_strategy.py

crawl4ai/content_filter_strategy.py

crawl4ai/extraction_strategy.py

crawl4ai/chunking_strategy.py

crawl4ai/user_agent_generator.py

crawl4ai/ssl_certificate.py

42 KiB

Raw Blame History

`crawl4ai/models.py`

`crawl4ai/async_configs.py`

`crawl4ai/async_webcrawler.py`

`crawl4ai/async_crawler_strategy.py`

`crawl4ai/content_scraping_strategy.py`

`crawl4ai/markdown_generation_strategy.py`

`crawl4ai/content_filter_strategy.py`

`crawl4ai/extraction_strategy.py`

`crawl4ai/chunking_strategy.py`

`crawl4ai/user_agent_generator.py`

`crawl4ai/ssl_certificate.py`