Implement initial MVP for Docker-based browser management in Crawl4ai, enabling
remote browser execution in containerized environments.
Key Changes:
- Add browser_farm module with Docker support components:
* BrowserFarmService: Manages browser endpoints
* DockerBrowser: Handles Docker browser communication
* Basic health check implementation
* Dockerfile with optimized Chrome/Playwright setup:
- Based on python:3.10-slim for minimal size
- Includes all required system dependencies
- Auto-installs crawl4ai and sets up Playwright
- Configures Chrome with remote debugging
- Uses socat for port forwarding (9223)
- Update core components:
* Rename use_managed_browser to use_remote_browser for clarity
* Modify BrowserManager to support Docker mode
* Add Docker configuration in BrowserConfig
* Update context handling for remote browsers
- Add example:
* hello_world_docker.py demonstrating Docker browser usage
Technical Details:
- Docker container exposes port 9223 (mapped to host:9333)
- Uses CDP (Chrome DevTools Protocol) for remote connection
- Maintains compatibility with existing managed browser features
- Simplified endpoint management for MVP phase
- Optimized Docker setup:
* Minimal dependencies installation
* Proper Chrome flags for containerized environment
* Headless mode with GPU disabled
* Security considerations (no-sandbox mode)
Testing:
- Extensive Docker configuration testing and optimization
- Verified with hello_world_docker.py example
- Confirmed remote browser connection and crawling functionality
- Tested basic health checks
This is the first step towards a scalable browser farm solution, setting up
the foundation for future enhancements like resource monitoring, multiple
browser instances, and container lifecycle management.
288 lines
42 KiB
Markdown
288 lines
42 KiB
Markdown
|
|
## `crawl4ai/models.py`
|
|
|
|
| Type | Name | Signature | Docstring |
|
|
| ------ | -------------------------- | ---------------------------------- | --------------------------- |
|
|
| MODULE | models.py | `` | |
|
|
| CLASS | TokenUsage | `class TokenUsage:` | |
|
|
| CLASS | UrlModel | `class UrlModel:` | |
|
|
| CLASS | MarkdownGenerationResult | `class MarkdownGenerationResult:` | |
|
|
| CLASS | CrawlResult | `class CrawlResult:` | |
|
|
| CLASS | AsyncCrawlResponse | `class AsyncCrawlResponse:` | |
|
|
|
|
## `crawl4ai/async_configs.py`
|
|
|
|
| Type | Name | Signature | Docstring |
|
|
| ------ | -------------------------- | ---------------------------------- | --------------------------- |
|
|
| MODULE | async_configs.py | `` | |
|
|
| CLASS | BrowserConfig | `class BrowserConfig:` | Configuration class for setting up a browser instance and its context in AsyncPlaywrightCrawlerStrat... (truncated) |
|
|
| METHOD | BrowserConfig.__init__ | `def __init__(self, browser_type='chromium', headless=True, use_remote_browser=False, use_persistent_context=False, user_data_dir=None, chrome_channel='chrome', proxy=None, proxy_config=None, viewport_width=1080, viewport_height=600, accept_downloads=False, downloads_path=None, storage_state=None, ignore_https_errors=True, java_script_enabled=True, sleep_on_close=False, verbose=True, cookies=None, headers=None, user_agent='Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:109.0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/116.0.5845.187 Safari/604.1 Edg/117.0.2045.47', user_agent_mode=None, user_agent_generator_config=None, text_mode=False, light_mode=False, extra_args=None, debugging_port=9222):` | |
|
|
| METHOD | BrowserConfig.from_kwargs | `def from_kwargs(kwargs):` | |
|
|
| CLASS | CrawlerRunConfig | `class CrawlerRunConfig:` | Configuration class for controlling how the crawler runs each crawl operation. This includes paramet... (truncated) |
|
|
| METHOD | CrawlerRunConfig.__init__ | `def __init__(self, word_count_threshold=MIN_WORD_THRESHOLD, extraction_strategy=None, chunking_strategy=None, markdown_generator=None, content_filter=None, only_text=False, css_selector=None, excluded_tags=None, excluded_selector=None, keep_data_attributes=False, remove_forms=False, prettiify=False, parser_type='lxml', fetch_ssl_certificate=False, cache_mode=None, session_id=None, bypass_cache=False, disable_cache=False, no_cache_read=False, no_cache_write=False, wait_until='domcontentloaded', page_timeout=PAGE_TIMEOUT, wait_for=None, wait_for_images=True, delay_before_return_html=0.1, mean_delay=0.1, max_range=0.3, semaphore_count=5, js_code=None, js_only=False, ignore_body_visibility=True, scan_full_page=False, scroll_delay=0.2, process_iframes=False, remove_overlay_elements=False, simulate_user=False, override_navigator=False, magic=False, adjust_viewport_to_content=False, screenshot=False, screenshot_wait_for=None, screenshot_height_threshold=SCREENSHOT_HEIGHT_TRESHOLD, pdf=False, image_description_min_word_threshold=IMAGE_DESCRIPTION_MIN_WORD_THRESHOLD, image_score_threshold=IMAGE_SCORE_THRESHOLD, exclude_external_images=False, exclude_social_media_domains=None, exclude_external_links=False, exclude_social_media_links=False, exclude_domains=None, verbose=True, log_console=False, url=None):` | |
|
|
| METHOD | CrawlerRunConfig.from_kwargs | `def from_kwargs(kwargs):` | |
|
|
| METHOD | CrawlerRunConfig.to_dict | `def to_dict(self):` | |
|
|
|
|
## `crawl4ai/async_webcrawler.py`
|
|
|
|
| Type | Name | Signature | Docstring |
|
|
| ------ | -------------------------- | ---------------------------------- | --------------------------- |
|
|
| MODULE | async_webcrawler.py | `` | |
|
|
| CLASS | AsyncWebCrawler | `class AsyncWebCrawler:` | Asynchronous web crawler with flexible caching capabilities. There are two ways to use the crawler:... (truncated) |
|
|
| METHOD | AsyncWebCrawler.__init__ | `def __init__(self, crawler_strategy=None, config=None, always_bypass_cache=False, always_by_pass_cache=None, base_directory=str(os.getenv('CRAWL4_AI_BASE_DIRECTORY', Path.home())), thread_safe=False, **kwargs):` | Initialize the AsyncWebCrawler. Args: crawler_strategy: Strategy for crawling web pages. If Non... (truncated) |
|
|
| METHOD | AsyncWebCrawler.start | `async def start(self):` | Start the crawler explicitly without using context manager. This is equivalent to using 'async with'... (truncated) |
|
|
| METHOD | AsyncWebCrawler.close | `async def close(self):` | Close the crawler explicitly without using context manager. This should be called when you're done w... (truncated) |
|
|
| METHOD | AsyncWebCrawler.__aenter__ | `async def __aenter__(self):` | |
|
|
| METHOD | AsyncWebCrawler.__aexit__ | `async def __aexit__(self, exc_type, exc_val, exc_tb):` | |
|
|
| METHOD | AsyncWebCrawler.awarmup | `async def awarmup(self):` | Initialize the crawler with warm-up sequence. This method: 1. Logs initialization info 2. Sets up b... (truncated) |
|
|
| METHOD | AsyncWebCrawler.nullcontext | `async def nullcontext(self):` | 异步空上下文管理器 |
|
|
| METHOD | AsyncWebCrawler.arun | `async def arun(self, url, config=None, word_count_threshold=MIN_WORD_THRESHOLD, extraction_strategy=None, chunking_strategy=RegexChunking(), content_filter=None, cache_mode=None, bypass_cache=False, disable_cache=False, no_cache_read=False, no_cache_write=False, css_selector=None, screenshot=False, pdf=False, user_agent=None, verbose=True, **kwargs):` | Runs the crawler for a single source: URL (web, local file, or raw HTML). Migration Guide: Old way ... (truncated) |
|
|
| METHOD | AsyncWebCrawler.aprocess_html | `async def aprocess_html(self, url, html, extracted_content, config, screenshot, pdf_data, verbose, **kwargs):` | Process HTML content using the provided configuration. Args: url: The URL being processed h... (truncated) |
|
|
| METHOD | AsyncWebCrawler.arun_many | `async def arun_many(self, urls, config=None, word_count_threshold=MIN_WORD_THRESHOLD, extraction_strategy=None, chunking_strategy=RegexChunking(), content_filter=None, cache_mode=None, bypass_cache=False, css_selector=None, screenshot=False, pdf=False, user_agent=None, verbose=True, **kwargs):` | Runs the crawler for multiple URLs concurrently. Migration Guide: Old way (deprecated): results... (truncated) |
|
|
| METHOD | AsyncWebCrawler.aclear_cache | `async def aclear_cache(self):` | Clear the cache database. |
|
|
| METHOD | AsyncWebCrawler.aflush_cache | `async def aflush_cache(self):` | Flush the cache database. |
|
|
| METHOD | AsyncWebCrawler.aget_cache_size | `async def aget_cache_size(self):` | Get the total number of cached items. |
|
|
|
|
## `crawl4ai/async_crawler_strategy.py`
|
|
|
|
| Type | Name | Signature | Docstring |
|
|
| ------ | -------------------------- | ---------------------------------- | --------------------------- |
|
|
| MODULE | async_crawler_strategy.py | `` | |
|
|
| CLASS | RemoteConnector | `class RemoteConnector:` | Manages the browser process and context. This class allows to connect to the browser using CDP proto... (truncated) |
|
|
| METHOD | RemoteConnector.__init__ | `def __init__(self, browser_type='chromium', user_data_dir=None, headless=False, logger=None, host='localhost', debugging_port=9222):` | Initialize the RemoteConnector instance. Args: browser_type (str): The type of browser to launch... (truncated) |
|
|
| METHOD | RemoteConnector.start | `async def start(self):` | Starts the browser process and returns the CDP endpoint URL. If user_data_dir is not provided, creat... (truncated) |
|
|
| METHOD | RemoteConnector._monitor_browser_process | `async def _monitor_browser_process(self):` | Monitor the browser process for unexpected termination. How it works: 1. Read stdout and stderr fro... (truncated) |
|
|
| METHOD | RemoteConnector._get_browser_path | `def _get_browser_path(self):` | Returns the browser executable path based on OS and browser type |
|
|
| METHOD | RemoteConnector._get_browser_args | `def _get_browser_args(self):` | Returns browser-specific command line arguments |
|
|
| METHOD | RemoteConnector.cleanup | `async def cleanup(self):` | Cleanup browser process and temporary directory |
|
|
| CLASS | BrowserManager | `class BrowserManager:` | Manages the browser instance and context. Attributes: config (BrowserConfig): Configuration ob... (truncated) |
|
|
| METHOD | BrowserManager.__init__ | `def __init__(self, browser_config, logger=None):` | Initialize the BrowserManager with a browser configuration. Args: browser_config (BrowserConfig... (truncated) |
|
|
| METHOD | BrowserManager.start | `async def start(self):` | Start the browser instance and set up the default context. How it works: 1. Check if Playwright is ... (truncated) |
|
|
| METHOD | BrowserManager._build_browser_args | `def _build_browser_args(self):` | Build browser launch arguments from config. |
|
|
| METHOD | BrowserManager.setup_context | `async def setup_context(self, context, crawlerRunConfig, is_default=False):` | Set up a browser context with the configured options. How it works: 1. Set extra HTTP headers if pr... (truncated) |
|
|
| METHOD | BrowserManager.create_browser_context | `async def create_browser_context(self):` | Creates and returns a new browser context with configured settings. Applies text-only mode settings ... (truncated) |
|
|
| METHOD | BrowserManager.get_page | `async def get_page(self, crawlerRunConfig):` | Get a page for the given session ID, creating a new one if needed. Args: crawlerRunConfig (Craw... (truncated) |
|
|
| METHOD | BrowserManager.kill_session | `async def kill_session(self, session_id):` | Kill a browser session and clean up resources. Args: session_id (str): The session ID to kill... (truncated) |
|
|
| METHOD | BrowserManager._cleanup_expired_sessions | `def _cleanup_expired_sessions(self):` | Clean up expired sessions based on TTL. |
|
|
| METHOD | BrowserManager.close | `async def close(self):` | Close all browser resources and clean up. |
|
|
| CLASS | AsyncCrawlerStrategy | `class AsyncCrawlerStrategy:` | Abstract base class for crawler strategies. Subclasses must implement the crawl method. |
|
|
| METHOD | AsyncCrawlerStrategy.crawl | `async def crawl(self, url, **kwargs):` | |
|
|
| CLASS | AsyncPlaywrightCrawlerStrategy | `class AsyncPlaywrightCrawlerStrategy:` | Crawler strategy using Playwright. Attributes: browser_config (BrowserConfig): Configuration ob... (truncated) |
|
|
| METHOD | AsyncPlaywrightCrawlerStrategy.__init__ | `def __init__(self, browser_config=None, logger=None, **kwargs):` | Initialize the AsyncPlaywrightCrawlerStrategy with a browser configuration. Args: browser_confi... (truncated) |
|
|
| METHOD | AsyncPlaywrightCrawlerStrategy.__aenter__ | `async def __aenter__(self):` | |
|
|
| METHOD | AsyncPlaywrightCrawlerStrategy.__aexit__ | `async def __aexit__(self, exc_type, exc_val, exc_tb):` | |
|
|
| METHOD | AsyncPlaywrightCrawlerStrategy.start | `async def start(self):` | Start the browser and initialize the browser manager. |
|
|
| METHOD | AsyncPlaywrightCrawlerStrategy.close | `async def close(self):` | Close the browser and clean up resources. |
|
|
| METHOD | AsyncPlaywrightCrawlerStrategy.kill_session | `async def kill_session(self, session_id):` | Kill a browser session and clean up resources. Args: session_id (str): The ID of the session to... (truncated) |
|
|
| METHOD | AsyncPlaywrightCrawlerStrategy.set_hook | `def set_hook(self, hook_type, hook):` | Set a hook function for a specific hook type. Following are list of hook types: - on_browser_created... (truncated) |
|
|
| METHOD | AsyncPlaywrightCrawlerStrategy.execute_hook | `async def execute_hook(self, hook_type, *args, **kwargs):` | Execute a hook function for a specific hook type. Args: hook_type (str): The type of the hook. ... (truncated) |
|
|
| METHOD | AsyncPlaywrightCrawlerStrategy.update_user_agent | `def update_user_agent(self, user_agent):` | Update the user agent for the browser. Args: user_agent (str): The new user agent string. ... (truncated) |
|
|
| METHOD | AsyncPlaywrightCrawlerStrategy.set_custom_headers | `def set_custom_headers(self, headers):` | Set custom headers for the browser. Args: headers (Dict[str, str]): A dictionary of headers to... (truncated) |
|
|
| METHOD | AsyncPlaywrightCrawlerStrategy.smart_wait | `async def smart_wait(self, page, wait_for, timeout=30000):` | Wait for a condition in a smart way. This functions works as below: 1. If wait_for starts with 'js:... (truncated) |
|
|
| METHOD | AsyncPlaywrightCrawlerStrategy.csp_compliant_wait | `async def csp_compliant_wait(self, page, user_wait_function, timeout=30000):` | Wait for a condition in a CSP-compliant way. Args: page: Playwright page object user_wait_f... (truncated) |
|
|
| METHOD | AsyncPlaywrightCrawlerStrategy.process_iframes | `async def process_iframes(self, page):` | Process iframes on a page. This function will extract the content of each iframe and replace it with... (truncated) |
|
|
| METHOD | AsyncPlaywrightCrawlerStrategy.create_session | `async def create_session(self, **kwargs):` | Creates a new browser session and returns its ID. A browse session is a unique openned page can be r... (truncated) |
|
|
| METHOD | AsyncPlaywrightCrawlerStrategy.crawl | `async def crawl(self, url, config, **kwargs):` | Crawls a given URL or processes raw HTML/local file content based on the URL prefix. Args: url ... (truncated) |
|
|
| METHOD | AsyncPlaywrightCrawlerStrategy._crawl_web | `async def _crawl_web(self, url, config):` | Internal method to crawl web URLs with the specified configuration. Args: url (str): The web UR... (truncated) |
|
|
| METHOD | AsyncPlaywrightCrawlerStrategy._handle_full_page_scan | `async def _handle_full_page_scan(self, page, scroll_delay):` | Helper method to handle full page scanning. How it works: 1. Get the viewport height. 2. Scroll to... (truncated) |
|
|
| METHOD | AsyncPlaywrightCrawlerStrategy._handle_download | `async def _handle_download(self, download):` | Handle file downloads. How it works: 1. Get the suggested filename. 2. Get the download path. 3. Lo... (truncated) |
|
|
| METHOD | AsyncPlaywrightCrawlerStrategy.remove_overlay_elements | `async def remove_overlay_elements(self, page):` | Removes popup overlays, modals, cookie notices, and other intrusive elements from the page. Args: ... (truncated) |
|
|
| METHOD | AsyncPlaywrightCrawlerStrategy.export_pdf | `async def export_pdf(self, page):` | Exports the current page as a PDF. Args: page (Page): The Playwright page object Returns: ... (truncated) |
|
|
| METHOD | AsyncPlaywrightCrawlerStrategy.take_screenshot | `async def take_screenshot(self, page, **kwargs):` | Take a screenshot of the current page. Args: page (Page): The Playwright page object kwargs... (truncated) |
|
|
| METHOD | AsyncPlaywrightCrawlerStrategy.take_screenshot_from_pdf | `async def take_screenshot_from_pdf(self, pdf_data):` | Convert the first page of the PDF to a screenshot. Requires pdf2image and poppler. Args: ... (truncated) |
|
|
| METHOD | AsyncPlaywrightCrawlerStrategy.take_screenshot_scroller | `async def take_screenshot_scroller(self, page, **kwargs):` | Attempt to set a large viewport and take a full-page screenshot. If still too large, segment the pag... (truncated) |
|
|
| METHOD | AsyncPlaywrightCrawlerStrategy.take_screenshot_naive | `async def take_screenshot_naive(self, page):` | Takes a screenshot of the current page. Args: page (Page): The Playwright page instance Return... (truncated) |
|
|
| METHOD | AsyncPlaywrightCrawlerStrategy.export_storage_state | `async def export_storage_state(self, path=None):` | Exports the current storage state (cookies, localStorage, sessionStorage) to a JSON file at the spec... (truncated) |
|
|
| METHOD | AsyncPlaywrightCrawlerStrategy.robust_execute_user_script | `async def robust_execute_user_script(self, page, js_code):` | Executes user-provided JavaScript code with proper error handling and context, supporting both synch... (truncated) |
|
|
| METHOD | AsyncPlaywrightCrawlerStrategy.execute_user_script | `async def execute_user_script(self, page, js_code):` | Executes user-provided JavaScript code with proper error handling and context. Args: page: Play... (truncated) |
|
|
| METHOD | AsyncPlaywrightCrawlerStrategy.check_visibility | `async def check_visibility(self, page):` | Checks if an element is visible on the page. Args: page: Playwright page object Returns: ... (truncated) |
|
|
| METHOD | AsyncPlaywrightCrawlerStrategy.safe_scroll | `async def safe_scroll(self, page, x, y):` | Safely scroll the page with rendering time. Args: page: Playwright page object x: Horizonta... (truncated) |
|
|
| METHOD | AsyncPlaywrightCrawlerStrategy.csp_scroll_to | `async def csp_scroll_to(self, page, x, y):` | Performs a CSP-compliant scroll operation and returns the result status. Args: page: Playwright... (truncated) |
|
|
| METHOD | AsyncPlaywrightCrawlerStrategy.get_page_dimensions | `async def get_page_dimensions(self, page):` | Get the dimensions of the page. Args: page: Playwright page object Returns: Dict conta... (truncated) |
|
|
|
|
## `crawl4ai/content_scraping_strategy.py`
|
|
|
|
| Type | Name | Signature | Docstring |
|
|
| ------ | -------------------------- | ---------------------------------- | --------------------------- |
|
|
| MODULE | content_scraping_strategy.py | `` | |
|
|
| FUNCTION | parse_dimension | `def parse_dimension(dimension):` | |
|
|
| FUNCTION | fetch_image_file_size | `def fetch_image_file_size(img, base_url):` | |
|
|
| CLASS | ContentScrapingStrategy | `class ContentScrapingStrategy:` | |
|
|
| METHOD | ContentScrapingStrategy.scrap | `def scrap(self, url, html, **kwargs):` | |
|
|
| METHOD | ContentScrapingStrategy.ascrap | `async def ascrap(self, url, html, **kwargs):` | |
|
|
| CLASS | WebScrapingStrategy | `class WebScrapingStrategy:` | Class for web content scraping. Perhaps the most important class. How it works: 1. Extract content... (truncated) |
|
|
| METHOD | WebScrapingStrategy.__init__ | `def __init__(self, logger=None):` | |
|
|
| METHOD | WebScrapingStrategy._log | `def _log(self, level, message, tag='SCRAPE', **kwargs):` | Helper method to safely use logger. |
|
|
| METHOD | WebScrapingStrategy.scrap | `def scrap(self, url, html, **kwargs):` | Main entry point for content scraping. Args: url (str): The URL of the page to scrape. ht... (truncated) |
|
|
| METHOD | WebScrapingStrategy.ascrap | `async def ascrap(self, url, html, **kwargs):` | Main entry point for asynchronous content scraping. Args: url (str): The URL of the page to scr... (truncated) |
|
|
| METHOD | WebScrapingStrategy._generate_markdown_content | `def _generate_markdown_content(self, cleaned_html, html, url, success, **kwargs):` | Generate markdown content from cleaned HTML. Args: cleaned_html (str): The cleaned HTML content... (truncated) |
|
|
| METHOD | WebScrapingStrategy.flatten_nested_elements | `def flatten_nested_elements(self, node):` | Flatten nested elements in a HTML tree. Args: node (Tag): The root node of the HTML tree. Retu... (truncated) |
|
|
| METHOD | WebScrapingStrategy.find_closest_parent_with_useful_text | `def find_closest_parent_with_useful_text(self, tag, **kwargs):` | Find the closest parent with useful text. Args: tag (Tag): The starting tag to search from. ... (truncated) |
|
|
| METHOD | WebScrapingStrategy.remove_unwanted_attributes | `def remove_unwanted_attributes(self, element, important_attrs, keep_data_attributes=False):` | Remove unwanted attributes from an HTML element. Args: element (Tag): The HTML element to r... (truncated) |
|
|
| METHOD | WebScrapingStrategy.process_image | `def process_image(self, img, url, index, total_images, **kwargs):` | Process an image element. How it works: 1. Check if the image has valid display and inside undesire... (truncated) |
|
|
| METHOD | WebScrapingStrategy.process_element | `def process_element(self, url, element, **kwargs):` | Process an HTML element. How it works: 1. Check if the element is an image, video, or audio. 2. Ext... (truncated) |
|
|
| METHOD | WebScrapingStrategy._process_element | `def _process_element(self, url, element, media, internal_links_dict, external_links_dict, **kwargs):` | Process an HTML element. |
|
|
| METHOD | WebScrapingStrategy._scrap | `def _scrap(self, url, html, word_count_threshold=MIN_WORD_THRESHOLD, css_selector=None, **kwargs):` | Extract content from HTML using BeautifulSoup. Args: url (str): The URL of the page to scrape. ... (truncated) |
|
|
|
|
## `crawl4ai/markdown_generation_strategy.py`
|
|
|
|
| Type | Name | Signature | Docstring |
|
|
| ------ | -------------------------- | ---------------------------------- | --------------------------- |
|
|
| MODULE | markdown_generation_strategy.py | `` | |
|
|
| FUNCTION | fast_urljoin | `def fast_urljoin(base, url):` | Fast URL joining for common cases. |
|
|
| CLASS | MarkdownGenerationStrategy | `class MarkdownGenerationStrategy:` | Abstract base class for markdown generation strategies. |
|
|
| METHOD | MarkdownGenerationStrategy.__init__ | `def __init__(self, content_filter=None, options=None):` | |
|
|
| METHOD | MarkdownGenerationStrategy.generate_markdown | `def generate_markdown(self, cleaned_html, base_url='', html2text_options=None, content_filter=None, citations=True, **kwargs):` | Generate markdown from cleaned HTML. |
|
|
| CLASS | DefaultMarkdownGenerator | `class DefaultMarkdownGenerator:` | Default implementation of markdown generation strategy. How it works: 1. Generate raw markdown from... (truncated) |
|
|
| METHOD | DefaultMarkdownGenerator.__init__ | `def __init__(self, content_filter=None, options=None):` | |
|
|
| METHOD | DefaultMarkdownGenerator.convert_links_to_citations | `def convert_links_to_citations(self, markdown, base_url=''):` | Convert links in markdown to citations. How it works: 1. Find all links in the markdown. 2. Convert... (truncated) |
|
|
| METHOD | DefaultMarkdownGenerator.generate_markdown | `def generate_markdown(self, cleaned_html, base_url='', html2text_options=None, options=None, content_filter=None, citations=True, **kwargs):` | Generate markdown with citations from cleaned HTML. How it works: 1. Generate raw markdown from cle... (truncated) |
|
|
|
|
## `crawl4ai/content_filter_strategy.py`
|
|
|
|
| Type | Name | Signature | Docstring |
|
|
| ------ | -------------------------- | ---------------------------------- | --------------------------- |
|
|
| MODULE | content_filter_strategy.py | `` | |
|
|
| CLASS | RelevantContentFilter | `class RelevantContentFilter:` | Abstract base class for content filtering strategies |
|
|
| METHOD | RelevantContentFilter.__init__ | `def __init__(self, user_query=None):` | |
|
|
| METHOD | RelevantContentFilter.filter_content | `def filter_content(self, html):` | Abstract method to be implemented by specific filtering strategies |
|
|
| METHOD | RelevantContentFilter.extract_page_query | `def extract_page_query(self, soup, body):` | Common method to extract page metadata with fallbacks |
|
|
| METHOD | RelevantContentFilter.extract_text_chunks | `def extract_text_chunks(self, body, min_word_threshold=None):` | Extracts text chunks from a BeautifulSoup body element while preserving order. Returns list of tuple... (truncated) |
|
|
| METHOD | RelevantContentFilter._deprecated_extract_text_chunks | `def _deprecated_extract_text_chunks(self, soup):` | Common method for extracting text chunks |
|
|
| METHOD | RelevantContentFilter.is_excluded | `def is_excluded(self, tag):` | Common method for exclusion logic |
|
|
| METHOD | RelevantContentFilter.clean_element | `def clean_element(self, tag):` | Common method for cleaning HTML elements with minimal overhead |
|
|
| CLASS | BM25ContentFilter | `class BM25ContentFilter:` | Content filtering using BM25 algorithm with priority tag handling. How it works: 1. Extracts page m... (truncated) |
|
|
| METHOD | BM25ContentFilter.__init__ | `def __init__(self, user_query=None, bm25_threshold=1.0, language='english'):` | Initializes the BM25ContentFilter class, if not provided, falls back to page metadata. Note: If no ... (truncated) |
|
|
| METHOD | BM25ContentFilter.filter_content | `def filter_content(self, html, min_word_threshold=None):` | Implements content filtering using BM25 algorithm with priority tag handling. Note: This method... (truncated) |
|
|
| CLASS | PruningContentFilter | `class PruningContentFilter:` | Content filtering using pruning algorithm with dynamic threshold. How it works: 1. Extracts page me... (truncated) |
|
|
| METHOD | PruningContentFilter.__init__ | `def __init__(self, user_query=None, min_word_threshold=None, threshold_type='fixed', threshold=0.48):` | Initializes the PruningContentFilter class, if not provided, falls back to page metadata. Note: If ... (truncated) |
|
|
| METHOD | PruningContentFilter.filter_content | `def filter_content(self, html, min_word_threshold=None):` | Implements content filtering using pruning algorithm with dynamic threshold. Note: This method impl... (truncated) |
|
|
| METHOD | PruningContentFilter._remove_comments | `def _remove_comments(self, soup):` | Removes HTML comments |
|
|
| METHOD | PruningContentFilter._remove_unwanted_tags | `def _remove_unwanted_tags(self, soup):` | Removes unwanted tags |
|
|
| METHOD | PruningContentFilter._prune_tree | `def _prune_tree(self, node):` | Prunes the tree starting from the given node. Args: node (Tag): The node from which the pruning... (truncated) |
|
|
| METHOD | PruningContentFilter._compute_composite_score | `def _compute_composite_score(self, metrics, text_len, tag_len, link_text_len):` | Computes the composite score |
|
|
| METHOD | PruningContentFilter._compute_class_id_weight | `def _compute_class_id_weight(self, node):` | Computes the class ID weight |
|
|
|
|
## `crawl4ai/extraction_strategy.py`
|
|
|
|
| Type | Name | Signature | Docstring |
|
|
| ------ | -------------------------- | ---------------------------------- | --------------------------- |
|
|
| MODULE | extraction_strategy.py | `` | |
|
|
| CLASS | ExtractionStrategy | `class ExtractionStrategy:` | Abstract base class for all extraction strategies. |
|
|
| METHOD | ExtractionStrategy.__init__ | `def __init__(self, input_format='markdown', **kwargs):` | Initialize the extraction strategy. Args: input_format: Content format to use for extraction. ... (truncated) |
|
|
| METHOD | ExtractionStrategy.extract | `def extract(self, url, html, *q, **kwargs):` | Extract meaningful blocks or chunks from the given HTML. :param url: The URL of the webpage. :param... (truncated) |
|
|
| METHOD | ExtractionStrategy.run | `def run(self, url, sections, *q, **kwargs):` | Process sections of text in parallel by default. :param url: The URL of the webpage. :param section... (truncated) |
|
|
| CLASS | NoExtractionStrategy | `class NoExtractionStrategy:` | A strategy that does not extract any meaningful content from the HTML. It simply returns the entire ... (truncated) |
|
|
| METHOD | NoExtractionStrategy.extract | `def extract(self, url, html, *q, **kwargs):` | Extract meaningful blocks or chunks from the given HTML. |
|
|
| METHOD | NoExtractionStrategy.run | `def run(self, url, sections, *q, **kwargs):` | |
|
|
| CLASS | LLMExtractionStrategy | `class LLMExtractionStrategy:` | A strategy that uses an LLM to extract meaningful content from the HTML. Attributes: provider: ... (truncated) |
|
|
| METHOD | LLMExtractionStrategy.__init__ | `def __init__(self, provider=DEFAULT_PROVIDER, api_token=None, instruction=None, schema=None, extraction_type='block', **kwargs):` | Initialize the strategy with clustering parameters. Args: provider: The provider to use for ext... (truncated) |
|
|
| METHOD | LLMExtractionStrategy.extract | `def extract(self, url, ix, html):` | Extract meaningful blocks or chunks from the given HTML using an LLM. How it works: 1. Construct a ... (truncated) |
|
|
| METHOD | LLMExtractionStrategy._merge | `def _merge(self, documents, chunk_token_threshold, overlap):` | Merge documents into sections based on chunk_token_threshold and overlap. |
|
|
| METHOD | LLMExtractionStrategy.run | `def run(self, url, sections):` | Process sections sequentially with a delay for rate limiting issues, specifically for LLMExtractionS... (truncated) |
|
|
| METHOD | LLMExtractionStrategy.show_usage | `def show_usage(self):` | Print a detailed token usage report showing total and per-request usage. |
|
|
| CLASS | CosineStrategy | `class CosineStrategy:` | Extract meaningful blocks or chunks from the given HTML using cosine similarity. How it works: 1. P... (truncated) |
|
|
| METHOD | CosineStrategy.__init__ | `def __init__(self, semantic_filter=None, word_count_threshold=10, max_dist=0.2, linkage_method='ward', top_k=3, model_name='sentence-transformers/all-MiniLM-L6-v2', sim_threshold=0.3, **kwargs):` | Initialize the strategy with clustering parameters. Args: semantic_filter (str): A keyword filt... (truncated) |
|
|
| METHOD | CosineStrategy.filter_documents_embeddings | `def filter_documents_embeddings(self, documents, semantic_filter, at_least_k=20):` | Filter and sort documents based on the cosine similarity of their embeddings with the semantic_filte... (truncated) |
|
|
| METHOD | CosineStrategy.get_embeddings | `def get_embeddings(self, sentences, batch_size=None, bypass_buffer=False):` | Get BERT embeddings for a list of sentences. Args: sentences (List[str]): A list of text chunks... (truncated) |
|
|
| METHOD | CosineStrategy.hierarchical_clustering | `def hierarchical_clustering(self, sentences, embeddings=None):` | Perform hierarchical clustering on sentences and return cluster labels. Args: sentences (List[s... (truncated) |
|
|
| METHOD | CosineStrategy.filter_clusters_by_word_count | `def filter_clusters_by_word_count(self, clusters):` | Filter clusters to remove those with a word count below the threshold. Args: clusters (Dict[int... (truncated) |
|
|
| METHOD | CosineStrategy.extract | `def extract(self, url, html, *q, **kwargs):` | Extract clusters from HTML content using hierarchical clustering. Args: url (str): The URL of t... (truncated) |
|
|
| METHOD | CosineStrategy.run | `def run(self, url, sections, *q, **kwargs):` | Process sections using hierarchical clustering. Args: url (str): The URL of the webpage. se... (truncated) |
|
|
| CLASS | JsonElementExtractionStrategy | `class JsonElementExtractionStrategy:` | Abstract base class for extracting structured JSON from HTML content. How it works: 1. ... (truncated) |
|
|
| METHOD | JsonElementExtractionStrategy.__init__ | `def __init__(self, schema, **kwargs):` | Initialize the JSON element extraction strategy with a schema. Args: schema (Dict[str, Any]): T... (truncated) |
|
|
| METHOD | JsonElementExtractionStrategy.extract | `def extract(self, url, html_content, *q, **kwargs):` | Extract structured data from HTML content. How it works: 1. Parses the HTML content using the `_par... (truncated) |
|
|
| METHOD | JsonElementExtractionStrategy._parse_html | `def _parse_html(self, html_content):` | Parse HTML content into appropriate format |
|
|
| METHOD | JsonElementExtractionStrategy._get_base_elements | `def _get_base_elements(self, parsed_html, selector):` | Get all base elements using the selector |
|
|
| METHOD | JsonElementExtractionStrategy._get_elements | `def _get_elements(self, element, selector):` | Get child elements using the selector |
|
|
| METHOD | JsonElementExtractionStrategy._extract_field | `def _extract_field(self, element, field):` | |
|
|
| METHOD | JsonElementExtractionStrategy._extract_single_field | `def _extract_single_field(self, element, field):` | Extract a single field based on its type. How it works: 1. Selects the target element using the fie... (truncated) |
|
|
| METHOD | JsonElementExtractionStrategy._extract_list_item | `def _extract_list_item(self, element, fields):` | |
|
|
| METHOD | JsonElementExtractionStrategy._extract_item | `def _extract_item(self, element, fields):` | Extracts fields from a given element. How it works: 1. Iterates through the fields defined in the s... (truncated) |
|
|
| METHOD | JsonElementExtractionStrategy._apply_transform | `def _apply_transform(self, value, transform):` | Apply a transformation to a value. How it works: 1. Checks the transformation type (e.g., `lowercas... (truncated) |
|
|
| METHOD | JsonElementExtractionStrategy._compute_field | `def _compute_field(self, item, field):` | |
|
|
| METHOD | JsonElementExtractionStrategy.run | `def run(self, url, sections, *q, **kwargs):` | Run the extraction strategy on a combined HTML content. How it works: 1. Combines multiple HTML sec... (truncated) |
|
|
| METHOD | JsonElementExtractionStrategy._get_element_text | `def _get_element_text(self, element):` | Get text content from element |
|
|
| METHOD | JsonElementExtractionStrategy._get_element_html | `def _get_element_html(self, element):` | Get HTML content from element |
|
|
| METHOD | JsonElementExtractionStrategy._get_element_attribute | `def _get_element_attribute(self, element, attribute):` | Get attribute value from element |
|
|
| CLASS | JsonCssExtractionStrategy | `class JsonCssExtractionStrategy:` | Concrete implementation of `JsonElementExtractionStrategy` using CSS selectors. How it works: 1. Pa... (truncated) |
|
|
| METHOD | JsonCssExtractionStrategy.__init__ | `def __init__(self, schema, **kwargs):` | |
|
|
| METHOD | JsonCssExtractionStrategy._parse_html | `def _parse_html(self, html_content):` | |
|
|
| METHOD | JsonCssExtractionStrategy._get_base_elements | `def _get_base_elements(self, parsed_html, selector):` | |
|
|
| METHOD | JsonCssExtractionStrategy._get_elements | `def _get_elements(self, element, selector):` | |
|
|
| METHOD | JsonCssExtractionStrategy._get_element_text | `def _get_element_text(self, element):` | |
|
|
| METHOD | JsonCssExtractionStrategy._get_element_html | `def _get_element_html(self, element):` | |
|
|
| METHOD | JsonCssExtractionStrategy._get_element_attribute | `def _get_element_attribute(self, element, attribute):` | |
|
|
| CLASS | JsonXPathExtractionStrategy | `class JsonXPathExtractionStrategy:` | Concrete implementation of `JsonElementExtractionStrategy` using XPath selectors. How it works: 1. ... (truncated) |
|
|
| METHOD | JsonXPathExtractionStrategy.__init__ | `def __init__(self, schema, **kwargs):` | |
|
|
| METHOD | JsonXPathExtractionStrategy._parse_html | `def _parse_html(self, html_content):` | |
|
|
| METHOD | JsonXPathExtractionStrategy._get_base_elements | `def _get_base_elements(self, parsed_html, selector):` | |
|
|
| METHOD | JsonXPathExtractionStrategy._css_to_xpath | `def _css_to_xpath(self, css_selector):` | Convert CSS selector to XPath if needed |
|
|
| METHOD | JsonXPathExtractionStrategy._basic_css_to_xpath | `def _basic_css_to_xpath(self, css_selector):` | Basic CSS to XPath conversion for common cases |
|
|
| METHOD | JsonXPathExtractionStrategy._get_elements | `def _get_elements(self, element, selector):` | |
|
|
| METHOD | JsonXPathExtractionStrategy._get_element_text | `def _get_element_text(self, element):` | |
|
|
| METHOD | JsonXPathExtractionStrategy._get_element_html | `def _get_element_html(self, element):` | |
|
|
| METHOD | JsonXPathExtractionStrategy._get_element_attribute | `def _get_element_attribute(self, element, attribute):` | |
|
|
|
|
## `crawl4ai/chunking_strategy.py`
|
|
|
|
| Type | Name | Signature | Docstring |
|
|
| ------ | -------------------------- | ---------------------------------- | --------------------------- |
|
|
| MODULE | chunking_strategy.py | `` | |
|
|
| CLASS | ChunkingStrategy | `class ChunkingStrategy:` | Abstract base class for chunking strategies. |
|
|
| METHOD | ChunkingStrategy.chunk | `def chunk(self, text):` | Abstract method to chunk the given text. Args: text (str): The text to chunk. Returns: lis... (truncated) |
|
|
| CLASS | IdentityChunking | `class IdentityChunking:` | Chunking strategy that returns the input text as a single chunk. |
|
|
| METHOD | IdentityChunking.chunk | `def chunk(self, text):` | |
|
|
| CLASS | RegexChunking | `class RegexChunking:` | Chunking strategy that splits text based on regular expression patterns. |
|
|
| METHOD | RegexChunking.__init__ | `def __init__(self, patterns=None, **kwargs):` | Initialize the RegexChunking object. Args: patterns (list): A list of regular expression patter... (truncated) |
|
|
| METHOD | RegexChunking.chunk | `def chunk(self, text):` | |
|
|
| CLASS | NlpSentenceChunking | `class NlpSentenceChunking:` | Chunking strategy that splits text into sentences using NLTK's sentence tokenizer. |
|
|
| METHOD | NlpSentenceChunking.__init__ | `def __init__(self, **kwargs):` | Initialize the NlpSentenceChunking object. |
|
|
| METHOD | NlpSentenceChunking.chunk | `def chunk(self, text):` | |
|
|
| CLASS | TopicSegmentationChunking | `class TopicSegmentationChunking:` | Chunking strategy that segments text into topics using NLTK's TextTilingTokenizer. How it works: 1.... (truncated) |
|
|
| METHOD | TopicSegmentationChunking.__init__ | `def __init__(self, num_keywords=3, **kwargs):` | Initialize the TopicSegmentationChunking object. Args: num_keywords (int): The number of keywor... (truncated) |
|
|
| METHOD | TopicSegmentationChunking.chunk | `def chunk(self, text):` | |
|
|
| METHOD | TopicSegmentationChunking.extract_keywords | `def extract_keywords(self, text):` | |
|
|
| METHOD | TopicSegmentationChunking.chunk_with_topics | `def chunk_with_topics(self, text):` | |
|
|
| CLASS | FixedLengthWordChunking | `class FixedLengthWordChunking:` | Chunking strategy that splits text into fixed-length word chunks. How it works: 1. Split the text i... (truncated) |
|
|
| METHOD | FixedLengthWordChunking.__init__ | `def __init__(self, chunk_size=100, **kwargs):` | Initialize the fixed-length word chunking strategy with the given chunk size. Args: chunk_size ... (truncated) |
|
|
| METHOD | FixedLengthWordChunking.chunk | `def chunk(self, text):` | |
|
|
| CLASS | SlidingWindowChunking | `class SlidingWindowChunking:` | Chunking strategy that splits text into overlapping word chunks. How it works: 1. Split the text in... (truncated) |
|
|
| METHOD | SlidingWindowChunking.__init__ | `def __init__(self, window_size=100, step=50, **kwargs):` | Initialize the sliding window chunking strategy with the given window size and step size. Args: ... (truncated) |
|
|
| METHOD | SlidingWindowChunking.chunk | `def chunk(self, text):` | |
|
|
| CLASS | OverlappingWindowChunking | `class OverlappingWindowChunking:` | Chunking strategy that splits text into overlapping word chunks. How it works: 1. Split the text in... (truncated) |
|
|
| METHOD | OverlappingWindowChunking.__init__ | `def __init__(self, window_size=1000, overlap=100, **kwargs):` | Initialize the overlapping window chunking strategy with the given window size and overlap size. Ar... (truncated) |
|
|
| METHOD | OverlappingWindowChunking.chunk | `def chunk(self, text):` | |
|
|
|
|
## `crawl4ai/user_agent_generator.py`
|
|
|
|
| Type | Name | Signature | Docstring |
|
|
| ------ | -------------------------- | ---------------------------------- | --------------------------- |
|
|
| MODULE | user_agent_generator.py | `` | |
|
|
| CLASS | UserAgentGenerator | `class UserAgentGenerator:` | Generate random user agents with specified constraints. Attributes: desktop_platforms (dict): A... (truncated) |
|
|
| METHOD | UserAgentGenerator.__init__ | `def __init__(self):` | |
|
|
| METHOD | UserAgentGenerator.get_browser_stack | `def get_browser_stack(self, num_browsers=1):` | Get a valid combination of browser versions. How it works: 1. Check if the number of browsers is su... (truncated) |
|
|
| METHOD | UserAgentGenerator.generate | `def generate(self, device_type=None, os_type=None, device_brand=None, browser_type=None, num_browsers=3):` | Generate a random user agent with specified constraints. Args: device_type: 'desktop' or 'mobil... (truncated) |
|
|
| METHOD | UserAgentGenerator.generate_with_client_hints | `def generate_with_client_hints(self, **kwargs):` | Generate both user agent and matching client hints |
|
|
| METHOD | UserAgentGenerator.get_random_platform | `def get_random_platform(self, device_type, os_type, device_brand):` | Helper method to get random platform based on constraints |
|
|
| METHOD | UserAgentGenerator.parse_user_agent | `def parse_user_agent(self, user_agent):` | Parse a user agent string to extract browser and version information |
|
|
| METHOD | UserAgentGenerator.generate_client_hints | `def generate_client_hints(self, user_agent):` | Generate Sec-CH-UA header value based on user agent string |
|
|
|
|
## `crawl4ai/ssl_certificate.py`
|
|
|
|
| Type | Name | Signature | Docstring |
|
|
| ------ | -------------------------- | ---------------------------------- | --------------------------- |
|
|
| MODULE | ssl_certificate.py | `` | SSL Certificate class for handling certificate operations. |
|
|
| CLASS | SSLCertificate | `class SSLCertificate:` | A class representing an SSL certificate with methods to export in various formats. Attributes: ... (truncated) |
|
|
| METHOD | SSLCertificate.__init__ | `def __init__(self, cert_info):` | |
|
|
| METHOD | SSLCertificate.from_url | `def from_url(url, timeout=10):` | Create SSLCertificate instance from a URL. Args: url (str): URL of the website. timeout (in... (truncated) |
|
|
| METHOD | SSLCertificate._decode_cert_data | `def _decode_cert_data(data):` | Helper method to decode bytes in certificate data. |
|
|
| METHOD | SSLCertificate.to_json | `def to_json(self, filepath=None):` | Export certificate as JSON. Args: filepath (Optional[str]): Path to save the JSON file (default... (truncated) |
|
|
| METHOD | SSLCertificate.to_pem | `def to_pem(self, filepath=None):` | Export certificate as PEM. Args: filepath (Optional[str]): Path to save the PEM file (default: ... (truncated) |
|
|
| METHOD | SSLCertificate.to_der | `def to_der(self, filepath=None):` | Export certificate as DER. Args: filepath (Optional[str]): Path to save the DER file (default: ... (truncated) |
|
|
| METHOD | SSLCertificate.issuer | `def issuer(self):` | Get certificate issuer information. |
|
|
| METHOD | SSLCertificate.subject | `def subject(self):` | Get certificate subject information. |
|
|
| METHOD | SSLCertificate.valid_from | `def valid_from(self):` | Get certificate validity start date. |
|
|
| METHOD | SSLCertificate.valid_until | `def valid_until(self):` | Get certificate validity end date. |
|
|
| METHOD | SSLCertificate.fingerprint | `def fingerprint(self):` | Get certificate fingerprint. |
|