This pull request addresses the issue of handling default context pages when none are open.
- Introduces a conditional check to determine if a page exists in the context.
- If no pages exist, a new page is created via await context.new_page().
Adds a new wait_for_timeout parameter to CrawlerRunConfig that allows specifying
a separate timeout for the wait_for condition, independent of the page_timeout.
This provides more granular control over waiting behaviors in the crawler.
Also removes unused colorama dependency and updates LinkedIn crawler example.
BREAKING CHANGE: LinkedIn crawler example now uses different wait_for_images timing
Resolved a bug where running the crawler on local HTML files with `capture_console_messages=False`
(default) raised `UnboundLocalError` due to `captured_console` being accessed before assignment.
Revised the description for the `f` parameter in the `/mcp/md` tool schema to use lowercase enum values
(`raw`, `fit`, `bm25`, `llm`) for consistency with the actual `enum` definition. This change prevents
LLM-based clients (e.g., Gemini via LibreChat) from generating uppercase values like `"FIT"`, which
caused 422 validation errors due to strict case-sensitive matching.
This patch ensures consistent handling of `response.choices[0].message.content` by avoiding redefinition
of the `response` variable, which caused downstream exceptions during error handling.
Removes automatic page closure in take_screenshot and take_screenshot_naive methods
to prevent premature closure of pages that might still be needed in the calling context.
This allows for more flexible page lifecycle management by the caller.
BREAKING CHANGE: Page objects are no longer automatically closed after taking screenshots.
Callers must explicitly handle page closure when appropriate.
Add session_id feature to allow reusing browser pages across multiple crawls.
Add support for view-source: protocol in URL handling.
Fix browser config reference and string formatting issues.
Update examples to demonstrate new session management features.
BREAKING CHANGE: Browser page handling now persists when using session_id
Add new RegexExtractionStrategy for fast, zero-LLM extraction of common data types:
- Built-in patterns for emails, URLs, phones, dates, and more
- Support for custom regex patterns
- LLM-assisted pattern generation utility
- Optimized HTML preprocessing with fit_html field
- Enhanced network response body capture
Breaking changes: None