feat(crawler): add session management and view-source support

Add session_id feature to allow reusing browser pages across multiple crawls.
Add support for view-source: protocol in URL handling.
Fix browser config reference and string formatting issues.
Update examples to demonstrate new session management features.

BREAKING CHANGE: Browser page handling now persists when using session_id
This commit is contained in:
UncleCode
2025-05-08 17:13:35 +08:00
parent 9b5ccac76e
commit 206a9dfabd
4 changed files with 49 additions and 9 deletions

View File

@@ -8,17 +8,19 @@ from crawl4ai import (
CrawlResult
)
async def main():
browser_config = BrowserConfig(headless=True, verbose=True)
async def main():
browser_config = BrowserConfig(
headless=False,
verbose=True,
)
async with AsyncWebCrawler(config=browser_config) as crawler:
crawler_config = CrawlerRunConfig(
markdown_generator=DefaultMarkdownGenerator(
content_filter=PruningContentFilter()
),
)
result : CrawlResult = await crawler.arun(
result: CrawlResult = await crawler.arun(
url="https://www.helloworld.org", config=crawler_config
)
print(result.markdown.raw_markdown[:500])