diff --git a/CHANGELOG.md b/CHANGELOG.md index 03a7afb0..58dacf81 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -1,5 +1,91 @@ # Changelog +## [0.4.1] December 8, 2024 + +### **File: `crawl4ai/async_crawler_strategy.py`** + +#### **New Parameters and Attributes Added** +- **`text_only` (boolean)**: Enables text-only mode, disables images, JavaScript, and GPU-related features for faster, minimal rendering. +- **`light_mode` (boolean)**: Optimizes the browser by disabling unnecessary background processes and features for efficiency. +- **`viewport_width` and `viewport_height`**: Dynamically adjusts based on `text_only` mode (default values: 800x600 for `text_only`, 1920x1080 otherwise). +- **`extra_args`**: Adds browser-specific flags for `text_only` mode. +- **`adjust_viewport_to_content`**: Dynamically adjusts the viewport to the content size for accurate rendering. + +#### **Browser Context Adjustments** +- Added **`viewport` adjustments**: Dynamically computed based on `text_only` or custom configuration. +- Enhanced support for `light_mode` and `text_only` by adding specific browser arguments to reduce resource consumption. + +#### **Dynamic Content Handling** +- **Full Page Scan Feature**: + - Scrolls through the entire page while dynamically detecting content changes. + - Ensures scrolling stops when no new dynamic content is loaded. + +#### **Session Management** +- Added **`create_session`** method: + - Creates a new browser session and assigns a unique ID. + - Supports persistent and non-persistent contexts with full compatibility for cookies, headers, and proxies. + +#### **Improved Content Loading and Adjustment** +- **`adjust_viewport_to_content`**: + - Automatically adjusts viewport to match content dimensions. + - Includes scaling via Chrome DevTools Protocol (CDP). +- Enhanced content loading: + - Waits for images to load and ensures network activity is idle before proceeding. + +#### **Error Handling and Logging** +- Improved error handling and detailed logging for: + - Viewport adjustment (`adjust_viewport_to_content`). + - Full page scanning (`scan_full_page`). + - Dynamic content loading. + +#### **Refactoring and Cleanup** +- Removed hardcoded viewport dimensions in multiple places, replaced with dynamic values (`self.viewport_width`, `self.viewport_height`). +- Removed commented-out and unused code for better readability. +- Added default value for `delay_before_return_html` parameter. + +#### **Optimizations** +- Reduced resource usage in `light_mode` by disabling unnecessary browser features such as extensions, background timers, and sync. +- Improved compatibility for different browser types (`chrome`, `firefox`, `webkit`). + +--- + +### **File: `docs/examples/quickstart_async.py`** + +#### **Schema Adjustment** +- Changed schema reference for `LLMExtractionStrategy`: + - **Old**: `OpenAIModelFee.schema()` + - **New**: `OpenAIModelFee.model_json_schema()` + - This likely ensures better compatibility with the `OpenAIModelFee` class and its JSON schema. + +#### **Documentation Comments Updated** +- Improved extraction instruction for schema-based LLM strategies. + +--- + +### **New Features Added** +1. **Text-Only Mode**: + - Focuses on minimal resource usage by disabling non-essential browser features. +2. **Light Mode**: + - Optimizes browser for performance by disabling background tasks and unnecessary services. +3. **Full Page Scanning**: + - Ensures the entire content of a page is crawled, including dynamic elements loaded during scrolling. +4. **Dynamic Viewport Adjustment**: + - Automatically resizes the viewport to match content dimensions, improving compatibility and rendering accuracy. +5. **Session Management**: + - Simplifies session handling with better support for persistent and non-persistent contexts. + +--- + +### **Bug Fixes** +- Fixed potential viewport mismatches by ensuring consistent use of `self.viewport_width` and `self.viewport_height` throughout the code. +- Improved robustness of dynamic content loading to avoid timeouts and failed evaluations. + + + + + + + ## [0.3.75] December 1, 2024 ### PruningContentFilter diff --git a/README.md b/README.md index cbeb4067..dede4a03 100644 --- a/README.md +++ b/README.md @@ -11,10 +11,9 @@ Crawl4AI is the #1 trending GitHub repository, actively maintained by a vibrant community. It delivers blazing-fast, AI-ready web crawling tailored for LLMs, AI agents, and data pipelines. Open source, flexible, and built for real-time performance, Crawl4AI empowers developers with unmatched speed, precision, and deployment ease. +[✨ Check out latest update v0.4.1](#-recent-updates) -🎉 **Version 0.4.0 is out!** Introducing our experimental PruningContentFilter - a powerful new algorithm for smarter Markdown generation. Test it out and [share your feedback](https://github.com/unclecode/crawl4ai/issues)! [Read the release notes →](https://github.com/unclecode/crawl4ai/blob/main/docs/md_v2/blog/releases/0.4.0.md) - -[✨ Check out latest update v0.4.0](#-recent-updates) +🎉 **Version 0.4.x is out!** Introducing our experimental PruningContentFilter - a powerful new algorithm for smarter Markdown generation. Test it out and [share your feedback](https://github.com/unclecode/crawl4ai/issues)! [Read the release notes →](https://crawl4ai.com/mkdocs/blog) ## 🧐 Why Crawl4AI? @@ -80,6 +79,7 @@ if __name__ == "__main__": - 🧩 **Proxy Support**: Seamlessly connect to proxies with authentication for secure access. - ⚙️ **Full Browser Control**: Modify headers, cookies, user agents, and more for tailored crawling setups. - 🌍 **Multi-Browser Support**: Compatible with Chromium, Firefox, and WebKit. +- 📐 **Dynamic Viewport Adjustment**: Automatically adjusts the browser viewport to match page content, ensuring complete rendering and capturing of all elements. @@ -95,6 +95,8 @@ if __name__ == "__main__": - 💾 **Caching**: Cache data for improved speed and to avoid redundant fetches. - 📄 **Metadata Extraction**: Retrieve structured metadata from web pages. - 📡 **IFrame Content Extraction**: Seamless extraction from embedded iframe content. +- 🕵️ **Lazy Load Handling**: Waits for images to fully load, ensuring no content is missed due to lazy loading. +- 🔄 **Full-Page Scanning**: Simulates scrolling to load and capture all dynamic content, perfect for infinite scroll pages. @@ -121,8 +123,6 @@ if __name__ == "__main__": - - ## Try it Now! ✨ Play around with this [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1SgRPrByQLzjRfwoRNq1wSGE9nYY_EE8C?usp=sharing) @@ -626,13 +626,14 @@ async def test_news_crawl(): ## ✨ Recent Updates -- 🔬 **PruningContentFilter**: New unsupervised filtering strategy for intelligent content extraction based on text density and relevance scoring. -- 🧵 **Enhanced Thread Safety**: Improved multi-threaded environment handling with better locks and parallel processing support. -- 🤖 **Smart User-Agent Generation**: Advanced user-agent generator with customization options and randomization capabilities. -- 📝 **New Blog Launch**: Stay updated with our detailed release notes and technical deep dives at [crawl4ai.com/blog](https://crawl4ai.com/blog). -- 🧪 **Expanded Test Coverage**: Comprehensive test suite for both PruningContentFilter and BM25ContentFilter with edge case handling. +- 🖼️ **Lazy Load Handling**: Improved support for websites with lazy-loaded images. The crawler now waits for all images to fully load, ensuring no content is missed. +- ⚡ **Text-Only Mode**: New mode for fast, lightweight crawling. Disables images, JavaScript, and GPU rendering, improving speed by 3-4x for text-focused crawls. +- 📐 **Dynamic Viewport Adjustment**: Automatically adjusts the browser viewport to fit page content, ensuring accurate rendering and capturing of all elements. +- 🔄 **Full-Page Scanning**: Added scrolling support for pages with infinite scroll or dynamic content loading. Ensures every part of the page is captured. +- 🧑‍💻 **Session Reuse**: Introduced `create_session` for efficient crawling by reusing the same browser session across multiple requests. +- 🌟 **Light Mode**: Optimized browser performance by disabling unnecessary features like extensions, background timers, and sync processes. -Read the full details of this release in our [0.4.0 Release Notes](https://github.com/unclecode/crawl4ai/blob/main/docs/md_v2/blog/releases/0.4.0.md). +Read the full details of this release in our [0.4.1 Release Notes](https://github.com/unclecode/crawl4ai/blob/main/docs/md_v2/blog/releases/0.4.1.md). ## 📖 Documentation & Roadmap diff --git a/crawl4ai/__version__.py b/crawl4ai/__version__.py index 6f8b06f4..80861132 100644 --- a/crawl4ai/__version__.py +++ b/crawl4ai/__version__.py @@ -1,2 +1,2 @@ # crawl4ai/_version.py -__version__ = "0.4.0" +__version__ = "0.4.1" diff --git a/crawl4ai/async_crawler_strategy.py b/crawl4ai/async_crawler_strategy.py index 493597ea..5c706239 100644 --- a/crawl4ai/async_crawler_strategy.py +++ b/crawl4ai/async_crawler_strategy.py @@ -220,8 +220,22 @@ class AsyncCrawlerStrategy(ABC): class AsyncPlaywrightCrawlerStrategy(AsyncCrawlerStrategy): def __init__(self, use_cached_html=False, js_code=None, logger = None, **kwargs): + self.text_only = kwargs.get("text_only", False) + self.light_mode = kwargs.get("light_mode", False) self.logger = logger self.use_cached_html = use_cached_html + self.viewport_width = kwargs.get("viewport_width", 800 if self.text_only else 1920) + self.viewport_height = kwargs.get("viewport_height", 600 if self.text_only else 1080) + + if self.text_only: + self.extra_args = kwargs.get("extra_args", []) + [ + '--disable-images', + '--disable-javascript', + '--disable-gpu', + '--disable-software-rasterizer', + '--disable-dev-shm-usage' + ] + self.user_agent = kwargs.get( "user_agent", # "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:109.0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/116.0.5845.187 Safari/604.1 Edg/117.0.2045.47" @@ -300,7 +314,8 @@ class AsyncPlaywrightCrawlerStrategy(AsyncCrawlerStrategy): else: # If no default context exists, create one self.default_context = await self.browser.new_context( - viewport={"width": 1920, "height": 1080} + # viewport={"width": 1920, "height": 1080} + viewport={"width": self.viewport_width, "height": self.viewport_height} ) # Set up the default context @@ -334,10 +349,40 @@ class AsyncPlaywrightCrawlerStrategy(AsyncCrawlerStrategy): "--ignore-certificate-errors", "--ignore-certificate-errors-spki-list", "--disable-blink-features=AutomationControlled", - + "--window-position=400,0", + f"--window-size={self.viewport_width},{self.viewport_height}", ] } + if self.light_mode: + browser_args["args"].extend([ + # "--disable-background-networking", + "--disable-background-timer-throttling", + "--disable-backgrounding-occluded-windows", + "--disable-breakpad", + "--disable-client-side-phishing-detection", + "--disable-component-extensions-with-background-pages", + "--disable-default-apps", + "--disable-extensions", + "--disable-features=TranslateUI", + "--disable-hang-monitor", + "--disable-ipc-flooding-protection", + "--disable-popup-blocking", + "--disable-prompt-on-repost", + "--disable-sync", + "--force-color-profile=srgb", + "--metrics-recording-only", + "--no-first-run", + "--password-store=basic", + "--use-mock-keychain" + ]) + + if self.text_only: + browser_args["args"].extend([ + '--blink-settings=imagesEnabled=false', + '--disable-remote-fonts' + ]) + # Add channel if specified (try Chrome first) if self.chrome_channel: browser_args["channel"] = self.chrome_channel @@ -367,6 +412,8 @@ class AsyncPlaywrightCrawlerStrategy(AsyncCrawlerStrategy): if self.browser_type == "firefox": self.browser = await self.playwright.firefox.launch(**browser_args) elif self.browser_type == "webkit": + if "viewport" not in browser_args: + browser_args["viewport"] = {"width": self.viewport_width, "height": self.viewport_height} self.browser = await self.playwright.webkit.launch(**browser_args) else: if self.use_persistent_context and self.user_data_dir: @@ -576,6 +623,38 @@ class AsyncPlaywrightCrawlerStrategy(AsyncCrawlerStrategy): # Return the page object return page + async def create_session(self, **kwargs) -> str: + """Creates a new browser session and returns its ID.""" + if not self.browser: + await self.start() + + session_id = kwargs.get('session_id') or str(uuid.uuid4()) + + if self.use_managed_browser: + page = await self.default_context.new_page() + self.sessions[session_id] = (self.default_context, page, time.time()) + else: + if self.use_persistent_context and self.browser_type in ["chrome", "chromium"]: + context = self.browser + page = await context.new_page() + else: + context = await self.browser.new_context( + user_agent=kwargs.get("user_agent", self.user_agent), + viewport={"width": self.viewport_width, "height": self.viewport_height}, + proxy={"server": self.proxy} if self.proxy else None, + accept_downloads=self.accept_downloads, + ignore_https_errors=True + ) + + if self.cookies: + await context.add_cookies(self.cookies) + await context.set_extra_http_headers(self.headers) + page = await context.new_page() + + self.sessions[session_id] = (context, page, time.time()) + + return session_id + async def crawl(self, url: str, **kwargs) -> AsyncCrawlResponse: """ Crawls a given URL or processes raw HTML/local file content based on the URL prefix. @@ -684,12 +763,11 @@ class AsyncPlaywrightCrawlerStrategy(AsyncCrawlerStrategy): if self.use_persistent_context and self.browser_type in ["chrome", "chromium"]: # In persistent context, browser is the context context = self.browser - page = await context.new_page() else: # Normal context creation for non-persistent or non-Chrome browsers context = await self.browser.new_context( user_agent=user_agent, - viewport={"width": 1200, "height": 800}, + viewport={"width": self.viewport_width, "height": self.viewport_height}, proxy={"server": self.proxy} if self.proxy else None, java_script_enabled=True, accept_downloads=self.accept_downloads, @@ -699,7 +777,8 @@ class AsyncPlaywrightCrawlerStrategy(AsyncCrawlerStrategy): if self.cookies: await context.add_cookies(self.cookies) await context.set_extra_http_headers(self.headers) - page = await context.new_page() + + page = await context.new_page() self.sessions[session_id] = (context, page, time.time()) else: if self.use_persistent_context and self.browser_type in ["chrome", "chromium"]: @@ -709,7 +788,8 @@ class AsyncPlaywrightCrawlerStrategy(AsyncCrawlerStrategy): # Normal context creation context = await self.browser.new_context( user_agent=user_agent, - viewport={"width": 1920, "height": 1080}, + # viewport={"width": 1920, "height": 1080}, + viewport={"width": self.viewport_width, "height": self.viewport_height}, proxy={"server": self.proxy} if self.proxy else None, accept_downloads=self.accept_downloads, ignore_https_errors=True # Add this line @@ -763,9 +843,6 @@ class AsyncPlaywrightCrawlerStrategy(AsyncCrawlerStrategy): if self.accept_downloads: page.on("download", lambda download: asyncio.create_task(self._handle_download(download))) - # if self.verbose: - # print(f"[LOG] 🕸️ Crawling {url} using AsyncPlaywrightCrawlerStrategy...") - if self.use_cached_html: cache_file_path = os.path.join( os.getenv("CRAWL4_AI_BASE_DIRECTORY", Path.home()), ".crawl4ai", "cache", hashlib.md5(url.encode()).hexdigest() @@ -786,7 +863,6 @@ class AsyncPlaywrightCrawlerStrategy(AsyncCrawlerStrategy): if not kwargs.get("js_only", False): await self.execute_hook('before_goto', page, context = context) - try: response = await page.goto( @@ -798,9 +874,6 @@ class AsyncPlaywrightCrawlerStrategy(AsyncCrawlerStrategy): except Error as e: raise RuntimeError(f"Failed on navigating ACS-GOTO :\n{str(e)}") - # response = await page.goto("about:blank") - # await page.evaluate(f"window.location.href = '{url}'") - await self.execute_hook('after_goto', page, context = context) # Get status code and headers @@ -853,7 +926,83 @@ class AsyncPlaywrightCrawlerStrategy(AsyncCrawlerStrategy): else: raise Error(f"Body element is hidden: {visibility_info}") - await page.evaluate("window.scrollTo(0, document.body.scrollHeight)") + # CONTENT LOADING ASSURANCE + if not self.text_only and (kwargs.get("wait_for_images", True) or kwargs.get("adjust_viewport_to_content", False)): + # Wait for network idle after initial load and images to load + await page.wait_for_load_state("networkidle") + await asyncio.sleep(0.1) + await page.wait_for_function("Array.from(document.images).every(img => img.complete)") + + # After initial load, adjust viewport to content size + if not self.text_only and kwargs.get("adjust_viewport_to_content", False): + try: + # Get actual page dimensions + page_width = await page.evaluate("document.documentElement.scrollWidth") + page_height = await page.evaluate("document.documentElement.scrollHeight") + + target_width = self.viewport_width + target_height = int(target_width * page_width / page_height * 0.95) + await page.set_viewport_size({"width": target_width, "height": target_height}) + + # Compute scale factor + # We want the entire page visible: the scale should make both width and height fit + scale = min(target_width / page_width, target_height / page_height) + + # Now we call CDP to set metrics. + # We tell Chrome that the "device" is page_width x page_height in size, + # but we scale it down so everything fits within the real viewport. + cdp = await page.context.new_cdp_session(page) + await cdp.send('Emulation.setDeviceMetricsOverride', { + 'width': page_width, # full page width + 'height': page_height, # full page height + 'deviceScaleFactor': 1, # keep normal DPR + 'mobile': False, + 'scale': scale # scale the entire rendered content + }) + + except Exception as e: + self.logger.warning( + message="Failed to adjust viewport to content: {error}", + tag="VIEWPORT", + params={"error": str(e)} + ) + + # After viewport adjustment, handle page scanning if requested + if kwargs.get("scan_full_page", False): + try: + viewport_height = page.viewport_size.get("height", self.viewport_height) + current_position = viewport_height # Start with one viewport height + scroll_delay = kwargs.get("scroll_delay", 0.2) + + # Initial scroll + await page.evaluate(f"window.scrollTo(0, {current_position})") + await asyncio.sleep(scroll_delay) + + # Get height after first scroll to account for any dynamic content + total_height = await page.evaluate("document.documentElement.scrollHeight") + + while current_position < total_height: + current_position = min(current_position + viewport_height, total_height) + await page.evaluate(f"window.scrollTo(0, {current_position})") + await asyncio.sleep(scroll_delay) + + # Check for dynamic content + new_height = await page.evaluate("document.documentElement.scrollHeight") + if new_height > total_height: + total_height = new_height + + # Scroll back to top + await page.evaluate("window.scrollTo(0, 0)") + + except Exception as e: + self.logger.warning( + message="Failed to perform full page scan: {error}", + tag="PAGE_SCAN", + params={"error": str(e)} + ) + else: + # Scroll to the bottom of the page + await page.evaluate("window.scrollTo(0, document.body.scrollHeight)") js_code = kwargs.get("js_code", kwargs.get("js", self.js_code)) if js_code: @@ -887,7 +1036,8 @@ class AsyncPlaywrightCrawlerStrategy(AsyncCrawlerStrategy): # await page.wait_for_load_state('networkidle', timeout=5000) # Update image dimensions - update_image_dimensions_js = """ + if not self.text_only: + update_image_dimensions_js = """ () => { return new Promise((resolve) => { const filterImage = (img) => { @@ -944,26 +1094,26 @@ class AsyncPlaywrightCrawlerStrategy(AsyncCrawlerStrategy): } """ - try: try: - await page.wait_for_load_state( - # state="load", - state="domcontentloaded", - timeout=5 + try: + await page.wait_for_load_state( + # state="load", + state="domcontentloaded", + timeout=5 + ) + except PlaywrightTimeoutError: + pass + await page.evaluate(update_image_dimensions_js) + except Exception as e: + self.logger.error( + message="Error updating image dimensions ACS-UPDATE_IMAGE_DIMENSIONS_JS: {error}", + tag="ERROR", + params={"error": str(e)} ) - except PlaywrightTimeoutError: - pass - await page.evaluate(update_image_dimensions_js) - except Exception as e: - self.logger.error( - message="Error updating image dimensions ACS-UPDATE_IMAGE_DIMENSIONS_JS: {error}", - tag="ERROR", - params={"error": str(e)} - ) - # raise RuntimeError(f"Error updating image dimensions ACS-UPDATE_IMAGE_DIMENSIONS_JS: {str(e)}") + # raise RuntimeError(f"Error updating image dimensions ACS-UPDATE_IMAGE_DIMENSIONS_JS: {str(e)}") # Wait a bit for any onload events to complete - await page.wait_for_timeout(100) + # await page.wait_for_timeout(100) # Process iframes if kwargs.get("process_iframes", False): @@ -971,7 +1121,7 @@ class AsyncPlaywrightCrawlerStrategy(AsyncCrawlerStrategy): await self.execute_hook('before_retrieve_html', page, context = context) # Check if delay_before_return_html is set then wait for that time - delay_before_return_html = kwargs.get("delay_before_return_html") + delay_before_return_html = kwargs.get("delay_before_return_html", 0.1) if delay_before_return_html: await asyncio.sleep(delay_before_return_html) diff --git a/docs/examples/quickstart_async.py b/docs/examples/quickstart_async.py index 9d97dabd..ac844ed5 100644 --- a/docs/examples/quickstart_async.py +++ b/docs/examples/quickstart_async.py @@ -128,7 +128,7 @@ async def extract_structured_data_using_llm(provider: str, api_token: str = None extraction_strategy=LLMExtractionStrategy( provider=provider, api_token=api_token, - schema=OpenAIModelFee.schema(), + schema=OpenAIModelFee.model_json_schema(), extraction_type="schema", instruction="""From the crawled content, extract all mentioned model names along with their fees for input and output tokens. Do not miss any models in the entire content. One extracted model JSON format should look like this: diff --git a/docs/md_v2/blog/index.md b/docs/md_v2/blog/index.md index 054b12f8..28ccfa6b 100644 --- a/docs/md_v2/blog/index.md +++ b/docs/md_v2/blog/index.md @@ -1,19 +1,28 @@ # Crawl4AI Blog -Welcome to the Crawl4AI blog! Here you'll find detailed release notes, technical deep dives, and news about the project. +Welcome to the Crawl4AI blog! Here you'll find detailed release notes, technical insights, and updates about the project. Whether you're looking for the latest improvements or want to dive deep into web crawling techniques, this is the place. ## Latest Release +### [0.4.1 - Smarter Crawling with Lazy-Load Handling, Text-Only Mode, and More](releases/0.4.1.md) +*December 8, 2024* + +This release brings major improvements to handling lazy-loaded images, a blazing-fast Text-Only Mode, full-page scanning for infinite scrolls, dynamic viewport adjustments, and session reuse for efficient crawling. If you're looking to improve speed, reliability, or handle dynamic content with ease, this update has you covered. + +[Read full release notes →](releases/0.4.1.md) + +--- + ### [0.4.0 - Major Content Filtering Update](releases/0.4.0.md) *December 1, 2024* -Introducing significant improvements to content filtering, multi-threaded environment handling, and user-agent generation. This release features the new PruningContentFilter, enhanced thread safety, and improved test coverage. +Introduced significant improvements to content filtering, multi-threaded environment handling, and user-agent generation. This release features the new PruningContentFilter, enhanced thread safety, and improved test coverage. [Read full release notes →](releases/0.4.0.md) ## Project History -Want to see how we got here? Check out our [complete changelog](https://github.com/unclecode/crawl4ai/blob/main/CHANGELOG.md) covering all previous versions and the evolution of Crawl4AI. +Curious about how Crawl4AI has evolved? Check out our [complete changelog](https://github.com/unclecode/crawl4ai/blob/main/CHANGELOG.md) for a detailed history of all versions and updates. ## Categories diff --git a/docs/md_v2/blog/releases/0.4.1.md b/docs/md_v2/blog/releases/0.4.1.md new file mode 100644 index 00000000..b02b758d --- /dev/null +++ b/docs/md_v2/blog/releases/0.4.1.md @@ -0,0 +1,145 @@ +# Release Summary for Version 0.4.1 (December 8, 2024): Major Efficiency Boosts with New Features! + +_This post was generated with the help of ChatGPT, take everything with a grain of salt. 🧂_ + +Hi everyone, + +I just finished putting together version 0.4.1 of Crawl4AI, and there are a few changes in here that I think you’ll find really helpful. I’ll explain what’s new, why it matters, and exactly how you can use these features (with the code to back it up). Let’s get into it. + +--- + +### Handling Lazy Loading Better (Images Included) + +One thing that always bugged me with crawlers is how often they miss lazy-loaded content, especially images. In this version, I made sure Crawl4AI **waits for all images to load** before moving forward. This is useful because many modern websites only load images when they’re in the viewport or after some JavaScript executes. + +Here’s how to enable it: + +```python +await crawler.crawl( + url="https://example.com", + wait_for_images=True # Add this argument to ensure images are fully loaded +) +``` + +What this does is: +1. Waits for the page to reach a "network idle" state. +2. Ensures all images on the page have been completely loaded. + +This single change handles the majority of lazy-loading cases you’re likely to encounter. + +--- + +### Text-Only Mode (Fast, Lightweight Crawling) + +Sometimes, you don’t need to download images or process JavaScript at all. For example, if you’re crawling to extract text data, you can enable **text-only mode** to speed things up. By disabling images, JavaScript, and other heavy resources, this mode makes crawling **3-4 times faster** in most cases. + +Here’s how to turn it on: + +```python +crawler = AsyncPlaywrightCrawlerStrategy( + text_only=True # Set this to True to enable text-only crawling +) +``` + +When `text_only=True`, the crawler automatically: +- Disables GPU processing. +- Blocks image and JavaScript resources. +- Reduces the viewport size to 800x600 (you can override this with `viewport_width` and `viewport_height`). + +If you need to crawl thousands of pages where you only care about text, this mode will save you a ton of time and resources. + +--- + +### Adjusting the Viewport Dynamically + +Another useful addition is the ability to **dynamically adjust the viewport size** to match the content on the page. This is particularly helpful when you’re working with responsive layouts or want to ensure all parts of the page load properly. + +Here’s how it works: +1. The crawler calculates the page’s width and height after it loads. +2. It adjusts the viewport to fit the content dimensions. +3. (Optional) It uses Chrome DevTools Protocol (CDP) to simulate zooming out so everything fits in the viewport. + +To enable this, use: + +```python +await crawler.crawl( + url="https://example.com", + adjust_viewport_to_content=True # Dynamically adjusts the viewport +) +``` + +This approach makes sure the entire page gets loaded into the viewport, especially for layouts that load content based on visibility. + +--- + +### Simulating Full-Page Scrolling + +Some websites load data dynamically as you scroll down the page. To handle these cases, I added support for **full-page scanning**. It simulates scrolling to the bottom of the page, checking for new content, and capturing it all. + +Here’s an example: + +```python +await crawler.crawl( + url="https://example.com", + scan_full_page=True, # Enables scrolling + scroll_delay=0.2 # Waits 200ms between scrolls (optional) +) +``` + +What happens here: +1. The crawler scrolls down in increments, waiting for content to load after each scroll. +2. It stops when no new content appears (i.e., dynamic elements stop loading). +3. It scrolls back to the top before finishing (if necessary). + +If you’ve ever had to deal with infinite scroll pages, this is going to save you a lot of headaches. + +--- + +### Reusing Browser Sessions (Save Time on Setup) + +By default, every time you crawl a page, a new browser context (or tab) is created. That’s fine for small crawls, but if you’re working on a large dataset, it’s more efficient to reuse the same session. + +I added a method called `create_session` for this: + +```python +session_id = await crawler.create_session() + +# Use the same session for multiple crawls +await crawler.crawl( + url="https://example.com/page1", + session_id=session_id # Reuse the session +) +await crawler.crawl( + url="https://example.com/page2", + session_id=session_id +) +``` + +This avoids creating a new tab for every page, speeding up the crawl and reducing memory usage. + +--- + +### Other Updates + +Here are a few smaller updates I’ve made: +- **Light Mode**: Use `light_mode=True` to disable background processes, extensions, and other unnecessary features, making the browser more efficient. +- **Logging**: Improved logs to make debugging easier. +- **Defaults**: Added sensible defaults for things like `delay_before_return_html` (now set to 0.1 seconds). + +--- + +### How to Get the Update + +You can install or upgrade to version `0.4.1` like this: + +```bash +pip install crawl4ai --upgrade +``` + +As always, I’d love to hear your thoughts. If there’s something you think could be improved or if you have suggestions for future versions, let me know! + +Enjoy the new features, and happy crawling! 🕷️ + +--- + + diff --git a/mkdocs.yml b/mkdocs.yml index 4ba7c2a7..6009dddf 100644 --- a/mkdocs.yml +++ b/mkdocs.yml @@ -12,7 +12,7 @@ nav: - 'Quick Start': 'basic/quickstart.md' - Changelog & Blog: - 'Blog Home': 'blog/index.md' - - 'Latest (0.4.0)': 'blog/releases/0.4.0.md' + - 'Latest (0.4.1)': 'blog/releases/0.4.1.md' - 'Changelog': 'https://github.com/unclecode/crawl4ai/blob/main/CHANGELOG.md' - Basic: