Update documents, upload new version of quickstart.
This commit is contained in:
@@ -150,4 +150,11 @@ strong,
|
||||
.tab-content pre {
|
||||
margin: 0;
|
||||
max-height: 300px; overflow: auto; border:none;
|
||||
}
|
||||
|
||||
ol li::before {
|
||||
content: counters(item, ".") ". ";
|
||||
counter-increment: item;
|
||||
/* float: left; */
|
||||
/* padding-right: 5px; */
|
||||
}
|
||||
@@ -9,17 +9,19 @@ Here's a condensed outline of the **Installation and Setup** video content:
|
||||
|
||||
---
|
||||
|
||||
1. **Introduction to Crawl4AI**:
|
||||
- Briefly explain that Crawl4AI is a powerful tool for web scraping, data extraction, and content processing, with customizable options for various needs.
|
||||
1 **Introduction to Crawl4AI**: Briefly explain that Crawl4AI is a powerful tool for web scraping, data extraction, and content processing, with customizable options for various needs.
|
||||
|
||||
2. **Installation Overview**:
|
||||
2 **Installation Overview**:
|
||||
|
||||
- **Basic Install**: Run `pip install crawl4ai` and `playwright install` (to set up browser dependencies).
|
||||
|
||||
- **Optional Advanced Installs**:
|
||||
- `pip install crawl4ai[torch]` - Adds PyTorch for clustering.
|
||||
- `pip install crawl4ai[transformer]` - Adds support for LLM-based extraction.
|
||||
- `pip install crawl4ai[all]` - Installs all features for complete functionality.
|
||||
|
||||
3. **Verifying the Installation**:
|
||||
3 **Verifying the Installation**:
|
||||
|
||||
- Walk through a simple test script to confirm the setup:
|
||||
```python
|
||||
import asyncio
|
||||
@@ -34,12 +36,13 @@ Here's a condensed outline of the **Installation and Setup** video content:
|
||||
```
|
||||
- Explain that this script initializes the crawler and runs it on a test URL, displaying part of the extracted content to verify functionality.
|
||||
|
||||
4. **Important Tips**:
|
||||
4 **Important Tips**:
|
||||
|
||||
- **Run** `playwright install` **after installation** to set up dependencies.
|
||||
- **For full performance** on text-related tasks, run `crawl4ai-download-models` after installing with `[torch]`, `[transformer]`, or `[all]` options.
|
||||
- If you encounter issues, refer to the documentation or GitHub issues.
|
||||
|
||||
5. **Wrap Up**:
|
||||
5 **Wrap Up**:
|
||||
- Introduce the next topic in the series, which will cover Crawl4AI's browser configuration options (like choosing between `chromium`, `firefox`, and `webkit`).
|
||||
|
||||
---
|
||||
|
||||
@@ -11,10 +11,12 @@ Here's a condensed outline for an **Overview of Advanced Features** video coveri
|
||||
|
||||
### **Overview of Advanced Features**
|
||||
|
||||
1. **Introduction to Advanced Features**:
|
||||
1 **Introduction to Advanced Features**:
|
||||
|
||||
- Briefly introduce Crawl4AI’s advanced tools, which let users go beyond basic crawling to customize and fine-tune their scraping workflows.
|
||||
|
||||
2. **Taking Screenshots**:
|
||||
2 **Taking Screenshots**:
|
||||
|
||||
- Explain the screenshot capability for capturing page state and verifying content.
|
||||
- **Example**:
|
||||
```python
|
||||
@@ -22,7 +24,8 @@ Here's a condensed outline for an **Overview of Advanced Features** video coveri
|
||||
```
|
||||
- Mention that screenshots are saved as a base64 string in `result`, allowing easy decoding and saving.
|
||||
|
||||
3. **Media and Link Extraction**:
|
||||
3 **Media and Link Extraction**:
|
||||
|
||||
- Demonstrate how to pull all media (images, videos) and links (internal and external) from a page for deeper analysis or content gathering.
|
||||
- **Example**:
|
||||
```python
|
||||
@@ -31,14 +34,16 @@ Here's a condensed outline for an **Overview of Advanced Features** video coveri
|
||||
print("Links:", result.links)
|
||||
```
|
||||
|
||||
4. **Custom User Agent**:
|
||||
4 **Custom User Agent**:
|
||||
|
||||
- Show how to set a custom user agent to disguise the crawler or simulate specific devices/browsers.
|
||||
- **Example**:
|
||||
```python
|
||||
result = await crawler.arun(url="https://www.example.com", user_agent="Mozilla/5.0 (compatible; MyCrawler/1.0)")
|
||||
```
|
||||
|
||||
5. **Custom Hooks for Enhanced Control**:
|
||||
5 **Custom Hooks for Enhanced Control**:
|
||||
|
||||
- Briefly cover how to use hooks, which allow custom actions like setting headers or handling login during the crawl.
|
||||
- **Example**: Setting a custom header with `before_get_url` hook.
|
||||
```python
|
||||
@@ -46,7 +51,8 @@ Here's a condensed outline for an **Overview of Advanced Features** video coveri
|
||||
await page.set_extra_http_headers({"X-Test-Header": "test"})
|
||||
```
|
||||
|
||||
6. **CSS Selectors for Targeted Extraction**:
|
||||
6 **CSS Selectors for Targeted Extraction**:
|
||||
|
||||
- Explain the use of CSS selectors to extract specific elements, ideal for structured data like articles or product details.
|
||||
- **Example**:
|
||||
```python
|
||||
@@ -54,14 +60,16 @@ Here's a condensed outline for an **Overview of Advanced Features** video coveri
|
||||
print("H2 Tags:", result.extracted_content)
|
||||
```
|
||||
|
||||
7. **Crawling Inside Iframes**:
|
||||
7 **Crawling Inside Iframes**:
|
||||
|
||||
- Mention how enabling `process_iframes=True` allows extracting content within iframes, useful for sites with embedded content or ads.
|
||||
- **Example**:
|
||||
```python
|
||||
result = await crawler.arun(url="https://www.example.com", process_iframes=True)
|
||||
```
|
||||
|
||||
8. **Wrap-Up**:
|
||||
8 **Wrap-Up**:
|
||||
|
||||
- Summarize these advanced features and how they allow users to customize every part of their web scraping experience.
|
||||
- Tease upcoming videos where each feature will be explored in detail.
|
||||
|
||||
|
||||
@@ -42,7 +42,7 @@ Here’s a detailed outline for the **Hooks and Custom Workflow with AsyncWebCra
|
||||
async def log_browser_creation(browser):
|
||||
print("Browser instance created:", browser)
|
||||
|
||||
crawler.set_hook('on_browser_created', log_browser_creation)
|
||||
crawler.crawler_strategy.set_hook('on_browser_created', log_browser_creation)
|
||||
```
|
||||
- **Explanation**: This hook logs the browser creation event, useful for tracking when a new browser instance starts.
|
||||
|
||||
@@ -57,7 +57,7 @@ Here’s a detailed outline for the **Hooks and Custom Workflow with AsyncWebCra
|
||||
def update_user_agent(user_agent):
|
||||
print(f"User Agent Updated: {user_agent}")
|
||||
|
||||
crawler.set_hook('on_user_agent_updated', update_user_agent)
|
||||
crawler.crawler_strategy.set_hook('on_user_agent_updated', update_user_agent)
|
||||
crawler.update_user_agent("Mozilla/5.0 (iPhone; CPU iPhone OS 14_0 like Mac OS X)")
|
||||
```
|
||||
- **Explanation**: This hook provides a callback every time the user agent changes, helpful for debugging or dynamically altering user agent settings based on conditions.
|
||||
@@ -73,7 +73,7 @@ Here’s a detailed outline for the **Hooks and Custom Workflow with AsyncWebCra
|
||||
async def log_execution_start(page):
|
||||
print("Execution started on page:", page.url)
|
||||
|
||||
crawler.set_hook('on_execution_started', log_execution_start)
|
||||
crawler.crawler_strategy.set_hook('on_execution_started', log_execution_start)
|
||||
```
|
||||
- **Explanation**: Logs the start of any major interaction on the page, ideal for cases where you want to monitor each interaction.
|
||||
|
||||
@@ -90,7 +90,7 @@ Here’s a detailed outline for the **Hooks and Custom Workflow with AsyncWebCra
|
||||
await page.set_extra_http_headers({"X-Custom-Header": "CustomValue"})
|
||||
print("Custom headers set before navigation")
|
||||
|
||||
crawler.set_hook('before_goto', modify_headers_before_goto)
|
||||
crawler.crawler_strategy.set_hook('before_goto', modify_headers_before_goto)
|
||||
```
|
||||
- **Explanation**: This hook allows injecting headers or altering settings based on the page’s needs, particularly useful for pages with custom requirements.
|
||||
|
||||
@@ -106,7 +106,7 @@ Here’s a detailed outline for the **Hooks and Custom Workflow with AsyncWebCra
|
||||
await page.evaluate("window.scrollTo(0, document.body.scrollHeight)")
|
||||
print("Scrolled to the bottom after navigation")
|
||||
|
||||
crawler.set_hook('after_goto', post_navigation_scroll)
|
||||
crawler.crawler_strategy.set_hook('after_goto', post_navigation_scroll)
|
||||
```
|
||||
- **Explanation**: This hook scrolls to the bottom of the page after loading, which can help load dynamically added content like infinite scroll elements.
|
||||
|
||||
@@ -122,7 +122,7 @@ Here’s a detailed outline for the **Hooks and Custom Workflow with AsyncWebCra
|
||||
await page.evaluate("document.querySelectorAll('.ad-banner').forEach(el => el.remove());")
|
||||
print("Advertisements removed before returning HTML")
|
||||
|
||||
crawler.set_hook('before_return_html', remove_advertisements)
|
||||
crawler.crawler_strategy.set_hook('before_return_html', remove_advertisements)
|
||||
```
|
||||
- **Explanation**: The hook removes ad banners from the HTML before it’s retrieved, ensuring a cleaner data extraction.
|
||||
|
||||
@@ -138,7 +138,7 @@ Here’s a detailed outline for the **Hooks and Custom Workflow with AsyncWebCra
|
||||
await page.wait_for_selector('.main-content')
|
||||
print("Main content loaded, ready to retrieve HTML")
|
||||
|
||||
crawler.set_hook('before_retrieve_html', wait_for_content_before_retrieve)
|
||||
crawler.crawler_strategy.set_hook('before_retrieve_html', wait_for_content_before_retrieve)
|
||||
```
|
||||
- **Explanation**: This hook waits for the main content to load before retrieving the HTML, ensuring that all essential content is captured.
|
||||
|
||||
@@ -148,9 +148,9 @@ Here’s a detailed outline for the **Hooks and Custom Workflow with AsyncWebCra
|
||||
- Each hook function can be asynchronous (useful for actions like waiting or retrieving async data).
|
||||
- **Example Setup**:
|
||||
```python
|
||||
crawler.set_hook('on_browser_created', log_browser_creation)
|
||||
crawler.set_hook('before_goto', modify_headers_before_goto)
|
||||
crawler.set_hook('after_goto', post_navigation_scroll)
|
||||
crawler.crawler_strategy.set_hook('on_browser_created', log_browser_creation)
|
||||
crawler.crawler_strategy.set_hook('before_goto', modify_headers_before_goto)
|
||||
crawler.crawler_strategy.set_hook('after_goto', post_navigation_scroll)
|
||||
```
|
||||
|
||||
#### **5. Complete Example: Using Hooks for a Customized Crawl Workflow**
|
||||
@@ -160,10 +160,10 @@ Here’s a detailed outline for the **Hooks and Custom Workflow with AsyncWebCra
|
||||
async def custom_crawl():
|
||||
async with AsyncWebCrawler() as crawler:
|
||||
# Set hooks for custom workflow
|
||||
crawler.set_hook('on_browser_created', log_browser_creation)
|
||||
crawler.set_hook('before_goto', modify_headers_before_goto)
|
||||
crawler.set_hook('after_goto', post_navigation_scroll)
|
||||
crawler.set_hook('before_return_html', remove_advertisements)
|
||||
crawler.crawler_strategy.set_hook('on_browser_created', log_browser_creation)
|
||||
crawler.crawler_strategy.set_hook('before_goto', modify_headers_before_goto)
|
||||
crawler.crawler_strategy.set_hook('after_goto', post_navigation_scroll)
|
||||
crawler.crawler_strategy.set_hook('before_return_html', remove_advertisements)
|
||||
|
||||
# Perform the crawl
|
||||
url = "https://example.com"
|
||||
|
||||
@@ -771,9 +771,11 @@ Here’s a concise outline for the **Custom Headers, Identity Management, and Us
|
||||
async with AsyncWebCrawler(
|
||||
headers={"Accept-Language": "en-US", "Cache-Control": "no-cache"},
|
||||
user_agent="Mozilla/5.0 (Windows NT 10.0; Win64; x64) Chrome/91.0",
|
||||
simulate_user=True
|
||||
) as crawler:
|
||||
result = await crawler.arun(url="https://example.com/secure-page")
|
||||
result = await crawler.arun(
|
||||
url="https://example.com/secure-page",
|
||||
simulate_user=True
|
||||
)
|
||||
print(result.markdown[:500]) # Display extracted content
|
||||
```
|
||||
- This example enables detailed customization for evading detection and accessing protected pages smoothly.
|
||||
@@ -1576,7 +1578,7 @@ Here’s a detailed outline for the **Hooks and Custom Workflow with AsyncWebCra
|
||||
async def log_browser_creation(browser):
|
||||
print("Browser instance created:", browser)
|
||||
|
||||
crawler.set_hook('on_browser_created', log_browser_creation)
|
||||
crawler.crawler_strategy.set_hook('on_browser_created', log_browser_creation)
|
||||
```
|
||||
- **Explanation**: This hook logs the browser creation event, useful for tracking when a new browser instance starts.
|
||||
|
||||
@@ -1591,7 +1593,7 @@ Here’s a detailed outline for the **Hooks and Custom Workflow with AsyncWebCra
|
||||
def update_user_agent(user_agent):
|
||||
print(f"User Agent Updated: {user_agent}")
|
||||
|
||||
crawler.set_hook('on_user_agent_updated', update_user_agent)
|
||||
crawler.crawler_strategy.set_hook('on_user_agent_updated', update_user_agent)
|
||||
crawler.update_user_agent("Mozilla/5.0 (iPhone; CPU iPhone OS 14_0 like Mac OS X)")
|
||||
```
|
||||
- **Explanation**: This hook provides a callback every time the user agent changes, helpful for debugging or dynamically altering user agent settings based on conditions.
|
||||
@@ -1607,7 +1609,7 @@ Here’s a detailed outline for the **Hooks and Custom Workflow with AsyncWebCra
|
||||
async def log_execution_start(page):
|
||||
print("Execution started on page:", page.url)
|
||||
|
||||
crawler.set_hook('on_execution_started', log_execution_start)
|
||||
crawler.crawler_strategy.set_hook('on_execution_started', log_execution_start)
|
||||
```
|
||||
- **Explanation**: Logs the start of any major interaction on the page, ideal for cases where you want to monitor each interaction.
|
||||
|
||||
@@ -1624,7 +1626,7 @@ Here’s a detailed outline for the **Hooks and Custom Workflow with AsyncWebCra
|
||||
await page.set_extra_http_headers({"X-Custom-Header": "CustomValue"})
|
||||
print("Custom headers set before navigation")
|
||||
|
||||
crawler.set_hook('before_goto', modify_headers_before_goto)
|
||||
crawler.crawler_strategy.set_hook('before_goto', modify_headers_before_goto)
|
||||
```
|
||||
- **Explanation**: This hook allows injecting headers or altering settings based on the page’s needs, particularly useful for pages with custom requirements.
|
||||
|
||||
@@ -1640,7 +1642,7 @@ Here’s a detailed outline for the **Hooks and Custom Workflow with AsyncWebCra
|
||||
await page.evaluate("window.scrollTo(0, document.body.scrollHeight)")
|
||||
print("Scrolled to the bottom after navigation")
|
||||
|
||||
crawler.set_hook('after_goto', post_navigation_scroll)
|
||||
crawler.crawler_strategy.set_hook('after_goto', post_navigation_scroll)
|
||||
```
|
||||
- **Explanation**: This hook scrolls to the bottom of the page after loading, which can help load dynamically added content like infinite scroll elements.
|
||||
|
||||
@@ -1656,7 +1658,7 @@ Here’s a detailed outline for the **Hooks and Custom Workflow with AsyncWebCra
|
||||
await page.evaluate("document.querySelectorAll('.ad-banner').forEach(el => el.remove());")
|
||||
print("Advertisements removed before returning HTML")
|
||||
|
||||
crawler.set_hook('before_return_html', remove_advertisements)
|
||||
crawler.crawler_strategy.set_hook('before_return_html', remove_advertisements)
|
||||
```
|
||||
- **Explanation**: The hook removes ad banners from the HTML before it’s retrieved, ensuring a cleaner data extraction.
|
||||
|
||||
@@ -1672,7 +1674,7 @@ Here’s a detailed outline for the **Hooks and Custom Workflow with AsyncWebCra
|
||||
await page.wait_for_selector('.main-content')
|
||||
print("Main content loaded, ready to retrieve HTML")
|
||||
|
||||
crawler.set_hook('before_retrieve_html', wait_for_content_before_retrieve)
|
||||
crawler.crawler_strategy.set_hook('before_retrieve_html', wait_for_content_before_retrieve)
|
||||
```
|
||||
- **Explanation**: This hook waits for the main content to load before retrieving the HTML, ensuring that all essential content is captured.
|
||||
|
||||
@@ -1682,9 +1684,9 @@ Here’s a detailed outline for the **Hooks and Custom Workflow with AsyncWebCra
|
||||
- Each hook function can be asynchronous (useful for actions like waiting or retrieving async data).
|
||||
- **Example Setup**:
|
||||
```python
|
||||
crawler.set_hook('on_browser_created', log_browser_creation)
|
||||
crawler.set_hook('before_goto', modify_headers_before_goto)
|
||||
crawler.set_hook('after_goto', post_navigation_scroll)
|
||||
crawler.crawler_strategy.set_hook('on_browser_created', log_browser_creation)
|
||||
crawler.crawler_strategy.set_hook('before_goto', modify_headers_before_goto)
|
||||
crawler.crawler_strategy.set_hook('after_goto', post_navigation_scroll)
|
||||
```
|
||||
|
||||
#### **5. Complete Example: Using Hooks for a Customized Crawl Workflow**
|
||||
@@ -1694,10 +1696,10 @@ Here’s a detailed outline for the **Hooks and Custom Workflow with AsyncWebCra
|
||||
async def custom_crawl():
|
||||
async with AsyncWebCrawler() as crawler:
|
||||
# Set hooks for custom workflow
|
||||
crawler.set_hook('on_browser_created', log_browser_creation)
|
||||
crawler.set_hook('before_goto', modify_headers_before_goto)
|
||||
crawler.set_hook('after_goto', post_navigation_scroll)
|
||||
crawler.set_hook('before_return_html', remove_advertisements)
|
||||
crawler.crawler_strategy.set_hook('on_browser_created', log_browser_creation)
|
||||
crawler.crawler_strategy.set_hook('before_goto', modify_headers_before_goto)
|
||||
crawler.crawler_strategy.set_hook('after_goto', post_navigation_scroll)
|
||||
crawler.crawler_strategy.set_hook('before_return_html', remove_advertisements)
|
||||
|
||||
# Perform the crawl
|
||||
url = "https://example.com"
|
||||
|
||||
Reference in New Issue
Block a user