Update documents, upload new version of quickstart.

This commit is contained in:
UncleCode
2024-10-30 20:39:35 +08:00
parent 3529c2e732
commit 9307c19f35
10 changed files with 1481 additions and 799 deletions

View File

@@ -150,4 +150,11 @@ strong,
.tab-content pre {
margin: 0;
max-height: 300px; overflow: auto; border:none;
}
ol li::before {
content: counters(item, ".") ". ";
counter-increment: item;
/* float: left; */
/* padding-right: 5px; */
}

View File

@@ -9,17 +9,19 @@ Here's a condensed outline of the **Installation and Setup** video content:
---
1. **Introduction to Crawl4AI**:
- Briefly explain that Crawl4AI is a powerful tool for web scraping, data extraction, and content processing, with customizable options for various needs.
1 **Introduction to Crawl4AI**: Briefly explain that Crawl4AI is a powerful tool for web scraping, data extraction, and content processing, with customizable options for various needs.
2. **Installation Overview**:
2 **Installation Overview**:
- **Basic Install**: Run `pip install crawl4ai` and `playwright install` (to set up browser dependencies).
- **Optional Advanced Installs**:
- `pip install crawl4ai[torch]` - Adds PyTorch for clustering.
- `pip install crawl4ai[transformer]` - Adds support for LLM-based extraction.
- `pip install crawl4ai[all]` - Installs all features for complete functionality.
3. **Verifying the Installation**:
3 **Verifying the Installation**:
- Walk through a simple test script to confirm the setup:
```python
import asyncio
@@ -34,12 +36,13 @@ Here's a condensed outline of the **Installation and Setup** video content:
```
- Explain that this script initializes the crawler and runs it on a test URL, displaying part of the extracted content to verify functionality.
4. **Important Tips**:
4 **Important Tips**:
- **Run** `playwright install` **after installation** to set up dependencies.
- **For full performance** on text-related tasks, run `crawl4ai-download-models` after installing with `[torch]`, `[transformer]`, or `[all]` options.
- If you encounter issues, refer to the documentation or GitHub issues.
5. **Wrap Up**:
5 **Wrap Up**:
- Introduce the next topic in the series, which will cover Crawl4AI's browser configuration options (like choosing between `chromium`, `firefox`, and `webkit`).
---

View File

@@ -11,10 +11,12 @@ Here's a condensed outline for an **Overview of Advanced Features** video coveri
### **Overview of Advanced Features**
1. **Introduction to Advanced Features**:
1 **Introduction to Advanced Features**:
- Briefly introduce Crawl4AIs advanced tools, which let users go beyond basic crawling to customize and fine-tune their scraping workflows.
2. **Taking Screenshots**:
2 **Taking Screenshots**:
- Explain the screenshot capability for capturing page state and verifying content.
- **Example**:
```python
@@ -22,7 +24,8 @@ Here's a condensed outline for an **Overview of Advanced Features** video coveri
```
- Mention that screenshots are saved as a base64 string in `result`, allowing easy decoding and saving.
3. **Media and Link Extraction**:
3 **Media and Link Extraction**:
- Demonstrate how to pull all media (images, videos) and links (internal and external) from a page for deeper analysis or content gathering.
- **Example**:
```python
@@ -31,14 +34,16 @@ Here's a condensed outline for an **Overview of Advanced Features** video coveri
print("Links:", result.links)
```
4. **Custom User Agent**:
4 **Custom User Agent**:
- Show how to set a custom user agent to disguise the crawler or simulate specific devices/browsers.
- **Example**:
```python
result = await crawler.arun(url="https://www.example.com", user_agent="Mozilla/5.0 (compatible; MyCrawler/1.0)")
```
5. **Custom Hooks for Enhanced Control**:
5 **Custom Hooks for Enhanced Control**:
- Briefly cover how to use hooks, which allow custom actions like setting headers or handling login during the crawl.
- **Example**: Setting a custom header with `before_get_url` hook.
```python
@@ -46,7 +51,8 @@ Here's a condensed outline for an **Overview of Advanced Features** video coveri
await page.set_extra_http_headers({"X-Test-Header": "test"})
```
6. **CSS Selectors for Targeted Extraction**:
6 **CSS Selectors for Targeted Extraction**:
- Explain the use of CSS selectors to extract specific elements, ideal for structured data like articles or product details.
- **Example**:
```python
@@ -54,14 +60,16 @@ Here's a condensed outline for an **Overview of Advanced Features** video coveri
print("H2 Tags:", result.extracted_content)
```
7. **Crawling Inside Iframes**:
7 **Crawling Inside Iframes**:
- Mention how enabling `process_iframes=True` allows extracting content within iframes, useful for sites with embedded content or ads.
- **Example**:
```python
result = await crawler.arun(url="https://www.example.com", process_iframes=True)
```
8. **Wrap-Up**:
8 **Wrap-Up**:
- Summarize these advanced features and how they allow users to customize every part of their web scraping experience.
- Tease upcoming videos where each feature will be explored in detail.

View File

@@ -42,7 +42,7 @@ Heres a detailed outline for the **Hooks and Custom Workflow with AsyncWebCra
async def log_browser_creation(browser):
print("Browser instance created:", browser)
crawler.set_hook('on_browser_created', log_browser_creation)
crawler.crawler_strategy.set_hook('on_browser_created', log_browser_creation)
```
- **Explanation**: This hook logs the browser creation event, useful for tracking when a new browser instance starts.
@@ -57,7 +57,7 @@ Heres a detailed outline for the **Hooks and Custom Workflow with AsyncWebCra
def update_user_agent(user_agent):
print(f"User Agent Updated: {user_agent}")
crawler.set_hook('on_user_agent_updated', update_user_agent)
crawler.crawler_strategy.set_hook('on_user_agent_updated', update_user_agent)
crawler.update_user_agent("Mozilla/5.0 (iPhone; CPU iPhone OS 14_0 like Mac OS X)")
```
- **Explanation**: This hook provides a callback every time the user agent changes, helpful for debugging or dynamically altering user agent settings based on conditions.
@@ -73,7 +73,7 @@ Heres a detailed outline for the **Hooks and Custom Workflow with AsyncWebCra
async def log_execution_start(page):
print("Execution started on page:", page.url)
crawler.set_hook('on_execution_started', log_execution_start)
crawler.crawler_strategy.set_hook('on_execution_started', log_execution_start)
```
- **Explanation**: Logs the start of any major interaction on the page, ideal for cases where you want to monitor each interaction.
@@ -90,7 +90,7 @@ Heres a detailed outline for the **Hooks and Custom Workflow with AsyncWebCra
await page.set_extra_http_headers({"X-Custom-Header": "CustomValue"})
print("Custom headers set before navigation")
crawler.set_hook('before_goto', modify_headers_before_goto)
crawler.crawler_strategy.set_hook('before_goto', modify_headers_before_goto)
```
- **Explanation**: This hook allows injecting headers or altering settings based on the pages needs, particularly useful for pages with custom requirements.
@@ -106,7 +106,7 @@ Heres a detailed outline for the **Hooks and Custom Workflow with AsyncWebCra
await page.evaluate("window.scrollTo(0, document.body.scrollHeight)")
print("Scrolled to the bottom after navigation")
crawler.set_hook('after_goto', post_navigation_scroll)
crawler.crawler_strategy.set_hook('after_goto', post_navigation_scroll)
```
- **Explanation**: This hook scrolls to the bottom of the page after loading, which can help load dynamically added content like infinite scroll elements.
@@ -122,7 +122,7 @@ Heres a detailed outline for the **Hooks and Custom Workflow with AsyncWebCra
await page.evaluate("document.querySelectorAll('.ad-banner').forEach(el => el.remove());")
print("Advertisements removed before returning HTML")
crawler.set_hook('before_return_html', remove_advertisements)
crawler.crawler_strategy.set_hook('before_return_html', remove_advertisements)
```
- **Explanation**: The hook removes ad banners from the HTML before its retrieved, ensuring a cleaner data extraction.
@@ -138,7 +138,7 @@ Heres a detailed outline for the **Hooks and Custom Workflow with AsyncWebCra
await page.wait_for_selector('.main-content')
print("Main content loaded, ready to retrieve HTML")
crawler.set_hook('before_retrieve_html', wait_for_content_before_retrieve)
crawler.crawler_strategy.set_hook('before_retrieve_html', wait_for_content_before_retrieve)
```
- **Explanation**: This hook waits for the main content to load before retrieving the HTML, ensuring that all essential content is captured.
@@ -148,9 +148,9 @@ Heres a detailed outline for the **Hooks and Custom Workflow with AsyncWebCra
- Each hook function can be asynchronous (useful for actions like waiting or retrieving async data).
- **Example Setup**:
```python
crawler.set_hook('on_browser_created', log_browser_creation)
crawler.set_hook('before_goto', modify_headers_before_goto)
crawler.set_hook('after_goto', post_navigation_scroll)
crawler.crawler_strategy.set_hook('on_browser_created', log_browser_creation)
crawler.crawler_strategy.set_hook('before_goto', modify_headers_before_goto)
crawler.crawler_strategy.set_hook('after_goto', post_navigation_scroll)
```
#### **5. Complete Example: Using Hooks for a Customized Crawl Workflow**
@@ -160,10 +160,10 @@ Heres a detailed outline for the **Hooks and Custom Workflow with AsyncWebCra
async def custom_crawl():
async with AsyncWebCrawler() as crawler:
# Set hooks for custom workflow
crawler.set_hook('on_browser_created', log_browser_creation)
crawler.set_hook('before_goto', modify_headers_before_goto)
crawler.set_hook('after_goto', post_navigation_scroll)
crawler.set_hook('before_return_html', remove_advertisements)
crawler.crawler_strategy.set_hook('on_browser_created', log_browser_creation)
crawler.crawler_strategy.set_hook('before_goto', modify_headers_before_goto)
crawler.crawler_strategy.set_hook('after_goto', post_navigation_scroll)
crawler.crawler_strategy.set_hook('before_return_html', remove_advertisements)
# Perform the crawl
url = "https://example.com"

View File

@@ -771,9 +771,11 @@ Heres a concise outline for the **Custom Headers, Identity Management, and Us
async with AsyncWebCrawler(
headers={"Accept-Language": "en-US", "Cache-Control": "no-cache"},
user_agent="Mozilla/5.0 (Windows NT 10.0; Win64; x64) Chrome/91.0",
simulate_user=True
) as crawler:
result = await crawler.arun(url="https://example.com/secure-page")
result = await crawler.arun(
url="https://example.com/secure-page",
simulate_user=True
)
print(result.markdown[:500]) # Display extracted content
```
- This example enables detailed customization for evading detection and accessing protected pages smoothly.
@@ -1576,7 +1578,7 @@ Heres a detailed outline for the **Hooks and Custom Workflow with AsyncWebCra
async def log_browser_creation(browser):
print("Browser instance created:", browser)
crawler.set_hook('on_browser_created', log_browser_creation)
crawler.crawler_strategy.set_hook('on_browser_created', log_browser_creation)
```
- **Explanation**: This hook logs the browser creation event, useful for tracking when a new browser instance starts.
@@ -1591,7 +1593,7 @@ Heres a detailed outline for the **Hooks and Custom Workflow with AsyncWebCra
def update_user_agent(user_agent):
print(f"User Agent Updated: {user_agent}")
crawler.set_hook('on_user_agent_updated', update_user_agent)
crawler.crawler_strategy.set_hook('on_user_agent_updated', update_user_agent)
crawler.update_user_agent("Mozilla/5.0 (iPhone; CPU iPhone OS 14_0 like Mac OS X)")
```
- **Explanation**: This hook provides a callback every time the user agent changes, helpful for debugging or dynamically altering user agent settings based on conditions.
@@ -1607,7 +1609,7 @@ Heres a detailed outline for the **Hooks and Custom Workflow with AsyncWebCra
async def log_execution_start(page):
print("Execution started on page:", page.url)
crawler.set_hook('on_execution_started', log_execution_start)
crawler.crawler_strategy.set_hook('on_execution_started', log_execution_start)
```
- **Explanation**: Logs the start of any major interaction on the page, ideal for cases where you want to monitor each interaction.
@@ -1624,7 +1626,7 @@ Heres a detailed outline for the **Hooks and Custom Workflow with AsyncWebCra
await page.set_extra_http_headers({"X-Custom-Header": "CustomValue"})
print("Custom headers set before navigation")
crawler.set_hook('before_goto', modify_headers_before_goto)
crawler.crawler_strategy.set_hook('before_goto', modify_headers_before_goto)
```
- **Explanation**: This hook allows injecting headers or altering settings based on the pages needs, particularly useful for pages with custom requirements.
@@ -1640,7 +1642,7 @@ Heres a detailed outline for the **Hooks and Custom Workflow with AsyncWebCra
await page.evaluate("window.scrollTo(0, document.body.scrollHeight)")
print("Scrolled to the bottom after navigation")
crawler.set_hook('after_goto', post_navigation_scroll)
crawler.crawler_strategy.set_hook('after_goto', post_navigation_scroll)
```
- **Explanation**: This hook scrolls to the bottom of the page after loading, which can help load dynamically added content like infinite scroll elements.
@@ -1656,7 +1658,7 @@ Heres a detailed outline for the **Hooks and Custom Workflow with AsyncWebCra
await page.evaluate("document.querySelectorAll('.ad-banner').forEach(el => el.remove());")
print("Advertisements removed before returning HTML")
crawler.set_hook('before_return_html', remove_advertisements)
crawler.crawler_strategy.set_hook('before_return_html', remove_advertisements)
```
- **Explanation**: The hook removes ad banners from the HTML before its retrieved, ensuring a cleaner data extraction.
@@ -1672,7 +1674,7 @@ Heres a detailed outline for the **Hooks and Custom Workflow with AsyncWebCra
await page.wait_for_selector('.main-content')
print("Main content loaded, ready to retrieve HTML")
crawler.set_hook('before_retrieve_html', wait_for_content_before_retrieve)
crawler.crawler_strategy.set_hook('before_retrieve_html', wait_for_content_before_retrieve)
```
- **Explanation**: This hook waits for the main content to load before retrieving the HTML, ensuring that all essential content is captured.
@@ -1682,9 +1684,9 @@ Heres a detailed outline for the **Hooks and Custom Workflow with AsyncWebCra
- Each hook function can be asynchronous (useful for actions like waiting or retrieving async data).
- **Example Setup**:
```python
crawler.set_hook('on_browser_created', log_browser_creation)
crawler.set_hook('before_goto', modify_headers_before_goto)
crawler.set_hook('after_goto', post_navigation_scroll)
crawler.crawler_strategy.set_hook('on_browser_created', log_browser_creation)
crawler.crawler_strategy.set_hook('before_goto', modify_headers_before_goto)
crawler.crawler_strategy.set_hook('after_goto', post_navigation_scroll)
```
#### **5. Complete Example: Using Hooks for a Customized Crawl Workflow**
@@ -1694,10 +1696,10 @@ Heres a detailed outline for the **Hooks and Custom Workflow with AsyncWebCra
async def custom_crawl():
async with AsyncWebCrawler() as crawler:
# Set hooks for custom workflow
crawler.set_hook('on_browser_created', log_browser_creation)
crawler.set_hook('before_goto', modify_headers_before_goto)
crawler.set_hook('after_goto', post_navigation_scroll)
crawler.set_hook('before_return_html', remove_advertisements)
crawler.crawler_strategy.set_hook('on_browser_created', log_browser_creation)
crawler.crawler_strategy.set_hook('before_goto', modify_headers_before_goto)
crawler.crawler_strategy.set_hook('after_goto', post_navigation_scroll)
crawler.crawler_strategy.set_hook('before_return_html', remove_advertisements)
# Perform the crawl
url = "https://example.com"