refactor(docs): reorganize documentation structure and update styles

Reorganize documentation into core/advanced/extraction sections for better navigation.
Update terminal theme styles and add rich library for better CLI output.
Remove redundant tutorial files and consolidate content into core sections.
Add personal story to index page for project context.

BREAKING CHANGE: Documentation structure has been significantly reorganized
This commit is contained in:
UncleCode
2025-01-07 20:49:50 +08:00
parent ae376f15fb
commit ca3e33122e
87 changed files with 4869 additions and 8951 deletions

View File

@@ -1,4 +1,4 @@
### Session Management
# Session Management
Session management in Crawl4AI is a powerful feature that allows you to maintain state across multiple requests, making it particularly suitable for handling complex multi-step crawling tasks. It enables you to reuse the same browser tab (or page object) across sequential actions and crawls, which is beneficial for:
@@ -20,8 +20,12 @@ async with AsyncWebCrawler() as crawler:
session_id = "my_session"
# Define configurations
config1 = CrawlerRunConfig(url="https://example.com/page1", session_id=session_id)
config2 = CrawlerRunConfig(url="https://example.com/page2", session_id=session_id)
config1 = CrawlerRunConfig(
url="https://example.com/page1", session_id=session_id
)
config2 = CrawlerRunConfig(
url="https://example.com/page2", session_id=session_id
)
# First request
result1 = await crawler.arun(config=config1)
@@ -54,7 +58,9 @@ async def crawl_dynamic_content():
schema = {
"name": "Commit Extractor",
"baseSelector": "li.Box-sc-g0xbh4-0",
"fields": [{"name": "title", "selector": "h4.markdown-title", "type": "text"}],
"fields": [{
"name": "title", "selector": "h4.markdown-title", "type": "text"
}],
}
extraction_strategy = JsonCssExtractionStrategy(schema)
@@ -87,51 +93,146 @@ async def crawl_dynamic_content():
---
#### Session Best Practices
## Example 1: Basic Session-Based Crawling
1. **Descriptive Session IDs**:
Use meaningful names for session IDs to organize workflows:
```python
session_id = "login_flow_session"
session_id = "product_catalog_session"
```
A simple example using session-based crawling:
2. **Resource Management**:
Always ensure sessions are cleaned up to free resources:
```python
try:
# Your crawling code here
pass
finally:
await crawler.crawler_strategy.kill_session(session_id)
```
```python
import asyncio
from crawl4ai.async_configs import BrowserConfig, CrawlerRunConfig
from crawl4ai.cache_context import CacheMode
3. **State Maintenance**:
Reuse the session for subsequent actions within the same workflow:
```python
# Step 1: Login
login_config = CrawlerRunConfig(
url="https://example.com/login",
session_id=session_id,
js_code="document.querySelector('form').submit();"
)
await crawler.arun(config=login_config)
async def basic_session_crawl():
async with AsyncWebCrawler() as crawler:
session_id = "dynamic_content_session"
url = "https://example.com/dynamic-content"
# Step 2: Verify login success
dashboard_config = CrawlerRunConfig(
url="https://example.com/dashboard",
session_id=session_id,
wait_for="css:.user-profile" # Wait for authenticated content
)
result = await crawler.arun(config=dashboard_config)
```
for page in range(3):
config = CrawlerRunConfig(
url=url,
session_id=session_id,
js_code="document.querySelector('.load-more-button').click();" if page > 0 else None,
css_selector=".content-item",
cache_mode=CacheMode.BYPASS
)
result = await crawler.arun(config=config)
print(f"Page {page + 1}: Found {result.extracted_content.count('.content-item')} items")
await crawler.crawler_strategy.kill_session(session_id)
asyncio.run(basic_session_crawl())
```
This example shows:
1. Reusing the same `session_id` across multiple requests.
2. Executing JavaScript to load more content dynamically.
3. Properly closing the session to free resources.
---
## Advanced Technique 1: Custom Execution Hooks
> Warning: You might feel confused by the end of the next few examples 😅, so make sure you are comfortable with the order of the parts before you start this.
Use custom hooks to handle complex scenarios, such as waiting for content to load dynamically:
```python
async def advanced_session_crawl_with_hooks():
first_commit = ""
async def on_execution_started(page):
nonlocal first_commit
try:
while True:
await page.wait_for_selector("li.commit-item h4")
commit = await page.query_selector("li.commit-item h4")
commit = await commit.evaluate("(element) => element.textContent").strip()
if commit and commit != first_commit:
first_commit = commit
break
await asyncio.sleep(0.5)
except Exception as e:
print(f"Warning: New content didn't appear: {e}")
async with AsyncWebCrawler() as crawler:
session_id = "commit_session"
url = "https://github.com/example/repo/commits/main"
crawler.crawler_strategy.set_hook("on_execution_started", on_execution_started)
js_next_page = """document.querySelector('a.pagination-next').click();"""
for page in range(3):
config = CrawlerRunConfig(
url=url,
session_id=session_id,
js_code=js_next_page if page > 0 else None,
css_selector="li.commit-item",
js_only=page > 0,
cache_mode=CacheMode.BYPASS
)
result = await crawler.arun(config=config)
print(f"Page {page + 1}: Found {len(result.extracted_content)} commits")
await crawler.crawler_strategy.kill_session(session_id)
asyncio.run(advanced_session_crawl_with_hooks())
```
This technique ensures new content loads before the next action.
---
## Advanced Technique 2: Integrated JavaScript Execution and Waiting
Combine JavaScript execution and waiting logic for concise handling of dynamic content:
```python
async def integrated_js_and_wait_crawl():
async with AsyncWebCrawler() as crawler:
session_id = "integrated_session"
url = "https://github.com/example/repo/commits/main"
js_next_page_and_wait = """
(async () => {
const getCurrentCommit = () => document.querySelector('li.commit-item h4').textContent.trim();
const initialCommit = getCurrentCommit();
document.querySelector('a.pagination-next').click();
while (getCurrentCommit() === initialCommit) {
await new Promise(resolve => setTimeout(resolve, 100));
}
})();
"""
for page in range(3):
config = CrawlerRunConfig(
url=url,
session_id=session_id,
js_code=js_next_page_and_wait if page > 0 else None,
css_selector="li.commit-item",
js_only=page > 0,
cache_mode=CacheMode.BYPASS
)
result = await crawler.arun(config=config)
print(f"Page {page + 1}: Found {len(result.extracted_content)} commits")
await crawler.crawler_strategy.kill_session(session_id)
asyncio.run(integrated_js_and_wait_crawl())
```
---
#### Common Use Cases for Sessions
1. **Authentication Flows**: Login and interact with secured pages.
2. **Pagination Handling**: Navigate through multiple pages.
3. **Form Submissions**: Fill forms, submit, and process results.
4. **Multi-step Processes**: Complete workflows that span multiple actions.
5. **Dynamic Content Navigation**: Handle JavaScript-rendered or event-triggered content.
1. **Authentication Flows**: Login and interact with secured pages.
2. **Pagination Handling**: Navigate through multiple pages.
3. **Form Submissions**: Fill forms, submit, and process results.
4. **Multi-step Processes**: Complete workflows that span multiple actions.
5. **Dynamic Content Navigation**: Handle JavaScript-rendered or event-triggered content.