Implement new async crawler features and stability updates
- Introduced new async crawl strategy with session management. - Added BrowserManager for improved browser management. - Enhanced documentation, focusing on storage state and usage examples. - Improved error handling and logging for sessions. - Added JavaScript snippets for customizing navigator properties.
This commit is contained in:
225
docs/examples/storage_state_tutorial.md
Normal file
225
docs/examples/storage_state_tutorial.md
Normal file
@@ -0,0 +1,225 @@
|
||||
### Using `storage_state` to Pre-Load Cookies and LocalStorage
|
||||
|
||||
Crawl4ai’s `AsyncWebCrawler` lets you preserve and reuse session data, including cookies and localStorage, across multiple runs. By providing a `storage_state`, you can start your crawls already “logged in” or with any other necessary session data—no need to repeat the login flow every time.
|
||||
|
||||
#### What is `storage_state`?
|
||||
|
||||
`storage_state` can be:
|
||||
|
||||
- A dictionary containing cookies and localStorage data.
|
||||
- A path to a JSON file that holds this information.
|
||||
|
||||
When you pass `storage_state` to the crawler, it applies these cookies and localStorage entries before loading any pages. This means your crawler effectively starts in a known authenticated or pre-configured state.
|
||||
|
||||
#### Example Structure
|
||||
|
||||
Here’s an example storage state:
|
||||
|
||||
```json
|
||||
{
|
||||
"cookies": [
|
||||
{
|
||||
"name": "session",
|
||||
"value": "abcd1234",
|
||||
"domain": "example.com",
|
||||
"path": "/",
|
||||
"expires": 1675363572.037711,
|
||||
"httpOnly": false,
|
||||
"secure": false,
|
||||
"sameSite": "None"
|
||||
}
|
||||
],
|
||||
"origins": [
|
||||
{
|
||||
"origin": "https://example.com",
|
||||
"localStorage": [
|
||||
{ "name": "token", "value": "my_auth_token" },
|
||||
{ "name": "refreshToken", "value": "my_refresh_token" }
|
||||
]
|
||||
}
|
||||
]
|
||||
}
|
||||
```
|
||||
|
||||
This JSON sets a `session` cookie and two localStorage entries (`token` and `refreshToken`) for `https://example.com`.
|
||||
|
||||
---
|
||||
|
||||
### Passing `storage_state` as a Dictionary
|
||||
|
||||
You can directly provide the data as a dictionary:
|
||||
|
||||
```python
|
||||
import asyncio
|
||||
from crawl4ai import AsyncWebCrawler
|
||||
|
||||
async def main():
|
||||
storage_dict = {
|
||||
"cookies": [
|
||||
{
|
||||
"name": "session",
|
||||
"value": "abcd1234",
|
||||
"domain": "example.com",
|
||||
"path": "/",
|
||||
"expires": 1675363572.037711,
|
||||
"httpOnly": False,
|
||||
"secure": False,
|
||||
"sameSite": "None"
|
||||
}
|
||||
],
|
||||
"origins": [
|
||||
{
|
||||
"origin": "https://example.com",
|
||||
"localStorage": [
|
||||
{"name": "token", "value": "my_auth_token"},
|
||||
{"name": "refreshToken", "value": "my_refresh_token"}
|
||||
]
|
||||
}
|
||||
]
|
||||
}
|
||||
|
||||
async with AsyncWebCrawler(
|
||||
headless=True,
|
||||
storage_state=storage_dict
|
||||
) as crawler:
|
||||
result = await crawler.arun(url='https://example.com/protected')
|
||||
if result.success:
|
||||
print("Crawl succeeded with pre-loaded session data!")
|
||||
print("Page HTML length:", len(result.html))
|
||||
|
||||
if __name__ == "__main__":
|
||||
asyncio.run(main())
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### Passing `storage_state` as a File
|
||||
|
||||
If you prefer a file-based approach, save the JSON above to `mystate.json` and reference it:
|
||||
|
||||
```python
|
||||
import asyncio
|
||||
from crawl4ai import AsyncWebCrawler
|
||||
|
||||
async def main():
|
||||
async with AsyncWebCrawler(
|
||||
headless=True,
|
||||
storage_state="mystate.json" # Uses a JSON file instead of a dictionary
|
||||
) as crawler:
|
||||
result = await crawler.arun(url='https://example.com/protected')
|
||||
if result.success:
|
||||
print("Crawl succeeded with pre-loaded session data!")
|
||||
print("Page HTML length:", len(result.html))
|
||||
|
||||
if __name__ == "__main__":
|
||||
asyncio.run(main())
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### Using `storage_state` to Avoid Repeated Logins (Sign In Once, Use Later)
|
||||
|
||||
A common scenario is when you need to log in to a site (entering username/password, etc.) to access protected pages. Doing so every crawl is cumbersome. Instead, you can:
|
||||
|
||||
1. Perform the login once in a hook.
|
||||
2. After login completes, export the resulting `storage_state` to a file.
|
||||
3. On subsequent runs, provide that `storage_state` to skip the login step.
|
||||
|
||||
**Step-by-Step Example:**
|
||||
|
||||
**First Run (Perform Login and Save State):**
|
||||
|
||||
```python
|
||||
import asyncio
|
||||
from crawl4ai import AsyncWebCrawler, CacheMode
|
||||
from crawl4ai.markdown_generation_strategy import DefaultMarkdownGenerator
|
||||
|
||||
async def on_browser_created_hook(browser):
|
||||
# Access the default context and create a page
|
||||
context = browser.contexts[0]
|
||||
page = await context.new_page()
|
||||
|
||||
# Navigate to the login page
|
||||
await page.goto("https://example.com/login", wait_until="domcontentloaded")
|
||||
|
||||
# Fill in credentials and submit
|
||||
await page.fill("input[name='username']", "myuser")
|
||||
await page.fill("input[name='password']", "mypassword")
|
||||
await page.click("button[type='submit']")
|
||||
await page.wait_for_load_state("networkidle")
|
||||
|
||||
# Now the site sets tokens in localStorage and cookies
|
||||
# Export this state to a file so we can reuse it
|
||||
await context.storage_state(path="my_storage_state.json")
|
||||
await page.close()
|
||||
|
||||
async def main():
|
||||
# First run: perform login and export the storage_state
|
||||
async with AsyncWebCrawler(
|
||||
headless=True,
|
||||
verbose=True,
|
||||
hooks={"on_browser_created": on_browser_created_hook},
|
||||
use_persistent_context=True,
|
||||
user_data_dir="./my_user_data"
|
||||
) as crawler:
|
||||
|
||||
# After on_browser_created_hook runs, we have storage_state saved to my_storage_state.json
|
||||
result = await crawler.arun(
|
||||
url='https://example.com/protected-page',
|
||||
cache_mode=CacheMode.BYPASS,
|
||||
markdown_generator=DefaultMarkdownGenerator(options={"ignore_links": True}),
|
||||
)
|
||||
print("First run result success:", result.success)
|
||||
if result.success:
|
||||
print("Protected page HTML length:", len(result.html))
|
||||
|
||||
if __name__ == "__main__":
|
||||
asyncio.run(main())
|
||||
```
|
||||
|
||||
**Second Run (Reuse Saved State, No Login Needed):**
|
||||
|
||||
```python
|
||||
import asyncio
|
||||
from crawl4ai import AsyncWebCrawler, CacheMode
|
||||
from crawl4ai.markdown_generation_strategy import DefaultMarkdownGenerator
|
||||
|
||||
async def main():
|
||||
# Second run: no need to hook on_browser_created this time.
|
||||
# Just provide the previously saved storage state.
|
||||
async with AsyncWebCrawler(
|
||||
headless=True,
|
||||
verbose=True,
|
||||
use_persistent_context=True,
|
||||
user_data_dir="./my_user_data",
|
||||
storage_state="my_storage_state.json" # Reuse previously exported state
|
||||
) as crawler:
|
||||
|
||||
# Now the crawler starts already logged in
|
||||
result = await crawler.arun(
|
||||
url='https://example.com/protected-page',
|
||||
cache_mode=CacheMode.BYPASS,
|
||||
markdown_generator=DefaultMarkdownGenerator(options={"ignore_links": True}),
|
||||
)
|
||||
print("Second run result success:", result.success)
|
||||
if result.success:
|
||||
print("Protected page HTML length:", len(result.html))
|
||||
|
||||
if __name__ == "__main__":
|
||||
asyncio.run(main())
|
||||
```
|
||||
|
||||
**What’s Happening Here?**
|
||||
|
||||
- During the first run, the `on_browser_created_hook` logs into the site.
|
||||
- After logging in, the crawler exports the current session (cookies, localStorage, etc.) to `my_storage_state.json`.
|
||||
- On subsequent runs, passing `storage_state="my_storage_state.json"` starts the browser context with these tokens already in place, skipping the login steps.
|
||||
|
||||
**Sign Out Scenario:**
|
||||
If the website allows you to sign out by clearing tokens or by navigating to a sign-out URL, you can also run a script that uses `on_browser_created_hook` or `arun` to simulate signing out, then export the resulting `storage_state` again. That would give you a baseline “logged out” state to start fresh from next time.
|
||||
|
||||
---
|
||||
|
||||
### Conclusion
|
||||
|
||||
By using `storage_state`, you can skip repetitive actions, like logging in, and jump straight into crawling protected content. Whether you provide a file path or a dictionary, this powerful feature helps maintain state between crawls, simplifying your data extraction pipelines.
|
||||
Reference in New Issue
Block a user