Push async version last changes for merge to main branch

This commit is contained in:
unclecode
2024-09-24 20:52:08 +08:00
parent d628bc4034
commit 4d48bd31ca
61 changed files with 6219 additions and 891 deletions

View File

@@ -1,20 +1,22 @@
# Quick Start Guide 🚀
Welcome to the Crawl4AI Quickstart Guide! In this tutorial, we'll walk you through the basic usage of Crawl4AI with a friendly and humorous tone. We'll cover everything from basic usage to advanced features like chunking and extraction strategies. Let's dive in! 🌟
Welcome to the Crawl4AI Quickstart Guide! In this tutorial, we'll walk you through the basic usage of Crawl4AI with a friendly and humorous tone. We'll cover everything from basic usage to advanced features like chunking and extraction strategies, all with the power of asynchronous programming. Let's dive in! 🌟
## Getting Started 🛠️
First, let's create an instance of `WebCrawler` and call the `warmup()` function. This might take a few seconds the first time you run Crawl4AI, as it loads the required model files.
First, let's import the necessary modules and create an instance of `AsyncWebCrawler`. We'll use an async context manager, which handles the setup and teardown of the crawler for us.
```python
from crawl4ai import WebCrawler
import asyncio
from crawl4ai import AsyncWebCrawler
def create_crawler():
crawler = WebCrawler(verbose=True)
crawler.warmup()
return crawler
async def main():
async with AsyncWebCrawler(verbose=True) as crawler:
# We'll add our crawling code here
pass
crawler = create_crawler()
if __name__ == "__main__":
asyncio.run(main())
```
### Basic Usage
@@ -22,8 +24,12 @@ crawler = create_crawler()
Simply provide a URL and let Crawl4AI do the magic!
```python
result = crawler.run(url="https://www.nbcnews.com/business")
print(f"Basic crawl result: {result}")
async def main():
async with AsyncWebCrawler(verbose=True) as crawler:
result = await crawler.arun(url="https://www.nbcnews.com/business")
print(f"Basic crawl result: {result.markdown[:500]}") # Print first 500 characters
asyncio.run(main())
```
### Taking Screenshots 📸
@@ -31,26 +37,34 @@ print(f"Basic crawl result: {result}")
Let's take a screenshot of the page!
```python
result = crawler.run(url="https://www.nbcnews.com/business", screenshot=True)
with open("screenshot.png", "wb") as f:
f.write(base64.b64decode(result.screenshot))
print("Screenshot saved to 'screenshot.png'!")
import base64
async def main():
async with AsyncWebCrawler(verbose=True) as crawler:
result = await crawler.arun(url="https://www.nbcnews.com/business", screenshot=True)
with open("screenshot.png", "wb") as f:
f.write(base64.b64decode(result.screenshot))
print("Screenshot saved to 'screenshot.png'!")
asyncio.run(main())
```
### Understanding Parameters 🧠
By default, Crawl4AI caches the results of your crawls. This means that subsequent crawls of the same URL will be much faster! Let's see this in action.
First crawl (caches the result):
```python
result = crawler.run(url="https://www.nbcnews.com/business")
print(f"First crawl result: {result}")
```
async def main():
async with AsyncWebCrawler(verbose=True) as crawler:
# First crawl (caches the result)
result1 = await crawler.arun(url="https://www.nbcnews.com/business")
print(f"First crawl result: {result1.markdown[:100]}...")
Force to crawl again:
```python
result = crawler.run(url="https://www.nbcnews.com/business", bypass_cache=True)
print(f"Second crawl result: {result}")
# Force to crawl again
result2 = await crawler.arun(url="https://www.nbcnews.com/business", bypass_cache=True)
print(f"Second crawl result: {result2.markdown[:100]}...")
asyncio.run(main())
```
### Adding a Chunking Strategy 🧩
@@ -60,145 +74,212 @@ Let's add a chunking strategy: `RegexChunking`! This strategy splits the text ba
```python
from crawl4ai.chunking_strategy import RegexChunking
result = crawler.run(
url="https://www.nbcnews.com/business",
chunking_strategy=RegexChunking(patterns=["\n\n"])
)
print(f"RegexChunking result: {result}")
```
async def main():
async with AsyncWebCrawler(verbose=True) as crawler:
result = await crawler.arun(
url="https://www.nbcnews.com/business",
chunking_strategy=RegexChunking(patterns=["\n\n"])
)
print(f"RegexChunking result: {result.extracted_content[:200]}...")
You can also use `NlpSentenceChunking` which splits the text into sentences using NLP techniques.
```python
from crawl4ai.chunking_strategy import NlpSentenceChunking
result = crawler.run(
url="https://www.nbcnews.com/business",
chunking_strategy=NlpSentenceChunking()
)
print(f"NlpSentenceChunking result: {result}")
asyncio.run(main())
```
### Adding an Extraction Strategy 🧠
Let's get smarter with an extraction strategy: `CosineStrategy`! This strategy uses cosine similarity to extract semantically similar blocks of text.
Let's get smarter with an extraction strategy: `JsonCssExtractionStrategy`! This strategy extracts structured data from HTML using CSS selectors.
```python
from crawl4ai.extraction_strategy import CosineStrategy
from crawl4ai.extraction_strategy import JsonCssExtractionStrategy
import json
result = crawler.run(
url="https://www.nbcnews.com/business",
extraction_strategy=CosineStrategy(
word_count_threshold=10,
max_dist=0.2,
linkage_method="ward",
top_k=3
)
)
print(f"CosineStrategy result: {result}")
```
async def main():
schema = {
"name": "News Articles",
"baseSelector": "article.tease-card",
"fields": [
{
"name": "title",
"selector": "h2",
"type": "text",
},
{
"name": "summary",
"selector": "div.tease-card__info",
"type": "text",
}
],
}
You can also pass other parameters like `semantic_filter` to extract specific content.
async with AsyncWebCrawler(verbose=True) as crawler:
result = await crawler.arun(
url="https://www.nbcnews.com/business",
extraction_strategy=JsonCssExtractionStrategy(schema, verbose=True)
)
extracted_data = json.loads(result.extracted_content)
print(f"Extracted {len(extracted_data)} articles")
print(json.dumps(extracted_data[0], indent=2))
```python
result = crawler.run(
url="https://www.nbcnews.com/business",
extraction_strategy=CosineStrategy(
semantic_filter="inflation rent prices"
)
)
print(f"CosineStrategy result with semantic filter: {result}")
asyncio.run(main())
```
### Using LLMExtractionStrategy 🤖
Time to bring in the big guns: `LLMExtractionStrategy` without instructions! This strategy uses a large language model to extract relevant information from the web page.
Time to bring in the big guns: `LLMExtractionStrategy`! This strategy uses a large language model to extract relevant information from the web page.
```python
from crawl4ai.extraction_strategy import LLMExtractionStrategy
import os
from pydantic import BaseModel, Field
result = crawler.run(
url="https://www.nbcnews.com/business",
extraction_strategy=LLMExtractionStrategy(
provider="openai/gpt-4o",
api_token=os.getenv('OPENAI_API_KEY')
)
)
print(f"LLMExtractionStrategy (no instructions) result: {result}")
```
class OpenAIModelFee(BaseModel):
model_name: str = Field(..., description="Name of the OpenAI model.")
input_fee: str = Field(..., description="Fee for input token for the OpenAI model.")
output_fee: str = Field(..., description="Fee for output token for the OpenAI model.")
You can also provide specific instructions to guide the extraction.
async def main():
if not os.getenv("OPENAI_API_KEY"):
print("OpenAI API key not found. Skipping this example.")
return
```python
result = crawler.run(
url="https://www.nbcnews.com/business",
extraction_strategy=LLMExtractionStrategy(
provider="openai/gpt-4o",
api_token=os.getenv('OPENAI_API_KEY'),
instruction="I am interested in only financial news"
)
)
print(f"LLMExtractionStrategy (with instructions) result: {result}")
```
async with AsyncWebCrawler(verbose=True) as crawler:
result = await crawler.arun(
url="https://openai.com/api/pricing/",
word_count_threshold=1,
extraction_strategy=LLMExtractionStrategy(
provider="openai/gpt-4o",
api_token=os.getenv("OPENAI_API_KEY"),
schema=OpenAIModelFee.schema(),
extraction_type="schema",
instruction="""From the crawled content, extract all mentioned model names along with their fees for input and output tokens.
Do not miss any models in the entire content. One extracted model JSON format should look like this:
{"model_name": "GPT-4", "input_fee": "US$10.00 / 1M tokens", "output_fee": "US$30.00 / 1M tokens"}.""",
),
bypass_cache=True,
)
print(result.extracted_content)
### Targeted Extraction 🎯
Let's use a CSS selector to extract only H2 tags!
```python
result = crawler.run(
url="https://www.nbcnews.com/business",
css_selector="h2"
)
print(f"CSS Selector (H2 tags) result: {result}")
asyncio.run(main())
```
### Interactive Extraction 🖱️
Passing JavaScript code to click the 'Load More' button!
Let's use JavaScript to interact with the page before extraction!
```python
js_code = """
const loadMoreButton = Array.from(document.querySelectorAll('button')).find(button => button.textContent.includes('Load More'));
loadMoreButton && loadMoreButton.click();
"""
async def main():
js_code = """
const loadMoreButton = Array.from(document.querySelectorAll('button')).find(button => button.textContent.includes('Load More'));
loadMoreButton && loadMoreButton.click();
"""
result = crawler.run(
url="https://www.nbcnews.com/business",
js=js_code
)
print(f"JavaScript Code (Load More button) result: {result}")
wait_for = """() => {
return Array.from(document.querySelectorAll('article.tease-card')).length > 10;
}"""
async with AsyncWebCrawler(verbose=True) as crawler:
result = await crawler.arun(
url="https://www.nbcnews.com/business",
js_code=js_code,
wait_for=wait_for,
css_selector="article.tease-card",
bypass_cache=True,
)
print(f"JavaScript interaction result: {result.extracted_content[:500]}")
asyncio.run(main())
```
### Using Crawler Hooks 🔗
### Advanced Session-Based Crawling with Dynamic Content 🔄
Let's see how we can customize the crawler using hooks!
In modern web applications, content is often loaded dynamically without changing the URL. This is common in single-page applications (SPAs) or websites using infinite scrolling. Traditional crawling methods that rely on URL changes won't work here. That's where Crawl4AI's advanced session-based crawling comes in handy!
Here's what makes this approach powerful:
1. **Session Preservation**: By using a `session_id`, we can maintain the state of our crawling session across multiple interactions with the page. This is crucial for navigating through dynamically loaded content.
2. **Asynchronous JavaScript Execution**: We can execute custom JavaScript to trigger content loading or navigation. In this example, we'll click a "Load More" button to fetch the next page of commits.
3. **Dynamic Content Waiting**: The `wait_for` parameter allows us to specify a condition that must be met before considering the page load complete. This ensures we don't extract data before the new content is fully loaded.
Let's see how this works with a real-world example: crawling multiple pages of commits on a GitHub repository. The URL doesn't change as we load more commits, so we'll use these advanced techniques to navigate and extract data.
```python
import time
import json
from bs4 import BeautifulSoup
from crawl4ai import AsyncWebCrawler
from crawl4ai.extraction_strategy import JsonCssExtractionStrategy
from crawl4ai.web_crawler import WebCrawler
from crawl4ai.crawler_strategy import *
async def main():
async with AsyncWebCrawler(verbose=True) as crawler:
url = "https://github.com/microsoft/TypeScript/commits/main"
session_id = "typescript_commits_session"
all_commits = []
def delay(driver):
print("Delaying for 5 seconds...")
time.sleep(5)
print("Resuming...")
def create_crawler():
crawler_strategy = LocalSeleniumCrawlerStrategy(verbose=True)
crawler_strategy.set_hook('after_get_url', delay)
crawler = WebCrawler(verbose=True, crawler_strategy=crawler_strategy)
crawler.warmup()
return crawler
js_next_page = """
const button = document.querySelector('a[data-testid="pagination-next-button"]');
if (button) button.click();
"""
crawler = create_crawler()
result = crawler.run(url="https://www.nbcnews.com/business", bypass_cache=True)
wait_for = """() => {
const commits = document.querySelectorAll('li.Box-sc-g0xbh4-0 h4');
if (commits.length === 0) return false;
const firstCommit = commits[0].textContent.trim();
return firstCommit !== window.lastCommit;
}"""
schema = {
"name": "Commit Extractor",
"baseSelector": "li.Box-sc-g0xbh4-0",
"fields": [
{
"name": "title",
"selector": "h4.markdown-title",
"type": "text",
"transform": "strip",
},
],
}
extraction_strategy = JsonCssExtractionStrategy(schema, verbose=True)
for page in range(3): # Crawl 3 pages
result = await crawler.arun(
url=url,
session_id=session_id,
css_selector="li.Box-sc-g0xbh4-0",
extraction_strategy=extraction_strategy,
js_code=js_next_page if page > 0 else None,
wait_for=wait_for if page > 0 else None,
js_only=page > 0,
bypass_cache=True,
headless=False,
)
assert result.success, f"Failed to crawl page {page + 1}"
commits = json.loads(result.extracted_content)
all_commits.extend(commits)
print(f"Page {page + 1}: Found {len(commits)} commits")
await crawler.crawler_strategy.kill_session(session_id)
print(f"Successfully crawled {len(all_commits)} commits across 3 pages")
asyncio.run(main())
```
check [Hooks](examples/hooks_auth.md) for more examples.
In this example, we're crawling multiple pages of commits from a GitHub repository. The URL doesn't change as we load more commits, so we use JavaScript to click the "Load More" button and a `wait_for` condition to ensure the new content is loaded before extraction. This powerful combination allows us to navigate and extract data from complex, dynamically-loaded web applications with ease!
## Congratulations! 🎉
You've made it through the Crawl4AI Quickstart Guide! Now go forth and crawl the web like a pro! 🕸️
You've made it through the Crawl4AI Quickstart Guide! Now go forth and crawl the web asynchronously like a pro! 🕸️
Remember, these are just a few examples of what Crawl4AI can do. For more advanced usage, check out our other documentation pages:
- [LLM Extraction](examples/llm_extraction.md)
- [JS Execution & CSS Filtering](examples/js_execution_css_filtering.md)
- [Hooks & Auth](examples/hooks_auth.md)
- [Summarization](examples/summarization.md)
- [Research Assistant](examples/research_assistant.md)
Happy crawling! 🚀