Push async version last changes for merge to main branch
This commit is contained in:
@@ -1,20 +1,22 @@
|
||||
# Quick Start Guide 🚀
|
||||
|
||||
Welcome to the Crawl4AI Quickstart Guide! In this tutorial, we'll walk you through the basic usage of Crawl4AI with a friendly and humorous tone. We'll cover everything from basic usage to advanced features like chunking and extraction strategies. Let's dive in! 🌟
|
||||
Welcome to the Crawl4AI Quickstart Guide! In this tutorial, we'll walk you through the basic usage of Crawl4AI with a friendly and humorous tone. We'll cover everything from basic usage to advanced features like chunking and extraction strategies, all with the power of asynchronous programming. Let's dive in! 🌟
|
||||
|
||||
## Getting Started 🛠️
|
||||
|
||||
First, let's create an instance of `WebCrawler` and call the `warmup()` function. This might take a few seconds the first time you run Crawl4AI, as it loads the required model files.
|
||||
First, let's import the necessary modules and create an instance of `AsyncWebCrawler`. We'll use an async context manager, which handles the setup and teardown of the crawler for us.
|
||||
|
||||
```python
|
||||
from crawl4ai import WebCrawler
|
||||
import asyncio
|
||||
from crawl4ai import AsyncWebCrawler
|
||||
|
||||
def create_crawler():
|
||||
crawler = WebCrawler(verbose=True)
|
||||
crawler.warmup()
|
||||
return crawler
|
||||
async def main():
|
||||
async with AsyncWebCrawler(verbose=True) as crawler:
|
||||
# We'll add our crawling code here
|
||||
pass
|
||||
|
||||
crawler = create_crawler()
|
||||
if __name__ == "__main__":
|
||||
asyncio.run(main())
|
||||
```
|
||||
|
||||
### Basic Usage
|
||||
@@ -22,8 +24,12 @@ crawler = create_crawler()
|
||||
Simply provide a URL and let Crawl4AI do the magic!
|
||||
|
||||
```python
|
||||
result = crawler.run(url="https://www.nbcnews.com/business")
|
||||
print(f"Basic crawl result: {result}")
|
||||
async def main():
|
||||
async with AsyncWebCrawler(verbose=True) as crawler:
|
||||
result = await crawler.arun(url="https://www.nbcnews.com/business")
|
||||
print(f"Basic crawl result: {result.markdown[:500]}") # Print first 500 characters
|
||||
|
||||
asyncio.run(main())
|
||||
```
|
||||
|
||||
### Taking Screenshots 📸
|
||||
@@ -31,26 +37,34 @@ print(f"Basic crawl result: {result}")
|
||||
Let's take a screenshot of the page!
|
||||
|
||||
```python
|
||||
result = crawler.run(url="https://www.nbcnews.com/business", screenshot=True)
|
||||
with open("screenshot.png", "wb") as f:
|
||||
f.write(base64.b64decode(result.screenshot))
|
||||
print("Screenshot saved to 'screenshot.png'!")
|
||||
import base64
|
||||
|
||||
async def main():
|
||||
async with AsyncWebCrawler(verbose=True) as crawler:
|
||||
result = await crawler.arun(url="https://www.nbcnews.com/business", screenshot=True)
|
||||
with open("screenshot.png", "wb") as f:
|
||||
f.write(base64.b64decode(result.screenshot))
|
||||
print("Screenshot saved to 'screenshot.png'!")
|
||||
|
||||
asyncio.run(main())
|
||||
```
|
||||
|
||||
### Understanding Parameters 🧠
|
||||
|
||||
By default, Crawl4AI caches the results of your crawls. This means that subsequent crawls of the same URL will be much faster! Let's see this in action.
|
||||
|
||||
First crawl (caches the result):
|
||||
```python
|
||||
result = crawler.run(url="https://www.nbcnews.com/business")
|
||||
print(f"First crawl result: {result}")
|
||||
```
|
||||
async def main():
|
||||
async with AsyncWebCrawler(verbose=True) as crawler:
|
||||
# First crawl (caches the result)
|
||||
result1 = await crawler.arun(url="https://www.nbcnews.com/business")
|
||||
print(f"First crawl result: {result1.markdown[:100]}...")
|
||||
|
||||
Force to crawl again:
|
||||
```python
|
||||
result = crawler.run(url="https://www.nbcnews.com/business", bypass_cache=True)
|
||||
print(f"Second crawl result: {result}")
|
||||
# Force to crawl again
|
||||
result2 = await crawler.arun(url="https://www.nbcnews.com/business", bypass_cache=True)
|
||||
print(f"Second crawl result: {result2.markdown[:100]}...")
|
||||
|
||||
asyncio.run(main())
|
||||
```
|
||||
|
||||
### Adding a Chunking Strategy 🧩
|
||||
@@ -60,145 +74,212 @@ Let's add a chunking strategy: `RegexChunking`! This strategy splits the text ba
|
||||
```python
|
||||
from crawl4ai.chunking_strategy import RegexChunking
|
||||
|
||||
result = crawler.run(
|
||||
url="https://www.nbcnews.com/business",
|
||||
chunking_strategy=RegexChunking(patterns=["\n\n"])
|
||||
)
|
||||
print(f"RegexChunking result: {result}")
|
||||
```
|
||||
async def main():
|
||||
async with AsyncWebCrawler(verbose=True) as crawler:
|
||||
result = await crawler.arun(
|
||||
url="https://www.nbcnews.com/business",
|
||||
chunking_strategy=RegexChunking(patterns=["\n\n"])
|
||||
)
|
||||
print(f"RegexChunking result: {result.extracted_content[:200]}...")
|
||||
|
||||
You can also use `NlpSentenceChunking` which splits the text into sentences using NLP techniques.
|
||||
|
||||
```python
|
||||
from crawl4ai.chunking_strategy import NlpSentenceChunking
|
||||
|
||||
result = crawler.run(
|
||||
url="https://www.nbcnews.com/business",
|
||||
chunking_strategy=NlpSentenceChunking()
|
||||
)
|
||||
print(f"NlpSentenceChunking result: {result}")
|
||||
asyncio.run(main())
|
||||
```
|
||||
|
||||
### Adding an Extraction Strategy 🧠
|
||||
|
||||
Let's get smarter with an extraction strategy: `CosineStrategy`! This strategy uses cosine similarity to extract semantically similar blocks of text.
|
||||
Let's get smarter with an extraction strategy: `JsonCssExtractionStrategy`! This strategy extracts structured data from HTML using CSS selectors.
|
||||
|
||||
```python
|
||||
from crawl4ai.extraction_strategy import CosineStrategy
|
||||
from crawl4ai.extraction_strategy import JsonCssExtractionStrategy
|
||||
import json
|
||||
|
||||
result = crawler.run(
|
||||
url="https://www.nbcnews.com/business",
|
||||
extraction_strategy=CosineStrategy(
|
||||
word_count_threshold=10,
|
||||
max_dist=0.2,
|
||||
linkage_method="ward",
|
||||
top_k=3
|
||||
)
|
||||
)
|
||||
print(f"CosineStrategy result: {result}")
|
||||
```
|
||||
async def main():
|
||||
schema = {
|
||||
"name": "News Articles",
|
||||
"baseSelector": "article.tease-card",
|
||||
"fields": [
|
||||
{
|
||||
"name": "title",
|
||||
"selector": "h2",
|
||||
"type": "text",
|
||||
},
|
||||
{
|
||||
"name": "summary",
|
||||
"selector": "div.tease-card__info",
|
||||
"type": "text",
|
||||
}
|
||||
],
|
||||
}
|
||||
|
||||
You can also pass other parameters like `semantic_filter` to extract specific content.
|
||||
async with AsyncWebCrawler(verbose=True) as crawler:
|
||||
result = await crawler.arun(
|
||||
url="https://www.nbcnews.com/business",
|
||||
extraction_strategy=JsonCssExtractionStrategy(schema, verbose=True)
|
||||
)
|
||||
extracted_data = json.loads(result.extracted_content)
|
||||
print(f"Extracted {len(extracted_data)} articles")
|
||||
print(json.dumps(extracted_data[0], indent=2))
|
||||
|
||||
```python
|
||||
result = crawler.run(
|
||||
url="https://www.nbcnews.com/business",
|
||||
extraction_strategy=CosineStrategy(
|
||||
semantic_filter="inflation rent prices"
|
||||
)
|
||||
)
|
||||
print(f"CosineStrategy result with semantic filter: {result}")
|
||||
asyncio.run(main())
|
||||
```
|
||||
|
||||
### Using LLMExtractionStrategy 🤖
|
||||
|
||||
Time to bring in the big guns: `LLMExtractionStrategy` without instructions! This strategy uses a large language model to extract relevant information from the web page.
|
||||
Time to bring in the big guns: `LLMExtractionStrategy`! This strategy uses a large language model to extract relevant information from the web page.
|
||||
|
||||
```python
|
||||
from crawl4ai.extraction_strategy import LLMExtractionStrategy
|
||||
import os
|
||||
from pydantic import BaseModel, Field
|
||||
|
||||
result = crawler.run(
|
||||
url="https://www.nbcnews.com/business",
|
||||
extraction_strategy=LLMExtractionStrategy(
|
||||
provider="openai/gpt-4o",
|
||||
api_token=os.getenv('OPENAI_API_KEY')
|
||||
)
|
||||
)
|
||||
print(f"LLMExtractionStrategy (no instructions) result: {result}")
|
||||
```
|
||||
class OpenAIModelFee(BaseModel):
|
||||
model_name: str = Field(..., description="Name of the OpenAI model.")
|
||||
input_fee: str = Field(..., description="Fee for input token for the OpenAI model.")
|
||||
output_fee: str = Field(..., description="Fee for output token for the OpenAI model.")
|
||||
|
||||
You can also provide specific instructions to guide the extraction.
|
||||
async def main():
|
||||
if not os.getenv("OPENAI_API_KEY"):
|
||||
print("OpenAI API key not found. Skipping this example.")
|
||||
return
|
||||
|
||||
```python
|
||||
result = crawler.run(
|
||||
url="https://www.nbcnews.com/business",
|
||||
extraction_strategy=LLMExtractionStrategy(
|
||||
provider="openai/gpt-4o",
|
||||
api_token=os.getenv('OPENAI_API_KEY'),
|
||||
instruction="I am interested in only financial news"
|
||||
)
|
||||
)
|
||||
print(f"LLMExtractionStrategy (with instructions) result: {result}")
|
||||
```
|
||||
async with AsyncWebCrawler(verbose=True) as crawler:
|
||||
result = await crawler.arun(
|
||||
url="https://openai.com/api/pricing/",
|
||||
word_count_threshold=1,
|
||||
extraction_strategy=LLMExtractionStrategy(
|
||||
provider="openai/gpt-4o",
|
||||
api_token=os.getenv("OPENAI_API_KEY"),
|
||||
schema=OpenAIModelFee.schema(),
|
||||
extraction_type="schema",
|
||||
instruction="""From the crawled content, extract all mentioned model names along with their fees for input and output tokens.
|
||||
Do not miss any models in the entire content. One extracted model JSON format should look like this:
|
||||
{"model_name": "GPT-4", "input_fee": "US$10.00 / 1M tokens", "output_fee": "US$30.00 / 1M tokens"}.""",
|
||||
),
|
||||
bypass_cache=True,
|
||||
)
|
||||
print(result.extracted_content)
|
||||
|
||||
### Targeted Extraction 🎯
|
||||
|
||||
Let's use a CSS selector to extract only H2 tags!
|
||||
|
||||
```python
|
||||
result = crawler.run(
|
||||
url="https://www.nbcnews.com/business",
|
||||
css_selector="h2"
|
||||
)
|
||||
print(f"CSS Selector (H2 tags) result: {result}")
|
||||
asyncio.run(main())
|
||||
```
|
||||
|
||||
### Interactive Extraction 🖱️
|
||||
|
||||
Passing JavaScript code to click the 'Load More' button!
|
||||
Let's use JavaScript to interact with the page before extraction!
|
||||
|
||||
```python
|
||||
js_code = """
|
||||
const loadMoreButton = Array.from(document.querySelectorAll('button')).find(button => button.textContent.includes('Load More'));
|
||||
loadMoreButton && loadMoreButton.click();
|
||||
"""
|
||||
async def main():
|
||||
js_code = """
|
||||
const loadMoreButton = Array.from(document.querySelectorAll('button')).find(button => button.textContent.includes('Load More'));
|
||||
loadMoreButton && loadMoreButton.click();
|
||||
"""
|
||||
|
||||
result = crawler.run(
|
||||
url="https://www.nbcnews.com/business",
|
||||
js=js_code
|
||||
)
|
||||
print(f"JavaScript Code (Load More button) result: {result}")
|
||||
wait_for = """() => {
|
||||
return Array.from(document.querySelectorAll('article.tease-card')).length > 10;
|
||||
}"""
|
||||
|
||||
async with AsyncWebCrawler(verbose=True) as crawler:
|
||||
result = await crawler.arun(
|
||||
url="https://www.nbcnews.com/business",
|
||||
js_code=js_code,
|
||||
wait_for=wait_for,
|
||||
css_selector="article.tease-card",
|
||||
bypass_cache=True,
|
||||
)
|
||||
print(f"JavaScript interaction result: {result.extracted_content[:500]}")
|
||||
|
||||
asyncio.run(main())
|
||||
```
|
||||
|
||||
### Using Crawler Hooks 🔗
|
||||
### Advanced Session-Based Crawling with Dynamic Content 🔄
|
||||
|
||||
Let's see how we can customize the crawler using hooks!
|
||||
In modern web applications, content is often loaded dynamically without changing the URL. This is common in single-page applications (SPAs) or websites using infinite scrolling. Traditional crawling methods that rely on URL changes won't work here. That's where Crawl4AI's advanced session-based crawling comes in handy!
|
||||
|
||||
Here's what makes this approach powerful:
|
||||
|
||||
1. **Session Preservation**: By using a `session_id`, we can maintain the state of our crawling session across multiple interactions with the page. This is crucial for navigating through dynamically loaded content.
|
||||
|
||||
2. **Asynchronous JavaScript Execution**: We can execute custom JavaScript to trigger content loading or navigation. In this example, we'll click a "Load More" button to fetch the next page of commits.
|
||||
|
||||
3. **Dynamic Content Waiting**: The `wait_for` parameter allows us to specify a condition that must be met before considering the page load complete. This ensures we don't extract data before the new content is fully loaded.
|
||||
|
||||
Let's see how this works with a real-world example: crawling multiple pages of commits on a GitHub repository. The URL doesn't change as we load more commits, so we'll use these advanced techniques to navigate and extract data.
|
||||
|
||||
```python
|
||||
import time
|
||||
import json
|
||||
from bs4 import BeautifulSoup
|
||||
from crawl4ai import AsyncWebCrawler
|
||||
from crawl4ai.extraction_strategy import JsonCssExtractionStrategy
|
||||
|
||||
from crawl4ai.web_crawler import WebCrawler
|
||||
from crawl4ai.crawler_strategy import *
|
||||
async def main():
|
||||
async with AsyncWebCrawler(verbose=True) as crawler:
|
||||
url = "https://github.com/microsoft/TypeScript/commits/main"
|
||||
session_id = "typescript_commits_session"
|
||||
all_commits = []
|
||||
|
||||
def delay(driver):
|
||||
print("Delaying for 5 seconds...")
|
||||
time.sleep(5)
|
||||
print("Resuming...")
|
||||
|
||||
def create_crawler():
|
||||
crawler_strategy = LocalSeleniumCrawlerStrategy(verbose=True)
|
||||
crawler_strategy.set_hook('after_get_url', delay)
|
||||
crawler = WebCrawler(verbose=True, crawler_strategy=crawler_strategy)
|
||||
crawler.warmup()
|
||||
return crawler
|
||||
js_next_page = """
|
||||
const button = document.querySelector('a[data-testid="pagination-next-button"]');
|
||||
if (button) button.click();
|
||||
"""
|
||||
|
||||
crawler = create_crawler()
|
||||
result = crawler.run(url="https://www.nbcnews.com/business", bypass_cache=True)
|
||||
wait_for = """() => {
|
||||
const commits = document.querySelectorAll('li.Box-sc-g0xbh4-0 h4');
|
||||
if (commits.length === 0) return false;
|
||||
const firstCommit = commits[0].textContent.trim();
|
||||
return firstCommit !== window.lastCommit;
|
||||
}"""
|
||||
|
||||
schema = {
|
||||
"name": "Commit Extractor",
|
||||
"baseSelector": "li.Box-sc-g0xbh4-0",
|
||||
"fields": [
|
||||
{
|
||||
"name": "title",
|
||||
"selector": "h4.markdown-title",
|
||||
"type": "text",
|
||||
"transform": "strip",
|
||||
},
|
||||
],
|
||||
}
|
||||
extraction_strategy = JsonCssExtractionStrategy(schema, verbose=True)
|
||||
|
||||
for page in range(3): # Crawl 3 pages
|
||||
result = await crawler.arun(
|
||||
url=url,
|
||||
session_id=session_id,
|
||||
css_selector="li.Box-sc-g0xbh4-0",
|
||||
extraction_strategy=extraction_strategy,
|
||||
js_code=js_next_page if page > 0 else None,
|
||||
wait_for=wait_for if page > 0 else None,
|
||||
js_only=page > 0,
|
||||
bypass_cache=True,
|
||||
headless=False,
|
||||
)
|
||||
|
||||
assert result.success, f"Failed to crawl page {page + 1}"
|
||||
|
||||
commits = json.loads(result.extracted_content)
|
||||
all_commits.extend(commits)
|
||||
|
||||
print(f"Page {page + 1}: Found {len(commits)} commits")
|
||||
|
||||
await crawler.crawler_strategy.kill_session(session_id)
|
||||
print(f"Successfully crawled {len(all_commits)} commits across 3 pages")
|
||||
|
||||
asyncio.run(main())
|
||||
```
|
||||
|
||||
check [Hooks](examples/hooks_auth.md) for more examples.
|
||||
In this example, we're crawling multiple pages of commits from a GitHub repository. The URL doesn't change as we load more commits, so we use JavaScript to click the "Load More" button and a `wait_for` condition to ensure the new content is loaded before extraction. This powerful combination allows us to navigate and extract data from complex, dynamically-loaded web applications with ease!
|
||||
|
||||
## Congratulations! 🎉
|
||||
|
||||
You've made it through the Crawl4AI Quickstart Guide! Now go forth and crawl the web like a pro! 🕸️
|
||||
You've made it through the Crawl4AI Quickstart Guide! Now go forth and crawl the web asynchronously like a pro! 🕸️
|
||||
|
||||
Remember, these are just a few examples of what Crawl4AI can do. For more advanced usage, check out our other documentation pages:
|
||||
|
||||
- [LLM Extraction](examples/llm_extraction.md)
|
||||
- [JS Execution & CSS Filtering](examples/js_execution_css_filtering.md)
|
||||
- [Hooks & Auth](examples/hooks_auth.md)
|
||||
- [Summarization](examples/summarization.md)
|
||||
- [Research Assistant](examples/research_assistant.md)
|
||||
|
||||
Happy crawling! 🚀
|
||||
Reference in New Issue
Block a user