refactor(docs): reorganize documentation structure and update styles
Reorganize documentation into core/advanced/extraction sections for better navigation. Update terminal theme styles and add rich library for better CLI output. Remove redundant tutorial files and consolidate content into core sections. Add personal story to index page for project context. BREAKING CHANGE: Documentation structure has been significantly reorganized
This commit is contained in:
248
docs/md_v2/core/browser-crawler-config.md
Normal file
248
docs/md_v2/core/browser-crawler-config.md
Normal file
@@ -0,0 +1,248 @@
|
||||
# Browser & Crawler Configuration (Quick Overview)
|
||||
|
||||
Crawl4AI’s flexibility stems from two key classes:
|
||||
|
||||
1. **`BrowserConfig`** – Dictates **how** the browser is launched and behaves (e.g., headless or visible, proxy, user agent).
|
||||
2. **`CrawlerRunConfig`** – Dictates **how** each **crawl** operates (e.g., caching, extraction, timeouts, JavaScript code to run, etc.).
|
||||
|
||||
In most examples, you create **one** `BrowserConfig` for the entire crawler session, then pass a **fresh** or re-used `CrawlerRunConfig` whenever you call `arun()`. This tutorial shows the most commonly used parameters. If you need advanced or rarely used fields, see the [Configuration Parameters](../api/parameters.md).
|
||||
|
||||
---
|
||||
|
||||
## 1. BrowserConfig Essentials
|
||||
|
||||
```python
|
||||
class BrowserConfig:
|
||||
def __init__(
|
||||
browser_type="chromium",
|
||||
headless=True,
|
||||
proxy_config=None,
|
||||
viewport_width=1080,
|
||||
viewport_height=600,
|
||||
verbose=True,
|
||||
use_persistent_context=False,
|
||||
user_data_dir=None,
|
||||
cookies=None,
|
||||
headers=None,
|
||||
user_agent=None,
|
||||
text_mode=False,
|
||||
light_mode=False,
|
||||
extra_args=None,
|
||||
# ... other advanced parameters omitted here
|
||||
):
|
||||
...
|
||||
```
|
||||
|
||||
### Key Fields to Note
|
||||
|
||||
|
||||
|
||||
1. **`browser_type`**
|
||||
- Options: `"chromium"`, `"firefox"`, or `"webkit"`.
|
||||
- Defaults to `"chromium"`.
|
||||
- If you need a different engine, specify it here.
|
||||
|
||||
2. **`headless`**
|
||||
- `True`: Runs the browser in headless mode (invisible browser).
|
||||
- `False`: Runs the browser in visible mode, which helps with debugging.
|
||||
|
||||
3. **`proxy_config`**
|
||||
- A dictionary with fields like:
|
||||
```json
|
||||
{
|
||||
"server": "http://proxy.example.com:8080",
|
||||
"username": "...",
|
||||
"password": "..."
|
||||
}
|
||||
```
|
||||
- Leave as `None` if a proxy is not required.
|
||||
|
||||
4. **`viewport_width` & `viewport_height`**:
|
||||
- The initial window size.
|
||||
- Some sites behave differently with smaller or bigger viewports.
|
||||
|
||||
5. **`verbose`**:
|
||||
- If `True`, prints extra logs.
|
||||
- Handy for debugging.
|
||||
|
||||
6. **`use_persistent_context`**:
|
||||
- If `True`, uses a **persistent** browser profile, storing cookies/local storage across runs.
|
||||
- Typically also set `user_data_dir` to point to a folder.
|
||||
|
||||
7. **`cookies`** & **`headers`**:
|
||||
- If you want to start with specific cookies or add universal HTTP headers, set them here.
|
||||
- E.g. `cookies=[{"name": "session", "value": "abc123", "domain": "example.com"}]`.
|
||||
|
||||
8. **`user_agent`**:
|
||||
- Custom User-Agent string. If `None`, a default is used.
|
||||
- You can also set `user_agent_mode="random"` for randomization (if you want to fight bot detection).
|
||||
|
||||
9. **`text_mode`** & **`light_mode`**:
|
||||
- `text_mode=True` disables images, possibly speeding up text-only crawls.
|
||||
- `light_mode=True` turns off certain background features for performance.
|
||||
|
||||
10. **`extra_args`**:
|
||||
- Additional flags for the underlying browser.
|
||||
- E.g. `["--disable-extensions"]`.
|
||||
|
||||
**Minimal Example**:
|
||||
|
||||
```python
|
||||
from crawl4ai import AsyncWebCrawler, BrowserConfig
|
||||
|
||||
browser_conf = BrowserConfig(
|
||||
browser_type="firefox",
|
||||
headless=False,
|
||||
text_mode=True
|
||||
)
|
||||
|
||||
async with AsyncWebCrawler(config=browser_conf) as crawler:
|
||||
result = await crawler.arun("https://example.com")
|
||||
print(result.markdown[:300])
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 2. CrawlerRunConfig Essentials
|
||||
|
||||
```python
|
||||
class CrawlerRunConfig:
|
||||
def __init__(
|
||||
word_count_threshold=200,
|
||||
extraction_strategy=None,
|
||||
markdown_generator=None,
|
||||
cache_mode=None,
|
||||
js_code=None,
|
||||
wait_for=None,
|
||||
screenshot=False,
|
||||
pdf=False,
|
||||
verbose=True,
|
||||
# ... other advanced parameters omitted
|
||||
):
|
||||
...
|
||||
```
|
||||
|
||||
### Key Fields to Note
|
||||
|
||||
1. **`word_count_threshold`**:
|
||||
- The minimum word count before a block is considered.
|
||||
- If your site has lots of short paragraphs or items, you can lower it.
|
||||
|
||||
2. **`extraction_strategy`**:
|
||||
- Where you plug in JSON-based extraction (CSS, LLM, etc.).
|
||||
- If `None`, no structured extraction is done (only raw/cleaned HTML + markdown).
|
||||
|
||||
3. **`markdown_generator`**:
|
||||
- E.g., `DefaultMarkdownGenerator(...)`, controlling how HTML→Markdown conversion is done.
|
||||
- If `None`, a default approach is used.
|
||||
|
||||
4. **`cache_mode`**:
|
||||
- Controls caching behavior (`ENABLED`, `BYPASS`, `DISABLED`, etc.).
|
||||
- If `None`, defaults to some level of caching or you can specify `CacheMode.ENABLED`.
|
||||
|
||||
5. **`js_code`**:
|
||||
- A string or list of JS strings to execute.
|
||||
- Great for “Load More” buttons or user interactions.
|
||||
|
||||
6. **`wait_for`**:
|
||||
- A CSS or JS expression to wait for before extracting content.
|
||||
- Common usage: `wait_for="css:.main-loaded"` or `wait_for="js:() => window.loaded === true"`.
|
||||
|
||||
7. **`screenshot`** & **`pdf`**:
|
||||
- If `True`, captures a screenshot or PDF after the page is fully loaded.
|
||||
- The results go to `result.screenshot` (base64) or `result.pdf` (bytes).
|
||||
|
||||
8. **`verbose`**:
|
||||
- Logs additional runtime details.
|
||||
- Overlaps with the browser’s verbosity if also set to `True` in `BrowserConfig`.
|
||||
|
||||
**Minimal Example**:
|
||||
|
||||
```python
|
||||
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig
|
||||
|
||||
crawl_conf = CrawlerRunConfig(
|
||||
js_code="document.querySelector('button#loadMore')?.click()",
|
||||
wait_for="css:.loaded-content",
|
||||
screenshot=True
|
||||
)
|
||||
|
||||
async with AsyncWebCrawler() as crawler:
|
||||
result = await crawler.arun(url="https://example.com", config=crawl_conf)
|
||||
print(result.screenshot[:100]) # Base64-encoded PNG snippet
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 3. Putting It All Together
|
||||
|
||||
In a typical scenario, you define **one** `BrowserConfig` for your crawler session, then create **one or more** `CrawlerRunConfig` depending on each call’s needs:
|
||||
|
||||
```python
|
||||
import asyncio
|
||||
from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig, CacheMode
|
||||
from crawl4ai.extraction_strategy import JsonCssExtractionStrategy
|
||||
|
||||
async def main():
|
||||
# 1) Browser config: headless, bigger viewport, no proxy
|
||||
browser_conf = BrowserConfig(
|
||||
headless=True,
|
||||
viewport_width=1280,
|
||||
viewport_height=720
|
||||
)
|
||||
|
||||
# 2) Example extraction strategy
|
||||
schema = {
|
||||
"name": "Articles",
|
||||
"baseSelector": "div.article",
|
||||
"fields": [
|
||||
{"name": "title", "selector": "h2", "type": "text"},
|
||||
{"name": "link", "selector": "a", "type": "attribute", "attribute": "href"}
|
||||
]
|
||||
}
|
||||
extraction = JsonCssExtractionStrategy(schema)
|
||||
|
||||
# 3) Crawler run config: skip cache, use extraction
|
||||
run_conf = CrawlerRunConfig(
|
||||
extraction_strategy=extraction,
|
||||
cache_mode=CacheMode.BYPASS
|
||||
)
|
||||
|
||||
async with AsyncWebCrawler(config=browser_conf) as crawler:
|
||||
# 4) Execute the crawl
|
||||
result = await crawler.arun(url="https://example.com/news", config=run_conf)
|
||||
|
||||
if result.success:
|
||||
print("Extracted content:", result.extracted_content)
|
||||
else:
|
||||
print("Error:", result.error_message)
|
||||
|
||||
if __name__ == "__main__":
|
||||
asyncio.run(main())
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 4. Next Steps
|
||||
|
||||
For a **detailed list** of available parameters (including advanced ones), see:
|
||||
|
||||
- [BrowserConfig and CrawlerRunConfig Reference](../api/parameters.md)
|
||||
|
||||
You can explore topics like:
|
||||
|
||||
- **Custom Hooks & Auth** (Inject JavaScript or handle login forms).
|
||||
- **Session Management** (Re-use pages, preserve state across multiple calls).
|
||||
- **Magic Mode** or **Identity-based Crawling** (Fight bot detection by simulating user behavior).
|
||||
- **Advanced Caching** (Fine-tune read/write cache modes).
|
||||
|
||||
---
|
||||
|
||||
## 5. Conclusion
|
||||
|
||||
**BrowserConfig** and **CrawlerRunConfig** give you straightforward ways to define:
|
||||
|
||||
- **Which** browser to launch, how it should run, and any proxy or user agent needs.
|
||||
- **How** each crawl should behave—caching, timeouts, JavaScript code, extraction strategies, etc.
|
||||
|
||||
Use them together for **clear, maintainable** code, and when you need more specialized behavior, check out the advanced parameters in the [reference docs](../api/parameters.md). Happy crawling!
|
||||
75
docs/md_v2/core/cache-modes.md
Normal file
75
docs/md_v2/core/cache-modes.md
Normal file
@@ -0,0 +1,75 @@
|
||||
# Crawl4AI Cache System and Migration Guide
|
||||
|
||||
## Overview
|
||||
Starting from version 0.5.0, Crawl4AI introduces a new caching system that replaces the old boolean flags with a more intuitive `CacheMode` enum. This change simplifies cache control and makes the behavior more predictable.
|
||||
|
||||
## Old vs New Approach
|
||||
|
||||
### Old Way (Deprecated)
|
||||
The old system used multiple boolean flags:
|
||||
- `bypass_cache`: Skip cache entirely
|
||||
- `disable_cache`: Disable all caching
|
||||
- `no_cache_read`: Don't read from cache
|
||||
- `no_cache_write`: Don't write to cache
|
||||
|
||||
### New Way (Recommended)
|
||||
The new system uses a single `CacheMode` enum:
|
||||
- `CacheMode.ENABLED`: Normal caching (read/write)
|
||||
- `CacheMode.DISABLED`: No caching at all
|
||||
- `CacheMode.READ_ONLY`: Only read from cache
|
||||
- `CacheMode.WRITE_ONLY`: Only write to cache
|
||||
- `CacheMode.BYPASS`: Skip cache for this operation
|
||||
|
||||
## Migration Example
|
||||
|
||||
### Old Code (Deprecated)
|
||||
```python
|
||||
import asyncio
|
||||
from crawl4ai import AsyncWebCrawler
|
||||
|
||||
async def use_proxy():
|
||||
async with AsyncWebCrawler(verbose=True) as crawler:
|
||||
result = await crawler.arun(
|
||||
url="https://www.nbcnews.com/business",
|
||||
bypass_cache=True # Old way
|
||||
)
|
||||
print(len(result.markdown))
|
||||
|
||||
async def main():
|
||||
await use_proxy()
|
||||
|
||||
if __name__ == "__main__":
|
||||
asyncio.run(main())
|
||||
```
|
||||
|
||||
### New Code (Recommended)
|
||||
```python
|
||||
import asyncio
|
||||
from crawl4ai import AsyncWebCrawler, CacheMode
|
||||
from crawl4ai.async_configs import CrawlerRunConfig
|
||||
|
||||
async def use_proxy():
|
||||
# Use CacheMode in CrawlerRunConfig
|
||||
config = CrawlerRunConfig(cache_mode=CacheMode.BYPASS)
|
||||
async with AsyncWebCrawler(verbose=True) as crawler:
|
||||
result = await crawler.arun(
|
||||
url="https://www.nbcnews.com/business",
|
||||
config=config # Pass the configuration object
|
||||
)
|
||||
print(len(result.markdown))
|
||||
|
||||
async def main():
|
||||
await use_proxy()
|
||||
|
||||
if __name__ == "__main__":
|
||||
asyncio.run(main())
|
||||
```
|
||||
|
||||
## Common Migration Patterns
|
||||
|
||||
| Old Flag | New Mode |
|
||||
|-----------------------|---------------------------------|
|
||||
| `bypass_cache=True` | `cache_mode=CacheMode.BYPASS` |
|
||||
| `disable_cache=True` | `cache_mode=CacheMode.DISABLED`|
|
||||
| `no_cache_read=True` | `cache_mode=CacheMode.WRITE_ONLY` |
|
||||
| `no_cache_write=True` | `cache_mode=CacheMode.READ_ONLY` |
|
||||
332
docs/md_v2/core/content-selection.md
Normal file
332
docs/md_v2/core/content-selection.md
Normal file
@@ -0,0 +1,332 @@
|
||||
# Content Selection
|
||||
|
||||
Crawl4AI provides multiple ways to **select**, **filter**, and **refine** the content from your crawls. Whether you need to target a specific CSS region, exclude entire tags, filter out external links, or remove certain domains and images, **`CrawlerRunConfig`** offers a wide range of parameters.
|
||||
|
||||
Below, we show how to configure these parameters and combine them for precise control.
|
||||
|
||||
---
|
||||
|
||||
## 1. CSS-Based Selection
|
||||
|
||||
A straightforward way to **limit** your crawl results to a certain region of the page is **`css_selector`** in **`CrawlerRunConfig`**:
|
||||
|
||||
```python
|
||||
import asyncio
|
||||
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig
|
||||
|
||||
async def main():
|
||||
config = CrawlerRunConfig(
|
||||
# e.g., first 30 items from Hacker News
|
||||
css_selector=".athing:nth-child(-n+30)"
|
||||
)
|
||||
async with AsyncWebCrawler() as crawler:
|
||||
result = await crawler.arun(
|
||||
url="https://news.ycombinator.com/newest",
|
||||
config=config
|
||||
)
|
||||
print("Partial HTML length:", len(result.cleaned_html))
|
||||
|
||||
if __name__ == "__main__":
|
||||
asyncio.run(main())
|
||||
```
|
||||
|
||||
**Result**: Only elements matching that selector remain in `result.cleaned_html`.
|
||||
|
||||
---
|
||||
|
||||
## 2. Content Filtering & Exclusions
|
||||
|
||||
### 2.1 Basic Overview
|
||||
|
||||
```python
|
||||
config = CrawlerRunConfig(
|
||||
# Content thresholds
|
||||
word_count_threshold=10, # Minimum words per block
|
||||
|
||||
# Tag exclusions
|
||||
excluded_tags=['form', 'header', 'footer', 'nav'],
|
||||
|
||||
# Link filtering
|
||||
exclude_external_links=True,
|
||||
exclude_social_media_links=True,
|
||||
# Block entire domains
|
||||
exclude_domains=["adtrackers.com", "spammynews.org"],
|
||||
exclude_social_media_domains=["facebook.com", "twitter.com"],
|
||||
|
||||
# Media filtering
|
||||
exclude_external_images=True
|
||||
)
|
||||
```
|
||||
|
||||
**Explanation**:
|
||||
|
||||
- **`word_count_threshold`**: Ignores text blocks under X words. Helps skip trivial blocks like short nav or disclaimers.
|
||||
- **`excluded_tags`**: Removes entire tags (`<form>`, `<header>`, `<footer>`, etc.).
|
||||
- **Link Filtering**:
|
||||
- `exclude_external_links`: Strips out external links and may remove them from `result.links`.
|
||||
- `exclude_social_media_links`: Removes links pointing to known social media domains.
|
||||
- `exclude_domains`: A custom list of domains to block if discovered in links.
|
||||
- `exclude_social_media_domains`: A curated list (override or add to it) for social media sites.
|
||||
- **Media Filtering**:
|
||||
- `exclude_external_images`: Discards images not hosted on the same domain as the main page (or its subdomains).
|
||||
|
||||
By default in case you set `exclude_social_media_links=True`, the following social media domains are excluded:
|
||||
```python
|
||||
[
|
||||
'facebook.com',
|
||||
'twitter.com',
|
||||
'x.com',
|
||||
'linkedin.com',
|
||||
'instagram.com',
|
||||
'pinterest.com',
|
||||
'tiktok.com',
|
||||
'snapchat.com',
|
||||
'reddit.com',
|
||||
]
|
||||
```
|
||||
|
||||
|
||||
### 2.2 Example Usage
|
||||
|
||||
```python
|
||||
import asyncio
|
||||
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig, CacheMode
|
||||
|
||||
async def main():
|
||||
config = CrawlerRunConfig(
|
||||
css_selector="main.content",
|
||||
word_count_threshold=10,
|
||||
excluded_tags=["nav", "footer"],
|
||||
exclude_external_links=True,
|
||||
exclude_social_media_links=True,
|
||||
exclude_domains=["ads.com", "spammytrackers.net"],
|
||||
exclude_external_images=True,
|
||||
cache_mode=CacheMode.BYPASS
|
||||
)
|
||||
|
||||
async with AsyncWebCrawler() as crawler:
|
||||
result = await crawler.arun(url="https://news.ycombinator.com", config=config)
|
||||
print("Cleaned HTML length:", len(result.cleaned_html))
|
||||
|
||||
if __name__ == "__main__":
|
||||
asyncio.run(main())
|
||||
```
|
||||
|
||||
**Note**: If these parameters remove too much, reduce or disable them accordingly.
|
||||
|
||||
---
|
||||
|
||||
## 3. Handling Iframes
|
||||
|
||||
Some sites embed content in `<iframe>` tags. If you want that inline:
|
||||
```python
|
||||
config = CrawlerRunConfig(
|
||||
# Merge iframe content into the final output
|
||||
process_iframes=True,
|
||||
remove_overlay_elements=True
|
||||
)
|
||||
```
|
||||
|
||||
**Usage**:
|
||||
```python
|
||||
import asyncio
|
||||
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig
|
||||
|
||||
async def main():
|
||||
config = CrawlerRunConfig(
|
||||
process_iframes=True,
|
||||
remove_overlay_elements=True
|
||||
)
|
||||
async with AsyncWebCrawler() as crawler:
|
||||
result = await crawler.arun(
|
||||
url="https://example.org/iframe-demo",
|
||||
config=config
|
||||
)
|
||||
print("Iframe-merged length:", len(result.cleaned_html))
|
||||
|
||||
if __name__ == "__main__":
|
||||
asyncio.run(main())
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 4. Structured Extraction Examples
|
||||
|
||||
You can combine content selection with a more advanced extraction strategy. For instance, a **CSS-based** or **LLM-based** extraction strategy can run on the filtered HTML.
|
||||
|
||||
### 4.1 Pattern-Based with `JsonCssExtractionStrategy`
|
||||
|
||||
```python
|
||||
import asyncio
|
||||
import json
|
||||
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig, CacheMode
|
||||
from crawl4ai.extraction_strategy import JsonCssExtractionStrategy
|
||||
|
||||
async def main():
|
||||
# Minimal schema for repeated items
|
||||
schema = {
|
||||
"name": "News Items",
|
||||
"baseSelector": "tr.athing",
|
||||
"fields": [
|
||||
{"name": "title", "selector": "a.storylink", "type": "text"},
|
||||
{
|
||||
"name": "link",
|
||||
"selector": "a.storylink",
|
||||
"type": "attribute",
|
||||
"attribute": "href"
|
||||
}
|
||||
]
|
||||
}
|
||||
|
||||
config = CrawlerRunConfig(
|
||||
# Content filtering
|
||||
excluded_tags=["form", "header"],
|
||||
exclude_domains=["adsite.com"],
|
||||
|
||||
# CSS selection or entire page
|
||||
css_selector="table.itemlist",
|
||||
|
||||
# No caching for demonstration
|
||||
cache_mode=CacheMode.BYPASS,
|
||||
|
||||
# Extraction strategy
|
||||
extraction_strategy=JsonCssExtractionStrategy(schema)
|
||||
)
|
||||
|
||||
async with AsyncWebCrawler() as crawler:
|
||||
result = await crawler.arun(
|
||||
url="https://news.ycombinator.com/newest",
|
||||
config=config
|
||||
)
|
||||
data = json.loads(result.extracted_content)
|
||||
print("Sample extracted item:", data[:1]) # Show first item
|
||||
|
||||
if __name__ == "__main__":
|
||||
asyncio.run(main())
|
||||
```
|
||||
|
||||
### 4.2 LLM-Based Extraction
|
||||
|
||||
```python
|
||||
import asyncio
|
||||
import json
|
||||
from pydantic import BaseModel, Field
|
||||
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig
|
||||
from crawl4ai.extraction_strategy import LLMExtractionStrategy
|
||||
|
||||
class ArticleData(BaseModel):
|
||||
headline: str
|
||||
summary: str
|
||||
|
||||
async def main():
|
||||
llm_strategy = LLMExtractionStrategy(
|
||||
provider="openai/gpt-4",
|
||||
api_token="sk-YOUR_API_KEY",
|
||||
schema=ArticleData.schema(),
|
||||
extraction_type="schema",
|
||||
instruction="Extract 'headline' and a short 'summary' from the content."
|
||||
)
|
||||
|
||||
config = CrawlerRunConfig(
|
||||
exclude_external_links=True,
|
||||
word_count_threshold=20,
|
||||
extraction_strategy=llm_strategy
|
||||
)
|
||||
|
||||
async with AsyncWebCrawler() as crawler:
|
||||
result = await crawler.arun(url="https://news.ycombinator.com", config=config)
|
||||
article = json.loads(result.extracted_content)
|
||||
print(article)
|
||||
|
||||
if __name__ == "__main__":
|
||||
asyncio.run(main())
|
||||
```
|
||||
|
||||
Here, the crawler:
|
||||
|
||||
- Filters out external links (`exclude_external_links=True`).
|
||||
- Ignores very short text blocks (`word_count_threshold=20`).
|
||||
- Passes the final HTML to your LLM strategy for an AI-driven parse.
|
||||
|
||||
---
|
||||
|
||||
## 5. Comprehensive Example
|
||||
|
||||
Below is a short function that unifies **CSS selection**, **exclusion** logic, and a pattern-based extraction, demonstrating how you can fine-tune your final data:
|
||||
|
||||
```python
|
||||
import asyncio
|
||||
import json
|
||||
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig, CacheMode
|
||||
from crawl4ai.extraction_strategy import JsonCssExtractionStrategy
|
||||
|
||||
async def extract_main_articles(url: str):
|
||||
schema = {
|
||||
"name": "ArticleBlock",
|
||||
"baseSelector": "div.article-block",
|
||||
"fields": [
|
||||
{"name": "headline", "selector": "h2", "type": "text"},
|
||||
{"name": "summary", "selector": ".summary", "type": "text"},
|
||||
{
|
||||
"name": "metadata",
|
||||
"type": "nested",
|
||||
"fields": [
|
||||
{"name": "author", "selector": ".author", "type": "text"},
|
||||
{"name": "date", "selector": ".date", "type": "text"}
|
||||
]
|
||||
}
|
||||
]
|
||||
}
|
||||
|
||||
config = CrawlerRunConfig(
|
||||
# Keep only #main-content
|
||||
css_selector="#main-content",
|
||||
|
||||
# Filtering
|
||||
word_count_threshold=10,
|
||||
excluded_tags=["nav", "footer"],
|
||||
exclude_external_links=True,
|
||||
exclude_domains=["somebadsite.com"],
|
||||
exclude_external_images=True,
|
||||
|
||||
# Extraction
|
||||
extraction_strategy=JsonCssExtractionStrategy(schema),
|
||||
|
||||
cache_mode=CacheMode.BYPASS
|
||||
)
|
||||
|
||||
async with AsyncWebCrawler() as crawler:
|
||||
result = await crawler.arun(url=url, config=config)
|
||||
if not result.success:
|
||||
print(f"Error: {result.error_message}")
|
||||
return None
|
||||
return json.loads(result.extracted_content)
|
||||
|
||||
async def main():
|
||||
articles = await extract_main_articles("https://news.ycombinator.com/newest")
|
||||
if articles:
|
||||
print("Extracted Articles:", articles[:2]) # Show first 2
|
||||
|
||||
if __name__ == "__main__":
|
||||
asyncio.run(main())
|
||||
```
|
||||
|
||||
**Why This Works**:
|
||||
- **CSS** scoping with `#main-content`.
|
||||
- Multiple **exclude_** parameters to remove domains, external images, etc.
|
||||
- A **JsonCssExtractionStrategy** to parse repeated article blocks.
|
||||
|
||||
---
|
||||
|
||||
## 6. Conclusion
|
||||
|
||||
By mixing **css_selector** scoping, **content filtering** parameters, and advanced **extraction strategies**, you can precisely **choose** which data to keep. Key parameters in **`CrawlerRunConfig`** for content selection include:
|
||||
|
||||
1. **`css_selector`** – Basic scoping to an element or region.
|
||||
2. **`word_count_threshold`** – Skip short blocks.
|
||||
3. **`excluded_tags`** – Remove entire HTML tags.
|
||||
4. **`exclude_external_links`**, **`exclude_social_media_links`**, **`exclude_domains`** – Filter out unwanted links or domains.
|
||||
5. **`exclude_external_images`** – Remove images from external sources.
|
||||
6. **`process_iframes`** – Merge iframe content if needed.
|
||||
|
||||
Combine these with structured extraction (CSS, LLM-based, or others) to build powerful crawls that yield exactly the content you want, from raw or cleaned HTML up to sophisticated JSON structures. For more detail, see [Configuration Reference](../api/parameters.md). Enjoy curating your data to the max!
|
||||
246
docs/md_v2/core/crawler-result.md
Normal file
246
docs/md_v2/core/crawler-result.md
Normal file
@@ -0,0 +1,246 @@
|
||||
# Crawl Result and Output
|
||||
|
||||
When you call `arun()` on a page, Crawl4AI returns a **`CrawlResult`** object containing everything you might need—raw HTML, a cleaned version, optional screenshots or PDFs, structured extraction results, and more. This document explains those fields and how they map to different output types.
|
||||
|
||||
---
|
||||
|
||||
## 1. The `CrawlResult` Model
|
||||
|
||||
Below is the core schema. Each field captures a different aspect of the crawl’s result:
|
||||
|
||||
```python
|
||||
class MarkdownGenerationResult(BaseModel):
|
||||
raw_markdown: str
|
||||
markdown_with_citations: str
|
||||
references_markdown: str
|
||||
fit_markdown: Optional[str] = None
|
||||
fit_html: Optional[str] = None
|
||||
|
||||
class CrawlResult(BaseModel):
|
||||
url: str
|
||||
html: str
|
||||
success: bool
|
||||
cleaned_html: Optional[str] = None
|
||||
media: Dict[str, List[Dict]] = {}
|
||||
links: Dict[str, List[Dict]] = {}
|
||||
downloaded_files: Optional[List[str]] = None
|
||||
screenshot: Optional[str] = None
|
||||
pdf : Optional[bytes] = None
|
||||
markdown: Optional[Union[str, MarkdownGenerationResult]] = None
|
||||
markdown_v2: Optional[MarkdownGenerationResult] = None
|
||||
extracted_content: Optional[str] = None
|
||||
metadata: Optional[dict] = None
|
||||
error_message: Optional[str] = None
|
||||
session_id: Optional[str] = None
|
||||
response_headers: Optional[dict] = None
|
||||
status_code: Optional[int] = None
|
||||
ssl_certificate: Optional[SSLCertificate] = None
|
||||
class Config:
|
||||
arbitrary_types_allowed = True
|
||||
```
|
||||
|
||||
### Table: Key Fields in `CrawlResult`
|
||||
|
||||
| Field (Name & Type) | Description |
|
||||
|-------------------------------------------|-----------------------------------------------------------------------------------------------------|
|
||||
| **url (`str`)** | The final or actual URL crawled (in case of redirects). |
|
||||
| **html (`str`)** | Original, unmodified page HTML. Good for debugging or custom processing. |
|
||||
| **success (`bool`)** | `True` if the crawl completed without major errors, else `False`. |
|
||||
| **cleaned_html (`Optional[str]`)** | Sanitized HTML with scripts/styles removed; can exclude tags if configured via `excluded_tags` etc. |
|
||||
| **media (`Dict[str, List[Dict]]`)** | Extracted media info (images, audio, etc.), each with attributes like `src`, `alt`, `score`, etc. |
|
||||
| **links (`Dict[str, List[Dict]]`)** | Extracted link data, split by `internal` and `external`. Each link usually has `href`, `text`, etc. |
|
||||
| **downloaded_files (`Optional[List[str]]`)** | If `accept_downloads=True` in `BrowserConfig`, this lists the filepaths of saved downloads. |
|
||||
| **screenshot (`Optional[str]`)** | Screenshot of the page (base64-encoded) if `screenshot=True`. |
|
||||
| **pdf (`Optional[bytes]`)** | PDF of the page if `pdf=True`. |
|
||||
| **markdown (`Optional[str or MarkdownGenerationResult]`)** | For now, `markdown_v2` holds a `MarkdownGenerationResult`. Over time, this will be consolidated into `markdown`. The generator can provide raw markdown, citations, references, and optionally `fit_markdown`. |
|
||||
| **markdown_v2 (`Optional[MarkdownGenerationResult]`)** | Legacy field for detailed markdown output. This will be replaced by `markdown` soon. |
|
||||
| **extracted_content (`Optional[str]`)** | The output of a structured extraction (CSS/LLM-based) stored as JSON string or other text. |
|
||||
| **metadata (`Optional[dict]`)** | Additional info about the crawl or extracted data. |
|
||||
| **error_message (`Optional[str]`)** | If `success=False`, contains a short description of what went wrong. |
|
||||
| **session_id (`Optional[str]`)** | The ID of the session used for multi-page or persistent crawling. |
|
||||
| **response_headers (`Optional[dict]`)** | HTTP response headers, if captured. |
|
||||
| **status_code (`Optional[int]`)** | HTTP status code (e.g., 200 for OK). |
|
||||
| **ssl_certificate (`Optional[SSLCertificate]`)** | SSL certificate info if `fetch_ssl_certificate=True`. |
|
||||
|
||||
---
|
||||
|
||||
## 2. HTML Variants
|
||||
|
||||
### `html`: Raw HTML
|
||||
|
||||
Crawl4AI preserves the exact HTML as `result.html`. Useful for:
|
||||
|
||||
- Debugging page issues or checking the original content.
|
||||
- Performing your own specialized parse if needed.
|
||||
|
||||
### `cleaned_html`: Sanitized
|
||||
|
||||
If you specify any cleanup or exclusion parameters in `CrawlerRunConfig` (like `excluded_tags`, `remove_forms`, etc.), you’ll see the result here:
|
||||
|
||||
```python
|
||||
config = CrawlerRunConfig(
|
||||
excluded_tags=["form", "header", "footer"],
|
||||
keep_data_attributes=False
|
||||
)
|
||||
result = await crawler.arun("https://example.com", config=config)
|
||||
print(result.cleaned_html) # Freed of forms, header, footer, data-* attributes
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 3. Markdown Generation
|
||||
|
||||
### 3.1 `markdown_v2` (Legacy) vs `markdown`
|
||||
|
||||
- **`markdown_v2`**: The current location for detailed markdown output, returning a **`MarkdownGenerationResult`** object.
|
||||
- **`markdown`**: Eventually, we’re merging these fields. For now, you might see `result.markdown_v2` used widely in code examples.
|
||||
|
||||
**`MarkdownGenerationResult`** Fields:
|
||||
|
||||
| Field | Description |
|
||||
|-------------------------|--------------------------------------------------------------------------------|
|
||||
| **raw_markdown** | The basic HTML→Markdown conversion. |
|
||||
| **markdown_with_citations** | Markdown including inline citations that reference links at the end. |
|
||||
| **references_markdown** | The references/citations themselves (if `citations=True`). |
|
||||
| **fit_markdown** | The filtered/“fit” markdown if a content filter was used. |
|
||||
| **fit_html** | The filtered HTML that generated `fit_markdown`. |
|
||||
|
||||
### 3.2 Basic Example with a Markdown Generator
|
||||
|
||||
```python
|
||||
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig
|
||||
from crawl4ai.markdown_generation_strategy import DefaultMarkdownGenerator
|
||||
|
||||
config = CrawlerRunConfig(
|
||||
markdown_generator=DefaultMarkdownGenerator(
|
||||
options={"citations": True, "body_width": 80} # e.g. pass html2text style options
|
||||
)
|
||||
)
|
||||
result = await crawler.arun(url="https://example.com", config=config)
|
||||
|
||||
md_res = result.markdown_v2 # or eventually 'result.markdown'
|
||||
print(md_res.raw_markdown[:500])
|
||||
print(md_res.markdown_with_citations)
|
||||
print(md_res.references_markdown)
|
||||
```
|
||||
|
||||
**Note**: If you use a filter like `PruningContentFilter`, you’ll get `fit_markdown` and `fit_html` as well.
|
||||
|
||||
---
|
||||
|
||||
## 4. Structured Extraction: `extracted_content`
|
||||
|
||||
If you run a JSON-based extraction strategy (CSS, XPath, LLM, etc.), the structured data is **not** stored in `markdown`—it’s placed in **`result.extracted_content`** as a JSON string (or sometimes plain text).
|
||||
|
||||
### Example: CSS Extraction with `raw://` HTML
|
||||
|
||||
```python
|
||||
import asyncio
|
||||
import json
|
||||
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig, CacheMode
|
||||
from crawl4ai.extraction_strategy import JsonCssExtractionStrategy
|
||||
|
||||
async def main():
|
||||
schema = {
|
||||
"name": "Example Items",
|
||||
"baseSelector": "div.item",
|
||||
"fields": [
|
||||
{"name": "title", "selector": "h2", "type": "text"},
|
||||
{"name": "link", "selector": "a", "type": "attribute", "attribute": "href"}
|
||||
]
|
||||
}
|
||||
raw_html = "<div class='item'><h2>Item 1</h2><a href='https://example.com/item1'>Link 1</a></div>"
|
||||
|
||||
async with AsyncWebCrawler() as crawler:
|
||||
result = await crawler.arun(
|
||||
url="raw://" + raw_html,
|
||||
config=CrawlerRunConfig(
|
||||
cache_mode=CacheMode.BYPASS,
|
||||
extraction_strategy=JsonCssExtractionStrategy(schema)
|
||||
)
|
||||
)
|
||||
data = json.loads(result.extracted_content)
|
||||
print(data)
|
||||
|
||||
if __name__ == "__main__":
|
||||
asyncio.run(main())
|
||||
```
|
||||
|
||||
Here:
|
||||
- `url="raw://..."` passes the HTML content directly, no network requests.
|
||||
- The **CSS** extraction strategy populates `result.extracted_content` with the JSON array `[{"title": "...", "link": "..."}]`.
|
||||
|
||||
---
|
||||
|
||||
## 5. More Fields: Links, Media, and More
|
||||
|
||||
### 5.1 `links`
|
||||
|
||||
A dictionary, typically with `"internal"` and `"external"` lists. Each entry might have `href`, `text`, `title`, etc. This is automatically captured if you haven’t disabled link extraction.
|
||||
|
||||
```python
|
||||
print(result.links["internal"][:3]) # Show first 3 internal links
|
||||
```
|
||||
|
||||
### 5.2 `media`
|
||||
|
||||
Similarly, a dictionary with `"images"`, `"audio"`, `"video"`, etc. Each item could include `src`, `alt`, `score`, and more, if your crawler is set to gather them.
|
||||
|
||||
```python
|
||||
images = result.media.get("images", [])
|
||||
for img in images:
|
||||
print("Image URL:", img["src"], "Alt:", img.get("alt"))
|
||||
```
|
||||
|
||||
### 5.3 `screenshot` and `pdf`
|
||||
|
||||
If you set `screenshot=True` or `pdf=True` in **`CrawlerRunConfig`**, then:
|
||||
|
||||
- `result.screenshot` contains a base64-encoded PNG string.
|
||||
- `result.pdf` contains raw PDF bytes (you can write them to a file).
|
||||
|
||||
```python
|
||||
with open("page.pdf", "wb") as f:
|
||||
f.write(result.pdf)
|
||||
```
|
||||
|
||||
### 5.4 `ssl_certificate`
|
||||
|
||||
If `fetch_ssl_certificate=True`, `result.ssl_certificate` holds details about the site’s SSL cert, such as issuer, validity dates, etc.
|
||||
|
||||
---
|
||||
|
||||
## 6. Accessing These Fields
|
||||
|
||||
After you run:
|
||||
|
||||
```python
|
||||
result = await crawler.arun(url="https://example.com", config=some_config)
|
||||
```
|
||||
|
||||
Check any field:
|
||||
|
||||
```python
|
||||
if result.success:
|
||||
print(result.status_code, result.response_headers)
|
||||
print("Links found:", len(result.links.get("internal", [])))
|
||||
if result.markdown_v2:
|
||||
print("Markdown snippet:", result.markdown_v2.raw_markdown[:200])
|
||||
if result.extracted_content:
|
||||
print("Structured JSON:", result.extracted_content)
|
||||
else:
|
||||
print("Error:", result.error_message)
|
||||
```
|
||||
|
||||
**Remember**: Use `result.markdown_v2` for now. It will eventually become `result.markdown`.
|
||||
|
||||
---
|
||||
|
||||
## 7. Next Steps
|
||||
|
||||
- **Markdown Generation**: Dive deeper into how to configure `DefaultMarkdownGenerator` and various filters.
|
||||
- **Content Filtering**: Learn how to use `BM25ContentFilter` and `PruningContentFilter`.
|
||||
- **Session & Hooks**: If you want to manipulate the page or preserve state across multiple `arun()` calls, see the hooking or session docs.
|
||||
- **LLM Extraction**: For complex or unstructured content requiring AI-driven parsing, check the LLM-based strategies doc.
|
||||
|
||||
**Enjoy** exploring all that `CrawlResult` offers—whether you need raw HTML, sanitized output, markdown, or fully structured data, Crawl4AI has you covered!
|
||||
702
docs/md_v2/core/docker-deploymeny.md
Normal file
702
docs/md_v2/core/docker-deploymeny.md
Normal file
@@ -0,0 +1,702 @@
|
||||
# Docker Deployment
|
||||
|
||||
Crawl4AI provides official Docker images for easy deployment and scalability. This guide covers installation, configuration, and usage of Crawl4AI in Docker environments.
|
||||
|
||||
## Quick Start 🚀
|
||||
|
||||
Pull and run the basic version:
|
||||
|
||||
```bash
|
||||
# Basic run without security
|
||||
docker pull unclecode/crawl4ai:basic
|
||||
docker run -p 11235:11235 unclecode/crawl4ai:basic
|
||||
|
||||
# Run with API security enabled
|
||||
docker run -p 11235:11235 -e CRAWL4AI_API_TOKEN=your_secret_token unclecode/crawl4ai:basic
|
||||
```
|
||||
|
||||
## Running with Docker Compose 🐳
|
||||
|
||||
### Use Docker Compose (From Local Dockerfile or Docker Hub)
|
||||
|
||||
Crawl4AI provides flexibility to use Docker Compose for managing your containerized services. You can either build the image locally from the provided `Dockerfile` or use the pre-built image from Docker Hub.
|
||||
|
||||
### **Option 1: Using Docker Compose to Build Locally**
|
||||
If you want to build the image locally, use the provided `docker-compose.local.yml` file.
|
||||
|
||||
```bash
|
||||
docker-compose -f docker-compose.local.yml up -d
|
||||
```
|
||||
|
||||
This will:
|
||||
1. Build the Docker image from the provided `Dockerfile`.
|
||||
2. Start the container and expose it on `http://localhost:11235`.
|
||||
|
||||
---
|
||||
|
||||
### **Option 2: Using Docker Compose with Pre-Built Image from Hub**
|
||||
If you prefer using the pre-built image on Docker Hub, use the `docker-compose.hub.yml` file.
|
||||
|
||||
```bash
|
||||
docker-compose -f docker-compose.hub.yml up -d
|
||||
```
|
||||
|
||||
This will:
|
||||
1. Pull the pre-built image `unclecode/crawl4ai:basic` (or `all`, depending on your configuration).
|
||||
2. Start the container and expose it on `http://localhost:11235`.
|
||||
|
||||
---
|
||||
|
||||
### **Stopping the Running Services**
|
||||
|
||||
To stop the services started via Docker Compose, you can use:
|
||||
|
||||
```bash
|
||||
docker-compose -f docker-compose.local.yml down
|
||||
# OR
|
||||
docker-compose -f docker-compose.hub.yml down
|
||||
```
|
||||
|
||||
If the containers don’t stop and the application is still running, check the running containers:
|
||||
|
||||
```bash
|
||||
docker ps
|
||||
```
|
||||
|
||||
Find the `CONTAINER ID` of the running service and stop it forcefully:
|
||||
|
||||
```bash
|
||||
docker stop <CONTAINER_ID>
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### **Debugging with Docker Compose**
|
||||
|
||||
- **Check Logs**: To view the container logs:
|
||||
```bash
|
||||
docker-compose -f docker-compose.local.yml logs -f
|
||||
```
|
||||
|
||||
- **Remove Orphaned Containers**: If the service is still running unexpectedly:
|
||||
```bash
|
||||
docker-compose -f docker-compose.local.yml down --remove-orphans
|
||||
```
|
||||
|
||||
- **Manually Remove Network**: If the network is still in use:
|
||||
```bash
|
||||
docker network ls
|
||||
docker network rm crawl4ai_default
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### Why Use Docker Compose?
|
||||
|
||||
Docker Compose is the recommended way to deploy Crawl4AI because:
|
||||
1. It simplifies multi-container setups.
|
||||
2. Allows you to define environment variables, resources, and ports in a single file.
|
||||
3. Makes it easier to switch between local development and production-ready images.
|
||||
|
||||
For example, your `docker-compose.yml` could include API keys, token settings, and memory limits, making deployment quick and consistent.
|
||||
|
||||
|
||||
|
||||
|
||||
## API Security 🔒
|
||||
|
||||
### Understanding CRAWL4AI_API_TOKEN
|
||||
|
||||
The `CRAWL4AI_API_TOKEN` provides optional security for your Crawl4AI instance:
|
||||
|
||||
- If `CRAWL4AI_API_TOKEN` is set: All API endpoints (except `/health`) require authentication
|
||||
- If `CRAWL4AI_API_TOKEN` is not set: The API is publicly accessible
|
||||
|
||||
```bash
|
||||
# Secured Instance
|
||||
docker run -p 11235:11235 -e CRAWL4AI_API_TOKEN=your_secret_token unclecode/crawl4ai:all
|
||||
|
||||
# Unsecured Instance
|
||||
docker run -p 11235:11235 unclecode/crawl4ai:all
|
||||
```
|
||||
|
||||
### Making API Calls
|
||||
|
||||
For secured instances, include the token in all requests:
|
||||
|
||||
```python
|
||||
import requests
|
||||
|
||||
# Setup headers if token is being used
|
||||
api_token = "your_secret_token" # Same token set in CRAWL4AI_API_TOKEN
|
||||
headers = {"Authorization": f"Bearer {api_token}"} if api_token else {}
|
||||
|
||||
# Making authenticated requests
|
||||
response = requests.post(
|
||||
"http://localhost:11235/crawl",
|
||||
headers=headers,
|
||||
json={
|
||||
"urls": "https://example.com",
|
||||
"priority": 10
|
||||
}
|
||||
)
|
||||
|
||||
# Checking task status
|
||||
task_id = response.json()["task_id"]
|
||||
status = requests.get(
|
||||
f"http://localhost:11235/task/{task_id}",
|
||||
headers=headers
|
||||
)
|
||||
```
|
||||
|
||||
### Using with Docker Compose
|
||||
|
||||
In your `docker-compose.yml`:
|
||||
```yaml
|
||||
services:
|
||||
crawl4ai:
|
||||
image: unclecode/crawl4ai:all
|
||||
environment:
|
||||
- CRAWL4AI_API_TOKEN=${CRAWL4AI_API_TOKEN:-} # Optional
|
||||
# ... other configuration
|
||||
```
|
||||
|
||||
Then either:
|
||||
1. Set in `.env` file:
|
||||
```env
|
||||
CRAWL4AI_API_TOKEN=your_secret_token
|
||||
```
|
||||
|
||||
2. Or set via command line:
|
||||
```bash
|
||||
CRAWL4AI_API_TOKEN=your_secret_token docker-compose up
|
||||
```
|
||||
|
||||
> **Security Note**: If you enable the API token, make sure to keep it secure and never commit it to version control. The token will be required for all API endpoints except the health check endpoint (`/health`).
|
||||
|
||||
## Configuration Options 🔧
|
||||
|
||||
### Environment Variables
|
||||
|
||||
You can configure the service using environment variables:
|
||||
|
||||
```bash
|
||||
# Basic configuration
|
||||
docker run -p 11235:11235 \
|
||||
-e MAX_CONCURRENT_TASKS=5 \
|
||||
unclecode/crawl4ai:all
|
||||
|
||||
# With security and LLM support
|
||||
docker run -p 11235:11235 \
|
||||
-e CRAWL4AI_API_TOKEN=your_secret_token \
|
||||
-e OPENAI_API_KEY=sk-... \
|
||||
-e ANTHROPIC_API_KEY=sk-ant-... \
|
||||
unclecode/crawl4ai:all
|
||||
```
|
||||
|
||||
### Using Docker Compose (Recommended) 🐳
|
||||
|
||||
Create a `docker-compose.yml`:
|
||||
|
||||
```yaml
|
||||
version: '3.8'
|
||||
|
||||
services:
|
||||
crawl4ai:
|
||||
image: unclecode/crawl4ai:all
|
||||
ports:
|
||||
- "11235:11235"
|
||||
environment:
|
||||
- CRAWL4AI_API_TOKEN=${CRAWL4AI_API_TOKEN:-} # Optional API security
|
||||
- MAX_CONCURRENT_TASKS=5
|
||||
# LLM Provider Keys
|
||||
- OPENAI_API_KEY=${OPENAI_API_KEY:-}
|
||||
- ANTHROPIC_API_KEY=${ANTHROPIC_API_KEY:-}
|
||||
volumes:
|
||||
- /dev/shm:/dev/shm
|
||||
deploy:
|
||||
resources:
|
||||
limits:
|
||||
memory: 4G
|
||||
reservations:
|
||||
memory: 1G
|
||||
```
|
||||
|
||||
You can run it in two ways:
|
||||
|
||||
1. Using environment variables directly:
|
||||
```bash
|
||||
CRAWL4AI_API_TOKEN=secret123 OPENAI_API_KEY=sk-... docker-compose up
|
||||
```
|
||||
|
||||
2. Using a `.env` file (recommended):
|
||||
Create a `.env` file in the same directory:
|
||||
```env
|
||||
# API Security (optional)
|
||||
CRAWL4AI_API_TOKEN=your_secret_token
|
||||
|
||||
# LLM Provider Keys
|
||||
OPENAI_API_KEY=sk-...
|
||||
ANTHROPIC_API_KEY=sk-ant-...
|
||||
|
||||
# Other Configuration
|
||||
MAX_CONCURRENT_TASKS=5
|
||||
```
|
||||
|
||||
Then simply run:
|
||||
```bash
|
||||
docker-compose up
|
||||
```
|
||||
|
||||
### Testing the Deployment 🧪
|
||||
|
||||
```python
|
||||
import requests
|
||||
|
||||
# For unsecured instances
|
||||
def test_unsecured():
|
||||
# Health check
|
||||
health = requests.get("http://localhost:11235/health")
|
||||
print("Health check:", health.json())
|
||||
|
||||
# Basic crawl
|
||||
response = requests.post(
|
||||
"http://localhost:11235/crawl",
|
||||
json={
|
||||
"urls": "https://www.nbcnews.com/business",
|
||||
"priority": 10
|
||||
}
|
||||
)
|
||||
task_id = response.json()["task_id"]
|
||||
print("Task ID:", task_id)
|
||||
|
||||
# For secured instances
|
||||
def test_secured(api_token):
|
||||
headers = {"Authorization": f"Bearer {api_token}"}
|
||||
|
||||
# Basic crawl with authentication
|
||||
response = requests.post(
|
||||
"http://localhost:11235/crawl",
|
||||
headers=headers,
|
||||
json={
|
||||
"urls": "https://www.nbcnews.com/business",
|
||||
"priority": 10
|
||||
}
|
||||
)
|
||||
task_id = response.json()["task_id"]
|
||||
print("Task ID:", task_id)
|
||||
```
|
||||
|
||||
### LLM Extraction Example 🤖
|
||||
|
||||
When you've configured your LLM provider keys (via environment variables or `.env`), you can use LLM extraction:
|
||||
|
||||
```python
|
||||
request = {
|
||||
"urls": "https://example.com",
|
||||
"extraction_config": {
|
||||
"type": "llm",
|
||||
"params": {
|
||||
"provider": "openai/gpt-4",
|
||||
"instruction": "Extract main topics from the page"
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
# Make the request (add headers if using API security)
|
||||
response = requests.post("http://localhost:11235/crawl", json=request)
|
||||
```
|
||||
|
||||
> **Note**: Remember to add `.env` to your `.gitignore` to keep your API keys secure!
|
||||
|
||||
|
||||
## Usage Examples 📝
|
||||
|
||||
### Basic Crawling
|
||||
|
||||
```python
|
||||
request = {
|
||||
"urls": "https://www.nbcnews.com/business",
|
||||
"priority": 10
|
||||
}
|
||||
|
||||
response = requests.post("http://localhost:11235/crawl", json=request)
|
||||
task_id = response.json()["task_id"]
|
||||
|
||||
# Get results
|
||||
result = requests.get(f"http://localhost:11235/task/{task_id}")
|
||||
```
|
||||
|
||||
### Structured Data Extraction
|
||||
|
||||
```python
|
||||
schema = {
|
||||
"name": "Crypto Prices",
|
||||
"baseSelector": ".cds-tableRow-t45thuk",
|
||||
"fields": [
|
||||
{
|
||||
"name": "crypto",
|
||||
"selector": "td:nth-child(1) h2",
|
||||
"type": "text",
|
||||
},
|
||||
{
|
||||
"name": "price",
|
||||
"selector": "td:nth-child(2)",
|
||||
"type": "text",
|
||||
}
|
||||
],
|
||||
}
|
||||
|
||||
request = {
|
||||
"urls": "https://www.coinbase.com/explore",
|
||||
"extraction_config": {
|
||||
"type": "json_css",
|
||||
"params": {"schema": schema}
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
### Dynamic Content Handling
|
||||
|
||||
```python
|
||||
request = {
|
||||
"urls": "https://www.nbcnews.com/business",
|
||||
"js_code": [
|
||||
"const loadMoreButton = Array.from(document.querySelectorAll('button')).find(button => button.textContent.includes('Load More')); loadMoreButton && loadMoreButton.click();"
|
||||
],
|
||||
"wait_for": "article.tease-card:nth-child(10)"
|
||||
}
|
||||
```
|
||||
|
||||
### AI-Powered Extraction (Full Version)
|
||||
|
||||
```python
|
||||
request = {
|
||||
"urls": "https://www.nbcnews.com/business",
|
||||
"extraction_config": {
|
||||
"type": "cosine",
|
||||
"params": {
|
||||
"semantic_filter": "business finance economy",
|
||||
"word_count_threshold": 10,
|
||||
"max_dist": 0.2,
|
||||
"top_k": 3
|
||||
}
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
## Platform-Specific Instructions 💻
|
||||
|
||||
### macOS
|
||||
```bash
|
||||
docker pull unclecode/crawl4ai:basic
|
||||
docker run -p 11235:11235 unclecode/crawl4ai:basic
|
||||
```
|
||||
|
||||
### Ubuntu
|
||||
```bash
|
||||
# Basic version
|
||||
docker pull unclecode/crawl4ai:basic
|
||||
docker run -p 11235:11235 unclecode/crawl4ai:basic
|
||||
|
||||
# With GPU support
|
||||
docker pull unclecode/crawl4ai:gpu
|
||||
docker run --gpus all -p 11235:11235 unclecode/crawl4ai:gpu
|
||||
```
|
||||
|
||||
### Windows (PowerShell)
|
||||
```powershell
|
||||
docker pull unclecode/crawl4ai:basic
|
||||
docker run -p 11235:11235 unclecode/crawl4ai:basic
|
||||
```
|
||||
|
||||
## Testing 🧪
|
||||
|
||||
Save this as `test_docker.py`:
|
||||
|
||||
```python
|
||||
import requests
|
||||
import json
|
||||
import time
|
||||
import sys
|
||||
|
||||
class Crawl4AiTester:
|
||||
def __init__(self, base_url: str = "http://localhost:11235"):
|
||||
self.base_url = base_url
|
||||
|
||||
def submit_and_wait(self, request_data: dict, timeout: int = 300) -> dict:
|
||||
# Submit crawl job
|
||||
response = requests.post(f"{self.base_url}/crawl", json=request_data)
|
||||
task_id = response.json()["task_id"]
|
||||
print(f"Task ID: {task_id}")
|
||||
|
||||
# Poll for result
|
||||
start_time = time.time()
|
||||
while True:
|
||||
if time.time() - start_time > timeout:
|
||||
raise TimeoutError(f"Task {task_id} timeout")
|
||||
|
||||
result = requests.get(f"{self.base_url}/task/{task_id}")
|
||||
status = result.json()
|
||||
|
||||
if status["status"] == "completed":
|
||||
return status
|
||||
|
||||
time.sleep(2)
|
||||
|
||||
def test_deployment():
|
||||
tester = Crawl4AiTester()
|
||||
|
||||
# Test basic crawl
|
||||
request = {
|
||||
"urls": "https://www.nbcnews.com/business",
|
||||
"priority": 10
|
||||
}
|
||||
|
||||
result = tester.submit_and_wait(request)
|
||||
print("Basic crawl successful!")
|
||||
print(f"Content length: {len(result['result']['markdown'])}")
|
||||
|
||||
if __name__ == "__main__":
|
||||
test_deployment()
|
||||
```
|
||||
|
||||
## Advanced Configuration ⚙️
|
||||
|
||||
### Crawler Parameters
|
||||
|
||||
The `crawler_params` field allows you to configure the browser instance and crawling behavior. Here are key parameters you can use:
|
||||
|
||||
```python
|
||||
request = {
|
||||
"urls": "https://example.com",
|
||||
"crawler_params": {
|
||||
# Browser Configuration
|
||||
"headless": True, # Run in headless mode
|
||||
"browser_type": "chromium", # chromium/firefox/webkit
|
||||
"user_agent": "custom-agent", # Custom user agent
|
||||
"proxy": "http://proxy:8080", # Proxy configuration
|
||||
|
||||
# Performance & Behavior
|
||||
"page_timeout": 30000, # Page load timeout (ms)
|
||||
"verbose": True, # Enable detailed logging
|
||||
"semaphore_count": 5, # Concurrent request limit
|
||||
|
||||
# Anti-Detection Features
|
||||
"simulate_user": True, # Simulate human behavior
|
||||
"magic": True, # Advanced anti-detection
|
||||
"override_navigator": True, # Override navigator properties
|
||||
|
||||
# Session Management
|
||||
"user_data_dir": "./browser-data", # Browser profile location
|
||||
"use_managed_browser": True, # Use persistent browser
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
### Extra Parameters
|
||||
|
||||
The `extra` field allows passing additional parameters directly to the crawler's `arun` function:
|
||||
|
||||
```python
|
||||
request = {
|
||||
"urls": "https://example.com",
|
||||
"extra": {
|
||||
"word_count_threshold": 10, # Min words per block
|
||||
"only_text": True, # Extract only text
|
||||
"bypass_cache": True, # Force fresh crawl
|
||||
"process_iframes": True, # Include iframe content
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
### Complete Examples
|
||||
|
||||
1. **Advanced News Crawling**
|
||||
```python
|
||||
request = {
|
||||
"urls": "https://www.nbcnews.com/business",
|
||||
"crawler_params": {
|
||||
"headless": True,
|
||||
"page_timeout": 30000,
|
||||
"remove_overlay_elements": True # Remove popups
|
||||
},
|
||||
"extra": {
|
||||
"word_count_threshold": 50, # Longer content blocks
|
||||
"bypass_cache": True # Fresh content
|
||||
},
|
||||
"css_selector": ".article-body"
|
||||
}
|
||||
```
|
||||
|
||||
2. **Anti-Detection Configuration**
|
||||
```python
|
||||
request = {
|
||||
"urls": "https://example.com",
|
||||
"crawler_params": {
|
||||
"simulate_user": True,
|
||||
"magic": True,
|
||||
"override_navigator": True,
|
||||
"user_agent": "Mozilla/5.0 ...",
|
||||
"headers": {
|
||||
"Accept-Language": "en-US,en;q=0.9"
|
||||
}
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
3. **LLM Extraction with Custom Parameters**
|
||||
```python
|
||||
request = {
|
||||
"urls": "https://openai.com/pricing",
|
||||
"extraction_config": {
|
||||
"type": "llm",
|
||||
"params": {
|
||||
"provider": "openai/gpt-4",
|
||||
"schema": pricing_schema
|
||||
}
|
||||
},
|
||||
"crawler_params": {
|
||||
"verbose": True,
|
||||
"page_timeout": 60000
|
||||
},
|
||||
"extra": {
|
||||
"word_count_threshold": 1,
|
||||
"only_text": True
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
4. **Session-Based Dynamic Content**
|
||||
```python
|
||||
request = {
|
||||
"urls": "https://example.com",
|
||||
"crawler_params": {
|
||||
"session_id": "dynamic_session",
|
||||
"headless": False,
|
||||
"page_timeout": 60000
|
||||
},
|
||||
"js_code": ["window.scrollTo(0, document.body.scrollHeight);"],
|
||||
"wait_for": "js:() => document.querySelectorAll('.item').length > 10",
|
||||
"extra": {
|
||||
"delay_before_return_html": 2.0
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
5. **Screenshot with Custom Timing**
|
||||
```python
|
||||
request = {
|
||||
"urls": "https://example.com",
|
||||
"screenshot": True,
|
||||
"crawler_params": {
|
||||
"headless": True,
|
||||
"screenshot_wait_for": ".main-content"
|
||||
},
|
||||
"extra": {
|
||||
"delay_before_return_html": 3.0
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
### Parameter Reference Table
|
||||
|
||||
| Category | Parameter | Type | Description |
|
||||
|----------|-----------|------|-------------|
|
||||
| Browser | headless | bool | Run browser in headless mode |
|
||||
| Browser | browser_type | str | Browser engine selection |
|
||||
| Browser | user_agent | str | Custom user agent string |
|
||||
| Network | proxy | str | Proxy server URL |
|
||||
| Network | headers | dict | Custom HTTP headers |
|
||||
| Timing | page_timeout | int | Page load timeout (ms) |
|
||||
| Timing | delay_before_return_html | float | Wait before capture |
|
||||
| Anti-Detection | simulate_user | bool | Human behavior simulation |
|
||||
| Anti-Detection | magic | bool | Advanced protection |
|
||||
| Session | session_id | str | Browser session ID |
|
||||
| Session | user_data_dir | str | Profile directory |
|
||||
| Content | word_count_threshold | int | Minimum words per block |
|
||||
| Content | only_text | bool | Text-only extraction |
|
||||
| Content | process_iframes | bool | Include iframe content |
|
||||
| Debug | verbose | bool | Detailed logging |
|
||||
| Debug | log_console | bool | Browser console logs |
|
||||
|
||||
## Troubleshooting 🔍
|
||||
|
||||
### Common Issues
|
||||
|
||||
1. **Connection Refused**
|
||||
```
|
||||
Error: Connection refused at localhost:11235
|
||||
```
|
||||
Solution: Ensure the container is running and ports are properly mapped.
|
||||
|
||||
2. **Resource Limits**
|
||||
```
|
||||
Error: No available slots
|
||||
```
|
||||
Solution: Increase MAX_CONCURRENT_TASKS or container resources.
|
||||
|
||||
3. **GPU Access**
|
||||
```
|
||||
Error: GPU not found
|
||||
```
|
||||
Solution: Ensure proper NVIDIA drivers and use `--gpus all` flag.
|
||||
|
||||
### Debug Mode
|
||||
|
||||
Access container for debugging:
|
||||
```bash
|
||||
docker run -it --entrypoint /bin/bash unclecode/crawl4ai:all
|
||||
```
|
||||
|
||||
View container logs:
|
||||
```bash
|
||||
docker logs [container_id]
|
||||
```
|
||||
|
||||
## Best Practices 🌟
|
||||
|
||||
1. **Resource Management**
|
||||
- Set appropriate memory and CPU limits
|
||||
- Monitor resource usage via health endpoint
|
||||
- Use basic version for simple crawling tasks
|
||||
|
||||
2. **Scaling**
|
||||
- Use multiple containers for high load
|
||||
- Implement proper load balancing
|
||||
- Monitor performance metrics
|
||||
|
||||
3. **Security**
|
||||
- Use environment variables for sensitive data
|
||||
- Implement proper network isolation
|
||||
- Regular security updates
|
||||
|
||||
## API Reference 📚
|
||||
|
||||
### Health Check
|
||||
```http
|
||||
GET /health
|
||||
```
|
||||
|
||||
### Submit Crawl Task
|
||||
```http
|
||||
POST /crawl
|
||||
Content-Type: application/json
|
||||
|
||||
{
|
||||
"urls": "string or array",
|
||||
"extraction_config": {
|
||||
"type": "basic|llm|cosine|json_css",
|
||||
"params": {}
|
||||
},
|
||||
"priority": 1-10,
|
||||
"ttl": 3600
|
||||
}
|
||||
```
|
||||
|
||||
### Get Task Status
|
||||
```http
|
||||
GET /task/{task_id}
|
||||
```
|
||||
|
||||
For more details, visit the [official documentation](https://crawl4ai.com/mkdocs/).
|
||||
248
docs/md_v2/core/fit-markdown.md
Normal file
248
docs/md_v2/core/fit-markdown.md
Normal file
@@ -0,0 +1,248 @@
|
||||
# Fit Markdown with Pruning & BM25
|
||||
|
||||
**Fit Markdown** is a specialized **filtered** version of your page’s markdown, focusing on the most relevant content. By default, Crawl4AI converts the entire HTML into a broad **raw_markdown**. With fit markdown, we apply a **content filter** algorithm (e.g., **Pruning** or **BM25**) to remove or rank low-value sections—such as repetitive sidebars, shallow text blocks, or irrelevancies—leaving a concise textual “core.”
|
||||
|
||||
---
|
||||
|
||||
## 1. How “Fit Markdown” Works
|
||||
|
||||
### 1.1 The `content_filter`
|
||||
|
||||
In **`CrawlerRunConfig`**, you can specify a **`content_filter`** to shape how content is pruned or ranked before final markdown generation. A filter’s logic is applied **before** or **during** the HTML→Markdown process, producing:
|
||||
|
||||
- **`result.markdown_v2.raw_markdown`** (unfiltered)
|
||||
- **`result.markdown_v2.fit_markdown`** (filtered or “fit” version)
|
||||
- **`result.markdown_v2.fit_html`** (the corresponding HTML snippet that produced `fit_markdown`)
|
||||
|
||||
> **Note**: We’re currently storing the result in `markdown_v2`, but eventually we’ll unify it as `result.markdown`.
|
||||
|
||||
### 1.2 Common Filters
|
||||
|
||||
1. **PruningContentFilter** – Scores each node by text density, link density, and tag importance, discarding those below a threshold.
|
||||
2. **BM25ContentFilter** – Focuses on textual relevance using BM25 ranking, especially useful if you have a specific user query (e.g., “machine learning” or “food nutrition”).
|
||||
|
||||
---
|
||||
|
||||
## 2. PruningContentFilter
|
||||
|
||||
**Pruning** discards less relevant nodes based on **text density, link density, and tag importance**. It’s a heuristic-based approach—if certain sections appear too “thin” or too “spammy,” they’re pruned.
|
||||
|
||||
### 2.1 Usage Example
|
||||
|
||||
```python
|
||||
import asyncio
|
||||
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig
|
||||
from crawl4ai.content_filter_strategy import PruningContentFilter
|
||||
from crawl4ai.markdown_generation_strategy import DefaultMarkdownGenerator
|
||||
|
||||
async def main():
|
||||
# Step 1: Create a pruning filter
|
||||
prune_filter = PruningContentFilter(
|
||||
# Lower → more content retained, higher → more content pruned
|
||||
threshold=0.45,
|
||||
# "fixed" or "dynamic"
|
||||
threshold_type="dynamic",
|
||||
# Ignore nodes with <5 words
|
||||
min_word_threshold=5
|
||||
)
|
||||
|
||||
# Step 2: Insert it into a Markdown Generator
|
||||
md_generator = DefaultMarkdownGenerator(content_filter=prune_filter)
|
||||
|
||||
# Step 3: Pass it to CrawlerRunConfig
|
||||
config = CrawlerRunConfig(
|
||||
markdown_generator=md_generator
|
||||
)
|
||||
|
||||
async with AsyncWebCrawler() as crawler:
|
||||
result = await crawler.arun(
|
||||
url="https://news.ycombinator.com",
|
||||
config=config
|
||||
)
|
||||
|
||||
if result.success:
|
||||
# 'fit_markdown' is your pruned content, focusing on "denser" text
|
||||
print("Raw Markdown length:", len(result.markdown_v2.raw_markdown))
|
||||
print("Fit Markdown length:", len(result.markdown_v2.fit_markdown))
|
||||
else:
|
||||
print("Error:", result.error_message)
|
||||
|
||||
if __name__ == "__main__":
|
||||
asyncio.run(main())
|
||||
```
|
||||
|
||||
### 2.2 Key Parameters
|
||||
|
||||
- **`min_word_threshold`** (int): If a block has fewer words than this, it’s pruned.
|
||||
- **`threshold_type`** (str):
|
||||
- `"fixed"` → each node must exceed `threshold` (0–1).
|
||||
- `"dynamic"` → node scoring adjusts according to tag type, text/link density, etc.
|
||||
- **`threshold`** (float, default ~0.48): The base or “anchor” cutoff.
|
||||
|
||||
**Algorithmic Factors**:
|
||||
|
||||
- **Text density** – Encourages blocks that have a higher ratio of text to overall content.
|
||||
- **Link density** – Penalizes sections that are mostly links.
|
||||
- **Tag importance** – e.g., an `<article>` or `<p>` might be more important than a `<div>`.
|
||||
- **Structural context** – If a node is deeply nested or in a suspected sidebar, it might be deprioritized.
|
||||
|
||||
---
|
||||
|
||||
## 3. BM25ContentFilter
|
||||
|
||||
**BM25** is a classical text ranking algorithm often used in search engines. If you have a **user query** or rely on page metadata to derive a query, BM25 can identify which text chunks best match that query.
|
||||
|
||||
### 3.1 Usage Example
|
||||
|
||||
```python
|
||||
import asyncio
|
||||
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig
|
||||
from crawl4ai.content_filter_strategy import BM25ContentFilter
|
||||
from crawl4ai.markdown_generation_strategy import DefaultMarkdownGenerator
|
||||
|
||||
async def main():
|
||||
# 1) A BM25 filter with a user query
|
||||
bm25_filter = BM25ContentFilter(
|
||||
user_query="startup fundraising tips",
|
||||
# Adjust for stricter or looser results
|
||||
bm25_threshold=1.2
|
||||
)
|
||||
|
||||
# 2) Insert into a Markdown Generator
|
||||
md_generator = DefaultMarkdownGenerator(content_filter=bm25_filter)
|
||||
|
||||
# 3) Pass to crawler config
|
||||
config = CrawlerRunConfig(
|
||||
markdown_generator=md_generator
|
||||
)
|
||||
|
||||
async with AsyncWebCrawler() as crawler:
|
||||
result = await crawler.arun(
|
||||
url="https://news.ycombinator.com",
|
||||
config=config
|
||||
)
|
||||
if result.success:
|
||||
print("Fit Markdown (BM25 query-based):")
|
||||
print(result.markdown_v2.fit_markdown)
|
||||
else:
|
||||
print("Error:", result.error_message)
|
||||
|
||||
if __name__ == "__main__":
|
||||
asyncio.run(main())
|
||||
```
|
||||
|
||||
### 3.2 Parameters
|
||||
|
||||
- **`user_query`** (str, optional): E.g. `"machine learning"`. If blank, the filter tries to glean a query from page metadata.
|
||||
- **`bm25_threshold`** (float, default 1.0):
|
||||
- Higher → fewer chunks but more relevant.
|
||||
- Lower → more inclusive.
|
||||
|
||||
> In more advanced scenarios, you might see parameters like `use_stemming`, `case_sensitive`, or `priority_tags` to refine how text is tokenized or weighted.
|
||||
|
||||
---
|
||||
|
||||
## 4. Accessing the “Fit” Output
|
||||
|
||||
After the crawl, your “fit” content is found in **`result.markdown_v2.fit_markdown`**. In future versions, it will be **`result.markdown.fit_markdown`**. Meanwhile:
|
||||
|
||||
```python
|
||||
fit_md = result.markdown_v2.fit_markdown
|
||||
fit_html = result.markdown_v2.fit_html
|
||||
```
|
||||
|
||||
If the content filter is **BM25**, you might see additional logic or references in `fit_markdown` that highlight relevant segments. If it’s **Pruning**, the text is typically well-cleaned but not necessarily matched to a query.
|
||||
|
||||
---
|
||||
|
||||
## 5. Code Patterns Recap
|
||||
|
||||
### 5.1 Pruning
|
||||
|
||||
```python
|
||||
prune_filter = PruningContentFilter(
|
||||
threshold=0.5,
|
||||
threshold_type="fixed",
|
||||
min_word_threshold=10
|
||||
)
|
||||
md_generator = DefaultMarkdownGenerator(content_filter=prune_filter)
|
||||
config = CrawlerRunConfig(markdown_generator=md_generator)
|
||||
# => result.markdown_v2.fit_markdown
|
||||
```
|
||||
|
||||
### 5.2 BM25
|
||||
|
||||
```python
|
||||
bm25_filter = BM25ContentFilter(
|
||||
user_query="health benefits fruit",
|
||||
bm25_threshold=1.2
|
||||
)
|
||||
md_generator = DefaultMarkdownGenerator(content_filter=bm25_filter)
|
||||
config = CrawlerRunConfig(markdown_generator=md_generator)
|
||||
# => result.markdown_v2.fit_markdown
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 6. Combining with “word_count_threshold” & Exclusions
|
||||
|
||||
Remember you can also specify:
|
||||
|
||||
```python
|
||||
config = CrawlerRunConfig(
|
||||
word_count_threshold=10,
|
||||
excluded_tags=["nav", "footer", "header"],
|
||||
exclude_external_links=True,
|
||||
markdown_generator=DefaultMarkdownGenerator(
|
||||
content_filter=PruningContentFilter(threshold=0.5)
|
||||
)
|
||||
)
|
||||
```
|
||||
|
||||
Thus, **multi-level** filtering occurs:
|
||||
|
||||
1. The crawler’s `excluded_tags` are removed from the HTML first.
|
||||
2. The content filter (Pruning, BM25, or custom) prunes or ranks the remaining text blocks.
|
||||
3. The final “fit” content is generated in `result.markdown_v2.fit_markdown`.
|
||||
|
||||
---
|
||||
|
||||
## 7. Custom Filters
|
||||
|
||||
If you need a different approach (like a specialized ML model or site-specific heuristics), you can create a new class inheriting from `RelevantContentFilter` and implement `filter_content(html)`. Then inject it into your **markdown generator**:
|
||||
|
||||
```python
|
||||
from crawl4ai.content_filter_strategy import RelevantContentFilter
|
||||
|
||||
class MyCustomFilter(RelevantContentFilter):
|
||||
def filter_content(self, html, min_word_threshold=None):
|
||||
# parse HTML, implement custom logic
|
||||
return [block for block in ... if ... some condition...]
|
||||
|
||||
```
|
||||
|
||||
**Steps**:
|
||||
|
||||
1. Subclass `RelevantContentFilter`.
|
||||
2. Implement `filter_content(...)`.
|
||||
3. Use it in your `DefaultMarkdownGenerator(content_filter=MyCustomFilter(...))`.
|
||||
|
||||
---
|
||||
|
||||
## 8. Final Thoughts
|
||||
|
||||
**Fit Markdown** is a crucial feature for:
|
||||
|
||||
- **Summaries**: Quickly get the important text from a cluttered page.
|
||||
- **Search**: Combine with **BM25** to produce content relevant to a query.
|
||||
- **AI Pipelines**: Filter out boilerplate so LLM-based extraction or summarization runs on denser text.
|
||||
|
||||
**Key Points**:
|
||||
- **PruningContentFilter**: Great if you just want the “meatiest” text without a user query.
|
||||
- **BM25ContentFilter**: Perfect for query-based extraction or searching.
|
||||
- Combine with **`excluded_tags`, `exclude_external_links`, `word_count_threshold`** to refine your final “fit” text.
|
||||
- Fit markdown ends up in **`result.markdown_v2.fit_markdown`**; eventually **`result.markdown.fit_markdown`** in future versions.
|
||||
|
||||
With these tools, you can **zero in** on the text that truly matters, ignoring spammy or boilerplate content, and produce a concise, relevant “fit markdown” for your AI or data pipelines. Happy pruning and searching!
|
||||
|
||||
- Last Updated: 2025-01-01
|
||||
129
docs/md_v2/core/installation.md
Normal file
129
docs/md_v2/core/installation.md
Normal file
@@ -0,0 +1,129 @@
|
||||
# Installation & Setup (2023 Edition)
|
||||
|
||||
## 1. Basic Installation
|
||||
|
||||
```bash
|
||||
pip install crawl4ai
|
||||
```
|
||||
|
||||
This installs the **core** Crawl4AI library along with essential dependencies. **No** advanced features (like transformers or PyTorch) are included yet.
|
||||
|
||||
## 2. Initial Setup & Diagnostics
|
||||
|
||||
### 2.1 Run the Setup Command
|
||||
After installing, call:
|
||||
|
||||
```bash
|
||||
crawl4ai-setup
|
||||
```
|
||||
|
||||
**What does it do?**
|
||||
- Installs or updates required Playwright browsers (Chromium, Firefox, etc.)
|
||||
- Performs OS-level checks (e.g., missing libs on Linux)
|
||||
- Confirms your environment is ready to crawl
|
||||
|
||||
### 2.2 Diagnostics
|
||||
Optionally, you can run **diagnostics** to confirm everything is functioning:
|
||||
|
||||
```bash
|
||||
crawl4ai-doctor
|
||||
```
|
||||
|
||||
This command attempts to:
|
||||
- Check Python version compatibility
|
||||
- Verify Playwright installation
|
||||
- Inspect environment variables or library conflicts
|
||||
|
||||
If any issues arise, follow its suggestions (e.g., installing additional system packages) and re-run `crawl4ai-setup`.
|
||||
|
||||
---
|
||||
|
||||
## 3. Verifying Installation: A Simple Crawl (Skip this step if you already run `crawl4ai-doctor`)
|
||||
|
||||
Below is a minimal Python script demonstrating a **basic** crawl. It uses our new **`BrowserConfig`** and **`CrawlerRunConfig`** for clarity, though no custom settings are passed in this example:
|
||||
|
||||
```python
|
||||
import asyncio
|
||||
from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig
|
||||
|
||||
async def main():
|
||||
async with AsyncWebCrawler() as crawler:
|
||||
result = await crawler.arun(
|
||||
url="https://www.example.com",
|
||||
)
|
||||
print(result.markdown[:300]) # Show the first 300 characters of extracted text
|
||||
|
||||
if __name__ == "__main__":
|
||||
asyncio.run(main())
|
||||
```
|
||||
|
||||
**Expected** outcome:
|
||||
- A headless browser session loads `example.com`
|
||||
- Crawl4AI returns ~300 characters of markdown.
|
||||
If errors occur, rerun `crawl4ai-doctor` or manually ensure Playwright is installed correctly.
|
||||
|
||||
---
|
||||
|
||||
## 4. Advanced Installation (Optional)
|
||||
|
||||
**Warning**: Only install these **if you truly need them**. They bring in larger dependencies, including big models, which can increase disk usage and memory load significantly.
|
||||
|
||||
### 4.1 Torch, Transformers, or All
|
||||
|
||||
- **Text Clustering (Torch)**
|
||||
```bash
|
||||
pip install crawl4ai[torch]
|
||||
crawl4ai-setup
|
||||
```
|
||||
Installs PyTorch-based features (e.g., cosine similarity or advanced semantic chunking).
|
||||
|
||||
- **Transformers**
|
||||
```bash
|
||||
pip install crawl4ai[transformer]
|
||||
crawl4ai-setup
|
||||
```
|
||||
Adds Hugging Face-based summarization or generation strategies.
|
||||
|
||||
- **All Features**
|
||||
```bash
|
||||
pip install crawl4ai[all]
|
||||
crawl4ai-setup
|
||||
```
|
||||
|
||||
#### (Optional) Pre-Fetching Models
|
||||
```bash
|
||||
crawl4ai-download-models
|
||||
```
|
||||
This step caches large models locally (if needed). **Only do this** if your workflow requires them.
|
||||
|
||||
---
|
||||
|
||||
## 5. Docker (Experimental)
|
||||
|
||||
We provide a **temporary** Docker approach for testing. **It’s not stable and may break** with future releases. We plan a major Docker revamp in a future stable version, 2025 Q1. If you still want to try:
|
||||
|
||||
```bash
|
||||
docker pull unclecode/crawl4ai:basic
|
||||
docker run -p 11235:11235 unclecode/crawl4ai:basic
|
||||
```
|
||||
|
||||
You can then make POST requests to `http://localhost:11235/crawl` to perform crawls. **Production usage** is discouraged until our new Docker approach is ready (planned in Jan or Feb 2025).
|
||||
|
||||
---
|
||||
|
||||
## 6. Local Server Mode (Legacy)
|
||||
|
||||
Some older docs mention running Crawl4AI as a local server. This approach has been **partially replaced** by the new Docker-based prototype and upcoming stable server release. You can experiment, but expect major changes. Official local server instructions will arrive once the new Docker architecture is finalized.
|
||||
|
||||
---
|
||||
|
||||
## Summary
|
||||
|
||||
1. **Install** with `pip install crawl4ai` and run `crawl4ai-setup`.
|
||||
2. **Diagnose** with `crawl4ai-doctor` if you see errors.
|
||||
3. **Verify** by crawling `example.com` with minimal `BrowserConfig` + `CrawlerRunConfig`.
|
||||
4. **Advanced** features (Torch, Transformers) are **optional**—avoid them if you don’t need them (they significantly increase resource usage).
|
||||
5. **Docker** is **experimental**—use at your own risk until the stable version is released.
|
||||
6. **Local server** references in older docs are largely deprecated; a new solution is in progress.
|
||||
|
||||
**Got questions?** Check [GitHub issues](https://github.com/unclecode/crawl4ai/issues) for updates or ask the community!
|
||||
276
docs/md_v2/core/link-media.md
Normal file
276
docs/md_v2/core/link-media.md
Normal file
@@ -0,0 +1,276 @@
|
||||
# Link & Media
|
||||
|
||||
In this tutorial, you’ll learn how to:
|
||||
|
||||
1. Extract links (internal, external) from crawled pages
|
||||
2. Filter or exclude specific domains (e.g., social media or custom domains)
|
||||
3. Access and manage media data (especially images) in the crawl result
|
||||
4. Configure your crawler to exclude or prioritize certain images
|
||||
|
||||
> **Prerequisites**
|
||||
> - You have completed or are familiar with the [AsyncWebCrawler Basics](../core/simple-crawling.md) tutorial.
|
||||
> - You can run Crawl4AI in your environment (Playwright, Python, etc.).
|
||||
|
||||
---
|
||||
|
||||
Below is a revised version of the **Link Extraction** and **Media Extraction** sections that includes example data structures showing how links and media items are stored in `CrawlResult`. Feel free to adjust any field names or descriptions to match your actual output.
|
||||
|
||||
---
|
||||
|
||||
## 1. Link Extraction
|
||||
|
||||
### 1.1 `result.links`
|
||||
|
||||
When you call `arun()` or `arun_many()` on a URL, Crawl4AI automatically extracts links and stores them in the `links` field of `CrawlResult`. By default, the crawler tries to distinguish **internal** links (same domain) from **external** links (different domains).
|
||||
|
||||
**Basic Example**:
|
||||
|
||||
```python
|
||||
from crawl4ai import AsyncWebCrawler
|
||||
|
||||
async with AsyncWebCrawler() as crawler:
|
||||
result = await crawler.arun("https://www.example.com")
|
||||
if result.success:
|
||||
internal_links = result.links.get("internal", [])
|
||||
external_links = result.links.get("external", [])
|
||||
print(f"Found {len(internal_links)} internal links.")
|
||||
print(f"Found {len(internal_links)} external links.")
|
||||
print(f"Found {len(result.media)} media items.")
|
||||
|
||||
# Each link is typically a dictionary with fields like:
|
||||
# { "href": "...", "text": "...", "title": "...", "base_domain": "..." }
|
||||
if internal_links:
|
||||
print("Sample Internal Link:", internal_links[0])
|
||||
else:
|
||||
print("Crawl failed:", result.error_message)
|
||||
```
|
||||
|
||||
**Structure Example**:
|
||||
|
||||
```python
|
||||
result.links = {
|
||||
"internal": [
|
||||
{
|
||||
"href": "https://kidocode.com/",
|
||||
"text": "",
|
||||
"title": "",
|
||||
"base_domain": "kidocode.com"
|
||||
},
|
||||
{
|
||||
"href": "https://kidocode.com/degrees/technology",
|
||||
"text": "Technology Degree",
|
||||
"title": "KidoCode Tech Program",
|
||||
"base_domain": "kidocode.com"
|
||||
},
|
||||
# ...
|
||||
],
|
||||
"external": [
|
||||
# possibly other links leading to third-party sites
|
||||
]
|
||||
}
|
||||
```
|
||||
|
||||
- **`href`**: The raw hyperlink URL.
|
||||
- **`text`**: The link text (if any) within the `<a>` tag.
|
||||
- **`title`**: The `title` attribute of the link (if present).
|
||||
- **`base_domain`**: The domain extracted from `href`. Helpful for filtering or grouping by domain.
|
||||
|
||||
---
|
||||
|
||||
## 2. Domain Filtering
|
||||
|
||||
Some websites contain hundreds of third-party or affiliate links. You can filter out certain domains at **crawl time** by configuring the crawler. The most relevant parameters in `CrawlerRunConfig` are:
|
||||
|
||||
- **`exclude_external_links`**: If `True`, discard any link pointing outside the root domain.
|
||||
- **`exclude_social_media_domains`**: Provide a list of social media platforms (e.g., `["facebook.com", "twitter.com"]`) to exclude from your crawl.
|
||||
- **`exclude_social_media_links`**: If `True`, automatically skip known social platforms.
|
||||
- **`exclude_domains`**: Provide a list of custom domains you want to exclude (e.g., `["spammyads.com", "tracker.net"]`).
|
||||
|
||||
### 2.1 Example: Excluding External & Social Media Links
|
||||
|
||||
```python
|
||||
import asyncio
|
||||
from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig
|
||||
|
||||
async def main():
|
||||
crawler_cfg = CrawlerRunConfig(
|
||||
exclude_external_links=True, # No links outside primary domain
|
||||
exclude_social_media_links=True # Skip recognized social media domains
|
||||
)
|
||||
|
||||
async with AsyncWebCrawler() as crawler:
|
||||
result = await crawler.arun(
|
||||
"https://www.example.com",
|
||||
config=crawler_cfg
|
||||
)
|
||||
if result.success:
|
||||
print("[OK] Crawled:", result.url)
|
||||
print("Internal links count:", len(result.links.get("internal", [])))
|
||||
print("External links count:", len(result.links.get("external", [])))
|
||||
# Likely zero external links in this scenario
|
||||
else:
|
||||
print("[ERROR]", result.error_message)
|
||||
|
||||
if __name__ == "__main__":
|
||||
asyncio.run(main())
|
||||
```
|
||||
|
||||
### 2.2 Example: Excluding Specific Domains
|
||||
|
||||
If you want to let external links in, but specifically exclude a domain (e.g., `suspiciousads.com`), do this:
|
||||
|
||||
```python
|
||||
crawler_cfg = CrawlerRunConfig(
|
||||
exclude_domains=["suspiciousads.com"]
|
||||
)
|
||||
```
|
||||
|
||||
This approach is handy when you still want external links but need to block certain sites you consider spammy.
|
||||
|
||||
---
|
||||
|
||||
## 3. Media Extraction
|
||||
|
||||
### 3.1 Accessing `result.media`
|
||||
|
||||
By default, Crawl4AI collects images, audio, and video URLs it finds on the page. These are stored in `result.media`, a dictionary keyed by media type (e.g., `images`, `videos`, `audio`).
|
||||
|
||||
**Basic Example**:
|
||||
|
||||
```python
|
||||
if result.success:
|
||||
images_info = result.media.get("images", [])
|
||||
print(f"Found {len(images_info)} images in total.")
|
||||
for i, img in enumerate(images_info[:5]): # Inspect just the first 5
|
||||
print(f"[Image {i}] URL: {img['src']}")
|
||||
print(f" Alt text: {img.get('alt', '')}")
|
||||
print(f" Score: {img.get('score')}")
|
||||
print(f" Description: {img.get('desc', '')}\n")
|
||||
```
|
||||
|
||||
**Structure Example**:
|
||||
|
||||
```python
|
||||
result.media = {
|
||||
"images": [
|
||||
{
|
||||
"src": "https://cdn.prod.website-files.com/.../Group%2089.svg",
|
||||
"alt": "coding school for kids",
|
||||
"desc": "Trial Class Degrees degrees All Degrees AI Degree Technology ...",
|
||||
"score": 3,
|
||||
"type": "image",
|
||||
"group_id": 0,
|
||||
"format": None,
|
||||
"width": None,
|
||||
"height": None
|
||||
},
|
||||
# ...
|
||||
],
|
||||
"videos": [
|
||||
# Similar structure but with video-specific fields
|
||||
],
|
||||
"audio": [
|
||||
# Similar structure but with audio-specific fields
|
||||
]
|
||||
}
|
||||
```
|
||||
|
||||
Depending on your Crawl4AI version or scraping strategy, these dictionaries can include fields like:
|
||||
|
||||
- **`src`**: The media URL (e.g., image source)
|
||||
- **`alt`**: The alt text for images (if present)
|
||||
- **`desc`**: A snippet of nearby text or a short description (optional)
|
||||
- **`score`**: A heuristic relevance score if you’re using content-scoring features
|
||||
- **`width`**, **`height`**: If the crawler detects dimensions for the image/video
|
||||
- **`type`**: Usually `"image"`, `"video"`, or `"audio"`
|
||||
- **`group_id`**: If you’re grouping related media items, the crawler might assign an ID
|
||||
|
||||
With these details, you can easily filter out or focus on certain images (for instance, ignoring images with very low scores or a different domain), or gather metadata for analytics.
|
||||
|
||||
### 3.2 Excluding External Images
|
||||
|
||||
If you’re dealing with heavy pages or want to skip third-party images (advertisements, for example), you can turn on:
|
||||
|
||||
```python
|
||||
crawler_cfg = CrawlerRunConfig(
|
||||
exclude_external_images=True
|
||||
)
|
||||
```
|
||||
|
||||
This setting attempts to discard images from outside the primary domain, keeping only those from the site you’re crawling.
|
||||
|
||||
### 3.3 Additional Media Config
|
||||
|
||||
- **`screenshot`**: Set to `True` if you want a full-page screenshot stored as `base64` in `result.screenshot`.
|
||||
- **`pdf`**: Set to `True` if you want a PDF version of the page in `result.pdf`.
|
||||
- **`wait_for_images`**: If `True`, attempts to wait until images are fully loaded before final extraction.
|
||||
|
||||
---
|
||||
|
||||
## 4. Putting It All Together: Link & Media Filtering
|
||||
|
||||
Here’s a combined example demonstrating how to filter out external links, skip certain domains, and exclude external images:
|
||||
|
||||
```python
|
||||
import asyncio
|
||||
from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig
|
||||
|
||||
async def main():
|
||||
# Suppose we want to keep only internal links, remove certain domains,
|
||||
# and discard external images from the final crawl data.
|
||||
crawler_cfg = CrawlerRunConfig(
|
||||
exclude_external_links=True,
|
||||
exclude_domains=["spammyads.com"],
|
||||
exclude_social_media_links=True, # skip Twitter, Facebook, etc.
|
||||
exclude_external_images=True, # keep only images from main domain
|
||||
wait_for_images=True, # ensure images are loaded
|
||||
verbose=True
|
||||
)
|
||||
|
||||
async with AsyncWebCrawler() as crawler:
|
||||
result = await crawler.arun("https://www.example.com", config=crawler_cfg)
|
||||
|
||||
if result.success:
|
||||
print("[OK] Crawled:", result.url)
|
||||
|
||||
# 1. Links
|
||||
in_links = result.links.get("internal", [])
|
||||
ext_links = result.links.get("external", [])
|
||||
print("Internal link count:", len(in_links))
|
||||
print("External link count:", len(ext_links)) # should be zero with exclude_external_links=True
|
||||
|
||||
# 2. Images
|
||||
images = result.media.get("images", [])
|
||||
print("Images found:", len(images))
|
||||
|
||||
# Let's see a snippet of these images
|
||||
for i, img in enumerate(images[:3]):
|
||||
print(f" - {img['src']} (alt={img.get('alt','')}, score={img.get('score','N/A')})")
|
||||
else:
|
||||
print("[ERROR] Failed to crawl. Reason:", result.error_message)
|
||||
|
||||
if __name__ == "__main__":
|
||||
asyncio.run(main())
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 5. Common Pitfalls & Tips
|
||||
|
||||
1. **Conflicting Flags**:
|
||||
- `exclude_external_links=True` but then also specifying `exclude_social_media_links=True` is typically fine, but understand that the first setting already discards *all* external links. The second becomes somewhat redundant.
|
||||
- `exclude_external_images=True` but want to keep some external images? Currently no partial domain-based setting for images, so you might need a custom approach or hook logic.
|
||||
|
||||
2. **Relevancy Scores**:
|
||||
- If your version of Crawl4AI or your scraping strategy includes an `img["score"]`, it’s typically a heuristic based on size, position, or content analysis. Evaluate carefully if you rely on it.
|
||||
|
||||
3. **Performance**:
|
||||
- Excluding certain domains or external images can speed up your crawl, especially for large, media-heavy pages.
|
||||
- If you want a “full” link map, do *not* exclude them. Instead, you can post-filter in your own code.
|
||||
|
||||
4. **Social Media Lists**:
|
||||
- `exclude_social_media_links=True` typically references an internal list of known social domains like Facebook, Twitter, LinkedIn, etc. If you need to add or remove from that list, look for library settings or a local config file (depending on your version).
|
||||
|
||||
---
|
||||
|
||||
**That’s it for Link & Media Analysis!** You’re now equipped to filter out unwanted sites and zero in on the images and videos that matter for your project.
|
||||
161
docs/md_v2/core/local-files.md
Normal file
161
docs/md_v2/core/local-files.md
Normal file
@@ -0,0 +1,161 @@
|
||||
# Prefix-Based Input Handling in Crawl4AI
|
||||
|
||||
This guide will walk you through using the Crawl4AI library to crawl web pages, local HTML files, and raw HTML strings. We'll demonstrate these capabilities using a Wikipedia page as an example.
|
||||
|
||||
## Crawling a Web URL
|
||||
|
||||
To crawl a live web page, provide the URL starting with `http://` or `https://`, using a `CrawlerRunConfig` object:
|
||||
|
||||
```python
|
||||
import asyncio
|
||||
from crawl4ai import AsyncWebCrawler
|
||||
from crawl4ai.async_configs import CrawlerRunConfig
|
||||
|
||||
async def crawl_web():
|
||||
config = CrawlerRunConfig(bypass_cache=True)
|
||||
async with AsyncWebCrawler() as crawler:
|
||||
result = await crawler.arun(
|
||||
url="https://en.wikipedia.org/wiki/apple",
|
||||
config=config
|
||||
)
|
||||
if result.success:
|
||||
print("Markdown Content:")
|
||||
print(result.markdown)
|
||||
else:
|
||||
print(f"Failed to crawl: {result.error_message}")
|
||||
|
||||
asyncio.run(crawl_web())
|
||||
```
|
||||
|
||||
## Crawling a Local HTML File
|
||||
|
||||
To crawl a local HTML file, prefix the file path with `file://`.
|
||||
|
||||
```python
|
||||
import asyncio
|
||||
from crawl4ai import AsyncWebCrawler
|
||||
from crawl4ai.async_configs import CrawlerRunConfig
|
||||
|
||||
async def crawl_local_file():
|
||||
local_file_path = "/path/to/apple.html" # Replace with your file path
|
||||
file_url = f"file://{local_file_path}"
|
||||
config = CrawlerRunConfig(bypass_cache=True)
|
||||
|
||||
async with AsyncWebCrawler() as crawler:
|
||||
result = await crawler.arun(url=file_url, config=config)
|
||||
if result.success:
|
||||
print("Markdown Content from Local File:")
|
||||
print(result.markdown)
|
||||
else:
|
||||
print(f"Failed to crawl local file: {result.error_message}")
|
||||
|
||||
asyncio.run(crawl_local_file())
|
||||
```
|
||||
|
||||
## Crawling Raw HTML Content
|
||||
|
||||
To crawl raw HTML content, prefix the HTML string with `raw:`.
|
||||
|
||||
```python
|
||||
import asyncio
|
||||
from crawl4ai import AsyncWebCrawler
|
||||
from crawl4ai.async_configs import CrawlerRunConfig
|
||||
|
||||
async def crawl_raw_html():
|
||||
raw_html = "<html><body><h1>Hello, World!</h1></body></html>"
|
||||
raw_html_url = f"raw:{raw_html}"
|
||||
config = CrawlerRunConfig(bypass_cache=True)
|
||||
|
||||
async with AsyncWebCrawler() as crawler:
|
||||
result = await crawler.arun(url=raw_html_url, config=config)
|
||||
if result.success:
|
||||
print("Markdown Content from Raw HTML:")
|
||||
print(result.markdown)
|
||||
else:
|
||||
print(f"Failed to crawl raw HTML: {result.error_message}")
|
||||
|
||||
asyncio.run(crawl_raw_html())
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
# Complete Example
|
||||
|
||||
Below is a comprehensive script that:
|
||||
|
||||
1. Crawls the Wikipedia page for "Apple."
|
||||
2. Saves the HTML content to a local file (`apple.html`).
|
||||
3. Crawls the local HTML file and verifies the markdown length matches the original crawl.
|
||||
4. Crawls the raw HTML content from the saved file and verifies consistency.
|
||||
|
||||
```python
|
||||
import os
|
||||
import sys
|
||||
import asyncio
|
||||
from pathlib import Path
|
||||
from crawl4ai import AsyncWebCrawler
|
||||
from crawl4ai.async_configs import CrawlerRunConfig
|
||||
|
||||
async def main():
|
||||
wikipedia_url = "https://en.wikipedia.org/wiki/apple"
|
||||
script_dir = Path(__file__).parent
|
||||
html_file_path = script_dir / "apple.html"
|
||||
|
||||
async with AsyncWebCrawler() as crawler:
|
||||
# Step 1: Crawl the Web URL
|
||||
print("\n=== Step 1: Crawling the Wikipedia URL ===")
|
||||
web_config = CrawlerRunConfig(bypass_cache=True)
|
||||
result = await crawler.arun(url=wikipedia_url, config=web_config)
|
||||
|
||||
if not result.success:
|
||||
print(f"Failed to crawl {wikipedia_url}: {result.error_message}")
|
||||
return
|
||||
|
||||
with open(html_file_path, 'w', encoding='utf-8') as f:
|
||||
f.write(result.html)
|
||||
web_crawl_length = len(result.markdown)
|
||||
print(f"Length of markdown from web crawl: {web_crawl_length}\n")
|
||||
|
||||
# Step 2: Crawl from the Local HTML File
|
||||
print("=== Step 2: Crawling from the Local HTML File ===")
|
||||
file_url = f"file://{html_file_path.resolve()}"
|
||||
file_config = CrawlerRunConfig(bypass_cache=True)
|
||||
local_result = await crawler.arun(url=file_url, config=file_config)
|
||||
|
||||
if not local_result.success:
|
||||
print(f"Failed to crawl local file {file_url}: {local_result.error_message}")
|
||||
return
|
||||
|
||||
local_crawl_length = len(local_result.markdown)
|
||||
assert web_crawl_length == local_crawl_length, "Markdown length mismatch"
|
||||
print("✅ Markdown length matches between web and local file crawl.\n")
|
||||
|
||||
# Step 3: Crawl Using Raw HTML Content
|
||||
print("=== Step 3: Crawling Using Raw HTML Content ===")
|
||||
with open(html_file_path, 'r', encoding='utf-8') as f:
|
||||
raw_html_content = f.read()
|
||||
raw_html_url = f"raw:{raw_html_content}"
|
||||
raw_config = CrawlerRunConfig(bypass_cache=True)
|
||||
raw_result = await crawler.arun(url=raw_html_url, config=raw_config)
|
||||
|
||||
if not raw_result.success:
|
||||
print(f"Failed to crawl raw HTML content: {raw_result.error_message}")
|
||||
return
|
||||
|
||||
raw_crawl_length = len(raw_result.markdown)
|
||||
assert web_crawl_length == raw_crawl_length, "Markdown length mismatch"
|
||||
print("✅ Markdown length matches between web and raw HTML crawl.\n")
|
||||
|
||||
print("All tests passed successfully!")
|
||||
if html_file_path.exists():
|
||||
os.remove(html_file_path)
|
||||
|
||||
if __name__ == "__main__":
|
||||
asyncio.run(main())
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
# Conclusion
|
||||
|
||||
With the unified `url` parameter and prefix-based handling in **Crawl4AI**, you can seamlessly handle web URLs, local HTML files, and raw HTML content. Use `CrawlerRunConfig` for flexible and consistent configuration in all scenarios.
|
||||
369
docs/md_v2/core/markdown-generation.md
Normal file
369
docs/md_v2/core/markdown-generation.md
Normal file
@@ -0,0 +1,369 @@
|
||||
# Markdown Generation Basics
|
||||
|
||||
One of Crawl4AI’s core features is generating **clean, structured markdown** from web pages. Originally built to solve the problem of extracting only the “actual” content and discarding boilerplate or noise, Crawl4AI’s markdown system remains one of its biggest draws for AI workflows.
|
||||
|
||||
In this tutorial, you’ll learn:
|
||||
|
||||
1. How to configure the **Default Markdown Generator**
|
||||
2. How **content filters** (BM25 or Pruning) help you refine markdown and discard junk
|
||||
3. The difference between raw markdown (`result.markdown`) and filtered markdown (`fit_markdown`)
|
||||
|
||||
> **Prerequisites**
|
||||
> - You’ve completed or read [AsyncWebCrawler Basics](../core/simple-crawling.md) to understand how to run a simple crawl.
|
||||
> - You know how to configure `CrawlerRunConfig`.
|
||||
|
||||
---
|
||||
|
||||
## 1. Quick Example
|
||||
|
||||
Here’s a minimal code snippet that uses the **DefaultMarkdownGenerator** with no additional filtering:
|
||||
|
||||
```python
|
||||
import asyncio
|
||||
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig
|
||||
from crawl4ai.markdown_generation_strategy import DefaultMarkdownGenerator
|
||||
|
||||
async def main():
|
||||
config = CrawlerRunConfig(
|
||||
markdown_generator=DefaultMarkdownGenerator()
|
||||
)
|
||||
async with AsyncWebCrawler() as crawler:
|
||||
result = await crawler.arun("https://example.com", config=config)
|
||||
|
||||
if result.success:
|
||||
print("Raw Markdown Output:\n")
|
||||
print(result.markdown) # The unfiltered markdown from the page
|
||||
else:
|
||||
print("Crawl failed:", result.error_message)
|
||||
|
||||
if __name__ == "__main__":
|
||||
asyncio.run(main())
|
||||
```
|
||||
|
||||
**What’s happening?**
|
||||
- `CrawlerRunConfig( markdown_generator = DefaultMarkdownGenerator() )` instructs Crawl4AI to convert the final HTML into markdown at the end of each crawl.
|
||||
- The resulting markdown is accessible via `result.markdown`.
|
||||
|
||||
---
|
||||
|
||||
## 2. How Markdown Generation Works
|
||||
|
||||
### 2.1 HTML-to-Text Conversion (Forked & Modified)
|
||||
|
||||
Under the hood, **DefaultMarkdownGenerator** uses a specialized HTML-to-text approach that:
|
||||
|
||||
- Preserves headings, code blocks, bullet points, etc.
|
||||
- Removes extraneous tags (scripts, styles) that don’t add meaningful content.
|
||||
- Can optionally generate references for links or skip them altogether.
|
||||
|
||||
A set of **options** (passed as a dict) allows you to customize precisely how HTML converts to markdown. These map to standard html2text-like configuration plus your own enhancements (e.g., ignoring internal links, preserving certain tags verbatim, or adjusting line widths).
|
||||
|
||||
### 2.2 Link Citations & References
|
||||
|
||||
By default, the generator can convert `<a href="...">` elements into `[text][1]` citations, then place the actual links at the bottom of the document. This is handy for research workflows that demand references in a structured manner.
|
||||
|
||||
### 2.3 Optional Content Filters
|
||||
|
||||
Before or after the HTML-to-Markdown step, you can apply a **content filter** (like BM25 or Pruning) to reduce noise and produce a “fit_markdown”—a heavily pruned version focusing on the page’s main text. We’ll cover these filters shortly.
|
||||
|
||||
---
|
||||
|
||||
## 3. Configuring the Default Markdown Generator
|
||||
|
||||
You can tweak the output by passing an `options` dict to `DefaultMarkdownGenerator`. For example:
|
||||
|
||||
```python
|
||||
from crawl4ai.markdown_generation_strategy import DefaultMarkdownGenerator
|
||||
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig
|
||||
|
||||
async def main():
|
||||
# Example: ignore all links, don't escape HTML, and wrap text at 80 characters
|
||||
md_generator = DefaultMarkdownGenerator(
|
||||
options={
|
||||
"ignore_links": True,
|
||||
"escape_html": False,
|
||||
"body_width": 80
|
||||
}
|
||||
)
|
||||
|
||||
config = CrawlerRunConfig(
|
||||
markdown_generator=md_generator
|
||||
)
|
||||
|
||||
async with AsyncWebCrawler() as crawler:
|
||||
result = await crawler.arun("https://example.com/docs", config=config)
|
||||
if result.success:
|
||||
print("Markdown:\n", result.markdown[:500]) # Just a snippet
|
||||
else:
|
||||
print("Crawl failed:", result.error_message)
|
||||
|
||||
if __name__ == "__main__":
|
||||
import asyncio
|
||||
asyncio.run(main())
|
||||
```
|
||||
|
||||
Some commonly used `options`:
|
||||
|
||||
- **`ignore_links`** (bool): Whether to remove all hyperlinks in the final markdown.
|
||||
- **`ignore_images`** (bool): Remove all `![image]()` references.
|
||||
- **`escape_html`** (bool): Turn HTML entities into text (default is often `True`).
|
||||
- **`body_width`** (int): Wrap text at N characters. `0` or `None` means no wrapping.
|
||||
- **`skip_internal_links`** (bool): If `True`, omit `#localAnchors` or internal links referencing the same page.
|
||||
- **`include_sup_sub`** (bool): Attempt to handle `<sup>` / `<sub>` in a more readable way.
|
||||
|
||||
---
|
||||
|
||||
## 4. Content Filters
|
||||
|
||||
**Content filters** selectively remove or rank sections of text before turning them into Markdown. This is especially helpful if your page has ads, nav bars, or other clutter you don’t want.
|
||||
|
||||
### 4.1 BM25ContentFilter
|
||||
|
||||
If you have a **search query**, BM25 is a good choice:
|
||||
|
||||
```python
|
||||
from crawl4ai.markdown_generation_strategy import DefaultMarkdownGenerator
|
||||
from crawl4ai.content_filter_strategy import BM25ContentFilter
|
||||
from crawl4ai import CrawlerRunConfig
|
||||
|
||||
bm25_filter = BM25ContentFilter(
|
||||
user_query="machine learning",
|
||||
bm25_threshold=1.2,
|
||||
use_stemming=True
|
||||
)
|
||||
|
||||
md_generator = DefaultMarkdownGenerator(
|
||||
content_filter=bm25_filter,
|
||||
options={"ignore_links": True}
|
||||
)
|
||||
|
||||
config = CrawlerRunConfig(markdown_generator=md_generator)
|
||||
```
|
||||
|
||||
- **`user_query`**: The term you want to focus on. BM25 tries to keep only content blocks relevant to that query.
|
||||
- **`bm25_threshold`**: Raise it to keep fewer blocks; lower it to keep more.
|
||||
- **`use_stemming`**: If `True`, variations of words match (e.g., “learn,” “learning,” “learnt”).
|
||||
|
||||
**No query provided?** BM25 tries to glean a context from page metadata, or you can simply treat it as a scorched-earth approach that discards text with low generic score. Realistically, you want to supply a query for best results.
|
||||
|
||||
### 4.2 PruningContentFilter
|
||||
|
||||
If you **don’t** have a specific query, or if you just want a robust “junk remover,” use `PruningContentFilter`. It analyzes text density, link density, HTML structure, and known patterns (like “nav,” “footer”) to systematically prune extraneous or repetitive sections.
|
||||
|
||||
```python
|
||||
from crawl4ai.content_filter_strategy import PruningContentFilter
|
||||
|
||||
prune_filter = PruningContentFilter(
|
||||
threshold=0.5,
|
||||
threshold_type="fixed", # or "dynamic"
|
||||
min_word_threshold=50
|
||||
)
|
||||
```
|
||||
|
||||
- **`threshold`**: Score boundary. Blocks below this score get removed.
|
||||
- **`threshold_type`**:
|
||||
- `"fixed"`: Straight comparison (`score >= threshold` keeps the block).
|
||||
- `"dynamic"`: The filter adjusts threshold in a data-driven manner.
|
||||
- **`min_word_threshold`**: Discard blocks under N words as likely too short or unhelpful.
|
||||
|
||||
**When to Use PruningContentFilter**
|
||||
- You want a broad cleanup without a user query.
|
||||
- The page has lots of repeated sidebars, footers, or disclaimers that hamper text extraction.
|
||||
|
||||
---
|
||||
|
||||
## 5. Using Fit Markdown
|
||||
|
||||
When a content filter is active, the library produces two forms of markdown inside `result.markdown_v2` or (if using the simplified field) `result.markdown`:
|
||||
|
||||
1. **`raw_markdown`**: The full unfiltered markdown.
|
||||
2. **`fit_markdown`**: A “fit” version where the filter has removed or trimmed noisy segments.
|
||||
|
||||
**Note**:
|
||||
> In earlier examples, you may see references to `result.markdown_v2`. Depending on your library version, you might access `result.markdown`, `result.markdown_v2`, or an object named `MarkdownGenerationResult`. The idea is the same: you’ll have a raw version and a filtered (“fit”) version if a filter is used.
|
||||
|
||||
```python
|
||||
import asyncio
|
||||
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig
|
||||
from crawl4ai.markdown_generation_strategy import DefaultMarkdownGenerator
|
||||
from crawl4ai.content_filter_strategy import PruningContentFilter
|
||||
|
||||
async def main():
|
||||
config = CrawlerRunConfig(
|
||||
markdown_generator=DefaultMarkdownGenerator(
|
||||
content_filter=PruningContentFilter(threshold=0.6),
|
||||
options={"ignore_links": True}
|
||||
)
|
||||
)
|
||||
async with AsyncWebCrawler() as crawler:
|
||||
result = await crawler.arun("https://news.example.com/tech", config=config)
|
||||
if result.success:
|
||||
print("Raw markdown:\n", result.markdown)
|
||||
|
||||
# If a filter is used, we also have .fit_markdown:
|
||||
md_object = result.markdown_v2 # or your equivalent
|
||||
print("Filtered markdown:\n", md_object.fit_markdown)
|
||||
else:
|
||||
print("Crawl failed:", result.error_message)
|
||||
|
||||
if __name__ == "__main__":
|
||||
asyncio.run(main())
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 6. The `MarkdownGenerationResult` Object
|
||||
|
||||
If your library stores detailed markdown output in an object like `MarkdownGenerationResult`, you’ll see fields such as:
|
||||
|
||||
- **`raw_markdown`**: The direct HTML-to-markdown transformation (no filtering).
|
||||
- **`markdown_with_citations`**: A version that moves links to reference-style footnotes.
|
||||
- **`references_markdown`**: A separate string or section containing the gathered references.
|
||||
- **`fit_markdown`**: The filtered markdown if you used a content filter.
|
||||
- **`fit_html`**: The corresponding HTML snippet used to generate `fit_markdown` (helpful for debugging or advanced usage).
|
||||
|
||||
**Example**:
|
||||
|
||||
```python
|
||||
md_obj = result.markdown_v2 # your library’s naming may vary
|
||||
print("RAW:\n", md_obj.raw_markdown)
|
||||
print("CITED:\n", md_obj.markdown_with_citations)
|
||||
print("REFERENCES:\n", md_obj.references_markdown)
|
||||
print("FIT:\n", md_obj.fit_markdown)
|
||||
```
|
||||
|
||||
**Why Does This Matter?**
|
||||
- You can supply `raw_markdown` to an LLM if you want the entire text.
|
||||
- Or feed `fit_markdown` into a vector database to reduce token usage.
|
||||
- `references_markdown` can help you keep track of link provenance.
|
||||
|
||||
---
|
||||
|
||||
Below is a **revised section** under “Combining Filters (BM25 + Pruning)” that demonstrates how you can run **two** passes of content filtering without re-crawling, by taking the HTML (or text) from a first pass and feeding it into the second filter. It uses real code patterns from the snippet you provided for **BM25ContentFilter**, which directly accepts **HTML** strings (and can also handle plain text with minimal adaptation).
|
||||
|
||||
---
|
||||
|
||||
## 7. Combining Filters (BM25 + Pruning) in Two Passes
|
||||
|
||||
You might want to **prune out** noisy boilerplate first (with `PruningContentFilter`), and then **rank what’s left** against a user query (with `BM25ContentFilter`). You don’t have to crawl the page twice. Instead:
|
||||
|
||||
1. **First pass**: Apply `PruningContentFilter` directly to the raw HTML from `result.html` (the crawler’s downloaded HTML).
|
||||
2. **Second pass**: Take the pruned HTML (or text) from step 1, and feed it into `BM25ContentFilter`, focusing on a user query.
|
||||
|
||||
### Two-Pass Example
|
||||
|
||||
```python
|
||||
import asyncio
|
||||
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig
|
||||
from crawl4ai.content_filter_strategy import PruningContentFilter, BM25ContentFilter
|
||||
from bs4 import BeautifulSoup
|
||||
|
||||
async def main():
|
||||
# 1. Crawl with minimal or no markdown generator, just get raw HTML
|
||||
config = CrawlerRunConfig(
|
||||
# If you only want raw HTML, you can skip passing a markdown_generator
|
||||
# or provide one but focus on .html in this example
|
||||
)
|
||||
|
||||
async with AsyncWebCrawler() as crawler:
|
||||
result = await crawler.arun("https://example.com/tech-article", config=config)
|
||||
|
||||
if not result.success or not result.html:
|
||||
print("Crawl failed or no HTML content.")
|
||||
return
|
||||
|
||||
raw_html = result.html
|
||||
|
||||
# 2. First pass: PruningContentFilter on raw HTML
|
||||
pruning_filter = PruningContentFilter(threshold=0.5, min_word_threshold=50)
|
||||
|
||||
# filter_content returns a list of "text chunks" or cleaned HTML sections
|
||||
pruned_chunks = pruning_filter.filter_content(raw_html)
|
||||
# This list is basically pruned content blocks, presumably in HTML or text form
|
||||
|
||||
# For demonstration, let's combine these chunks back into a single HTML-like string
|
||||
# or you could do further processing. It's up to your pipeline design.
|
||||
pruned_html = "\n".join(pruned_chunks)
|
||||
|
||||
# 3. Second pass: BM25ContentFilter with a user query
|
||||
bm25_filter = BM25ContentFilter(
|
||||
user_query="machine learning",
|
||||
bm25_threshold=1.2,
|
||||
language="english"
|
||||
)
|
||||
|
||||
# returns a list of text chunks
|
||||
bm25_chunks = bm25_filter.filter_content(pruned_html)
|
||||
|
||||
if not bm25_chunks:
|
||||
print("Nothing matched the BM25 query after pruning.")
|
||||
return
|
||||
|
||||
# 4. Combine or display final results
|
||||
final_text = "\n---\n".join(bm25_chunks)
|
||||
|
||||
print("==== PRUNED OUTPUT (first pass) ====")
|
||||
print(pruned_html[:500], "... (truncated)") # preview
|
||||
|
||||
print("\n==== BM25 OUTPUT (second pass) ====")
|
||||
print(final_text[:500], "... (truncated)")
|
||||
|
||||
if __name__ == "__main__":
|
||||
asyncio.run(main())
|
||||
```
|
||||
|
||||
### What’s Happening?
|
||||
|
||||
1. **Raw HTML**: We crawl once and store the raw HTML in `result.html`.
|
||||
2. **PruningContentFilter**: Takes HTML + optional parameters. It extracts blocks of text or partial HTML, removing headings/sections deemed “noise.” It returns a **list of text chunks**.
|
||||
3. **Combine or Transform**: We join these pruned chunks back into a single HTML-like string. (Alternatively, you could store them in a list for further logic—whatever suits your pipeline.)
|
||||
4. **BM25ContentFilter**: We feed the pruned string into `BM25ContentFilter` with a user query. This second pass further narrows the content to chunks relevant to “machine learning.”
|
||||
|
||||
**No Re-Crawling**: We used `raw_html` from the first pass, so there’s no need to run `arun()` again—**no second network request**.
|
||||
|
||||
### Tips & Variations
|
||||
|
||||
- **Plain Text vs. HTML**: If your pruned output is mostly text, BM25 can still handle it; just keep in mind it expects a valid string input. If you supply partial HTML (like `"<p>some text</p>"`), it will parse it as HTML.
|
||||
- **Chaining in a Single Pipeline**: If your code supports it, you can chain multiple filters automatically. Otherwise, manual two-pass filtering (as shown) is straightforward.
|
||||
- **Adjust Thresholds**: If you see too much or too little text in step one, tweak `threshold=0.5` or `min_word_threshold=50`. Similarly, `bm25_threshold=1.2` can be raised/lowered for more or fewer chunks in step two.
|
||||
|
||||
### One-Pass Combination?
|
||||
|
||||
If your codebase or pipeline design allows applying multiple filters in one pass, you could do so. But often it’s simpler—and more transparent—to run them sequentially, analyzing each step’s result.
|
||||
|
||||
**Bottom Line**: By **manually chaining** your filtering logic in two passes, you get powerful incremental control over the final content. First, remove “global” clutter with Pruning, then refine further with BM25-based query relevance—without incurring a second network crawl.
|
||||
|
||||
---
|
||||
|
||||
## 8. Common Pitfalls & Tips
|
||||
|
||||
1. **No Markdown Output?**
|
||||
- Make sure the crawler actually retrieved HTML. If the site is heavily JS-based, you may need to enable dynamic rendering or wait for elements.
|
||||
- Check if your content filter is too aggressive. Lower thresholds or disable the filter to see if content reappears.
|
||||
|
||||
2. **Performance Considerations**
|
||||
- Very large pages with multiple filters can be slower. Consider `cache_mode` to avoid re-downloading.
|
||||
- If your final use case is LLM ingestion, consider summarizing further or chunking big texts.
|
||||
|
||||
3. **Take Advantage of `fit_markdown`**
|
||||
- Great for RAG pipelines, semantic search, or any scenario where extraneous boilerplate is unwanted.
|
||||
- Still verify the textual quality—some sites have crucial data in footers or sidebars.
|
||||
|
||||
4. **Adjusting `html2text` Options**
|
||||
- If you see lots of raw HTML slipping into the text, turn on `escape_html`.
|
||||
- If code blocks look messy, experiment with `mark_code` or `handle_code_in_pre`.
|
||||
|
||||
---
|
||||
|
||||
## 9. Summary & Next Steps
|
||||
|
||||
In this **Markdown Generation Basics** tutorial, you learned to:
|
||||
|
||||
- Configure the **DefaultMarkdownGenerator** with HTML-to-text options.
|
||||
- Use **BM25ContentFilter** for query-specific extraction or **PruningContentFilter** for general noise removal.
|
||||
- Distinguish between raw and filtered markdown (`fit_markdown`).
|
||||
- Leverage the `MarkdownGenerationResult` object to handle different forms of output (citations, references, etc.).
|
||||
|
||||
Now you can produce high-quality Markdown from any website, focusing on exactly the content you need—an essential step for powering AI models, summarization pipelines, or knowledge-base queries.
|
||||
|
||||
**Last Updated**: 2025-01-01
|
||||
343
docs/md_v2/core/page-interaction.md
Normal file
343
docs/md_v2/core/page-interaction.md
Normal file
@@ -0,0 +1,343 @@
|
||||
# Page Interaction
|
||||
|
||||
Crawl4AI provides powerful features for interacting with **dynamic** webpages, handling JavaScript execution, waiting for conditions, and managing multi-step flows. By combining **js_code**, **wait_for**, and certain **CrawlerRunConfig** parameters, you can:
|
||||
|
||||
1. Click “Load More” buttons
|
||||
2. Fill forms and submit them
|
||||
3. Wait for elements or data to appear
|
||||
4. Reuse sessions across multiple steps
|
||||
|
||||
Below is a quick overview of how to do it.
|
||||
|
||||
---
|
||||
|
||||
## 1. JavaScript Execution
|
||||
|
||||
### Basic Execution
|
||||
|
||||
**`js_code`** in **`CrawlerRunConfig`** accepts either a single JS string or a list of JS snippets.
|
||||
**Example**: We’ll scroll to the bottom of the page, then optionally click a “Load More” button.
|
||||
|
||||
```python
|
||||
import asyncio
|
||||
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig
|
||||
|
||||
async def main():
|
||||
# Single JS command
|
||||
config = CrawlerRunConfig(
|
||||
js_code="window.scrollTo(0, document.body.scrollHeight);"
|
||||
)
|
||||
|
||||
async with AsyncWebCrawler() as crawler:
|
||||
result = await crawler.arun(
|
||||
url="https://news.ycombinator.com", # Example site
|
||||
config=config
|
||||
)
|
||||
print("Crawled length:", len(result.cleaned_html))
|
||||
|
||||
# Multiple commands
|
||||
js_commands = [
|
||||
"window.scrollTo(0, document.body.scrollHeight);",
|
||||
# 'More' link on Hacker News
|
||||
"document.querySelector('a.morelink')?.click();",
|
||||
]
|
||||
config = CrawlerRunConfig(js_code=js_commands)
|
||||
|
||||
async with AsyncWebCrawler() as crawler:
|
||||
result = await crawler.arun(
|
||||
url="https://news.ycombinator.com", # Another pass
|
||||
config=config
|
||||
)
|
||||
print("After scroll+click, length:", len(result.cleaned_html))
|
||||
|
||||
if __name__ == "__main__":
|
||||
asyncio.run(main())
|
||||
```
|
||||
|
||||
**Relevant `CrawlerRunConfig` params**:
|
||||
- **`js_code`**: A string or list of strings with JavaScript to run after the page loads.
|
||||
- **`js_only`**: If set to `True` on subsequent calls, indicates we’re continuing an existing session without a new full navigation.
|
||||
- **`session_id`**: If you want to keep the same page across multiple calls, specify an ID.
|
||||
|
||||
---
|
||||
|
||||
## 2. Wait Conditions
|
||||
|
||||
### 2.1 CSS-Based Waiting
|
||||
|
||||
Sometimes, you just want to wait for a specific element to appear. For example:
|
||||
|
||||
```python
|
||||
import asyncio
|
||||
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig
|
||||
|
||||
async def main():
|
||||
config = CrawlerRunConfig(
|
||||
# Wait for at least 30 items on Hacker News
|
||||
wait_for="css:.athing:nth-child(30)"
|
||||
)
|
||||
async with AsyncWebCrawler() as crawler:
|
||||
result = await crawler.arun(
|
||||
url="https://news.ycombinator.com",
|
||||
config=config
|
||||
)
|
||||
print("We have at least 30 items loaded!")
|
||||
# Rough check
|
||||
print("Total items in HTML:", result.cleaned_html.count("athing"))
|
||||
|
||||
if __name__ == "__main__":
|
||||
asyncio.run(main())
|
||||
```
|
||||
|
||||
**Key param**:
|
||||
- **`wait_for="css:..."`**: Tells the crawler to wait until that CSS selector is present.
|
||||
|
||||
### 2.2 JavaScript-Based Waiting
|
||||
|
||||
For more complex conditions (e.g., waiting for content length to exceed a threshold), prefix `js:`:
|
||||
|
||||
```python
|
||||
wait_condition = """() => {
|
||||
const items = document.querySelectorAll('.athing');
|
||||
return items.length > 50; // Wait for at least 51 items
|
||||
}"""
|
||||
|
||||
config = CrawlerRunConfig(wait_for=f"js:{wait_condition}")
|
||||
```
|
||||
|
||||
**Behind the Scenes**: Crawl4AI keeps polling the JS function until it returns `true` or a timeout occurs.
|
||||
|
||||
---
|
||||
|
||||
## 3. Handling Dynamic Content
|
||||
|
||||
Many modern sites require **multiple steps**: scrolling, clicking “Load More,” or updating via JavaScript. Below are typical patterns.
|
||||
|
||||
### 3.1 Load More Example (Hacker News “More” Link)
|
||||
|
||||
```python
|
||||
import asyncio
|
||||
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig
|
||||
|
||||
async def main():
|
||||
# Step 1: Load initial Hacker News page
|
||||
config = CrawlerRunConfig(
|
||||
wait_for="css:.athing:nth-child(30)" # Wait for 30 items
|
||||
)
|
||||
async with AsyncWebCrawler() as crawler:
|
||||
result = await crawler.arun(
|
||||
url="https://news.ycombinator.com",
|
||||
config=config
|
||||
)
|
||||
print("Initial items loaded.")
|
||||
|
||||
# Step 2: Let's scroll and click the "More" link
|
||||
load_more_js = [
|
||||
"window.scrollTo(0, document.body.scrollHeight);",
|
||||
# The "More" link at page bottom
|
||||
"document.querySelector('a.morelink')?.click();"
|
||||
]
|
||||
|
||||
next_page_conf = CrawlerRunConfig(
|
||||
js_code=load_more_js,
|
||||
wait_for="""js:() => {
|
||||
return document.querySelectorAll('.athing').length > 30;
|
||||
}""",
|
||||
# Mark that we do not re-navigate, but run JS in the same session:
|
||||
js_only=True,
|
||||
session_id="hn_session"
|
||||
)
|
||||
|
||||
# Re-use the same crawler session
|
||||
result2 = await crawler.arun(
|
||||
url="https://news.ycombinator.com", # same URL but continuing session
|
||||
config=next_page_conf
|
||||
)
|
||||
total_items = result2.cleaned_html.count("athing")
|
||||
print("Items after load-more:", total_items)
|
||||
|
||||
if __name__ == "__main__":
|
||||
asyncio.run(main())
|
||||
```
|
||||
|
||||
**Key params**:
|
||||
- **`session_id="hn_session"`**: Keep the same page across multiple calls to `arun()`.
|
||||
- **`js_only=True`**: We’re not performing a full reload, just applying JS in the existing page.
|
||||
- **`wait_for`** with `js:`: Wait for item count to grow beyond 30.
|
||||
|
||||
---
|
||||
|
||||
### 3.2 Form Interaction
|
||||
|
||||
If the site has a search or login form, you can fill fields and submit them with **`js_code`**. For instance, if GitHub had a local search form:
|
||||
|
||||
```python
|
||||
js_form_interaction = """
|
||||
document.querySelector('#your-search').value = 'TypeScript commits';
|
||||
document.querySelector('form').submit();
|
||||
"""
|
||||
|
||||
config = CrawlerRunConfig(
|
||||
js_code=js_form_interaction,
|
||||
wait_for="css:.commit"
|
||||
)
|
||||
result = await crawler.arun(url="https://github.com/search", config=config)
|
||||
```
|
||||
|
||||
**In reality**: Replace IDs or classes with the real site’s form selectors.
|
||||
|
||||
---
|
||||
|
||||
## 4. Timing Control
|
||||
|
||||
1. **`page_timeout`** (ms): Overall page load or script execution time limit.
|
||||
2. **`delay_before_return_html`** (seconds): Wait an extra moment before capturing the final HTML.
|
||||
3. **`mean_delay`** & **`max_range`**: If you call `arun_many()` with multiple URLs, these add a random pause between each request.
|
||||
|
||||
**Example**:
|
||||
|
||||
```python
|
||||
config = CrawlerRunConfig(
|
||||
page_timeout=60000, # 60s limit
|
||||
delay_before_return_html=2.5
|
||||
)
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 5. Multi-Step Interaction Example
|
||||
|
||||
Below is a simplified script that does multiple “Load More” clicks on GitHub’s TypeScript commits page. It **re-uses** the same session to accumulate new commits each time. The code includes the relevant **`CrawlerRunConfig`** parameters you’d rely on.
|
||||
|
||||
```python
|
||||
import asyncio
|
||||
from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig, CacheMode
|
||||
|
||||
async def multi_page_commits():
|
||||
browser_cfg = BrowserConfig(
|
||||
headless=False, # Visible for demonstration
|
||||
verbose=True
|
||||
)
|
||||
session_id = "github_ts_commits"
|
||||
|
||||
base_wait = """js:() => {
|
||||
const commits = document.querySelectorAll('li.Box-sc-g0xbh4-0 h4');
|
||||
return commits.length > 0;
|
||||
}"""
|
||||
|
||||
# Step 1: Load initial commits
|
||||
config1 = CrawlerRunConfig(
|
||||
wait_for=base_wait,
|
||||
session_id=session_id,
|
||||
cache_mode=CacheMode.BYPASS,
|
||||
# Not using js_only yet since it's our first load
|
||||
)
|
||||
|
||||
async with AsyncWebCrawler(config=browser_cfg) as crawler:
|
||||
result = await crawler.arun(
|
||||
url="https://github.com/microsoft/TypeScript/commits/main",
|
||||
config=config1
|
||||
)
|
||||
print("Initial commits loaded. Count:", result.cleaned_html.count("commit"))
|
||||
|
||||
# Step 2: For subsequent pages, we run JS to click 'Next Page' if it exists
|
||||
js_next_page = """
|
||||
const selector = 'a[data-testid="pagination-next-button"]';
|
||||
const button = document.querySelector(selector);
|
||||
if (button) button.click();
|
||||
"""
|
||||
|
||||
# Wait until new commits appear
|
||||
wait_for_more = """js:() => {
|
||||
const commits = document.querySelectorAll('li.Box-sc-g0xbh4-0 h4');
|
||||
if (!window.firstCommit && commits.length>0) {
|
||||
window.firstCommit = commits[0].textContent;
|
||||
return false;
|
||||
}
|
||||
// If top commit changes, we have new commits
|
||||
const topNow = commits[0]?.textContent.trim();
|
||||
return topNow && topNow !== window.firstCommit;
|
||||
}"""
|
||||
|
||||
for page in range(2): # let's do 2 more "Next" pages
|
||||
config_next = CrawlerRunConfig(
|
||||
session_id=session_id,
|
||||
js_code=js_next_page,
|
||||
wait_for=wait_for_more,
|
||||
js_only=True, # We're continuing from the open tab
|
||||
cache_mode=CacheMode.BYPASS
|
||||
)
|
||||
result2 = await crawler.arun(
|
||||
url="https://github.com/microsoft/TypeScript/commits/main",
|
||||
config=config_next
|
||||
)
|
||||
print(f"Page {page+2} commits count:", result2.cleaned_html.count("commit"))
|
||||
|
||||
# Optionally kill session
|
||||
await crawler.crawler_strategy.kill_session(session_id)
|
||||
|
||||
async def main():
|
||||
await multi_page_commits()
|
||||
|
||||
if __name__ == "__main__":
|
||||
asyncio.run(main())
|
||||
```
|
||||
|
||||
**Key Points**:
|
||||
|
||||
- **`session_id`**: Keep the same page open.
|
||||
- **`js_code`** + **`wait_for`** + **`js_only=True`**: We do partial refreshes, waiting for new commits to appear.
|
||||
- **`cache_mode=CacheMode.BYPASS`** ensures we always see fresh data each step.
|
||||
|
||||
---
|
||||
|
||||
## 6. Combine Interaction with Extraction
|
||||
|
||||
Once dynamic content is loaded, you can attach an **`extraction_strategy`** (like `JsonCssExtractionStrategy` or `LLMExtractionStrategy`). For example:
|
||||
|
||||
```python
|
||||
from crawl4ai.extraction_strategy import JsonCssExtractionStrategy
|
||||
|
||||
schema = {
|
||||
"name": "Commits",
|
||||
"baseSelector": "li.Box-sc-g0xbh4-0",
|
||||
"fields": [
|
||||
{"name": "title", "selector": "h4.markdown-title", "type": "text"}
|
||||
]
|
||||
}
|
||||
config = CrawlerRunConfig(
|
||||
session_id="ts_commits_session",
|
||||
js_code=js_next_page,
|
||||
wait_for=wait_for_more,
|
||||
extraction_strategy=JsonCssExtractionStrategy(schema)
|
||||
)
|
||||
```
|
||||
|
||||
When done, check `result.extracted_content` for the JSON.
|
||||
|
||||
---
|
||||
|
||||
## 7. Relevant `CrawlerRunConfig` Parameters
|
||||
|
||||
Below are the key interaction-related parameters in `CrawlerRunConfig`. For a full list, see [Configuration Parameters](../api/parameters.md).
|
||||
|
||||
- **`js_code`**: JavaScript to run after initial load.
|
||||
- **`js_only`**: If `True`, no new page navigation—only JS in the existing session.
|
||||
- **`wait_for`**: CSS (`"css:..."`) or JS (`"js:..."`) expression to wait for.
|
||||
- **`session_id`**: Reuse the same page across calls.
|
||||
- **`cache_mode`**: Whether to read/write from the cache or bypass.
|
||||
- **`remove_overlay_elements`**: Remove certain popups automatically.
|
||||
- **`simulate_user`, `override_navigator`, `magic`**: Anti-bot or “human-like” interactions.
|
||||
|
||||
---
|
||||
|
||||
## 8. Conclusion
|
||||
|
||||
Crawl4AI’s **page interaction** features let you:
|
||||
|
||||
1. **Execute JavaScript** for scrolling, clicks, or form filling.
|
||||
2. **Wait** for CSS or custom JS conditions before capturing data.
|
||||
3. **Handle** multi-step flows (like “Load More”) with partial reloads or persistent sessions.
|
||||
4. Combine with **structured extraction** for dynamic sites.
|
||||
|
||||
With these tools, you can scrape modern, interactive webpages confidently. For advanced hooking, user simulation, or in-depth config, check the [API reference](../api/parameters.md) or related advanced docs. Happy scripting!
|
||||
362
docs/md_v2/core/quickstart.md
Normal file
362
docs/md_v2/core/quickstart.md
Normal file
@@ -0,0 +1,362 @@
|
||||
Below is the **revised Quickstart** guide with the **Installation** section removed, plus an updated **dynamic content** crawl example that uses `BrowserConfig` and `CrawlerRunConfig` (instead of passing parameters directly to `arun()`). Everything else remains as before.
|
||||
|
||||
---
|
||||
|
||||
# Getting Started with Crawl4AI
|
||||
|
||||
Welcome to **Crawl4AI**, an open-source LLM-friendly Web Crawler & Scraper. In this tutorial, you’ll:
|
||||
|
||||
1. Run your **first crawl** using minimal configuration.
|
||||
2. Generate **Markdown** output (and learn how it’s influenced by content filters).
|
||||
3. Experiment with a simple **CSS-based extraction** strategy.
|
||||
4. See a glimpse of **LLM-based extraction** (including open-source and closed-source model options).
|
||||
5. Crawl a **dynamic** page that loads content via JavaScript.
|
||||
|
||||
---
|
||||
|
||||
## 1. Introduction
|
||||
|
||||
Crawl4AI provides:
|
||||
|
||||
- An asynchronous crawler, **`AsyncWebCrawler`**.
|
||||
- Configurable browser and run settings via **`BrowserConfig`** and **`CrawlerRunConfig`**.
|
||||
- Automatic HTML-to-Markdown conversion via **`DefaultMarkdownGenerator`** (supports optional filters).
|
||||
- Multiple extraction strategies (LLM-based or “traditional” CSS/XPath-based).
|
||||
|
||||
By the end of this guide, you’ll have performed a basic crawl, generated Markdown, tried out two extraction strategies, and crawled a dynamic page that uses “Load More” buttons or JavaScript updates.
|
||||
|
||||
---
|
||||
|
||||
## 2. Your First Crawl
|
||||
|
||||
Here’s a minimal Python script that creates an **`AsyncWebCrawler`**, fetches a webpage, and prints the first 300 characters of its Markdown output:
|
||||
|
||||
```python
|
||||
import asyncio
|
||||
from crawl4ai import AsyncWebCrawler
|
||||
|
||||
async def main():
|
||||
async with AsyncWebCrawler() as crawler:
|
||||
result = await crawler.arun("https://example.com")
|
||||
print(result.markdown[:300]) # Print first 300 chars
|
||||
|
||||
if __name__ == "__main__":
|
||||
asyncio.run(main())
|
||||
```
|
||||
|
||||
**What’s happening?**
|
||||
- **`AsyncWebCrawler`** launches a headless browser (Chromium by default).
|
||||
- It fetches `https://example.com`.
|
||||
- Crawl4AI automatically converts the HTML into Markdown.
|
||||
|
||||
You now have a simple, working crawl!
|
||||
|
||||
---
|
||||
|
||||
## 3. Basic Configuration (Light Introduction)
|
||||
|
||||
Crawl4AI’s crawler can be heavily customized using two main classes:
|
||||
|
||||
1. **`BrowserConfig`**: Controls browser behavior (headless or full UI, user agent, JavaScript toggles, etc.).
|
||||
2. **`CrawlerRunConfig`**: Controls how each crawl runs (caching, extraction, timeouts, hooking, etc.).
|
||||
|
||||
Below is an example with minimal usage:
|
||||
|
||||
```python
|
||||
import asyncio
|
||||
from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig, CacheMode
|
||||
|
||||
async def main():
|
||||
browser_conf = BrowserConfig(headless=True) # or False to see the browser
|
||||
run_conf = CrawlerRunConfig(
|
||||
cache_mode=CacheMode.BYPASS
|
||||
)
|
||||
|
||||
async with AsyncWebCrawler(config=browser_conf) as crawler:
|
||||
result = await crawler.arun(
|
||||
url="https://example.com",
|
||||
config=run_conf
|
||||
)
|
||||
print(result.markdown)
|
||||
|
||||
if __name__ == "__main__":
|
||||
asyncio.run(main())
|
||||
```
|
||||
|
||||
> IMPORTANT: By default cache mode is set to `CacheMode.ENABLED`. So to have fresh content, you need to set it to `CacheMode.BYPASS`
|
||||
|
||||
We’ll explore more advanced config in later tutorials (like enabling proxies, PDF output, multi-tab sessions, etc.). For now, just note how you pass these objects to manage crawling.
|
||||
|
||||
---
|
||||
|
||||
## 4. Generating Markdown Output
|
||||
|
||||
By default, Crawl4AI automatically generates Markdown from each crawled page. However, the exact output depends on whether you specify a **markdown generator** or **content filter**.
|
||||
|
||||
- **`result.markdown`**:
|
||||
The direct HTML-to-Markdown conversion.
|
||||
- **`result.markdown.fit_markdown`**:
|
||||
The same content after applying any configured **content filter** (e.g., `PruningContentFilter`).
|
||||
|
||||
### Example: Using a Filter with `DefaultMarkdownGenerator`
|
||||
|
||||
```python
|
||||
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig
|
||||
from crawl4ai.content_filter_strategy import PruningContentFilter
|
||||
from crawl4ai.markdown_generation_strategy import DefaultMarkdownGenerator
|
||||
|
||||
md_generator = DefaultMarkdownGenerator(
|
||||
content_filter=PruningContentFilter(threshold=0.4, threshold_type="fixed")
|
||||
)
|
||||
|
||||
config = CrawlerRunConfig(
|
||||
cache_mode=CacheMode.BYPASS,
|
||||
markdown_generator=md_generator
|
||||
)
|
||||
|
||||
async with AsyncWebCrawler() as crawler:
|
||||
result = await crawler.arun("https://news.ycombinator.com", config=config)
|
||||
print("Raw Markdown length:", len(result.markdown.raw_markdown))
|
||||
print("Fit Markdown length:", len(result.markdown.fit_markdown))
|
||||
```
|
||||
|
||||
**Note**: If you do **not** specify a content filter or markdown generator, you’ll typically see only the raw Markdown. `PruningContentFilter` may adds around `50ms` in processing time. We’ll dive deeper into these strategies in a dedicated **Markdown Generation** tutorial.
|
||||
|
||||
---
|
||||
|
||||
## 5. Simple Data Extraction (CSS-based)
|
||||
|
||||
Crawl4AI can also extract structured data (JSON) using CSS or XPath selectors. Below is a minimal CSS-based example:
|
||||
|
||||
```python
|
||||
import asyncio
|
||||
import json
|
||||
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig, CacheMode
|
||||
from crawl4ai.extraction_strategy import JsonCssExtractionStrategy
|
||||
|
||||
async def main():
|
||||
schema = {
|
||||
"name": "Example Items",
|
||||
"baseSelector": "div.item",
|
||||
"fields": [
|
||||
{"name": "title", "selector": "h2", "type": "text"},
|
||||
{"name": "link", "selector": "a", "type": "attribute", "attribute": "href"}
|
||||
]
|
||||
}
|
||||
|
||||
raw_html = "<div class='item'><h2>Item 1</h2><a href='https://example.com/item1'>Link 1</a></div>"
|
||||
|
||||
async with AsyncWebCrawler() as crawler:
|
||||
result = await crawler.arun(
|
||||
url="raw://" + raw_html,
|
||||
config=CrawlerRunConfig(
|
||||
cache_mode=CacheMode.BYPASS,
|
||||
extraction_strategy=JsonCssExtractionStrategy(schema)
|
||||
)
|
||||
)
|
||||
# The JSON output is stored in 'extracted_content'
|
||||
data = json.loads(result.extracted_content)
|
||||
print(data)
|
||||
|
||||
if __name__ == "__main__":
|
||||
asyncio.run(main())
|
||||
```
|
||||
|
||||
**Why is this helpful?**
|
||||
- Great for repetitive page structures (e.g., item listings, articles).
|
||||
- No AI usage or costs.
|
||||
- The crawler returns a JSON string you can parse or store.
|
||||
|
||||
> Tips: You can pass raw HTML to the crawler instead of a URL. To do so, prefix the HTML with `raw://`.
|
||||
|
||||
---
|
||||
|
||||
## 6. Simple Data Extraction (LLM-based)
|
||||
|
||||
For more complex or irregular pages, a language model can parse text intelligently into a structure you define. Crawl4AI supports **open-source** or **closed-source** providers:
|
||||
|
||||
- **Open-Source Models** (e.g., `ollama/llama3.3`, `no_token`)
|
||||
- **OpenAI Models** (e.g., `openai/gpt-4`, requires `api_token`)
|
||||
- Or any provider supported by the underlying library
|
||||
|
||||
Below is an example using **open-source** style (no token) and closed-source:
|
||||
|
||||
```python
|
||||
import os
|
||||
import json
|
||||
import asyncio
|
||||
from pydantic import BaseModel, Field
|
||||
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig
|
||||
from crawl4ai.extraction_strategy import LLMExtractionStrategy
|
||||
|
||||
class OpenAIModelFee(BaseModel):
|
||||
model_name: str = Field(..., description="Name of the OpenAI model.")
|
||||
input_fee: str = Field(..., description="Fee for input token for the OpenAI model.")
|
||||
output_fee: str = Field(
|
||||
..., description="Fee for output token for the OpenAI model."
|
||||
)
|
||||
|
||||
async def extract_structured_data_using_llm(
|
||||
provider: str, api_token: str = None, extra_headers: Dict[str, str] = None
|
||||
):
|
||||
print(f"\n--- Extracting Structured Data with {provider} ---")
|
||||
|
||||
if api_token is None and provider != "ollama":
|
||||
print(f"API token is required for {provider}. Skipping this example.")
|
||||
return
|
||||
|
||||
browser_config = BrowserConfig(headless=True)
|
||||
|
||||
extra_args = {"temperature": 0, "top_p": 0.9, "max_tokens": 2000}
|
||||
if extra_headers:
|
||||
extra_args["extra_headers"] = extra_headers
|
||||
|
||||
crawler_config = CrawlerRunConfig(
|
||||
cache_mode=CacheMode.BYPASS,
|
||||
word_count_threshold=1,
|
||||
page_timeout=80000,
|
||||
extraction_strategy=LLMExtractionStrategy(
|
||||
provider=provider,
|
||||
api_token=api_token,
|
||||
schema=OpenAIModelFee.model_json_schema(),
|
||||
extraction_type="schema",
|
||||
instruction="""From the crawled content, extract all mentioned model names along with their fees for input and output tokens.
|
||||
Do not miss any models in the entire content.""",
|
||||
extra_args=extra_args,
|
||||
),
|
||||
)
|
||||
|
||||
async with AsyncWebCrawler(config=browser_config) as crawler:
|
||||
result = await crawler.arun(
|
||||
url="https://openai.com/api/pricing/", config=crawler_config
|
||||
)
|
||||
print(result.extracted_content)
|
||||
|
||||
if __name__ == "__main__":
|
||||
# Use ollama with llama3.3
|
||||
# asyncio.run(
|
||||
# extract_structured_data_using_llm(
|
||||
# provider="ollama/llama3.3", api_token="no-token"
|
||||
# )
|
||||
# )
|
||||
|
||||
asyncio.run(
|
||||
extract_structured_data_using_llm(
|
||||
provider="openai/gpt-4o", api_token=os.getenv("OPENAI_API_KEY")
|
||||
)
|
||||
)
|
||||
```
|
||||
|
||||
**What’s happening?**
|
||||
- We define a Pydantic schema (`PricingInfo`) describing the fields we want.
|
||||
- The LLM extraction strategy uses that schema and your instructions to transform raw text into structured JSON.
|
||||
- Depending on the **provider** and **api_token**, you can use local models or a remote API.
|
||||
|
||||
---
|
||||
|
||||
## 7. Dynamic Content Example
|
||||
|
||||
Some sites require multiple “page clicks” or dynamic JavaScript updates. Below is an example showing how to **click** a “Next Page” button and wait for new commits to load on GitHub, using **`BrowserConfig`** and **`CrawlerRunConfig`**:
|
||||
|
||||
```python
|
||||
import asyncio
|
||||
from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig, CacheMode
|
||||
from crawl4ai.extraction_strategy import JsonCssExtractionStrategy
|
||||
|
||||
async def extract_structured_data_using_css_extractor():
|
||||
print("\n--- Using JsonCssExtractionStrategy for Fast Structured Output ---")
|
||||
schema = {
|
||||
"name": "KidoCode Courses",
|
||||
"baseSelector": "section.charge-methodology .w-tab-content > div",
|
||||
"fields": [
|
||||
{
|
||||
"name": "section_title",
|
||||
"selector": "h3.heading-50",
|
||||
"type": "text",
|
||||
},
|
||||
{
|
||||
"name": "section_description",
|
||||
"selector": ".charge-content",
|
||||
"type": "text",
|
||||
},
|
||||
{
|
||||
"name": "course_name",
|
||||
"selector": ".text-block-93",
|
||||
"type": "text",
|
||||
},
|
||||
{
|
||||
"name": "course_description",
|
||||
"selector": ".course-content-text",
|
||||
"type": "text",
|
||||
},
|
||||
{
|
||||
"name": "course_icon",
|
||||
"selector": ".image-92",
|
||||
"type": "attribute",
|
||||
"attribute": "src",
|
||||
},
|
||||
],
|
||||
}
|
||||
|
||||
browser_config = BrowserConfig(headless=True, java_script_enabled=True)
|
||||
|
||||
js_click_tabs = """
|
||||
(async () => {
|
||||
const tabs = document.querySelectorAll("section.charge-methodology .tabs-menu-3 > div");
|
||||
for(let tab of tabs) {
|
||||
tab.scrollIntoView();
|
||||
tab.click();
|
||||
await new Promise(r => setTimeout(r, 500));
|
||||
}
|
||||
})();
|
||||
"""
|
||||
|
||||
crawler_config = CrawlerRunConfig(
|
||||
cache_mode=CacheMode.BYPASS,
|
||||
extraction_strategy=JsonCssExtractionStrategy(schema),
|
||||
js_code=[js_click_tabs],
|
||||
)
|
||||
|
||||
async with AsyncWebCrawler(config=browser_config) as crawler:
|
||||
result = await crawler.arun(
|
||||
url="https://www.kidocode.com/degrees/technology", config=crawler_config
|
||||
)
|
||||
|
||||
companies = json.loads(result.extracted_content)
|
||||
print(f"Successfully extracted {len(companies)} companies")
|
||||
print(json.dumps(companies[0], indent=2))
|
||||
|
||||
async def main():
|
||||
await extract_structured_data_using_css_extractor()
|
||||
|
||||
if __name__ == "__main__":
|
||||
asyncio.run(main())
|
||||
```
|
||||
|
||||
**Key Points**:
|
||||
|
||||
- **`BrowserConfig(headless=False)`**: We want to watch it click “Next Page.”
|
||||
- **`CrawlerRunConfig(...)`**: We specify the extraction strategy, pass `session_id` to reuse the same page.
|
||||
- **`js_code`** and **`wait_for`** are used for subsequent pages (`page > 0`) to click the “Next” button and wait for new commits to load.
|
||||
- **`js_only=True`** indicates we’re not re-navigating but continuing the existing session.
|
||||
- Finally, we call `kill_session()` to clean up the page and browser session.
|
||||
|
||||
---
|
||||
|
||||
## 8. Next Steps
|
||||
|
||||
Congratulations! You have:
|
||||
|
||||
1. Performed a basic crawl and printed Markdown.
|
||||
2. Used **content filters** with a markdown generator.
|
||||
3. Extracted JSON via **CSS** or **LLM** strategies.
|
||||
4. Handled **dynamic** pages with JavaScript triggers.
|
||||
|
||||
If you’re ready for more, check out:
|
||||
|
||||
- **Installation**: A deeper dive into advanced installs, Docker usage (experimental), or optional dependencies.
|
||||
- **Hooks & Auth**: Learn how to run custom JavaScript or handle logins with cookies, local storage, etc.
|
||||
- **Deployment**: Explore ephemeral testing in Docker or plan for the upcoming stable Docker release.
|
||||
- **Browser Management**: Delve into user simulation, stealth modes, and concurrency best practices.
|
||||
|
||||
Crawl4AI is a powerful, flexible tool. Enjoy building out your scrapers, data pipelines, or AI-driven extraction flows. Happy crawling!
|
||||
145
docs/md_v2/core/simple-crawling.md
Normal file
145
docs/md_v2/core/simple-crawling.md
Normal file
@@ -0,0 +1,145 @@
|
||||
# Simple Crawling
|
||||
|
||||
This guide covers the basics of web crawling with Crawl4AI. You'll learn how to set up a crawler, make your first request, and understand the response.
|
||||
|
||||
## Basic Usage
|
||||
|
||||
Set up a simple crawl using `BrowserConfig` and `CrawlerRunConfig`:
|
||||
|
||||
```python
|
||||
import asyncio
|
||||
from crawl4ai import AsyncWebCrawler
|
||||
from crawl4ai.async_configs import BrowserConfig, CrawlerRunConfig
|
||||
|
||||
async def main():
|
||||
browser_config = BrowserConfig() # Default browser configuration
|
||||
run_config = CrawlerRunConfig() # Default crawl run configuration
|
||||
|
||||
async with AsyncWebCrawler(config=browser_config) as crawler:
|
||||
result = await crawler.arun(
|
||||
url="https://example.com",
|
||||
config=run_config
|
||||
)
|
||||
print(result.markdown) # Print clean markdown content
|
||||
|
||||
if __name__ == "__main__":
|
||||
asyncio.run(main())
|
||||
```
|
||||
|
||||
## Understanding the Response
|
||||
|
||||
The `arun()` method returns a `CrawlResult` object with several useful properties. Here's a quick overview (see [CrawlResult](../api/crawl-result.md) for complete details):
|
||||
|
||||
```python
|
||||
result = await crawler.arun(
|
||||
url="https://example.com",
|
||||
config=CrawlerRunConfig(fit_markdown=True)
|
||||
)
|
||||
|
||||
# Different content formats
|
||||
print(result.html) # Raw HTML
|
||||
print(result.cleaned_html) # Cleaned HTML
|
||||
print(result.markdown) # Markdown version
|
||||
print(result.fit_markdown) # Most relevant content in markdown
|
||||
|
||||
# Check success status
|
||||
print(result.success) # True if crawl succeeded
|
||||
print(result.status_code) # HTTP status code (e.g., 200, 404)
|
||||
|
||||
# Access extracted media and links
|
||||
print(result.media) # Dictionary of found media (images, videos, audio)
|
||||
print(result.links) # Dictionary of internal and external links
|
||||
```
|
||||
|
||||
## Adding Basic Options
|
||||
|
||||
Customize your crawl using `CrawlerRunConfig`:
|
||||
|
||||
```python
|
||||
run_config = CrawlerRunConfig(
|
||||
word_count_threshold=10, # Minimum words per content block
|
||||
exclude_external_links=True, # Remove external links
|
||||
remove_overlay_elements=True, # Remove popups/modals
|
||||
process_iframes=True # Process iframe content
|
||||
)
|
||||
|
||||
result = await crawler.arun(
|
||||
url="https://example.com",
|
||||
config=run_config
|
||||
)
|
||||
```
|
||||
|
||||
## Handling Errors
|
||||
|
||||
Always check if the crawl was successful:
|
||||
|
||||
```python
|
||||
run_config = CrawlerRunConfig()
|
||||
result = await crawler.arun(url="https://example.com", config=run_config)
|
||||
|
||||
if not result.success:
|
||||
print(f"Crawl failed: {result.error_message}")
|
||||
print(f"Status code: {result.status_code}")
|
||||
```
|
||||
|
||||
## Logging and Debugging
|
||||
|
||||
Enable verbose logging in `BrowserConfig`:
|
||||
|
||||
```python
|
||||
browser_config = BrowserConfig(verbose=True)
|
||||
|
||||
async with AsyncWebCrawler(config=browser_config) as crawler:
|
||||
run_config = CrawlerRunConfig()
|
||||
result = await crawler.arun(url="https://example.com", config=run_config)
|
||||
```
|
||||
|
||||
## Complete Example
|
||||
|
||||
Here's a more comprehensive example demonstrating common usage patterns:
|
||||
|
||||
```python
|
||||
import asyncio
|
||||
from crawl4ai import AsyncWebCrawler
|
||||
from crawl4ai.async_configs import BrowserConfig, CrawlerRunConfig, CacheMode
|
||||
|
||||
async def main():
|
||||
browser_config = BrowserConfig(verbose=True)
|
||||
run_config = CrawlerRunConfig(
|
||||
# Content filtering
|
||||
word_count_threshold=10,
|
||||
excluded_tags=['form', 'header'],
|
||||
exclude_external_links=True,
|
||||
|
||||
# Content processing
|
||||
process_iframes=True,
|
||||
remove_overlay_elements=True,
|
||||
|
||||
# Cache control
|
||||
cache_mode=CacheMode.ENABLED # Use cache if available
|
||||
)
|
||||
|
||||
async with AsyncWebCrawler(config=browser_config) as crawler:
|
||||
result = await crawler.arun(
|
||||
url="https://example.com",
|
||||
config=run_config
|
||||
)
|
||||
|
||||
if result.success:
|
||||
# Print clean content
|
||||
print("Content:", result.markdown[:500]) # First 500 chars
|
||||
|
||||
# Process images
|
||||
for image in result.media["images"]:
|
||||
print(f"Found image: {image['src']}")
|
||||
|
||||
# Process links
|
||||
for link in result.links["internal"]:
|
||||
print(f"Internal link: {link['href']}")
|
||||
|
||||
else:
|
||||
print(f"Crawl failed: {result.error_message}")
|
||||
|
||||
if __name__ == "__main__":
|
||||
asyncio.run(main())
|
||||
```
|
||||
Reference in New Issue
Block a user