Merge PR #899 into next, resolve conflicts in server.py and docs/browser-crawler-config.md

This commit is contained in:
unclecode
2025-04-22 14:56:47 +08:00
16 changed files with 132 additions and 140 deletions

View File

@@ -12,9 +12,10 @@ Weve introduced a new feature that effortlessly handles even the biggest page
**Simple Example:**
```python
import os, sys
import os
import sys
import asyncio
from crawl4ai import AsyncWebCrawler, CacheMode
from crawl4ai import AsyncWebCrawler, CacheMode, CrawlerRunConfig
# Adjust paths as needed
parent_dir = os.path.dirname(os.path.dirname(os.path.abspath(__file__)))
@@ -26,9 +27,11 @@ async def main():
# Request both PDF and screenshot
result = await crawler.arun(
url='https://en.wikipedia.org/wiki/List_of_common_misconceptions',
cache_mode=CacheMode.BYPASS,
pdf=True,
screenshot=True
config=CrawlerRunConfig(
cache_mode=CacheMode.BYPASS,
pdf=True,
screenshot=True
)
)
if result.success:
@@ -40,9 +43,8 @@ async def main():
# Save PDF
if result.pdf:
pdf_bytes = b64decode(result.pdf)
with open(os.path.join(__location__, "page.pdf"), "wb") as f:
f.write(pdf_bytes)
f.write(result.pdf)
if __name__ == "__main__":
asyncio.run(main())

View File

@@ -232,6 +232,7 @@ async def main():
if __name__ == "__main__":
asyncio.run(main())
```
## 2.4 Compliance & Ethics

View File

@@ -36,8 +36,6 @@ class BrowserConfig:
### Key Fields to Note
1. **`browser_type`**
- Options: `"chromium"`, `"firefox"`, or `"webkit"`.
- Defaults to `"chromium"`.
@@ -215,6 +213,7 @@ class CrawlerRunConfig:
- The display mode for progress information (`DETAILED`, `BRIEF`, etc.).
- Affects how much information is printed during the crawl.
### Helper Methods
The `clone()` method is particularly useful for creating variations of your crawler configuration:
@@ -248,9 +247,6 @@ The `clone()` method:
---
## 3. LLMConfig Essentials
### Key fields to note

View File

@@ -2,7 +2,7 @@
In some cases, you need to extract **complex or unstructured** information from a webpage that a simple CSS/XPath schema cannot easily parse. Or you want **AI**-driven insights, classification, or summarization. For these scenarios, Crawl4AI provides an **LLM-based extraction strategy** that:
1. Works with **any** large language model supported by [LightLLM](https://github.com/LightLLM) (Ollama, OpenAI, Claude, and more).
1. Works with **any** large language model supported by [LiteLLM](https://github.com/BerriAI/litellm) (Ollama, OpenAI, Claude, and more).
2. Automatically splits content into chunks (if desired) to handle token limits, then combines results.
3. Lets you define a **schema** (like a Pydantic model) or a simpler “block” extraction approach.
@@ -18,13 +18,19 @@ In some cases, you need to extract **complex or unstructured** information from
---
## 2. Provider-Agnostic via LightLLM
## 2. Provider-Agnostic via LiteLLM
Crawl4AI uses a “provider string” (e.g., `"openai/gpt-4o"`, `"ollama/llama2.0"`, `"aws/titan"`) to identify your LLM. **Any** model that LightLLM supports is fair game. You just provide:
You can use LlmConfig, to quickly configure multiple variations of LLMs and experiment with them to find the optimal one for your use case. You can read more about LlmConfig [here](/api/parameters).
```python
llmConfig = LlmConfig(provider="openai/gpt-4o-mini", api_token=os.getenv("OPENAI_API_KEY"))
```
Crawl4AI uses a “provider string” (e.g., `"openai/gpt-4o"`, `"ollama/llama2.0"`, `"aws/titan"`) to identify your LLM. **Any** model that LiteLLM supports is fair game. You just provide:
- **`provider`**: The `<provider>/<model_name>` identifier (e.g., `"openai/gpt-4"`, `"ollama/llama2"`, `"huggingface/google-flan"`, etc.).
- **`api_token`**: If needed (for OpenAI, HuggingFace, etc.); local models or Ollama might not require it.
- **`api_base`** (optional): If your provider has a custom endpoint.
- **`base_url`** (optional): If your provider has a custom endpoint.
This means you **arent locked** into a single LLM vendor. Switch or experiment easily.
@@ -52,20 +58,19 @@ For structured data, `"schema"` is recommended. You provide `schema=YourPydantic
Below is an overview of important LLM extraction parameters. All are typically set inside `LLMExtractionStrategy(...)`. You then put that strategy in your `CrawlerRunConfig(..., extraction_strategy=...)`.
1. **`provider`** (str): e.g., `"openai/gpt-4"`, `"ollama/llama2"`.
2. **`api_token`** (str): The API key or token for that model. May not be needed for local models.
3. **`schema`** (dict): A JSON schema describing the fields you want. Usually generated by `YourModel.model_json_schema()`.
4. **`extraction_type`** (str): `"schema"` or `"block"`.
5. **`instruction`** (str): Prompt text telling the LLM what you want extracted. E.g., “Extract these fields as a JSON array.”
6. **`chunk_token_threshold`** (int): Maximum tokens per chunk. If your content is huge, you can break it up for the LLM.
7. **`overlap_rate`** (float): Overlap ratio between adjacent chunks. E.g., `0.1` means 10% of each chunk is repeated to preserve context continuity.
8. **`apply_chunking`** (bool): Set `True` to chunk automatically. If you want a single pass, set `False`.
9. **`input_format`** (str): Determines **which** crawler result is passed to the LLM. Options include:
1. **`llmConfig`** (LlmConfig): e.g., `"openai/gpt-4"`, `"ollama/llama2"`.
2. **`schema`** (dict): A JSON schema describing the fields you want. Usually generated by `YourModel.model_json_schema()`.
3. **`extraction_type`** (str): `"schema"` or `"block"`.
4. **`instruction`** (str): Prompt text telling the LLM what you want extracted. E.g., “Extract these fields as a JSON array.”
5. **`chunk_token_threshold`** (int): Maximum tokens per chunk. If your content is huge, you can break it up for the LLM.
6. **`overlap_rate`** (float): Overlap ratio between adjacent chunks. E.g., `0.1` means 10% of each chunk is repeated to preserve context continuity.
7. **`apply_chunking`** (bool): Set `True` to chunk automatically. If you want a single pass, set `False`.
8. **`input_format`** (str): Determines **which** crawler result is passed to the LLM. Options include:
- `"markdown"`: The raw markdown (default).
- `"fit_markdown"`: The filtered “fit” markdown if you used a content filter.
- `"html"`: The cleaned or raw HTML.
10. **`extra_args`** (dict): Additional LLM parameters like `temperature`, `max_tokens`, `top_p`, etc.
11. **`show_usage()`**: A method you can call to print out usage info (token usage per chunk, total cost if known).
9. **`extra_args`** (dict): Additional LLM parameters like `temperature`, `max_tokens`, `top_p`, etc.
10. **`show_usage()`**: A method you can call to print out usage info (token usage per chunk, total cost if known).
**Example**:
@@ -233,8 +238,7 @@ class KnowledgeGraph(BaseModel):
async def main():
# LLM extraction strategy
llm_strat = LLMExtractionStrategy(
provider="openai/gpt-4",
api_token=os.getenv('OPENAI_API_KEY'),
llmConfig = LlmConfig(provider="openai/gpt-4", api_token=os.getenv('OPENAI_API_KEY')),
schema=KnowledgeGraph.schema_json(),
extraction_type="schema",
instruction="Extract entities and relationships from the content. Return valid JSON.",
@@ -286,7 +290,7 @@ if __name__ == "__main__":
## 11. Conclusion
**LLM-based extraction** in Crawl4AI is **provider-agnostic**, letting you choose from hundreds of models via LightLLM. Its perfect for **semantically complex** tasks or generating advanced structures like knowledge graphs. However, its **slower** and potentially costlier than schema-based approaches. Keep these tips in mind:
**LLM-based extraction** in Crawl4AI is **provider-agnostic**, letting you choose from hundreds of models via LiteLLM. Its perfect for **semantically complex** tasks or generating advanced structures like knowledge graphs. However, its **slower** and potentially costlier than schema-based approaches. Keep these tips in mind:
- Put your LLM strategy **in `CrawlerRunConfig`**.
- Use **`input_format`** to pick which form (markdown, HTML, fit_markdown) the LLM sees.
@@ -317,4 +321,4 @@ If your sites data is consistent or repetitive, consider [`JsonCssExtractionS
---
Thats it for **Extracting JSON (LLM)**—now you can harness AI to parse, classify, or reorganize data on the web. Happy crawling!
Thats it for **Extracting JSON (LLM)**—now you can harness AI to parse, classify, or reorganize data on the web. Happy crawling!