- Debug
- Refactor code for new version
This commit is contained in:
unclecode
2024-05-16 17:31:44 +08:00
parent f6e59157bf
commit 5b80be956d
23 changed files with 3116 additions and 1019 deletions

552
README.md
View File

@@ -8,16 +8,90 @@
Crawl4AI is a powerful, free web crawling service designed to extract useful information from web pages and make it accessible for large language models (LLMs) and AI applications. 🆓🌐
## 🚧 Work in Progress 👷‍♂️
## Recent Changes
- 🔧 Separate Crawl and Extract Semantic Chunk: Enhancing efficiency in large-scale tasks.
- 🔍 Colab Integration: Exploring integration with Google Colab for easy experimentation.
- 🎯 XPath and CSS Selector Support: Adding support for selective retrieval of specific elements.
- 📷 Image Captioning: Incorporating image captioning capabilities to extract descriptions from images.
- 💾 Embedding Vector Data: Generate and store embedding data for each crawled website.
- 🔍 Semantic Search Engine: Building a semantic search engine that fetches content, performs vector search similarity, and generates labeled chunk data based on user queries and URLs.
- 🚀 10x faster!!
- 📜 Execute custom JavaScript before crawling!
- 🤝 Colab friendly!
- 📚 Chunking strategies: topic-based, regex, sentence, and more!
- 🧠 Extraction strategies: cosine clustering, LLM, and more!
- 🎯 CSS selector support
- 📝 Pass instructions/keywords to refine extraction
## Power and Simplicity of Crawl4AI 🚀
Crawl4AI makes even complex web crawling tasks simple and intuitive. Below is an example of how you can execute JavaScript, filter data using keywords, and use a CSS selector to extract specific content—all in one go!
**Example Task:**
1. Execute custom JavaScript to click a "Load More" button.
2. Filter the data to include only content related to "technology".
3. Use a CSS selector to extract only paragraphs (`<p>` tags).
**Example Code:**
```python
# Import necessary modules
from crawl4ai import WebCrawler
from crawl4ai.chunking_strategy import *
from crawl4ai.extraction_strategy import *
from crawl4ai.crawler_strategy import *
# Define the JavaScript code to click the "Load More" button
js_code = """
const loadMoreButton = Array.from(document.querySelectorAll('button')).find(button => button.textContent.includes('Load More'));
loadMoreButton && loadMoreButton.click();
"""
# Define the crawling strategy
crawler_strategy = LocalSeleniumCrawlerStrategy(js_code=js_code)
# Create the WebCrawler instance with the defined strategy
crawler = WebCrawler(crawler_strategy=crawler_strategy)
# Run the crawler with keyword filtering and CSS selector
result = crawler.run(
url="https://www.example.com",
extraction_strategy=CosineStrategy(
semantic_filter="technology",
),
)
# Run the crawler with LLM extraction strategy
result = crawler.run(
url="https://www.example.com",
extraction_strategy=LLMExtractionStrategy(
provider="openai/gpt-4o",
api_token=os.getenv('OPENAI_API_KEY'),
instruction="Extract only content related to technology"
),
css_selector="p"
)
# Display the extracted result
print(result)
```
With Crawl4AI, you can perform advanced web crawling and data extraction tasks with just a few lines of code. This example demonstrates how you can harness the power of Crawl4AI to simplify your workflow and get the data you need efficiently.
---
*Continue reading to learn more about the features, installation process, usage, and more.*
## Table of Contents
1. [Features](#features)
2. [Installation](#installation)
3. [REST API/Local Server](#using-the-local-server-ot-rest-api)
4. [Python Library Usage](#usage)
5. [Parameters](#parameters)
6. [Chunking Strategies](#chunking-strategies)
7. [Extraction Strategies](#extraction-strategies)
8. [Contributing](#contributing)
9. [License](#license)
10. [Contact](#contact)
For more details, refer to the [CHANGELOG.md](https://github.com/unclecode/crawl4ai/edit/main/CHANGELOG.md) file.
## Features ✨
@@ -26,26 +100,28 @@ For more details, refer to the [CHANGELOG.md](https://github.com/unclecode/crawl
- 🌍 Supports crawling multiple URLs simultaneously
- 🌃 Replace media tags with ALT.
- 🆓 Completely free to use and open-source
## Getting Started 🚀
To get started with Crawl4AI, simply visit our web application at [https://crawl4ai.uccode.io](https://crawl4ai.uccode.io) (Available now!) and enter the URL(s) you want to crawl. The application will process the URLs and provide you with the extracted data in various formats.
- 📜 Execute custom JavaScript before crawling
- 📚 Chunking strategies: topic-based, regex, sentence, and more
- 🧠 Extraction strategies: cosine clustering, LLM, and more
- 🎯 CSS selector support
- 📝 Pass instructions/keywords to refine extraction
## Installation 💻
There are two ways to use Crawl4AI: as a library in your Python projects or as a standalone local server.
### Using Crawl4AI as a Library 📚
There are three ways to use Crawl4AI:
1. As a library (Recommended)
2. As a local server (Docker) or using the REST API
4. As a Google Colab notebook. [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1wz8u30rvbq6Scodye9AGCw8Qg_Z8QGsk)
To install Crawl4AI as a library, follow these steps:
1. Install the package from GitHub:
```sh
```bash
pip install git+https://github.com/unclecode/crawl4ai.git
```
Alternatively, you can clone the repository and install the package locally:
```sh
2. Alternatively, you can clone the repository and install the package locally:
```bash
virtualenv venv
source venv/bin/activate
git clone https://github.com/unclecode/crawl4ai.git
@@ -53,133 +129,193 @@ cd crawl4ai
pip install -e .
```
2. Import the necessary modules in your Python script:
```python
from crawl4ai.web_crawler import WebCrawler
from crawl4ai.chunking_strategy import *
from crawl4ai.extraction_strategy import *
import os
crawler = WebCrawler()
crawler.warmup() # IMPORTANT: Warmup the engine before running the first crawl
# Single page crawl
result = crawler.run(
url='https://www.nbcnews.com/business',
word_count_threshold=5, # Minimum word count for a HTML tag to be considered as a worthy block
chunking_strategy= RegexChunking( patterns = ["\n\n"]), # Default is RegexChunking
extraction_strategy= CosineStrategy(word_count_threshold=20, max_dist=0.2, linkage_method='ward', top_k=3) # Default is CosineStrategy
# extraction_strategy= LLMExtractionStrategy(provider= "openai/gpt-4o", api_token = os.getenv('OPENAI_API_KEY')),
bypass_cache=False,
extract_blocks =True, # Whether to extract semantical blocks of text from the HTML
css_selector = "", # Eg: "div.article-body"
verbose=True,
include_raw_html=True, # Whether to include the raw HTML content in the response
)
print(result.model_dump())
```
Running for the first time will download the chrome driver for selenium. Also creates a SQLite database file `crawler_data.db` in the current directory. This file will store the crawled data for future reference.
The response model is a `CrawlResponse` object that contains the following attributes:
```python
class CrawlResult(BaseModel):
url: str
html: str
success: bool
cleaned_html: str = None
markdown: str = None
parsed_json: str = None
error_message: str = None
```
### Running Crawl4AI as a Local Server 🚀
To run Crawl4AI as a standalone local server, follow these steps:
1. Clone the repository:
```sh
git clone https://github.com/unclecode/crawl4ai.git
```
2. Navigate to the project directory:
```sh
cd crawl4ai
```
3. Open `crawler/config.py` and set your favorite LLM provider and API token.
4. Build the Docker image:
```sh
docker build -t crawl4ai .
```
For Mac users, use the following command instead:
```sh
docker build --platform linux/amd64 -t crawl4ai .
```
5. Run the Docker container:
```sh
3. Use docker to run the local server:
```bash
docker build -t crawl4ai .
# For Mac users
# docker build --platform linux/amd64 -t crawl4ai .
docker run -d -p 8000:80 crawl4ai
```
6. Access the application at `http://localhost:8000`.
For more information about how to run Crawl4AI as a local server, please refer to the [GitHub repository](https://github.com/unclecode/crawl4ai).
- CURL Example:
Set the api_token to your OpenAI API key or any other provider you are using.
```sh
curl -X POST -H "Content-Type: application/json" -d '{"urls":["https://techcrunch.com/"],"provider_model":"openai/gpt-3.5-turbo","api_token":"your_api_token","include_raw_html":true,"forced":false,"extract_blocks_flag":false,"word_count_threshold":10}' http://localhost:8000/crawl
```
Set `extract_blocks_flag` to True to enable the LLM to generate semantically clustered chunks and return them as JSON. Depending on the model and data size, this may take up to 1 minute. Without this setting, it will take between 5 to 20 seconds.
## Using the Local server ot REST API 🌐
- Python Example:
```python
import requests
import os
You can also use Crawl4AI through the REST API. This method allows you to send HTTP requests to the Crawl4AI server and receive structured data in response. The base URL for the API is `https://crawl4ai.com/crawl`. If you run the local server, you can use `http://localhost:8000/crawl`. (Port is dependent on your docker configuration)
data = {
"urls": [
"https://www.nbcnews.com/business"
],
"provider_model": "groq/llama3-70b-8192",
"include_raw_html": true,
"bypass_cache": false,
"extract_blocks": true,
"word_count_threshold": 10,
"extraction_strategy": "CosineStrategy",
"chunking_strategy": "RegexChunking",
"css_selector": "",
"verbose": true
### Example Usage
To use the REST API, send a POST request to `https://crawl4ai.com/crawl` with the following parameters in the request body.
**Example Request:**
```json
{
"urls": ["https://www.example.com"],
"include_raw_html": false,
"bypass_cache": true,
"word_count_threshold": 5,
"extraction_strategy": "CosineStrategy",
"chunking_strategy": "RegexChunking",
"css_selector": "p",
"verbose": true,
"extraction_strategy_args": {
"semantic_filter": "finance economy and stock market",
"word_count_threshold": 20,
"max_dist": 0.2,
"linkage_method": "ward",
"top_k": 3
},
"chunking_strategy_args": {
"patterns": ["\n\n"]
}
}
response = requests.post("http://crawl4ai.uccode.io/crawl", json=data) # OR http://localhost:8000 if your run locally
if response.status_code == 200:
result = response.json()["results"][0]
print("Parsed JSON:")
print(result["parsed_json"])
print("\nCleaned HTML:")
print(result["cleaned_html"])
print("\nMarkdown:")
print(result["markdown"])
else:
print("Error:", response.status_code, response.text)
```
This code sends a POST request to the Crawl4AI server running on localhost, specifying the target URL (`http://crawl4ai.uccode.io/crawl`) and the desired options. The server processes the request and returns the crawled data in JSON format.
**Example Response:**
```json
{
"status": "success",
"data": [
{
"url": "https://www.example.com",
"extracted_content": "...",
"html": "...",
"markdown": "...",
"metadata": {...}
}
]
}
```
The response from the server includes the semantical clusters, cleaned HTML, and markdown representations of the crawled webpage. You can access and use this data in your Python application as needed.
For more information about the available parameters and their descriptions, refer to the [Parameters](#parameters) section.
Make sure to replace `"http://localhost:8000/crawl"` with the appropriate server URL if your Crawl4AI server is running on a different host or port.
Choose the approach that best suits your needs. If you want to integrate Crawl4AI into your existing Python projects, installing it as a library is the way to go. If you prefer to run Crawl4AI as a standalone service and interact with it via API endpoints, running it as a local server using Docker is the recommended approach.
## Python Library Usage 🚀
**Make sure to check the config.py tp set required environment variables.**
### Quickstart Guide
That's it! You can now integrate Crawl4AI into your Python projects and leverage its web crawling capabilities. 🎉
Create an instance of WebCrawler and call the `warmup()` function.
```python
crawler = WebCrawler()
crawler.warmup()
```
## 📖 Parameters
### Understanding 'bypass_cache' and 'include_raw_html' parameters
First crawl (caches the result):
```python
result = crawler.run(url="https://www.nbcnews.com/business")
```
Second crawl (Force to crawl again):
```python
result = crawler.run(url="https://www.nbcnews.com/business", bypass_cache=True)
```
💡 Don't forget to set `bypass_cache` to True if you want to try different strategies for the same URL. Otherwise, the cached result will be returned. You can also set `always_by_pass_cache` in constructor to True to always bypass the cache.
Crawl result without raw HTML content:
```python
result = crawler.run(url="https://www.nbcnews.com/business", include_raw_html=False)
```
### Adding a chunking strategy: RegexChunking
Using RegexChunking:
```python
result = crawler.run(
url="https://www.nbcnews.com/business",
chunking_strategy=RegexChunking(patterns=["\n\n"])
)
```
Using NlpSentenceChunking:
```python
result = crawler.run(
url="https://www.nbcnews.com/business",
chunking_strategy=NlpSentenceChunking()
)
```
### Extraction strategy: CosineStrategy
Using CosineStrategy:
```python
result = crawler.run(
url="https://www.nbcnews.com/business",
extraction_strategy=CosineStrategy(
semantic_filter="",
word_count_threshold=10,
max_dist=0.2,
linkage_method="ward",
top_k=3
)
)
```
You can set `semantic_filter` to filter relevant documents before clustering. Documents are filtered based on their cosine similarity to the keyword filter embedding.
```python
result = crawler.run(
url="https://www.nbcnews.com/business",
extraction_strategy=CosineStrategy(
semantic_filter="finance economy and stock market",
word_count_threshold=10,
max_dist=0.2,
linkage_method="ward",
top_k=3
)
)
```
### Using LLMExtractionStrategy
Without instructions:
```python
result = crawler.run(
url="https://www.nbcnews.com/business",
extraction_strategy=LLMExtractionStrategy(
provider="openai/gpt-4o",
api_token=os.getenv('OPENAI_API_KEY')
)
)
```
With instructions:
```python
result = crawler.run(
url="https://www.nbcnews.com/business",
extraction_strategy=LLMExtractionStrategy(
provider="openai/gpt-4o",
api_token=os.getenv('OPENAI_API_KEY'),
instruction="I am interested in only financial news"
)
)
```
### Targeted extraction using CSS selector
Extract only H2 tags:
```python
result = crawler.run(
url="https://www.nbcnews.com/business",
css_selector="h2"
)
```
### Passing JavaScript code to click 'Load More' button
Using JavaScript to click 'Load More' button:
```python
js_code = """
const loadMoreButton = Array.from(document.querySelectorAll('button')).find(button => button.textContent.includes('Load More'));
loadMoreButton && loadMoreButton.click();
"""
crawler_strategy = LocalSeleniumCrawlerStrategy(js_code=js_code)
crawler = WebCrawler(crawler_strategy=crawler_strategy, always_by_pass_cache=True)
result = crawler.run(url="https://www.nbcnews.com/business")
```
## Parameters 📖
| Parameter | Description | Required | Default Value |
|-----------------------|-------------------------------------------------------------------------------------------------------|----------|---------------------|
@@ -193,49 +329,134 @@ That's it! You can now integrate Crawl4AI into your Python projects and leverage
| `css_selector` | The CSS selector to target specific parts of the HTML for extraction. | No | `None` |
| `verbose` | Whether to enable verbose logging. | No | `true` |
## 🛠️ Configuration
Crawl4AI allows you to configure various parameters and settings in the `crawler/config.py` file. Here's an example of how you can adjust the parameters:
## Chunking Strategies 📚
### RegexChunking
`RegexChunking` is a text chunking strategy that splits a given text into smaller parts using regular expressions. This is useful for preparing large texts for processing by language models, ensuring they are divided into manageable segments.
**Constructor Parameters:**
- `patterns` (list, optional): A list of regular expression patterns used to split the text. Default is to split by double newlines (`['\n\n']`).
**Example usage:**
```python
import os
from dotenv import load_dotenv
load_dotenv() # Load environment variables from .env file
# Default provider, ONLY used when the extraction strategy is LLMExtractionStrategy
DEFAULT_PROVIDER = "openai/gpt-4-turbo"
# Provider-model dictionary, ONLY used when the extraction strategy is LLMExtractionStrategy
PROVIDER_MODELS = {
"ollama/llama3": "no-token-needed", # Any model from Ollama no need for API token
"groq/llama3-70b-8192": os.getenv("GROQ_API_KEY"),
"groq/llama3-8b-8192": os.getenv("GROQ_API_KEY"),
"openai/gpt-3.5-turbo": os.getenv("OPENAI_API_KEY"),
"openai/gpt-4-turbo": os.getenv("OPENAI_API_KEY"),
"openai/gpt-4o": os.getenv("OPENAI_API_KEY"),
"anthropic/claude-3-haiku-20240307": os.getenv("ANTHROPIC_API_KEY"),
"anthropic/claude-3-opus-20240229": os.getenv("ANTHROPIC_API_KEY"),
"anthropic/claude-3-sonnet-20240229": os.getenv("ANTHROPIC_API_KEY"),
}
# Chunk token threshold
CHUNK_TOKEN_THRESHOLD = 1000
# Threshold for the minimum number of words in an HTML tag to be considered
MIN_WORD_THRESHOLD = 5
chunker = RegexChunking(patterns=[r'\n\n', r'\. '])
chunks = chunker.chunk("This is a sample text. It will be split into chunks.")
```
In the `crawler/config.py` file, you can:
### NlpSentenceChunking
REMEBER: You only need to set the API keys for the providers in case you choose LLMExtractStrategy as the extraction strategy. If you choose CosineStrategy, you don't need to set the API keys.
`NlpSentenceChunking` uses a natural language processing model to chunk a given text into sentences. This approach leverages SpaCy to accurately split text based on sentence boundaries.
- Set the default provider using the `DEFAULT_PROVIDER` variable.
- Add or modify the provider-model dictionary (`PROVIDER_MODELS`) to include your desired providers and their corresponding API keys. Crawl4AI supports various providers such as Groq, OpenAI, Anthropic, and more. You can add any provider supported by LiteLLM, as well as Ollama.
- Adjust the `CHUNK_TOKEN_THRESHOLD` value to control the splitting of web content into chunks for parallel processing. A higher value means fewer chunks and faster processing, but it may cause issues with weaker LLMs during extraction.
- Modify the `MIN_WORD_THRESHOLD` value to set the minimum number of words an HTML tag must contain to be considered a meaningful block.
**Constructor Parameters:**
- `model` (str, optional): The SpaCy model to use for sentence detection. Default is `'en_core_web_sm'`.
Make sure to set the appropriate API keys for each provider in the `PROVIDER_MODELS` dictionary. You can either directly provide the API key or use environment variables to store them securely.
**Example usage:**
```python
chunker = NlpSentenceChunking(model='en_core_web_sm')
chunks = chunker.chunk("This is a sample text. It will be split into sentences.")
```
Remember to update the `crawler/config.py` file based on your specific requirements and the providers you want to use with Crawl4AI.
### TopicSegmentationChunking
`TopicSegmentationChunking` uses the TextTiling algorithm to segment a given text into topic-based chunks. This method identifies thematic boundaries in the text.
**Constructor Parameters:**
- `num_keywords` (int, optional): The number of keywords to extract for each topic segment. Default is `3`.
**Example usage:**
```python
chunker = TopicSegmentationChunking(num_keywords=3)
chunks = chunker.chunk("This is a sample text. It will be split into topic-based segments.")
```
### FixedLengthWordChunking
`FixedLengthWordChunking` splits a given text into chunks of fixed length, based on the number of words.
**Constructor Parameters:**
- `chunk_size` (int, optional): The number of words in each chunk. Default is `100`.
**Example usage:**
```python
chunker = FixedLengthWordChunking(chunk_size=100)
chunks = chunker.chunk("This is a sample text. It will be split into fixed-length word chunks.")
```
### SlidingWindowChunking
`SlidingWindowChunking` uses a sliding window approach to chunk a given text. Each chunk has a fixed length, and the window slides by a specified step size.
**Constructor Parameters:**
- `window_size` (int, optional): The number of words in each chunk. Default is `100`.
- `step` (int, optional): The number of words to slide the window. Default is `50`.
**Example usage:**
```python
chunker = SlidingWindowChunking(window_size=100, step=50)
chunks = chunker.chunk("This is a sample text. It will be split using a sliding window approach.")
```
## Extraction Strategies 🧠
### NoExtractionStrategy
`NoExtractionStrategy` is a basic extraction strategy that returns the entire HTML content without any modification. It is useful for cases where no specific extraction is required.
**Constructor Parameters:**
None.
**Example usage:**
```python
extractor = NoExtractionStrategy()
extracted_content = extractor.extract(url, html)
```
### LLMExtractionStrategy
`LLMExtractionStrategy` uses a Language Model (LLM) to extract meaningful blocks or chunks from the given HTML content. This strategy leverages an external provider for language model completions.
**Constructor Parameters:**
- `provider` (str, optional): The provider to use for the language model completions. Default is `DEFAULT_PROVIDER` (e.g., openai/gpt-4).
- `api_token` (str, optional): The API token for the provider. If not provided, it will try to load from the environment variable `OPENAI_API_KEY`.
- `instruction` (str, optional): An instruction to guide the LLM on how to perform the extraction. This allows users to specify the type of data they are interested in or set the tone of the response. Default is `None`.
**Example usage:**
```python
extractor = LLMExtractionStrategy(provider='openai', api_token='your_api_token', instruction='Extract only news about AI.')
extracted_content = extractor.extract(url, html)
```
### CosineStrategy
`CosineStrategy` uses hierarchical clustering based on cosine similarity to extract clusters of text from the given HTML content. This strategy is suitable for identifying related content sections.
**Constructor Parameters:**
- `semantic_filter` (str, optional): A string containing keywords for filtering relevant documents before clustering. If provided, documents are filtered based on their cosine similarity to the keyword filter embedding. Default is `None`.
- `word_count_threshold` (int, optional): Minimum number of words per cluster. Default is `20`.
- `max_dist` (float, optional): The maximum cophenetic distance on the dendrogram to form clusters. Default is `0.2`.
- `linkage_method` (str, optional): The linkage method for hierarchical clustering. Default is `'ward'`.
- `top_k` (int, optional): Number of top categories to extract. Default is `3`.
- `model_name` (str, optional): The model name for embedding generation. Default is `'BAAI/bge-small-en-v1.5'`.
**Example usage:**
```python
extractor = CosineStrategy(semantic_filter='artificial intelligence', word_count_threshold=10, max_dist=0.2, linkage_method='ward', top_k=3, model_name='BAAI/bge-small-en-v1.5')
extracted_content = extractor.extract(url, html)
```
### TopicExtractionStrategy
`TopicExtractionStrategy` uses the TextTiling algorithm to segment the HTML content into topics and extracts keywords for each segment. This strategy is useful for identifying and summarizing thematic content.
**Constructor Parameters:**
- `num_keywords` (int, optional): Number of keywords to represent each topic segment. Default is `3`.
**Example usage:**
```python
extractor = TopicExtractionStrategy(num_keywords=3)
extracted_content = extractor.extract(url, html)
```
## Contributing 🤝
@@ -259,5 +480,6 @@ If you have any questions, suggestions, or feedback, please feel free to reach o
- GitHub: [unclecode](https://github.com/unclecode)
- Twitter: [@unclecode](https://twitter.com/unclecode)
- Website: [crawl4ai.com](https://crawl4ai.com)
Let's work together to make the web more accessible and useful for AI applications! 💪🌐🤖