diff --git a/.gitignore b/.gitignore index d8a63c25..fd1fd196 100644 --- a/.gitignore +++ b/.gitignore @@ -171,4 +171,4 @@ test_pad*.py Crawl4AI.egg-info/ requirements0.txt -a.txt \ No newline at end of file +a.txt diff --git a/README.md b/README.md index d9297e5e..cce04a99 100644 --- a/README.md +++ b/README.md @@ -6,8 +6,9 @@ [](https://github.com/unclecode/crawl4ai/pulls) [](https://github.com/unclecode/crawl4ai/blob/main/LICENSE) -Crawl4AI is a powerful, free web crawling service designed to extract useful information from web pages and make it accessible for large language models (LLMs) and AI applications. ππ +Crawl4AI has one clear task: to simplify crawling and extract useful information from web pages, making it accessible for large language models (LLMs) and AI applications. ππ +<<<<<<< HEAD ## π New Changes Will be Released Soon - π 10x faster!! @@ -23,8 +24,104 @@ Crawl4AI is a powerful, free web crawling service designed to extract useful inf - π· Image Captioning: Incorporating image captioning capabilities to extract descriptions from images. - πΎ Embedding Vector Data: Generate and store embedding data for each crawled website. - π Semantic Search Engine: Building a semantic search engine that fetches content, performs vector search similarity, and generates labeled chunk data based on user queries and URLs. +======= +[](https://colab.research.google.com/drive/1wz8u30rvbq6Scodye9AGCw8Qg_Z8QGsk) + +## Recent Changes + +- π 10x faster!! +- π Execute custom JavaScript before crawling! +- π€ Colab friendly! +- π Chunking strategies: topic-based, regex, sentence, and more! +- π§ Extraction strategies: cosine clustering, LLM, and more! +- π― CSS selector support +- π Pass instructions/keywords to refine extraction + +## Power and Simplicity of Crawl4AI π + +To show the simplicity take a look at the first example: + +```python +from crawl4ai import WebCrawler + +# Create the WebCrawler instance +crawler = WebCrawler() + +# Run the crawler with keyword filtering and CSS selector +result = crawler.run(url="https://www.nbcnews.com/business") +print(result) # {url, html, markdown, extracted_content, metadata} +``` + +Now let's try a complex task. Below is an example of how you can execute JavaScript, filter data using keywords, and use a CSS selector to extract specific contentβall in one go! + +1. Instantiate a WebCrawler object. +2. Execute custom JavaScript to click a "Load More" button. +3. Extract semantical chunks of content and filter the data to include only content related to technology. +4. Use a CSS selector to extract only paragraphs (`
` tags).
+
+```python
+# Import necessary modules
+from crawl4ai import WebCrawler
+from crawl4ai.chunking_strategy import *
+from crawl4ai.extraction_strategy import *
+from crawl4ai.crawler_strategy import *
+
+# Define the JavaScript code to click the "Load More" button
+js_code = """
+const loadMoreButton = Array.from(document.querySelectorAll('button')).find(button => button.textContent.includes('Load More'));
+loadMoreButton && loadMoreButton.click();
+"""
+
+# Define the crawling strategy
+crawler_strategy = LocalSeleniumCrawlerStrategy(js_code=js_code)
+
+# Create the WebCrawler instance with the defined strategy
+crawler = WebCrawler(crawler_strategy=crawler_strategy)
+
+# Run the crawler with keyword filtering and CSS selector
+result = crawler.run(
+ url="https://www.nbcnews.com/business",
+ extraction_strategy=CosineStrategy(
+ semantic_filter="technology",
+ ),
+)
+
+# Run the crawler with LLM extraction strategy
+result = crawler.run(
+ url="https://www.nbcnews.com/business",
+ extraction_strategy=LLMExtractionStrategy(
+ provider="openai/gpt-4o",
+ api_token=os.getenv('OPENAI_API_KEY'),
+ instruction="Extract only content related to technology"
+ ),
+ css_selector="p"
+)
+
+# Display the extracted result
+print(result)
+```
+
+With Crawl4AI, you can perform advanced web crawling and data extraction tasks with just a few lines of code. This example demonstrates how you can harness the power of Crawl4AI to simplify your workflow and get the data you need efficiently.
+
+---
+
+*Continue reading to learn more about the features, installation process, usage, and more.*
+
+
+## Table of Contents
+
+1. [Features](#features-)
+2. [Installation](#installation-)
+3. [REST API/Local Server](#using-the-local-server-ot-rest-api-)
+4. [Python Library Usage](#python-library-usage-)
+5. [Parameters](#parameters-)
+6. [Chunking Strategies](#chunking-strategies-)
+7. [Extraction Strategies](#extraction-strategies-)
+8. [Contributing](#contributing-)
+9. [License](#license-)
+10. [Contact](#contact-)
+>>>>>>> new-release-0.0.2-no-spacy
-For more details, refer to the [CHANGELOG.md](https://github.com/unclecode/crawl4ai/edit/main/CHANGELOG.md) file.
## Features β¨
@@ -33,223 +130,372 @@ For more details, refer to the [CHANGELOG.md](https://github.com/unclecode/crawl
- π Supports crawling multiple URLs simultaneously
- π Replace media tags with ALT.
- π Completely free to use and open-source
-
-## Getting Started π
-
-To get started with Crawl4AI, simply visit our web application at [https://crawl4ai.uccode.io](https://crawl4ai.uccode.io) (Available now!) and enter the URL(s) you want to crawl. The application will process the URLs and provide you with the extracted data in various formats.
+- π Execute custom JavaScript before crawling
+- π Chunking strategies: topic-based, regex, sentence, and more
+- π§ Extraction strategies: cosine clustering, LLM, and more
+- π― CSS selector support
+- π Pass instructions/keywords to refine extraction
## Installation π»
-There are two ways to use Crawl4AI: as a library in your Python projects or as a standalone local server.
-
-### Using Crawl4AI as a Library π
+There are three ways to use Crawl4AI:
+1. As a library (Recommended)
+2. As a local server (Docker) or using the REST API
+4. As a Google Colab notebook. [](https://colab.research.google.com/drive/1wz8u30rvbq6Scodye9AGCw8Qg_Z8QGsk)
To install Crawl4AI as a library, follow these steps:
1. Install the package from GitHub:
-```sh
-pip install git+https://github.com/unclecode/crawl4ai.git
+```bash
+virtualenv venv
+source venv/bin/activate
+pip install "crawl4ai[all] @ git+https://github.com/unclecode/crawl4ai.git"
```
-Alternatively, you can clone the repository and install the package locally:
-```sh
+ π‘ Better to run the following CLI-command to load the required models. This is optional, but it will boost the performance and speed of the crawler. You need to do this only once.
+
+ crawl4ai-download-models
+
+2. Alternatively, you can clone the repository and install the package locally:
+```bash
virtualenv venv
source venv/bin/activate
git clone https://github.com/unclecode/crawl4ai.git
cd crawl4ai
-pip install -e .
+pip install -e .[all]
```
-2. Import the necessary modules in your Python script:
-```python
-from crawl4ai.web_crawler import WebCrawler
-from crawl4ai.models import UrlModel
-import os
-
-crawler = WebCrawler(db_path='crawler_data.db')
-
-# Single page crawl
-single_url = UrlModel(url='https://kidocode.com', forced=False)
-result = crawl4ai.fetch_page(
- single_url,
- provider= "openai/gpt-3.5-turbo",
- api_token = os.getenv('OPENAI_API_KEY'),
- # Set `extract_blocks_flag` to True to enable the LLM to generate semantically clustered chunks
- # and return them as JSON. Depending on the model and data size, this may take up to 1 minute.
- # Without this setting, it will take between 5 to 20 seconds.
- extract_blocks_flag=False
- word_count_threshold=5 # Minimum word count for a HTML tag to be considered as a worthy block
-)
-print(result.model_dump())
-
-# Multiple page crawl
-urls = [
- UrlModel(url='http://example.com', forced=False),
- UrlModel(url='http://example.org', forced=False)
-]
-results = crawl4ai.fetch_pages(
- urls,
- provider= "openai/gpt-3.5-turbo",
- api_token = os.getenv('OPENAI_API_KEY'),
- extract_blocks_flag=True,
- word_count_threshold=5
-)
-
-for res in results:
- print(res.model_dump())
-```
-
-Running for the first time will download the chrome driver for selenium. Also creates a SQLite database file `crawler_data.db` in the current directory. This file will store the crawled data for future reference.
-
-The response model is a `CrawlResponse` object that contains the following attributes:
-```python
-class CrawlResult(BaseModel):
- url: str
- html: str
- success: bool
- cleaned_html: str = None
- markdown: str = None
- parsed_json: str = None
- error_message: str = None
-```
-
-### Running Crawl4AI as a Local Server π
-
-To run Crawl4AI as a standalone local server, follow these steps:
-
-1. Clone the repository:
-```sh
-git clone https://github.com/unclecode/crawl4ai.git
-```
-
-2. Navigate to the project directory:
-```sh
-cd crawl4ai
-```
-
-3. Open `crawler/config.py` and set your favorite LLM provider and API token.
-
-4. Build the Docker image:
-```sh
+3. Use docker to run the local server:
+```bash
docker build -t crawl4ai .
-```
- For Mac users, use the following command instead:
-```sh
-docker build --platform linux/amd64 -t crawl4ai .
-```
-
-5. Run the Docker container:
-```sh
+# For Mac users
+# docker build --platform linux/amd64 -t crawl4ai .
docker run -d -p 8000:80 crawl4ai
```
-6. Access the application at `http://localhost:8000`.
+For more information about how to run Crawl4AI as a local server, please refer to the [GitHub repository](https://github.com/unclecode/crawl4ai).
-- CURL Example:
-Set the api_token to your OpenAI API key or any other provider you are using.
-```sh
-curl -X POST -H "Content-Type: application/json" -d '{"urls":["https://techcrunch.com/"],"provider_model":"openai/gpt-3.5-turbo","api_token":"your_api_token","include_raw_html":true,"forced":false,"extract_blocks_flag":false,"word_count_threshold":10}' http://localhost:8000/crawl
-```
-Set `extract_blocks_flag` to True to enable the LLM to generate semantically clustered chunks and return them as JSON. Depending on the model and data size, this may take up to 1 minute. Without this setting, it will take between 5 to 20 seconds.
+## Using the Local server ot REST API π
-- Python Example:
-```python
-import requests
-import os
+You can also use Crawl4AI through the REST API. This method allows you to send HTTP requests to the Crawl4AI server and receive structured data in response. The base URL for the API is `https://crawl4ai.com/crawl`. If you run the local server, you can use `http://localhost:8000/crawl`. (Port is dependent on your docker configuration)
-url = "http://localhost:8000/crawl" # Replace with the appropriate server URL
-data = {
- "urls": [
- "https://example.com"
- ],
- "provider_model": "groq/llama3-70b-8192",
- "api_token": "your_api_token",
- "include_raw_html": true,
- "forced": false,
- # Set `extract_blocks_flag` to True to enable the LLM to generate semantically clustered chunks
- # and return them as JSON. Depending on the model and data size, this may take up to 1 minute.
- # Without this setting, it will take between 5 to 20 seconds.
- "extract_blocks_flag": False,
- "word_count_threshold": 5
+### Example Usage
+
+To use the REST API, send a POST request to `https://crawl4ai.com/crawl` with the following parameters in the request body.
+
+**Example Request:**
+```json
+{
+ "urls": ["https://www.nbcnews.com/business"],
+ "include_raw_html": false,
+ "bypass_cache": true,
+ "word_count_threshold": 5,
+ "extraction_strategy": "CosineStrategy",
+ "chunking_strategy": "RegexChunking",
+ "css_selector": "p",
+ "verbose": true,
+ "extraction_strategy_args": {
+ "semantic_filter": "finance economy and stock market",
+ "word_count_threshold": 20,
+ "max_dist": 0.2,
+ "linkage_method": "ward",
+ "top_k": 3
+ },
+ "chunking_strategy_args": {
+ "patterns": ["\n\n"]
+ }
}
-
-response = requests.post(url, json=data)
-
-if response.status_code == 200:
- result = response.json()["results"][0]
- print("Parsed JSON:")
- print(result["parsed_json"])
- print("\nCleaned HTML:")
- print(result["cleaned_html"])
- print("\nMarkdown:")
- print(result["markdown"])
-else:
- print("Error:", response.status_code, response.text)
```
-This code sends a POST request to the Crawl4AI server running on localhost, specifying the target URL (`https://example.com`) and the desired options (`grq_api_token`, `include_raw_html`, and `forced`). The server processes the request and returns the crawled data in JSON format.
+**Example Response:**
+```json
+{
+ "status": "success",
+ "data": [
+ {
+ "url": "https://www.nbcnews.com/business",
+ "extracted_content": "...",
+ "html": "...",
+ "markdown": "...",
+ "metadata": {...}
+ }
+ ]
+}
+```
-The response from the server includes the parsed JSON, cleaned HTML, and markdown representations of the crawled webpage. You can access and use this data in your Python application as needed.
+For more information about the available parameters and their descriptions, refer to the [Parameters](#parameters) section.
-Make sure to replace `"http://localhost:8000/crawl"` with the appropriate server URL if your Crawl4AI server is running on a different host or port.
-Choose the approach that best suits your needs. If you want to integrate Crawl4AI into your existing Python projects, installing it as a library is the way to go. If you prefer to run Crawl4AI as a standalone service and interact with it via API endpoints, running it as a local server using Docker is the recommended approach.
+## Python Library Usage π
-**Make sure to check the config.py tp set required environment variables.**
+π₯ A great way to try out Crawl4AI is to run `quickstart.py` in the `docs/examples` directory. This script demonstrates how to use Crawl4AI to crawl a website and extract content from it.
-That's it! You can now integrate Crawl4AI into your Python projects and leverage its web crawling capabilities. π
+### Quickstart Guide
-## π Parameters
+Create an instance of WebCrawler and call the `warmup()` function.
+```python
+crawler = WebCrawler()
+crawler.warmup()
+```
-| Parameter | Description | Required | Default Value |
-|----------------------|-------------------------------------------------------------------------------------------------|----------|---------------|
-| `urls` | A list of URLs to crawl and extract data from. | Yes | - |
-| `provider_model` | The provider and model to use for extracting relevant information (e.g., "groq/llama3-70b-8192"). | Yes | - |
-| `api_token` | Your API token for the specified provider. | Yes | - |
-| `include_raw_html` | Whether to include the raw HTML content in the response. | No | `false` |
-| `forced` | Whether to force a fresh crawl even if the URL has been previously crawled. | No | `false` |
-| `extract_blocks_flag`| Whether to extract semantical blocks of text from the HTML. | No | `false` |
-| `word_count_threshold` | The minimum number of words a block must contain to be considered meaningful (minimum value is 5). | No | `5` |
+### Understanding 'bypass_cache' and 'include_raw_html' parameters
-## π οΈ Configuration
-Crawl4AI allows you to configure various parameters and settings in the `crawler/config.py` file. Here's an example of how you can adjust the parameters:
+First crawl (caches the result):
+```python
+result = crawler.run(url="https://www.nbcnews.com/business")
+```
+
+Second crawl (Force to crawl again):
+```python
+result = crawler.run(url="https://www.nbcnews.com/business", bypass_cache=True)
+```
+ π‘ Don't forget to set `bypass_cache` to True if you want to try different strategies for the same URL. Otherwise, the cached result will be returned. You can also set `always_by_pass_cache` in constructor to True to always bypass the cache.
+
+Crawl result without raw HTML content:
+```python
+result = crawler.run(url="https://www.nbcnews.com/business", include_raw_html=False)
+```
+
+### Adding a chunking strategy: RegexChunking
+
+Using RegexChunking:
+```python
+result = crawler.run(
+ url="https://www.nbcnews.com/business",
+ chunking_strategy=RegexChunking(patterns=["\n\n"])
+)
+```
+
+Using NlpSentenceChunking:
+```python
+result = crawler.run(
+ url="https://www.nbcnews.com/business",
+ chunking_strategy=NlpSentenceChunking()
+)
+```
+
+### Extraction strategy: CosineStrategy
+
+So far, the extracted content is just the result of chunking. To extract meaningful content, you can use extraction strategies. These strategies cluster consecutive chunks into meaningful blocks, keeping the same order as the text in the HTML. This approach is perfect for use in RAG applications and semantical search queries.
+
+Using CosineStrategy:
+```python
+result = crawler.run(
+ url="https://www.nbcnews.com/business",
+ extraction_strategy=CosineStrategy(
+ semantic_filter="",
+ word_count_threshold=10,
+ max_dist=0.2,
+ linkage_method="ward",
+ top_k=3
+ )
+)
+```
+
+You can set `semantic_filter` to filter relevant documents before clustering. Documents are filtered based on their cosine similarity to the keyword filter embedding.
```python
-import os
-from dotenv import load_dotenv
-
-load_dotenv() # Load environment variables from .env file
-
-# Default provider
-DEFAULT_PROVIDER = "openai/gpt-4-turbo"
-
-# Provider-model dictionary
-PROVIDER_MODELS = {
- "groq/llama3-70b-8192": os.getenv("GROQ_API_KEY"),
- "groq/llama3-8b-8192": os.getenv("GROQ_API_KEY"),
- "openai/gpt-3.5-turbo": os.getenv("OPENAI_API_KEY"),
- "openai/gpt-4-turbo": os.getenv("OPENAI_API_KEY"),
- "anthropic/claude-3-haiku-20240307": os.getenv("ANTHROPIC_API_KEY"),
- "anthropic/claude-3-opus-20240229": os.getenv("ANTHROPIC_API_KEY"),
- "anthropic/claude-3-sonnet-20240229": os.getenv("ANTHROPIC_API_KEY"),
-}
-
-# Chunk token threshold
-CHUNK_TOKEN_THRESHOLD = 1000
-
-# Threshold for the minimum number of words in an HTML tag to be considered
-MIN_WORD_THRESHOLD = 5
+result = crawler.run(
+ url="https://www.nbcnews.com/business",
+ extraction_strategy=CosineStrategy(
+ semantic_filter="finance economy and stock market",
+ word_count_threshold=10,
+ max_dist=0.2,
+ linkage_method="ward",
+ top_k=3
+ )
+)
```
-In the `crawler/config.py` file, you can:
-- Set the default provider using the `DEFAULT_PROVIDER` variable.
-- Add or modify the provider-model dictionary (`PROVIDER_MODELS`) to include your desired providers and their corresponding API keys. Crawl4AI supports various providers such as Groq, OpenAI, Anthropic, and more. You can add any provider supported by LiteLLM, as well as Ollama.
-- Adjust the `CHUNK_TOKEN_THRESHOLD` value to control the splitting of web content into chunks for parallel processing. A higher value means fewer chunks and faster processing, but it may cause issues with weaker LLMs during extraction.
-- Modify the `MIN_WORD_THRESHOLD` value to set the minimum number of words an HTML tag must contain to be considered a meaningful block.
+### Using LLMExtractionStrategy
-Make sure to set the appropriate API keys for each provider in the `PROVIDER_MODELS` dictionary. You can either directly provide the API key or use environment variables to store them securely.
+Without instructions:
+```python
+result = crawler.run(
+ url="https://www.nbcnews.com/business",
+ extraction_strategy=LLMExtractionStrategy(
+ provider="openai/gpt-4o",
+ api_token=os.getenv('OPENAI_API_KEY')
+ )
+)
+```
-Remember to update the `crawler/config.py` file based on your specific requirements and the providers you want to use with Crawl4AI.
+With instructions:
+```python
+result = crawler.run(
+ url="https://www.nbcnews.com/business",
+ extraction_strategy=LLMExtractionStrategy(
+ provider="openai/gpt-4o",
+ api_token=os.getenv('OPENAI_API_KEY'),
+ instruction="I am interested in only financial news"
+ )
+)
+```
+
+### Targeted extraction using CSS selector
+
+Extract only H2 tags:
+```python
+result = crawler.run(
+ url="https://www.nbcnews.com/business",
+ css_selector="h2"
+)
+```
+
+### Passing JavaScript code to click 'Load More' button
+
+Using JavaScript to click 'Load More' button:
+```python
+js_code = """
+const loadMoreButton = Array.from(document.querySelectorAll('button')).find(button => button.textContent.includes('Load More'));
+loadMoreButton && loadMoreButton.click();
+"""
+crawler_strategy = LocalSeleniumCrawlerStrategy(js_code=js_code)
+crawler = WebCrawler(crawler_strategy=crawler_strategy, always_by_pass_cache=True)
+result = crawler.run(url="https://www.nbcnews.com/business")
+```
+
+## Parameters π
+
+| Parameter | Description | Required | Default Value |
+|-----------------------|-------------------------------------------------------------------------------------------------------|----------|---------------------|
+| `urls` | A list of URLs to crawl and extract data from. | Yes | - |
+| `include_raw_html` | Whether to include the raw HTML content in the response. | No | `false` |
+| `bypass_cache` | Whether to force a fresh crawl even if the URL has been previously crawled. | No | `false` |
+| `word_count_threshold`| The minimum number of words a block must contain to be considered meaningful (minimum value is 5). | No | `5` |
+| `extraction_strategy` | The strategy to use for extracting content from the HTML (e.g., "CosineStrategy"). | No | `NoExtractionStrategy` |
+| `chunking_strategy` | The strategy to use for chunking the text before processing (e.g., "RegexChunking"). | No | `RegexChunking` |
+| `css_selector` | The CSS selector to target specific parts of the HTML for extraction. | No | `None` |
+| `verbose` | Whether to enable verbose logging. | No | `true` |
+
+## Chunking Strategies π
+
+### RegexChunking
+
+`RegexChunking` is a text chunking strategy that splits a given text into smaller parts using regular expressions. This is useful for preparing large texts for processing by language models, ensuring they are divided into manageable segments.
+
+**Constructor Parameters:**
+- `patterns` (list, optional): A list of regular expression patterns used to split the text. Default is to split by double newlines (`['\n\n']`).
+
+**Example usage:**
+```python
+chunker = RegexChunking(patterns=[r'\n\n', r'\. '])
+chunks = chunker.chunk("This is a sample text. It will be split into chunks.")
+```
+
+### NlpSentenceChunking
+
+`NlpSentenceChunking` uses a natural language processing model to chunk a given text into sentences. This approach leverages SpaCy to accurately split text based on sentence boundaries.
+
+**Constructor Parameters:**
+- None.
+
+**Example usage:**
+```python
+chunker = NlpSentenceChunking()
+chunks = chunker.chunk("This is a sample text. It will be split into sentences.")
+```
+
+### TopicSegmentationChunking
+
+`TopicSegmentationChunking` uses the TextTiling algorithm to segment a given text into topic-based chunks. This method identifies thematic boundaries in the text.
+
+**Constructor Parameters:**
+- `num_keywords` (int, optional): The number of keywords to extract for each topic segment. Default is `3`.
+
+**Example usage:**
+```python
+chunker = TopicSegmentationChunking(num_keywords=3)
+chunks = chunker.chunk("This is a sample text. It will be split into topic-based segments.")
+```
+
+### FixedLengthWordChunking
+
+`FixedLengthWordChunking` splits a given text into chunks of fixed length, based on the number of words.
+
+**Constructor Parameters:**
+- `chunk_size` (int, optional): The number of words in each chunk. Default is `100`.
+
+**Example usage:**
+```python
+chunker = FixedLengthWordChunking(chunk_size=100)
+chunks = chunker.chunk("This is a sample text. It will be split into fixed-length word chunks.")
+```
+
+### SlidingWindowChunking
+
+`SlidingWindowChunking` uses a sliding window approach to chunk a given text. Each chunk has a fixed length, and the window slides by a specified step size.
+
+**Constructor Parameters:**
+- `window_size` (int, optional): The number of words in each chunk. Default is `100`.
+- `step` (int, optional): The number of words to slide the window. Default is `50`.
+
+**Example usage:**
+```python
+chunker = SlidingWindowChunking(window_size=100, step=50)
+chunks = chunker.chunk("This is a sample text. It will be split using a sliding window approach.")
+```
+
+## Extraction Strategies π§
+
+### NoExtractionStrategy
+
+`NoExtractionStrategy` is a basic extraction strategy that returns the entire HTML content without any modification. It is useful for cases where no specific extraction is required.
+
+**Constructor Parameters:**
+None.
+
+**Example usage:**
+```python
+extractor = NoExtractionStrategy()
+extracted_content = extractor.extract(url, html)
+```
+
+### LLMExtractionStrategy
+
+`LLMExtractionStrategy` uses a Language Model (LLM) to extract meaningful blocks or chunks from the given HTML content. This strategy leverages an external provider for language model completions.
+
+**Constructor Parameters:**
+- `provider` (str, optional): The provider to use for the language model completions. Default is `DEFAULT_PROVIDER` (e.g., openai/gpt-4).
+- `api_token` (str, optional): The API token for the provider. If not provided, it will try to load from the environment variable `OPENAI_API_KEY`.
+- `instruction` (str, optional): An instruction to guide the LLM on how to perform the extraction. This allows users to specify the type of data they are interested in or set the tone of the response. Default is `None`.
+
+**Example usage:**
+```python
+extractor = LLMExtractionStrategy(provider='openai', api_token='your_api_token', instruction='Extract only news about AI.')
+extracted_content = extractor.extract(url, html)
+```
+
+### CosineStrategy
+
+`CosineStrategy` uses hierarchical clustering based on cosine similarity to extract clusters of text from the given HTML content. This strategy is suitable for identifying related content sections.
+
+**Constructor Parameters:**
+- `semantic_filter` (str, optional): A string containing keywords for filtering relevant documents before clustering. If provided, documents are filtered based on their cosine similarity to the keyword filter embedding. Default is `None`.
+- `word_count_threshold` (int, optional): Minimum number of words per cluster. Default is `20`.
+- `max_dist` (float, optional): The maximum cophenetic distance on the dendrogram to form clusters. Default is `0.2`.
+- `linkage_method` (str, optional): The linkage method for hierarchical clustering. Default is `'ward'`.
+- `top_k` (int, optional): Number of top categories to extract. Default is `3`.
+- `model_name` (str, optional): The model name for embedding generation. Default is `'BAAI/bge-small-en-v1.5'`.
+
+**Example usage:**
+```python
+extractor = CosineStrategy(semantic_filter='finance rental prices', word_count_threshold=10, max_dist=0.2, linkage_method='ward', top_k=3, model_name='BAAI/bge-small-en-v1.5')
+extracted_content = extractor.extract(url, html)
+```
+
+### TopicExtractionStrategy
+
+`TopicExtractionStrategy` uses the TextTiling algorithm to segment the HTML content into topics and extracts keywords for each segment. This strategy is useful for identifying and summarizing thematic content.
+
+**Constructor Parameters:**
+- `num_keywords` (int, optional): Number of keywords to represent each topic segment. Default is `3`.
+
+**Example usage:**
+```python
+extractor = TopicExtractionStrategy(num_keywords=3)
+extracted_content = extractor.extract(url, html)
+```
## Contributing π€
@@ -273,5 +519,6 @@ If you have any questions, suggestions, or feedback, please feel free to reach o
- GitHub: [unclecode](https://github.com/unclecode)
- Twitter: [@unclecode](https://twitter.com/unclecode)
+- Website: [crawl4ai.com](https://crawl4ai.com)
Let's work together to make the web more accessible and useful for AI applications! πͺππ€
diff --git a/crawl4ai/chunking_strategy.py b/crawl4ai/chunking_strategy.py
new file mode 100644
index 00000000..6ece75e3
--- /dev/null
+++ b/crawl4ai/chunking_strategy.py
@@ -0,0 +1,105 @@
+from abc import ABC, abstractmethod
+import re
+from collections import Counter
+import string
+from .model_loader import load_nltk_punkt
+
+# Define the abstract base class for chunking strategies
+class ChunkingStrategy(ABC):
+
+ @abstractmethod
+ def chunk(self, text: str) -> list:
+ """
+ Abstract method to chunk the given text.
+ """
+ pass
+
+# Regex-based chunking
+class RegexChunking(ChunkingStrategy):
+ def __init__(self, patterns=None):
+ if patterns is None:
+ patterns = [r'\n\n'] # Default split pattern
+ self.patterns = patterns
+
+ def chunk(self, text: str) -> list:
+ paragraphs = [text]
+ for pattern in self.patterns:
+ new_paragraphs = []
+ for paragraph in paragraphs:
+ new_paragraphs.extend(re.split(pattern, paragraph))
+ paragraphs = new_paragraphs
+ return paragraphs
+
+# NLP-based sentence chunking
+class NlpSentenceChunking(ChunkingStrategy):
+ def __init__(self):
+ load_nltk_punkt()
+ pass
+
+ def chunk(self, text: str) -> list:
+ # Improved regex for sentence splitting
+ # sentence_endings = re.compile(
+ # r'(? list:
+ # Use the TextTilingTokenizer to segment the text
+ segmented_topics = self.tokenizer.tokenize(text)
+ return segmented_topics
+
+ def extract_keywords(self, text: str) -> list:
+ # Tokenize and remove stopwords and punctuation
+ import nltk as nl
+ tokens = nl.toknize.word_tokenize(text)
+ tokens = [token.lower() for token in tokens if token not in nl.corpus.stopwords.words('english') and token not in string.punctuation]
+
+ # Calculate frequency distribution
+ freq_dist = Counter(tokens)
+ keywords = [word for word, freq in freq_dist.most_common(self.num_keywords)]
+ return keywords
+
+ def chunk_with_topics(self, text: str) -> list:
+ # Segment the text into topics
+ segments = self.chunk(text)
+ # Extract keywords for each topic segment
+ segments_with_topics = [(segment, self.extract_keywords(segment)) for segment in segments]
+ return segments_with_topics
+
+# Fixed-length word chunks
+class FixedLengthWordChunking(ChunkingStrategy):
+ def __init__(self, chunk_size=100):
+ self.chunk_size = chunk_size
+
+ def chunk(self, text: str) -> list:
+ words = text.split()
+ return [' '.join(words[i:i + self.chunk_size]) for i in range(0, len(words), self.chunk_size)]
+
+# Sliding window chunking
+class SlidingWindowChunking(ChunkingStrategy):
+ def __init__(self, window_size=100, step=50):
+ self.window_size = window_size
+ self.step = step
+
+ def chunk(self, text: str) -> list:
+ words = text.split()
+ chunks = []
+ for i in range(0, len(words), self.step):
+ chunks.append(' '.join(words[i:i + self.window_size]))
+ return chunks
+
+
diff --git a/crawl4ai/config.py b/crawl4ai/config.py
index b29325f1..a20eb547 100644
--- a/crawl4ai/config.py
+++ b/crawl4ai/config.py
@@ -3,15 +3,17 @@ from dotenv import load_dotenv
load_dotenv() # Load environment variables from .env file
-# Default provider
+# Default provider, ONLY used when the extraction strategy is LLMExtractionStrategy
DEFAULT_PROVIDER = "openai/gpt-4-turbo"
-
-# Provider-model dictionary
+MODEL_REPO_BRANCH = "new-release-0.0.2"
+# Provider-model dictionary, ONLY used when the extraction strategy is LLMExtractionStrategy
PROVIDER_MODELS = {
+ "ollama/llama3": "no-token-needed", # Any model from Ollama no need for API token
"groq/llama3-70b-8192": os.getenv("GROQ_API_KEY"),
"groq/llama3-8b-8192": os.getenv("GROQ_API_KEY"),
"openai/gpt-3.5-turbo": os.getenv("OPENAI_API_KEY"),
"openai/gpt-4-turbo": os.getenv("OPENAI_API_KEY"),
+ "openai/gpt-4o": os.getenv("OPENAI_API_KEY"),
"anthropic/claude-3-haiku-20240307": os.getenv("ANTHROPIC_API_KEY"),
"anthropic/claude-3-opus-20240229": os.getenv("ANTHROPIC_API_KEY"),
"anthropic/claude-3-sonnet-20240229": os.getenv("ANTHROPIC_API_KEY"),
diff --git a/crawl4ai/crawler_strategy.py b/crawl4ai/crawler_strategy.py
new file mode 100644
index 00000000..24add103
--- /dev/null
+++ b/crawl4ai/crawler_strategy.py
@@ -0,0 +1,92 @@
+from abc import ABC, abstractmethod
+from selenium import webdriver
+from selenium.webdriver.chrome.service import Service
+from selenium.webdriver.common.by import By
+from selenium.webdriver.support.ui import WebDriverWait
+from selenium.webdriver.support import expected_conditions as EC
+from selenium.webdriver.chrome.options import Options
+from selenium.common.exceptions import InvalidArgumentException
+
+from typing import List
+import requests
+import os
+from pathlib import Path
+
+class CrawlerStrategy(ABC):
+ @abstractmethod
+ def crawl(self, url: str, **kwargs) -> str:
+ pass
+
+class CloudCrawlerStrategy(CrawlerStrategy):
+ def __init__(self, use_cached_html = False):
+ super().__init__()
+ self.use_cached_html = use_cached_html
+
+ def crawl(self, url: str) -> str:
+ data = {
+ "urls": [url],
+ "include_raw_html": True,
+ "forced": True,
+ "extract_blocks": False,
+ }
+
+ response = requests.post("http://crawl4ai.uccode.io/crawl", json=data)
+ response = response.json()
+ html = response["results"][0]["html"]
+ return html
+
+class LocalSeleniumCrawlerStrategy(CrawlerStrategy):
+ def __init__(self, use_cached_html=False, js_code=None):
+ super().__init__()
+ print("[LOG] π Initializing LocalSeleniumCrawlerStrategy")
+ self.options = Options()
+ self.options.headless = True
+ self.options.add_argument("--no-sandbox")
+ self.options.add_argument("--disable-dev-shm-usage")
+ self.options.add_argument("--disable-gpu")
+ self.options.add_argument("--disable-extensions")
+ self.options.add_argument("--headless")
+ self.use_cached_html = use_cached_html
+ self.js_code = js_code
+
+ # chromedriver_autoinstaller.install()
+ import chromedriver_autoinstaller
+ self.service = Service(chromedriver_autoinstaller.install())
+ self.driver = webdriver.Chrome(service=self.service, options=self.options)
+
+ def crawl(self, url: str) -> str:
+ if self.use_cached_html:
+ cache_file_path = os.path.join(Path.home(), ".crawl4ai", "cache", url.replace("/", "_"))
+ if os.path.exists(cache_file_path):
+ with open(cache_file_path, "r") as f:
+ return f.read()
+
+ try:
+ self.driver.get(url)
+ WebDriverWait(self.driver, 10).until(
+ EC.presence_of_all_elements_located((By.TAG_NAME, "html"))
+ )
+
+ # Execute JS code if provided
+ if self.js_code:
+ self.driver.execute_script(self.js_code)
+ # Optionally, wait for some condition after executing the JS code
+ WebDriverWait(self.driver, 10).until(
+ lambda driver: driver.execute_script("return document.readyState") == "complete"
+ )
+
+ html = self.driver.page_source
+
+ # Store in cache
+ cache_file_path = os.path.join(Path.home(), ".crawl4ai", "cache", url.replace("/", "_"))
+ with open(cache_file_path, "w") as f:
+ f.write(html)
+
+ return html
+ except InvalidArgumentException:
+ raise InvalidArgumentException(f"Invalid URL {url}")
+ except Exception as e:
+ raise Exception(f"Failed to crawl {url}: {str(e)}")
+
+ def quit(self):
+ self.driver.quit()
\ No newline at end of file
diff --git a/crawl4ai/database.py b/crawl4ai/database.py
index 89048d05..391d3f4f 100644
--- a/crawl4ai/database.py
+++ b/crawl4ai/database.py
@@ -1,8 +1,16 @@
+import os
+from pathlib import Path
import sqlite3
from typing import Optional
+from typing import Optional, Tuple
-def init_db(db_path: str):
- conn = sqlite3.connect(db_path)
+DB_PATH = os.path.join(Path.home(), ".crawl4ai")
+os.makedirs(DB_PATH, exist_ok=True)
+DB_PATH = os.path.join(DB_PATH, "crawl4ai.db")
+
+def init_db():
+ global DB_PATH
+ conn = sqlite3.connect(DB_PATH)
cursor = conn.cursor()
cursor.execute('''
CREATE TABLE IF NOT EXISTS crawled_data (
@@ -10,52 +18,81 @@ def init_db(db_path: str):
html TEXT,
cleaned_html TEXT,
markdown TEXT,
- parsed_json TEXT,
+ extracted_content TEXT,
success BOOLEAN
)
''')
conn.commit()
conn.close()
-def get_cached_url(db_path: str, url: str) -> Optional[tuple]:
- conn = sqlite3.connect(db_path)
- cursor = conn.cursor()
- cursor.execute('SELECT url, html, cleaned_html, markdown, parsed_json, success FROM crawled_data WHERE url = ?', (url,))
- result = cursor.fetchone()
- conn.close()
- return result
+def check_db_path():
+ if not DB_PATH:
+ raise ValueError("Database path is not set or is empty.")
-def cache_url(db_path: str, url: str, html: str, cleaned_html: str, markdown: str, parsed_json: str, success: bool):
- conn = sqlite3.connect(db_path)
- cursor = conn.cursor()
- cursor.execute('''
- INSERT INTO crawled_data (url, html, cleaned_html, markdown, parsed_json, success)
- VALUES (?, ?, ?, ?, ?, ?)
- ON CONFLICT(url) DO UPDATE SET
- html = excluded.html,
- cleaned_html = excluded.cleaned_html,
- markdown = excluded.markdown,
- parsed_json = excluded.parsed_json,
- success = excluded.success
- ''', (str(url), html, cleaned_html, markdown, parsed_json, success))
- conn.commit()
- conn.close()
-
-def get_total_count(db_path: str) -> int:
+def get_cached_url(url: str) -> Optional[Tuple[str, str, str, str, str, bool]]:
+ check_db_path()
try:
- conn = sqlite3.connect(db_path)
+ conn = sqlite3.connect(DB_PATH)
+ cursor = conn.cursor()
+ cursor.execute('SELECT url, html, cleaned_html, markdown, extracted_content, success FROM crawled_data WHERE url = ?', (url,))
+ result = cursor.fetchone()
+ conn.close()
+ return result
+ except Exception as e:
+ print(f"Error retrieving cached URL: {e}")
+ return None
+
+def cache_url(url: str, html: str, cleaned_html: str, markdown: str, extracted_content: str, success: bool):
+ check_db_path()
+ try:
+ conn = sqlite3.connect(DB_PATH)
+ cursor = conn.cursor()
+ cursor.execute('''
+ INSERT INTO crawled_data (url, html, cleaned_html, markdown, extracted_content, success)
+ VALUES (?, ?, ?, ?, ?, ?)
+ ON CONFLICT(url) DO UPDATE SET
+ html = excluded.html,
+ cleaned_html = excluded.cleaned_html,
+ markdown = excluded.markdown,
+ extracted_content = excluded.extracted_content,
+ success = excluded.success
+ ''', (url, html, cleaned_html, markdown, extracted_content, success))
+ conn.commit()
+ conn.close()
+ except Exception as e:
+ print(f"Error caching URL: {e}")
+
+def get_total_count() -> int:
+ check_db_path()
+ try:
+ conn = sqlite3.connect(DB_PATH)
cursor = conn.cursor()
cursor.execute('SELECT COUNT(*) FROM crawled_data')
result = cursor.fetchone()
conn.close()
return result[0]
except Exception as e:
+ print(f"Error getting total count: {e}")
return 0
-
-# Crete function to cler the database
-def clear_db(db_path: str):
- conn = sqlite3.connect(db_path)
- cursor = conn.cursor()
- cursor.execute('DELETE FROM crawled_data')
- conn.commit()
- conn.close()
\ No newline at end of file
+
+def clear_db():
+ check_db_path()
+ try:
+ conn = sqlite3.connect(DB_PATH)
+ cursor = conn.cursor()
+ cursor.execute('DELETE FROM crawled_data')
+ conn.commit()
+ conn.close()
+ except Exception as e:
+ print(f"Error clearing database: {e}")
+
+def flush_db():
+ check_db_path()
+ try:
+ conn = sqlite3.connect(DB_PATH)
+ cursor = conn.cursor()
+ cursor.execute('DROP TABLE crawled_data')
+ conn.commit()
+ conn.close()
+ except Exception as e:
+ print(f"Error flushing database: {e}")
\ No newline at end of file
diff --git a/crawl4ai/extraction_strategy.py b/crawl4ai/extraction_strategy.py
new file mode 100644
index 00000000..8567ea6b
--- /dev/null
+++ b/crawl4ai/extraction_strategy.py
@@ -0,0 +1,466 @@
+from abc import ABC, abstractmethod
+from typing import Any, List, Dict, Optional, Union
+from concurrent.futures import ThreadPoolExecutor, as_completed
+import json, time
+# from optimum.intel import IPEXModel
+from .prompts import PROMPT_EXTRACT_BLOCKS, PROMPT_EXTRACT_BLOCKS_WITH_INSTRUCTION
+from .config import *
+from .utils import *
+from functools import partial
+from .model_loader import *
+
+
+import numpy as np
+class ExtractionStrategy(ABC):
+ """
+ Abstract base class for all extraction strategies.
+ """
+
+ def __init__(self, **kwargs):
+ self.DEL = "<|DEL|>"
+ self.name = self.__class__.__name__
+ self.verbose = kwargs.get("verbose", False)
+
+ @abstractmethod
+ def extract(self, url: str, html: str, *q, **kwargs) -> List[Dict[str, Any]]:
+ """
+ Extract meaningful blocks or chunks from the given HTML.
+
+ :param url: The URL of the webpage.
+ :param html: The HTML content of the webpage.
+ :return: A list of extracted blocks or chunks.
+ """
+ pass
+
+ def run(self, url: str, sections: List[str], *q, **kwargs) -> List[Dict[str, Any]]:
+ """
+ Process sections of text in parallel by default.
+
+ :param url: The URL of the webpage.
+ :param sections: List of sections (strings) to process.
+ :return: A list of processed JSON blocks.
+ """
+ extracted_content = []
+ with ThreadPoolExecutor() as executor:
+ futures = [executor.submit(self.extract, url, section, **kwargs) for section in sections]
+ for future in as_completed(futures):
+ extracted_content.extend(future.result())
+ return extracted_content
+class NoExtractionStrategy(ExtractionStrategy):
+ def extract(self, url: str, html: str, *q, **kwargs) -> List[Dict[str, Any]]:
+ return [{"index": 0, "content": html}]
+
+ def run(self, url: str, sections: List[str], *q, **kwargs) -> List[Dict[str, Any]]:
+ return [{"index": i, "tags": [], "content": section} for i, section in enumerate(sections)]
+
+class LLMExtractionStrategy(ExtractionStrategy):
+ def __init__(self, provider: str = DEFAULT_PROVIDER, api_token: Optional[str] = None, instruction:str = None, **kwargs):
+ """
+ Initialize the strategy with clustering parameters.
+
+ :param provider: The provider to use for extraction.
+ :param api_token: The API token for the provider.
+ :param instruction: The instruction to use for the LLM model.
+ """
+ super().__init__()
+ self.provider = provider
+ self.api_token = api_token or PROVIDER_MODELS.get(provider, None) or os.getenv("OPENAI_API_KEY")
+ self.instruction = instruction
+ self.verbose = kwargs.get("verbose", False)
+
+ if not self.api_token:
+ raise ValueError("API token must be provided for LLMExtractionStrategy. Update the config.py or set OPENAI_API_KEY environment variable.")
+
+
+ def extract(self, url: str, ix:int, html: str) -> List[Dict[str, Any]]:
+ # print("[LOG] Extracting blocks from URL:", url)
+ print(f"[LOG] Call LLM for {url} - block index: {ix}")
+ variable_values = {
+ "URL": url,
+ "HTML": escape_json_string(sanitize_html(html)),
+ }
+
+ if self.instruction:
+ variable_values["REQUEST"] = self.instruction
+
+ prompt_with_variables = PROMPT_EXTRACT_BLOCKS if not self.instruction else PROMPT_EXTRACT_BLOCKS_WITH_INSTRUCTION
+ for variable in variable_values:
+ prompt_with_variables = prompt_with_variables.replace(
+ "{" + variable + "}", variable_values[variable]
+ )
+
+ response = perform_completion_with_backoff(self.provider, prompt_with_variables, self.api_token)
+ try:
+ blocks = extract_xml_data(["blocks"], response.choices[0].message.content)['blocks']
+ blocks = json.loads(blocks)
+ for block in blocks:
+ block['error'] = False
+ except Exception as e:
+ print("Error extracting blocks:", str(e))
+ parsed, unparsed = split_and_parse_json_objects(response.choices[0].message.content)
+ blocks = parsed
+ if unparsed:
+ blocks.append({
+ "index": 0,
+ "error": True,
+ "tags": ["error"],
+ "content": unparsed
+ })
+
+ if self.verbose:
+ print("[LOG] Extracted", len(blocks), "blocks from URL:", url, "block index:", ix)
+ return blocks
+
+ def _merge(self, documents):
+ chunks = []
+ sections = []
+ total_token_so_far = 0
+
+ for document in documents:
+ if total_token_so_far < CHUNK_TOKEN_THRESHOLD:
+ chunk = document.split(' ')
+ total_token_so_far += len(chunk) * 1.3
+ chunks.append(document)
+ else:
+ sections.append('\n\n'.join(chunks))
+ chunks = [document]
+ total_token_so_far = len(document.split(' ')) * 1.3
+
+ if chunks:
+ sections.append('\n\n'.join(chunks))
+
+ return sections
+
+ def run(self, url: str, sections: List[str]) -> List[Dict[str, Any]]:
+ """
+ Process sections sequentially with a delay for rate limiting issues, specifically for LLMExtractionStrategy.
+ """
+
+ merged_sections = self._merge(sections)
+ extracted_content = []
+ if self.provider.startswith("groq/"):
+ # Sequential processing with a delay
+ for ix, section in enumerate(merged_sections):
+ extracted_content.extend(self.extract(ix, url, section))
+ time.sleep(0.5) # 500 ms delay between each processing
+ else:
+ # Parallel processing using ThreadPoolExecutor
+ with ThreadPoolExecutor(max_workers=4) as executor:
+ extract_func = partial(self.extract, url)
+ futures = [executor.submit(extract_func, ix, section) for ix, section in enumerate(merged_sections)]
+
+ for future in as_completed(futures):
+ extracted_content.extend(future.result())
+
+
+ return extracted_content
+
+class CosineStrategy(ExtractionStrategy):
+ def __init__(self, semantic_filter = None, word_count_threshold=10, max_dist=0.2, linkage_method='ward', top_k=3, model_name = 'BAAI/bge-small-en-v1.5', **kwargs):
+ """
+ Initialize the strategy with clustering parameters.
+
+ :param semantic_filter: A keyword filter for document filtering.
+ :param word_count_threshold: Minimum number of words per cluster.
+ :param max_dist: The maximum cophenetic distance on the dendrogram to form clusters.
+ :param linkage_method: The linkage method for hierarchical clustering.
+ :param top_k: Number of top categories to extract.
+ """
+ super().__init__()
+
+ self.semantic_filter = semantic_filter
+ self.word_count_threshold = word_count_threshold
+ self.max_dist = max_dist
+ self.linkage_method = linkage_method
+ self.top_k = top_k
+ self.timer = time.time()
+ self.verbose = kwargs.get("verbose", False)
+
+ self.buffer_embeddings = np.array([])
+
+ if model_name == "bert-base-uncased":
+ self.tokenizer, self.model = load_bert_base_uncased()
+ elif model_name == "BAAI/bge-small-en-v1.5":
+ self.tokenizer, self.model = load_bge_small_en_v1_5()
+
+ self.nlp = load_text_multilabel_classifier()
+
+ if self.verbose:
+ print(f"[LOG] Model loaded {model_name}, models/reuters, took " + str(time.time() - self.timer) + " seconds")
+
+ def filter_documents_embeddings(self, documents: List[str], semantic_filter: str, threshold: float = 0.5) -> List[str]:
+ """
+ Filter documents based on the cosine similarity of their embeddings with the semantic_filter embedding.
+
+ :param documents: List of text chunks (documents).
+ :param semantic_filter: A string containing the keywords for filtering.
+ :param threshold: Cosine similarity threshold for filtering documents.
+ :return: Filtered list of documents.
+ """
+ from sklearn.metrics.pairwise import cosine_similarity
+ if not semantic_filter:
+ return documents
+ # Compute embedding for the keyword filter
+ query_embedding = self.get_embeddings([semantic_filter])[0]
+
+ # Compute embeddings for the docu ments
+ document_embeddings = self.get_embeddings(documents)
+
+ # Calculate cosine similarity between the query embedding and document embeddings
+ similarities = cosine_similarity([query_embedding], document_embeddings).flatten()
+
+ # Filter documents based on the similarity threshold
+ filtered_docs = [doc for doc, sim in zip(documents, similarities) if sim >= threshold]
+
+ return filtered_docs
+
+ def get_embeddings(self, sentences: List[str], bypass_buffer=True):
+ """
+ Get BERT embeddings for a list of sentences.
+
+ :param sentences: List of text chunks (sentences).
+ :return: NumPy array of embeddings.
+ """
+ # if self.buffer_embeddings.any() and not bypass_buffer:
+ # return self.buffer_embeddings
+
+ import torch
+ # Tokenize sentences and convert to tensor
+ encoded_input = self.tokenizer(sentences, padding=True, truncation=True, return_tensors='pt')
+ # Compute token embeddings
+ with torch.no_grad():
+ model_output = self.model(**encoded_input)
+
+ # Get embeddings from the last hidden state (mean pooling)
+ embeddings = model_output.last_hidden_state.mean(1)
+ self.buffer_embeddings = embeddings.numpy()
+ return embeddings.numpy()
+
+ def hierarchical_clustering(self, sentences: List[str]):
+ """
+ Perform hierarchical clustering on sentences and return cluster labels.
+
+ :param sentences: List of text chunks (sentences).
+ :return: NumPy array of cluster labels.
+ """
+ # Get embeddings
+ from scipy.cluster.hierarchy import linkage, fcluster
+ from scipy.spatial.distance import pdist
+ self.timer = time.time()
+ embeddings = self.get_embeddings(sentences, bypass_buffer=False)
+ # print(f"[LOG] π Embeddings computed in {time.time() - self.timer:.2f} seconds")
+ # Compute pairwise cosine distances
+ distance_matrix = pdist(embeddings, 'cosine')
+ # Perform agglomerative clustering respecting order
+ linked = linkage(distance_matrix, method=self.linkage_method)
+ # Form flat clusters
+ labels = fcluster(linked, self.max_dist, criterion='distance')
+ return labels
+
+ def filter_clusters_by_word_count(self, clusters: Dict[int, List[str]]):
+ """
+ Filter clusters to remove those with a word count below the threshold.
+
+ :param clusters: Dictionary of clusters.
+ :return: Filtered dictionary of clusters.
+ """
+ filtered_clusters = {}
+ for cluster_id, texts in clusters.items():
+ # Concatenate texts for analysis
+ full_text = " ".join(texts)
+ # Count words
+ word_count = len(full_text.split())
+
+ # Keep clusters with word count above the threshold
+ if word_count >= self.word_count_threshold:
+ filtered_clusters[cluster_id] = texts
+
+ return filtered_clusters
+
+ def extract(self, url: str, html: str, *q, **kwargs) -> List[Dict[str, Any]]:
+ """
+ Extract clusters from HTML content using hierarchical clustering.
+
+ :param url: The URL of the webpage.
+ :param html: The HTML content of the webpage.
+ :return: A list of dictionaries representing the clusters.
+ """
+ # Assume `html` is a list of text chunks for this strategy
+ t = time.time()
+ text_chunks = html.split(self.DEL) # Split by lines or paragraphs as needed
+
+ # Pre-filter documents using embeddings and semantic_filter
+ text_chunks = self.filter_documents_embeddings(text_chunks, self.semantic_filter)
+
+ if not text_chunks:
+ return []
+
+ # Perform clustering
+ labels = self.hierarchical_clustering(text_chunks)
+ # print(f"[LOG] π Clustering done in {time.time() - t:.2f} seconds")
+
+ # Organize texts by their cluster labels, retaining order
+ t = time.time()
+ clusters = {}
+ for index, label in enumerate(labels):
+ clusters.setdefault(label, []).append(text_chunks[index])
+
+ # Filter clusters by word count
+ filtered_clusters = self.filter_clusters_by_word_count(clusters)
+
+ # Convert filtered clusters to a sorted list of dictionaries
+ cluster_list = [{"index": int(idx), "tags" : [], "content": " ".join(filtered_clusters[idx])} for idx in sorted(filtered_clusters)]
+
+ labels = self.nlp([cluster['content'] for cluster in cluster_list])
+
+ for cluster, label in zip(cluster_list, labels):
+ cluster['tags'] = label
+
+ # Process the text with the loaded model
+ # for cluster in cluster_list:
+ # cluster['tags'] = self.nlp(cluster['content'])[0]['label']
+ # doc = self.nlp(cluster['content'])
+ # tok_k = self.top_k
+ # top_categories = sorted(doc.cats.items(), key=lambda x: x[1], reverse=True)[:tok_k]
+ # cluster['tags'] = [cat for cat, _ in top_categories]
+
+ # print(f"[LOG] π Categorization done in {time.time() - t:.2f} seconds")
+
+ return cluster_list
+
+ def run(self, url: str, sections: List[str], *q, **kwargs) -> List[Dict[str, Any]]:
+ """
+ Process sections using hierarchical clustering.
+
+ :param url: The URL of the webpage.
+ :param sections: List of sections (strings) to process.
+ :param provider: The provider to be used for extraction (not used here).
+ :param api_token: Optional API token for the provider (not used here).
+ :return: A list of processed JSON blocks.
+ """
+ # This strategy processes all sections together
+
+ return self.extract(url, self.DEL.join(sections), **kwargs)
+
+class TopicExtractionStrategy(ExtractionStrategy):
+ def __init__(self, num_keywords: int = 3, **kwargs):
+ """
+ Initialize the topic extraction strategy with parameters for topic segmentation.
+
+ :param num_keywords: Number of keywords to represent each topic segment.
+ """
+ import nltk
+ super().__init__()
+ self.num_keywords = num_keywords
+ self.tokenizer = nltk.TextTilingTokenizer()
+
+ def extract_keywords(self, text: str) -> List[str]:
+ """
+ Extract keywords from a given text segment using simple frequency analysis.
+
+ :param text: The text segment from which to extract keywords.
+ :return: A list of keyword strings.
+ """
+ import nltk
+ # Tokenize the text and compute word frequency
+ words = nltk.word_tokenize(text)
+ freq_dist = nltk.FreqDist(words)
+ # Get the most common words as keywords
+ keywords = [word for (word, _) in freq_dist.most_common(self.num_keywords)]
+ return keywords
+
+ def extract(self, url: str, html: str, *q, **kwargs) -> List[Dict[str, Any]]:
+ """
+ Extract topics from HTML content using TextTiling for segmentation and keyword extraction.
+
+ :param url: The URL of the webpage.
+ :param html: The HTML content of the webpage.
+ :param provider: The provider to be used for extraction (not used here).
+ :param api_token: Optional API token for the provider (not used here).
+ :return: A list of dictionaries representing the topics.
+ """
+ # Use TextTiling to segment the text into topics
+ segmented_topics = html.split(self.DEL) # Split by lines or paragraphs as needed
+
+ # Prepare the output as a list of dictionaries
+ topic_list = []
+ for i, segment in enumerate(segmented_topics):
+ # Extract keywords for each segment
+ keywords = self.extract_keywords(segment)
+ topic_list.append({
+ "index": i,
+ "content": segment,
+ "keywords": keywords
+ })
+
+ return topic_list
+
+ def run(self, url: str, sections: List[str], *q, **kwargs) -> List[Dict[str, Any]]:
+ """
+ Process sections using topic segmentation and keyword extraction.
+
+ :param url: The URL of the webpage.
+ :param sections: List of sections (strings) to process.
+ :param provider: The provider to be used for extraction (not used here).
+ :param api_token: Optional API token for the provider (not used here).
+ :return: A list of processed JSON blocks.
+ """
+ # Concatenate sections into a single text for coherent topic segmentation
+
+
+ return self.extract(url, self.DEL.join(sections), **kwargs)
+
+class ContentSummarizationStrategy(ExtractionStrategy):
+ def __init__(self, model_name: str = "sshleifer/distilbart-cnn-12-6", **kwargs):
+ """
+ Initialize the content summarization strategy with a specific model.
+
+ :param model_name: The model to use for summarization.
+ """
+ from transformers import pipeline
+ self.summarizer = pipeline("summarization", model=model_name)
+
+ def extract(self, url: str, text: str, provider: str = None, api_token: Optional[str] = None) -> List[Dict[str, Any]]:
+ """
+ Summarize a single section of text.
+
+ :param url: The URL of the webpage.
+ :param text: A section of text to summarize.
+ :param provider: The provider to be used for extraction (not used here).
+ :param api_token: Optional API token for the provider (not used here).
+ :return: A dictionary with the summary.
+ """
+ try:
+ summary = self.summarizer(text, max_length=130, min_length=30, do_sample=False)
+ return {"summary": summary[0]['summary_text']}
+ except Exception as e:
+ print(f"Error summarizing text: {e}")
+ return {"summary": text} # Fallback to original text if summarization fails
+
+ def run(self, url: str, sections: List[str], provider: str = None, api_token: Optional[str] = None) -> List[Dict[str, Any]]:
+ """
+ Process each section in parallel to produce summaries.
+
+ :param url: The URL of the webpage.
+ :param sections: List of sections (strings) to summarize.
+ :param provider: The provider to be used for extraction (not used here).
+ :param api_token: Optional API token for the provider (not used here).
+ :return: A list of dictionaries with summaries for each section.
+ """
+ # Use a ThreadPoolExecutor to summarize in parallel
+ summaries = []
+ with ThreadPoolExecutor() as executor:
+ # Create a future for each section's summarization
+ future_to_section = {executor.submit(self.extract, url, section, provider, api_token): i for i, section in enumerate(sections)}
+ for future in as_completed(future_to_section):
+ section_index = future_to_section[future]
+ try:
+ summary_result = future.result()
+ summaries.append((section_index, summary_result))
+ except Exception as e:
+ print(f"Error processing section {section_index}: {e}")
+ summaries.append((section_index, {"summary": sections[section_index]})) # Fallback to original text
+
+ # Sort summaries by the original section index to maintain order
+ summaries.sort(key=lambda x: x[0])
+ return [summary for _, summary in summaries]
\ No newline at end of file
diff --git a/crawl4ai/model_loader.py b/crawl4ai/model_loader.py
new file mode 100644
index 00000000..3a2b8695
--- /dev/null
+++ b/crawl4ai/model_loader.py
@@ -0,0 +1,127 @@
+from functools import lru_cache
+from pathlib import Path
+import subprocess, os
+import shutil
+from crawl4ai.config import MODEL_REPO_BRANCH
+import argparse
+
+def get_home_folder():
+ home_folder = os.path.join(Path.home(), ".crawl4ai")
+ os.makedirs(home_folder, exist_ok=True)
+ os.makedirs(f"{home_folder}/cache", exist_ok=True)
+ os.makedirs(f"{home_folder}/models", exist_ok=True)
+ return home_folder
+
+@lru_cache()
+def load_bert_base_uncased():
+ from transformers import BertTokenizer, BertModel, AutoTokenizer, AutoModel
+ tokenizer = BertTokenizer.from_pretrained('bert-base-uncased', resume_download=None)
+ model = BertModel.from_pretrained('bert-base-uncased', resume_download=None)
+ return tokenizer, model
+
+@lru_cache()
+def load_bge_small_en_v1_5():
+ from transformers import BertTokenizer, BertModel, AutoTokenizer, AutoModel
+ tokenizer = AutoTokenizer.from_pretrained('BAAI/bge-small-en-v1.5', resume_download=None)
+ model = AutoModel.from_pretrained('BAAI/bge-small-en-v1.5', resume_download=None)
+ model.eval()
+ return tokenizer, model
+
+@lru_cache()
+def load_text_classifier():
+ from transformers import AutoTokenizer, AutoModelForSequenceClassification
+ from transformers import pipeline
+
+ tokenizer = AutoTokenizer.from_pretrained("dstefa/roberta-base_topic_classification_nyt_news")
+ model = AutoModelForSequenceClassification.from_pretrained("dstefa/roberta-base_topic_classification_nyt_news")
+ pipe = pipeline("text-classification", model=model, tokenizer=tokenizer)
+
+ return pipe
+
+@lru_cache()
+def load_text_multilabel_classifier():
+ from transformers import AutoModelForSequenceClassification, AutoTokenizer
+ import numpy as np
+ from scipy.special import expit
+ import torch
+
+ MODEL = "cardiffnlp/tweet-topic-21-multi"
+ tokenizer = AutoTokenizer.from_pretrained(MODEL, resume_download=None)
+ model = AutoModelForSequenceClassification.from_pretrained(MODEL, resume_download=None)
+ class_mapping = model.config.id2label
+
+ # Check for available device: CUDA, MPS (for Apple Silicon), or CPU
+ if torch.cuda.is_available():
+ device = torch.device("cuda")
+ elif torch.backends.mps.is_available():
+ device = torch.device("mps")
+ else:
+ device = torch.device("cpu")
+
+ model.to(device)
+
+ def _classifier(texts, threshold=0.5, max_length=64):
+ tokens = tokenizer(texts, return_tensors='pt', padding=True, truncation=True, max_length=max_length)
+ tokens = {key: val.to(device) for key, val in tokens.items()} # Move tokens to the selected device
+
+ with torch.no_grad():
+ output = model(**tokens)
+
+ scores = output.logits.detach().cpu().numpy()
+ scores = expit(scores)
+ predictions = (scores >= threshold) * 1
+
+ batch_labels = []
+ for prediction in predictions:
+ labels = [class_mapping[i] for i, value in enumerate(prediction) if value == 1]
+ batch_labels.append(labels)
+
+ return batch_labels
+
+ return _classifier
+
+@lru_cache()
+def load_nltk_punkt():
+ import nltk
+ try:
+ nltk.data.find('tokenizers/punkt')
+ except LookupError:
+ nltk.download('punkt')
+ return nltk.data.find('tokenizers/punkt')
+
+def download_all_models(remove_existing=False):
+ """Download all models required for Crawl4AI."""
+ if remove_existing:
+ print("[LOG] Removing existing models...")
+ home_folder = get_home_folder()
+ model_folders = [
+ os.path.join(home_folder, "models/reuters"),
+ os.path.join(home_folder, "models"),
+ ]
+ for folder in model_folders:
+ if Path(folder).exists():
+ shutil.rmtree(folder)
+ print("[LOG] Existing models removed.")
+
+ # Load each model to trigger download
+ print("[LOG] Downloading BERT Base Uncased...")
+ load_bert_base_uncased()
+ print("[LOG] Downloading BGE Small EN v1.5...")
+ load_bge_small_en_v1_5()
+ print("[LOG] Downloading text classifier...")
+ load_text_multilabel_classifier
+ print("[LOG] Downloading custom NLTK Punkt model...")
+ load_nltk_punkt()
+ print("[LOG] β
All models downloaded successfully.")
+
+def main():
+ print("[LOG] Welcome to the Crawl4AI Model Downloader!")
+ print("[LOG] This script will download all the models required for Crawl4AI.")
+ parser = argparse.ArgumentParser(description="Crawl4AI Model Downloader")
+ parser.add_argument('--remove-existing', action='store_true', help="Remove existing models before downloading")
+ args = parser.parse_args()
+
+ download_all_models(remove_existing=args.remove_existing)
+
+if __name__ == "__main__":
+ main()
\ No newline at end of file
diff --git a/crawl4ai/models.py b/crawl4ai/models.py
index b9373f78..c2c2d61e 100644
--- a/crawl4ai/models.py
+++ b/crawl4ai/models.py
@@ -11,5 +11,6 @@ class CrawlResult(BaseModel):
success: bool
cleaned_html: str = None
markdown: str = None
- parsed_json: str = None
+ extracted_content: str = None
+ metadata: dict = None
error_message: str = None
\ No newline at end of file
diff --git a/crawl4ai/prompts.py b/crawl4ai/prompts.py
index be7091bc..e0498ccc 100644
--- a/crawl4ai/prompts.py
+++ b/crawl4ai/prompts.py
@@ -59,7 +59,7 @@ Please provide your output within Loading... Please wait.
+ There are two ways to use Crawl4AI: as a library in your Python projects or as a standalone local
+ server.
+
+ You can also try Crawl4AI in a Google Colab
+ To install Crawl4AI as a library, follow these steps:
+ For more information about how to run Crawl4AI as a local server, please refer to the
+ GitHub repository.
+
+ In recent times, we've witnessed a surge of startups emerging, riding the AI hype wave and charging
+ for services that should rightfully be accessible to everyone. ππΈ One such example is scraping and
+ crawling web pages and transforming them into a format suitable for Large Language Models (LLMs).
+ πΈοΈπ€ We believe that building a business around this is not the right approach; instead, it should
+ definitely be open-source. ππ So, if you possess the skills to build such tools and share our
+ philosophy, we invite you to join our "Robinhood" band and help set these products free for the
+ benefit of all. π€πͺ
+
+ To install and run Crawl4AI as a library or a local server, please refer to the π
+ GitHub repository.
+
- Depends on the selected model, it may take up to 1 or 2 minutes to process the request.
- Loading...
- Content for chunking strategies... Content for extraction strategies...
- In recent times, we've witnessed a surge of startups emerging, riding the AI hype wave and charging
- for services that should rightfully be accessible to everyone. ππΈ One such example is scraping and
- crawling web pages and transforming them into a format suitable for Large Language Models (LLMs).
- πΈοΈπ€ We believe that building a business around this is not the right approach; instead, it should
- definitely be open-source. ππ So, if you possess the skills to build such tools and share our
- philosophy, we invite you to join our "Robinhood" band and help set these products free for the
- benefit of all. π€πͺ
-
- To install and run Crawl4AI as a library or a local server, please refer to the π
- GitHub repository.
-
+ In recent times, we've witnessed a surge of startups emerging, riding the AI hype wave and charging
+ for services that should rightfully be accessible to everyone. ππΈ One such example is scraping and
+ crawling web pages and transforming them into a format suitable for Large Language Models (LLMs).
+ πΈοΈπ€ We believe that building a business around this is not the right approach; instead, it should
+ definitely be open-source. ππ So, if you possess the skills to build such tools and share our
+ philosophy, we invite you to join our "Robinhood" band and help set these products free for the
+ benefit of all. π€πͺ
+
+ There are three ways to use Crawl4AI:
+ To install Crawl4AI as a library, follow these steps:
+ For more information about how to run Crawl4AI as a local server, please refer to the
+ GitHub repository.
+ Loading... Please wait. There are three ways to use Crawl4AI: To install Crawl4AI as a library, follow these steps:
+ For more information about how to run Crawl4AI as a local server, please refer to the
+ GitHub repository.
+
+
+
+
+
+
+ None.
+
+ By providing clear instructions, users can tailor the extraction process to their specific needs,
+ enhancing the relevance and utility of the extracted content.
+
+
+ When a
+ π₯π·οΈ Crawl4AI: Web Data for your Thoughts
+ Try It Now
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ warmup() function.
+
+ crawler = WebCrawler()
+ crawler.warmup()
+ result = crawler.run(url="https://www.nbcnews.com/business")
+ result = crawler.run(url="https://www.nbcnews.com/business", bypass_cache=True)
+ result = crawler.run(url="https://www.nbcnews.com/business", include_raw_html=False)always_by_pass_cache to True:
+ crawler.always_by_pass_cache = True
+ result = crawler.run(
+ url="https://www.nbcnews.com/business",
+ chunking_strategy=RegexChunking(patterns=["\n\n"])
+ )
+ result = crawler.run(
+ url="https://www.nbcnews.com/business",
+ chunking_strategy=NlpSentenceChunking()
+ )
+ result = crawler.run(
+ url="https://www.nbcnews.com/business",
+ extraction_strategy=CosineStrategy(word_count_threshold=10, max_dist=0.2, linkage_method="ward", top_k=3)
+ )
+ result = crawler.run(
+ url="https://www.nbcnews.com/business",
+ extraction_strategy=LLMExtractionStrategy(provider="openai/gpt-4o", api_token=os.getenv('OPENAI_API_KEY'))
+ )
+ result = crawler.run(
+ url="https://www.nbcnews.com/business",
+ extraction_strategy=LLMExtractionStrategy(
+ provider="openai/gpt-4o",
+ api_token=os.getenv('OPENAI_API_KEY'),
+ instruction="I am interested in only financial news"
+ )
+ )
+ result = crawler.run(
+ url="https://www.nbcnews.com/business",
+ css_selector="h2"
+ )
+ js_code = """
+ const loadMoreButton = Array.from(document.querySelectorAll('button')).find(button => button.textContent.includes('Load More'));
+ loadMoreButton && loadMoreButton.click();
+ """
+ crawler_strategy = LocalSeleniumCrawlerStrategy(js_code=js_code)
+ crawler = WebCrawler(crawler_strategy=crawler_strategy, always_by_pass_cache=True)
+ result = crawler.run(url="https://www.nbcnews.com/business")Installation π»
+
+
Using Crawl4AI as a Library π
+
+
+
+ pip install git+https://github.com/unclecode/crawl4ai.git
+ virtualenv venv
+source venv/bin/activate
+git clone https://github.com/unclecode/crawl4ai.git
+cd crawl4ai
+pip install -e .
+
+ from crawl4ai.web_crawler import WebCrawler
+from crawl4ai.chunking_strategy import *
+from crawl4ai.extraction_strategy import *
+import os
+
+crawler = WebCrawler()
+
+# Single page crawl
+single_url = UrlModel(url='https://www.nbcnews.com/business', forced=False)
+result = crawl4ai.fetch_page(
+ url='https://www.nbcnews.com/business',
+ word_count_threshold=5, # Minimum word count for a HTML tag to be considered as a worthy block
+ chunking_strategy= RegexChunking( patterns = ["\\n\\n"]), # Default is RegexChunking
+ extraction_strategy= CosineStrategy(word_count_threshold=10, max_dist=0.2, linkage_method='ward', top_k=3) # Default is CosineStrategy
+ # extraction_strategy= LLMExtractionStrategy(provider= "openai/gpt-4o", api_token = os.getenv('OPENAI_API_KEY')),
+ bypass_cache=False,
+ extract_blocks =True, # Whether to extract semantical blocks of text from the HTML
+ css_selector = "", # Eg: "div.article-body"
+ verbose=True,
+ include_raw_html=True, # Whether to include the raw HTML content in the response
+)
+print(result.model_dump())
+ π Parameters
+
+
+
+
+
+
+
+ Parameter
+ Description
+ Required
+ Default Value
+
+
+ urls
+
+ A list of URLs to crawl and extract data from.
+
+ Yes
+ -
+
+
+ include_raw_html
+
+ Whether to include the raw HTML content in the response.
+
+ No
+ false
+
+
+ bypass_cache
+
+ Whether to force a fresh crawl even if the URL has been previously crawled.
+
+ No
+ false
+
+
+ extract_blocks
+
+ Whether to extract semantical blocks of text from the HTML.
+
+ No
+ true
+
+
+ word_count_threshold
+
+ The minimum number of words a block must contain to be considered meaningful (minimum
+ value is 5).
+
+ No
+ 5
+
+
+ extraction_strategy
+
+ The strategy to use for extracting content from the HTML (e.g., "CosineStrategy").
+
+ No
+ CosineStrategy
+
+
+ chunking_strategy
+
+ The strategy to use for chunking the text before processing (e.g., "RegexChunking").
+
+ No
+ RegexChunking
+
+
+ css_selector
+
+ The CSS selector to target specific parts of the HTML for extraction.
+
+ No
+ None
+
+
+
+ verbose
+ Whether to enable verbose logging.
+ No
+ true
+ Extraction Strategies
+
+ Chunking Strategies
+
+ π€ Why building this?
+ βοΈ Installation
+ π₯π·οΈ Crawl4AI: Open-source LLM Friendly Web Crawler & Scrapper
+
+ π₯π·οΈ Crawl4AI: Web Data for your Thoughts
+ Try It Now
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- Chunking Strategies
+ Extraction Strategies
+ π€ Why building this?
- βοΈ Installation
- π€ Why building this?
+ How to Guide
+ warmup() function.
+
+ crawler = WebCrawler()
+crawler.warmup()
+ result = crawler.run(url="https://www.nbcnews.com/business")
+ result = crawler.run(url="https://www.nbcnews.com/business", bypass_cache=True)`bypass_cache` to True if you want to try different strategies for the same URL. Otherwise, the cached result will be returned. You can also set `always_by_pass_cache` in constructor to True to always bypass the cache.
+
+ result = crawler.run(url="https://www.nbcnews.com/business", include_raw_html=False)always_by_pass_cache to True:
+ crawler.always_by_pass_cache = True
+ result = crawler.run(
+ url="https://www.nbcnews.com/business",
+ chunking_strategy=RegexChunking(patterns=["\n\n"])
+)
+ result = crawler.run(
+ url="https://www.nbcnews.com/business",
+ chunking_strategy=NlpSentenceChunking()
+)
+ result = crawler.run(
+ url="https://www.nbcnews.com/business",
+ extraction_strategy=CosineStrategy(word_count_threshold=10, max_dist=0.2, linkage_method="ward", top_k=3)
+)
+ result = crawler.run(
+ url="https://www.nbcnews.com/business",
+ extraction_strategy=LLMExtractionStrategy(provider="openai/gpt-4o", api_token=os.getenv('OPENAI_API_KEY'))
+)
+ result = crawler.run(
+ url="https://www.nbcnews.com/business",
+ extraction_strategy=LLMExtractionStrategy(
+ provider="openai/gpt-4o",
+ api_token=os.getenv('OPENAI_API_KEY'),
+ instruction="I am interested in only financial news"
+)
+)
+ result = crawler.run(
+ url="https://www.nbcnews.com/business",
+ css_selector="h2"
+)
+ js_code = """
+const loadMoreButton = Array.from(document.querySelectorAll('button')).find(button => button.textContent.includes('Load More'));
+loadMoreButton && loadMoreButton.click();
+"""
+crawler_strategy = LocalSeleniumCrawlerStrategy(js_code=js_code)
+crawler = WebCrawler(crawler_strategy=crawler_strategy, always_by_pass_cache=True)
+result = crawler.run(url="https://www.nbcnews.com/business")Installation π»
+
+
+
+
+
+ virtualenv venv
+source venv/bin/activate
+pip install "crawl4ai[all] @ git+https://github.com/unclecode/crawl4ai.git"
+
+ crawl4ai-download-models
+ virtualenv venv
+source venv/bin/activate
+git clone https://github.com/unclecode/crawl4ai.git
+cd crawl4ai
+pip install -e .[all]
+
+ docker build -t crawl4ai .
+# docker build --platform linux/amd64 -t crawl4ai . For Mac users
+docker run -d -p 8000:80 crawl4aiTry It Now
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ Installation π»
+
+
+
+
+
+
+ pip install git+https://github.com/unclecode/crawl4ai.git
+ virtualenv venv
+source venv/bin/activate
+git clone https://github.com/unclecode/crawl4ai.git
+cd crawl4ai
+pip install -e .
+
+ docker build -t crawl4ai .
+# docker build --platform linux/amd64 -t crawl4ai . For Mac users
+docker run -d -p 8000:80 crawl4aiHow to Guide
+ warmup() function.
+
+ crawler = WebCrawler()
+crawler.warmup()
+ result = crawler.run(url="https://www.nbcnews.com/business")
+ result = crawler.run(url="https://www.nbcnews.com/business", bypass_cache=True)`bypass_cache` to True if you want to try different strategies
+ for the same URL. Otherwise, the cached result will be returned. You can also set
+ `always_by_pass_cache` in constructor to True to always bypass the cache.
+
+ result = crawler.run(url="https://www.nbcnews.com/business", include_raw_html=False)always_by_pass_cache to True:
+ crawler.always_by_pass_cache = True
+ result = crawler.run(
+url="https://www.nbcnews.com/business",
+chunking_strategy=RegexChunking(patterns=["\n\n"])
+)
+ result = crawler.run(
+url="https://www.nbcnews.com/business",
+chunking_strategy=NlpSentenceChunking()
+)
+ result = crawler.run(
+url="https://www.nbcnews.com/business",
+extraction_strategy=CosineStrategy(word_count_threshold=20, max_dist=0.2, linkage_method="ward", top_k=3)
+)
+ result = crawler.run(
+url="https://www.nbcnews.com/business",
+extraction_strategy=LLMExtractionStrategy(provider="openai/gpt-4o", api_token=os.getenv('OPENAI_API_KEY'))
+)
+ result = crawler.run(
+url="https://www.nbcnews.com/business",
+extraction_strategy=LLMExtractionStrategy(
+provider="openai/gpt-4o",
+api_token=os.getenv('OPENAI_API_KEY'),
+instruction="I am interested in only financial news"
+)
+)
+ result = crawler.run(
+url="https://www.nbcnews.com/business",
+css_selector="h2"
+)
+ js_code = """
+const loadMoreButton = Array.from(document.querySelectorAll('button')).find(button => button.textContent.includes('Load More'));
+loadMoreButton && loadMoreButton.click();
+"""
+crawler_strategy = LocalSeleniumCrawlerStrategy(js_code=js_code)
+crawler = WebCrawler(crawler_strategy=crawler_strategy, always_by_pass_cache=True)
+result = crawler.run(url="https://www.nbcnews.com/business")RegexChunking
+ RegexChunking is a text chunking strategy that splits a given text into smaller parts
+ using regular expressions. This is useful for preparing large texts for processing by language
+ models, ensuring they are divided into manageable segments.
+ Constructor Parameters:
+
+
+ patterns (list, optional): A list of regular expression patterns used to split the
+ text. Default is to split by double newlines (['\n\n']).
+ Example usage:
+
+ chunker = RegexChunking(patterns=[r'\n\n', r'\. '])
+chunks = chunker.chunk("This is a sample text. It will be split into chunks.")
+NlpSentenceChunking
+ NlpSentenceChunking uses a natural language processing model to chunk a given text into
+ sentences. This approach leverages SpaCy to accurately split text based on sentence boundaries.
+ Constructor Parameters:
+
+
+ Example usage:
+
+ chunker = NlpSentenceChunking()
+chunks = chunker.chunk("This is a sample text. It will be split into sentences.")
+TopicSegmentationChunking
+ TopicSegmentationChunking uses the TextTiling algorithm to segment a given text into
+ topic-based chunks. This method identifies thematic boundaries in the text.
+ Constructor Parameters:
+
+
+ num_keywords (int, optional): The number of keywords to extract for each topic
+ segment. Default is 3.
+ Example usage:
+
+ chunker = TopicSegmentationChunking(num_keywords=3)
+chunks = chunker.chunk("This is a sample text. It will be split into topic-based segments.")
+FixedLengthWordChunking
+ FixedLengthWordChunking splits a given text into chunks of fixed length, based on the
+ number of words.
+ Constructor Parameters:
+
+
+ chunk_size (int, optional): The number of words in each chunk. Default is
+ 100.
+ Example usage:
+
+ chunker = FixedLengthWordChunking(chunk_size=100)
+chunks = chunker.chunk("This is a sample text. It will be split into fixed-length word chunks.")
+SlidingWindowChunking
+ SlidingWindowChunking uses a sliding window approach to chunk a given text. Each chunk
+ has a fixed length, and the window slides by a specified step size.
+ Constructor Parameters:
+
+
+ window_size (int, optional): The number of words in each chunk. Default is
+ 100.
+ step (int, optional): The number of words to slide the window. Default is
+ 50.
+ Example usage:
+
+ chunker = SlidingWindowChunking(window_size=100, step=50)
+chunks = chunker.chunk("This is a sample text. It will be split using a sliding window approach.")
+NoExtractionStrategy
+ NoExtractionStrategy is a basic extraction strategy that returns the entire HTML
+ content without any modification. It is useful for cases where no specific extraction is required.
+ Only clean html, and amrkdown.
+ Constructor Parameters:
+ Example usage:
+
+ extractor = NoExtractionStrategy()
+extracted_content = extractor.extract(url, html)
+LLMExtractionStrategy
+ LLMExtractionStrategy uses a Language Model (LLM) to extract meaningful blocks or
+ chunks from the given HTML content. This strategy leverages an external provider for language model
+ completions.
+ Constructor Parameters:
+
+
+ provider (str, optional): The provider to use for the language model completions.
+ Default is DEFAULT_PROVIDER (e.g., openai/gpt-4).
+ api_token (str, optional): The API token for the provider. If not provided, it will
+ try to load from the environment variable OPENAI_API_KEY.
+ instruction (str, optional): An instruction to guide the LLM on how to perform the
+ extraction. This allows users to specify the type of data they are interested in or set the tone
+ of the response. Default is None.
+ Example usage:
+
+ extractor = LLMExtractionStrategy(provider='openai', api_token='your_api_token', instruction='Extract only news about AI.')
+extracted_content = extractor.extract(url, html)
+CosineStrategy
+ CosineStrategy uses hierarchical clustering based on cosine similarity to extract
+ clusters of text from the given HTML content. This strategy is suitable for identifying related
+ content sections.
+ Constructor Parameters:
+
+
+ semantic_filter (str, optional): A string containing keywords for filtering relevant
+ documents before clustering. If provided, documents are filtered based on their cosine
+ similarity to the keyword filter embedding. Default is None.
+ word_count_threshold (int, optional): Minimum number of words per cluster. Default
+ is 20.
+ max_dist (float, optional): The maximum cophenetic distance on the dendrogram to
+ form clusters. Default is 0.2.
+ linkage_method (str, optional): The linkage method for hierarchical clustering.
+ Default is 'ward'.
+ top_k (int, optional): Number of top categories to extract. Default is
+ 3.
+ model_name (str, optional): The model name for embedding generation. Default is
+ 'BAAI/bge-small-en-v1.5'.
+ Example usage:
+
+ extractor = CosineStrategy(semantic_filter='artificial intelligence', word_count_threshold=10, max_dist=0.2, linkage_method='ward', top_k=3, model_name='BAAI/bge-small-en-v1.5')
+extracted_content = extractor.extract(url, html)
+Cosine Similarity Filtering
+ semantic_filter is provided, the CosineStrategy applies an
+ embedding-based filtering process to select relevant documents before performing hierarchical
+ clustering.
+ TopicExtractionStrategy
+ TopicExtractionStrategy uses the TextTiling algorithm to segment the HTML content into
+ topics and extracts keywords for each segment. This strategy is useful for identifying and
+ summarizing thematic content.
+ Constructor Parameters:
+
+
+ num_keywords (int, optional): Number of keywords to represent each topic segment.
+ Default is 3.
+ Example usage:
+
+ extractor = TopicExtractionStrategy(num_keywords=3)
+extracted_content = extractor.extract(url, html)
+