diff --git a/README.md b/README.md index 90922299..6740b4e2 100644 --- a/README.md +++ b/README.md @@ -8,16 +8,90 @@ Crawl4AI is a powerful, free web crawling service designed to extract useful information from web pages and make it accessible for large language models (LLMs) and AI applications. πŸ†“πŸŒ -## 🚧 Work in Progress πŸ‘·β€β™‚οΈ +## Recent Changes -- πŸ”§ Separate Crawl and Extract Semantic Chunk: Enhancing efficiency in large-scale tasks. -- πŸ” Colab Integration: Exploring integration with Google Colab for easy experimentation. -- 🎯 XPath and CSS Selector Support: Adding support for selective retrieval of specific elements. -- πŸ“· Image Captioning: Incorporating image captioning capabilities to extract descriptions from images. -- πŸ’Ύ Embedding Vector Data: Generate and store embedding data for each crawled website. -- πŸ” Semantic Search Engine: Building a semantic search engine that fetches content, performs vector search similarity, and generates labeled chunk data based on user queries and URLs. +- πŸš€ 10x faster!! +- πŸ“œ Execute custom JavaScript before crawling! +- 🀝 Colab friendly! +- πŸ“š Chunking strategies: topic-based, regex, sentence, and more! +- 🧠 Extraction strategies: cosine clustering, LLM, and more! +- 🎯 CSS selector support +- πŸ“ Pass instructions/keywords to refine extraction + +## Power and Simplicity of Crawl4AI πŸš€ + +Crawl4AI makes even complex web crawling tasks simple and intuitive. Below is an example of how you can execute JavaScript, filter data using keywords, and use a CSS selector to extract specific contentβ€”all in one go! + +**Example Task:** + +1. Execute custom JavaScript to click a "Load More" button. +2. Filter the data to include only content related to "technology". +3. Use a CSS selector to extract only paragraphs (`

` tags). + +**Example Code:** + +```python +# Import necessary modules +from crawl4ai import WebCrawler +from crawl4ai.chunking_strategy import * +from crawl4ai.extraction_strategy import * +from crawl4ai.crawler_strategy import * + +# Define the JavaScript code to click the "Load More" button +js_code = """ +const loadMoreButton = Array.from(document.querySelectorAll('button')).find(button => button.textContent.includes('Load More')); +loadMoreButton && loadMoreButton.click(); +""" + +# Define the crawling strategy +crawler_strategy = LocalSeleniumCrawlerStrategy(js_code=js_code) + +# Create the WebCrawler instance with the defined strategy +crawler = WebCrawler(crawler_strategy=crawler_strategy) + +# Run the crawler with keyword filtering and CSS selector +result = crawler.run( + url="https://www.example.com", + extraction_strategy=CosineStrategy( + semantic_filter="technology", + ), +) + +# Run the crawler with LLM extraction strategy +result = crawler.run( + url="https://www.example.com", + extraction_strategy=LLMExtractionStrategy( + provider="openai/gpt-4o", + api_token=os.getenv('OPENAI_API_KEY'), + instruction="Extract only content related to technology" + ), + css_selector="p" +) + +# Display the extracted result +print(result) +``` + +With Crawl4AI, you can perform advanced web crawling and data extraction tasks with just a few lines of code. This example demonstrates how you can harness the power of Crawl4AI to simplify your workflow and get the data you need efficiently. + +--- + +*Continue reading to learn more about the features, installation process, usage, and more.* + + +## Table of Contents + +1. [Features](#features) +2. [Installation](#installation) +3. [REST API/Local Server](#using-the-local-server-ot-rest-api) +4. [Python Library Usage](#usage) +5. [Parameters](#parameters) +6. [Chunking Strategies](#chunking-strategies) +7. [Extraction Strategies](#extraction-strategies) +8. [Contributing](#contributing) +9. [License](#license) +10. [Contact](#contact) -For more details, refer to the [CHANGELOG.md](https://github.com/unclecode/crawl4ai/edit/main/CHANGELOG.md) file. ## Features ✨ @@ -26,26 +100,28 @@ For more details, refer to the [CHANGELOG.md](https://github.com/unclecode/crawl - 🌍 Supports crawling multiple URLs simultaneously - πŸŒƒ Replace media tags with ALT. - πŸ†“ Completely free to use and open-source - -## Getting Started πŸš€ - -To get started with Crawl4AI, simply visit our web application at [https://crawl4ai.uccode.io](https://crawl4ai.uccode.io) (Available now!) and enter the URL(s) you want to crawl. The application will process the URLs and provide you with the extracted data in various formats. +- πŸ“œ Execute custom JavaScript before crawling +- πŸ“š Chunking strategies: topic-based, regex, sentence, and more +- 🧠 Extraction strategies: cosine clustering, LLM, and more +- 🎯 CSS selector support +- πŸ“ Pass instructions/keywords to refine extraction ## Installation πŸ’» -There are two ways to use Crawl4AI: as a library in your Python projects or as a standalone local server. - -### Using Crawl4AI as a Library πŸ“š +There are three ways to use Crawl4AI: +1. As a library (Recommended) +2. As a local server (Docker) or using the REST API +4. As a Google Colab notebook. [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1wz8u30rvbq6Scodye9AGCw8Qg_Z8QGsk) To install Crawl4AI as a library, follow these steps: 1. Install the package from GitHub: -```sh +```bash pip install git+https://github.com/unclecode/crawl4ai.git ``` -Alternatively, you can clone the repository and install the package locally: -```sh +2. Alternatively, you can clone the repository and install the package locally: +```bash virtualenv venv source venv/bin/activate git clone https://github.com/unclecode/crawl4ai.git @@ -53,133 +129,193 @@ cd crawl4ai pip install -e . ``` -2. Import the necessary modules in your Python script: -```python -from crawl4ai.web_crawler import WebCrawler -from crawl4ai.chunking_strategy import * -from crawl4ai.extraction_strategy import * -import os - -crawler = WebCrawler() -crawler.warmup() # IMPORTANT: Warmup the engine before running the first crawl - -# Single page crawl -result = crawler.run( - url='https://www.nbcnews.com/business', - word_count_threshold=5, # Minimum word count for a HTML tag to be considered as a worthy block - chunking_strategy= RegexChunking( patterns = ["\n\n"]), # Default is RegexChunking - extraction_strategy= CosineStrategy(word_count_threshold=20, max_dist=0.2, linkage_method='ward', top_k=3) # Default is CosineStrategy - # extraction_strategy= LLMExtractionStrategy(provider= "openai/gpt-4o", api_token = os.getenv('OPENAI_API_KEY')), - bypass_cache=False, - extract_blocks =True, # Whether to extract semantical blocks of text from the HTML - css_selector = "", # Eg: "div.article-body" - verbose=True, - include_raw_html=True, # Whether to include the raw HTML content in the response -) - -print(result.model_dump()) -``` - -Running for the first time will download the chrome driver for selenium. Also creates a SQLite database file `crawler_data.db` in the current directory. This file will store the crawled data for future reference. - -The response model is a `CrawlResponse` object that contains the following attributes: -```python -class CrawlResult(BaseModel): - url: str - html: str - success: bool - cleaned_html: str = None - markdown: str = None - parsed_json: str = None - error_message: str = None -``` - -### Running Crawl4AI as a Local Server πŸš€ - -To run Crawl4AI as a standalone local server, follow these steps: - -1. Clone the repository: -```sh -git clone https://github.com/unclecode/crawl4ai.git -``` - -2. Navigate to the project directory: -```sh -cd crawl4ai -``` - -3. Open `crawler/config.py` and set your favorite LLM provider and API token. - -4. Build the Docker image: -```sh -docker build -t crawl4ai . -``` - For Mac users, use the following command instead: -```sh -docker build --platform linux/amd64 -t crawl4ai . -``` - -5. Run the Docker container: -```sh +3. Use docker to run the local server: +```bash +docker build -t crawl4ai . +# For Mac users +# docker build --platform linux/amd64 -t crawl4ai . docker run -d -p 8000:80 crawl4ai ``` -6. Access the application at `http://localhost:8000`. +For more information about how to run Crawl4AI as a local server, please refer to the [GitHub repository](https://github.com/unclecode/crawl4ai). -- CURL Example: -Set the api_token to your OpenAI API key or any other provider you are using. -```sh -curl -X POST -H "Content-Type: application/json" -d '{"urls":["https://techcrunch.com/"],"provider_model":"openai/gpt-3.5-turbo","api_token":"your_api_token","include_raw_html":true,"forced":false,"extract_blocks_flag":false,"word_count_threshold":10}' http://localhost:8000/crawl -``` -Set `extract_blocks_flag` to True to enable the LLM to generate semantically clustered chunks and return them as JSON. Depending on the model and data size, this may take up to 1 minute. Without this setting, it will take between 5 to 20 seconds. +## Using the Local server ot REST API 🌐 -- Python Example: -```python -import requests -import os +You can also use Crawl4AI through the REST API. This method allows you to send HTTP requests to the Crawl4AI server and receive structured data in response. The base URL for the API is `https://crawl4ai.com/crawl`. If you run the local server, you can use `http://localhost:8000/crawl`. (Port is dependent on your docker configuration) -data = { - "urls": [ - "https://www.nbcnews.com/business" - ], - "provider_model": "groq/llama3-70b-8192", - "include_raw_html": true, - "bypass_cache": false, - "extract_blocks": true, - "word_count_threshold": 10, - "extraction_strategy": "CosineStrategy", - "chunking_strategy": "RegexChunking", - "css_selector": "", - "verbose": true +### Example Usage + +To use the REST API, send a POST request to `https://crawl4ai.com/crawl` with the following parameters in the request body. + +**Example Request:** +```json +{ + "urls": ["https://www.example.com"], + "include_raw_html": false, + "bypass_cache": true, + "word_count_threshold": 5, + "extraction_strategy": "CosineStrategy", + "chunking_strategy": "RegexChunking", + "css_selector": "p", + "verbose": true, + "extraction_strategy_args": { + "semantic_filter": "finance economy and stock market", + "word_count_threshold": 20, + "max_dist": 0.2, + "linkage_method": "ward", + "top_k": 3 + }, + "chunking_strategy_args": { + "patterns": ["\n\n"] + } } - -response = requests.post("http://crawl4ai.uccode.io/crawl", json=data) # OR http://localhost:8000 if your run locally - -if response.status_code == 200: - result = response.json()["results"][0] - print("Parsed JSON:") - print(result["parsed_json"]) - print("\nCleaned HTML:") - print(result["cleaned_html"]) - print("\nMarkdown:") - print(result["markdown"]) -else: - print("Error:", response.status_code, response.text) ``` -This code sends a POST request to the Crawl4AI server running on localhost, specifying the target URL (`http://crawl4ai.uccode.io/crawl`) and the desired options. The server processes the request and returns the crawled data in JSON format. +**Example Response:** +```json +{ + "status": "success", + "data": [ + { + "url": "https://www.example.com", + "extracted_content": "...", + "html": "...", + "markdown": "...", + "metadata": {...} + } + ] +} +``` -The response from the server includes the semantical clusters, cleaned HTML, and markdown representations of the crawled webpage. You can access and use this data in your Python application as needed. +For more information about the available parameters and their descriptions, refer to the [Parameters](#parameters) section. -Make sure to replace `"http://localhost:8000/crawl"` with the appropriate server URL if your Crawl4AI server is running on a different host or port. -Choose the approach that best suits your needs. If you want to integrate Crawl4AI into your existing Python projects, installing it as a library is the way to go. If you prefer to run Crawl4AI as a standalone service and interact with it via API endpoints, running it as a local server using Docker is the recommended approach. +## Python Library Usage πŸš€ -**Make sure to check the config.py tp set required environment variables.** +### Quickstart Guide -That's it! You can now integrate Crawl4AI into your Python projects and leverage its web crawling capabilities. πŸŽ‰ +Create an instance of WebCrawler and call the `warmup()` function. +```python +crawler = WebCrawler() +crawler.warmup() +``` -## πŸ“– Parameters +### Understanding 'bypass_cache' and 'include_raw_html' parameters + +First crawl (caches the result): +```python +result = crawler.run(url="https://www.nbcnews.com/business") +``` + +Second crawl (Force to crawl again): +```python +result = crawler.run(url="https://www.nbcnews.com/business", bypass_cache=True) +``` + πŸ’‘ Don't forget to set `bypass_cache` to True if you want to try different strategies for the same URL. Otherwise, the cached result will be returned. You can also set `always_by_pass_cache` in constructor to True to always bypass the cache. + +Crawl result without raw HTML content: +```python +result = crawler.run(url="https://www.nbcnews.com/business", include_raw_html=False) +``` + +### Adding a chunking strategy: RegexChunking + +Using RegexChunking: +```python +result = crawler.run( + url="https://www.nbcnews.com/business", + chunking_strategy=RegexChunking(patterns=["\n\n"]) +) +``` + +Using NlpSentenceChunking: +```python +result = crawler.run( + url="https://www.nbcnews.com/business", + chunking_strategy=NlpSentenceChunking() +) +``` + +### Extraction strategy: CosineStrategy + +Using CosineStrategy: +```python +result = crawler.run( + url="https://www.nbcnews.com/business", + extraction_strategy=CosineStrategy( + semantic_filter="", + word_count_threshold=10, + max_dist=0.2, + linkage_method="ward", + top_k=3 + ) +) +``` + +You can set `semantic_filter` to filter relevant documents before clustering. Documents are filtered based on their cosine similarity to the keyword filter embedding. + +```python +result = crawler.run( + url="https://www.nbcnews.com/business", + extraction_strategy=CosineStrategy( + semantic_filter="finance economy and stock market", + word_count_threshold=10, + max_dist=0.2, + linkage_method="ward", + top_k=3 + ) +) +``` + +### Using LLMExtractionStrategy + +Without instructions: +```python +result = crawler.run( + url="https://www.nbcnews.com/business", + extraction_strategy=LLMExtractionStrategy( + provider="openai/gpt-4o", + api_token=os.getenv('OPENAI_API_KEY') + ) +) +``` + +With instructions: +```python +result = crawler.run( + url="https://www.nbcnews.com/business", + extraction_strategy=LLMExtractionStrategy( + provider="openai/gpt-4o", + api_token=os.getenv('OPENAI_API_KEY'), + instruction="I am interested in only financial news" + ) +) +``` + +### Targeted extraction using CSS selector + +Extract only H2 tags: +```python +result = crawler.run( + url="https://www.nbcnews.com/business", + css_selector="h2" +) +``` + +### Passing JavaScript code to click 'Load More' button + +Using JavaScript to click 'Load More' button: +```python +js_code = """ +const loadMoreButton = Array.from(document.querySelectorAll('button')).find(button => button.textContent.includes('Load More')); +loadMoreButton && loadMoreButton.click(); +""" +crawler_strategy = LocalSeleniumCrawlerStrategy(js_code=js_code) +crawler = WebCrawler(crawler_strategy=crawler_strategy, always_by_pass_cache=True) +result = crawler.run(url="https://www.nbcnews.com/business") +``` + +## Parameters πŸ“– | Parameter | Description | Required | Default Value | |-----------------------|-------------------------------------------------------------------------------------------------------|----------|---------------------| @@ -193,49 +329,134 @@ That's it! You can now integrate Crawl4AI into your Python projects and leverage | `css_selector` | The CSS selector to target specific parts of the HTML for extraction. | No | `None` | | `verbose` | Whether to enable verbose logging. | No | `true` | -## πŸ› οΈ Configuration -Crawl4AI allows you to configure various parameters and settings in the `crawler/config.py` file. Here's an example of how you can adjust the parameters: +## Chunking Strategies πŸ“š +### RegexChunking + +`RegexChunking` is a text chunking strategy that splits a given text into smaller parts using regular expressions. This is useful for preparing large texts for processing by language models, ensuring they are divided into manageable segments. + +**Constructor Parameters:** +- `patterns` (list, optional): A list of regular expression patterns used to split the text. Default is to split by double newlines (`['\n\n']`). + +**Example usage:** ```python -import os -from dotenv import load_dotenv - -load_dotenv() # Load environment variables from .env file - -# Default provider, ONLY used when the extraction strategy is LLMExtractionStrategy -DEFAULT_PROVIDER = "openai/gpt-4-turbo" - -# Provider-model dictionary, ONLY used when the extraction strategy is LLMExtractionStrategy -PROVIDER_MODELS = { - "ollama/llama3": "no-token-needed", # Any model from Ollama no need for API token - "groq/llama3-70b-8192": os.getenv("GROQ_API_KEY"), - "groq/llama3-8b-8192": os.getenv("GROQ_API_KEY"), - "openai/gpt-3.5-turbo": os.getenv("OPENAI_API_KEY"), - "openai/gpt-4-turbo": os.getenv("OPENAI_API_KEY"), - "openai/gpt-4o": os.getenv("OPENAI_API_KEY"), - "anthropic/claude-3-haiku-20240307": os.getenv("ANTHROPIC_API_KEY"), - "anthropic/claude-3-opus-20240229": os.getenv("ANTHROPIC_API_KEY"), - "anthropic/claude-3-sonnet-20240229": os.getenv("ANTHROPIC_API_KEY"), -} - -# Chunk token threshold -CHUNK_TOKEN_THRESHOLD = 1000 -# Threshold for the minimum number of words in an HTML tag to be considered -MIN_WORD_THRESHOLD = 5 +chunker = RegexChunking(patterns=[r'\n\n', r'\. ']) +chunks = chunker.chunk("This is a sample text. It will be split into chunks.") ``` -In the `crawler/config.py` file, you can: +### NlpSentenceChunking -REMEBER: You only need to set the API keys for the providers in case you choose LLMExtractStrategy as the extraction strategy. If you choose CosineStrategy, you don't need to set the API keys. +`NlpSentenceChunking` uses a natural language processing model to chunk a given text into sentences. This approach leverages SpaCy to accurately split text based on sentence boundaries. -- Set the default provider using the `DEFAULT_PROVIDER` variable. -- Add or modify the provider-model dictionary (`PROVIDER_MODELS`) to include your desired providers and their corresponding API keys. Crawl4AI supports various providers such as Groq, OpenAI, Anthropic, and more. You can add any provider supported by LiteLLM, as well as Ollama. -- Adjust the `CHUNK_TOKEN_THRESHOLD` value to control the splitting of web content into chunks for parallel processing. A higher value means fewer chunks and faster processing, but it may cause issues with weaker LLMs during extraction. -- Modify the `MIN_WORD_THRESHOLD` value to set the minimum number of words an HTML tag must contain to be considered a meaningful block. +**Constructor Parameters:** +- `model` (str, optional): The SpaCy model to use for sentence detection. Default is `'en_core_web_sm'`. -Make sure to set the appropriate API keys for each provider in the `PROVIDER_MODELS` dictionary. You can either directly provide the API key or use environment variables to store them securely. +**Example usage:** +```python +chunker = NlpSentenceChunking(model='en_core_web_sm') +chunks = chunker.chunk("This is a sample text. It will be split into sentences.") +``` -Remember to update the `crawler/config.py` file based on your specific requirements and the providers you want to use with Crawl4AI. +### TopicSegmentationChunking + +`TopicSegmentationChunking` uses the TextTiling algorithm to segment a given text into topic-based chunks. This method identifies thematic boundaries in the text. + +**Constructor Parameters:** +- `num_keywords` (int, optional): The number of keywords to extract for each topic segment. Default is `3`. + +**Example usage:** +```python +chunker = TopicSegmentationChunking(num_keywords=3) +chunks = chunker.chunk("This is a sample text. It will be split into topic-based segments.") +``` + +### FixedLengthWordChunking + +`FixedLengthWordChunking` splits a given text into chunks of fixed length, based on the number of words. + +**Constructor Parameters:** +- `chunk_size` (int, optional): The number of words in each chunk. Default is `100`. + +**Example usage:** +```python +chunker = FixedLengthWordChunking(chunk_size=100) +chunks = chunker.chunk("This is a sample text. It will be split into fixed-length word chunks.") +``` + +### SlidingWindowChunking + +`SlidingWindowChunking` uses a sliding window approach to chunk a given text. Each chunk has a fixed length, and the window slides by a specified step size. + +**Constructor Parameters:** +- `window_size` (int, optional): The number of words in each chunk. Default is `100`. +- `step` (int, optional): The number of words to slide the window. Default is `50`. + +**Example usage:** +```python +chunker = SlidingWindowChunking(window_size=100, step=50) +chunks = chunker.chunk("This is a sample text. It will be split using a sliding window approach.") +``` + +## Extraction Strategies 🧠 + +### NoExtractionStrategy + +`NoExtractionStrategy` is a basic extraction strategy that returns the entire HTML content without any modification. It is useful for cases where no specific extraction is required. + +**Constructor Parameters:** +None. + +**Example usage:** +```python +extractor = NoExtractionStrategy() +extracted_content = extractor.extract(url, html) +``` + +### LLMExtractionStrategy + +`LLMExtractionStrategy` uses a Language Model (LLM) to extract meaningful blocks or chunks from the given HTML content. This strategy leverages an external provider for language model completions. + +**Constructor Parameters:** +- `provider` (str, optional): The provider to use for the language model completions. Default is `DEFAULT_PROVIDER` (e.g., openai/gpt-4). +- `api_token` (str, optional): The API token for the provider. If not provided, it will try to load from the environment variable `OPENAI_API_KEY`. +- `instruction` (str, optional): An instruction to guide the LLM on how to perform the extraction. This allows users to specify the type of data they are interested in or set the tone of the response. Default is `None`. + +**Example usage:** +```python +extractor = LLMExtractionStrategy(provider='openai', api_token='your_api_token', instruction='Extract only news about AI.') +extracted_content = extractor.extract(url, html) +``` + +### CosineStrategy + +`CosineStrategy` uses hierarchical clustering based on cosine similarity to extract clusters of text from the given HTML content. This strategy is suitable for identifying related content sections. + +**Constructor Parameters:** +- `semantic_filter` (str, optional): A string containing keywords for filtering relevant documents before clustering. If provided, documents are filtered based on their cosine similarity to the keyword filter embedding. Default is `None`. +- `word_count_threshold` (int, optional): Minimum number of words per cluster. Default is `20`. +- `max_dist` (float, optional): The maximum cophenetic distance on the dendrogram to form clusters. Default is `0.2`. +- `linkage_method` (str, optional): The linkage method for hierarchical clustering. Default is `'ward'`. +- `top_k` (int, optional): Number of top categories to extract. Default is `3`. +- `model_name` (str, optional): The model name for embedding generation. Default is `'BAAI/bge-small-en-v1.5'`. + +**Example usage:** +```python +extractor = CosineStrategy(semantic_filter='artificial intelligence', word_count_threshold=10, max_dist=0.2, linkage_method='ward', top_k=3, model_name='BAAI/bge-small-en-v1.5') +extracted_content = extractor.extract(url, html) +``` + +### TopicExtractionStrategy + +`TopicExtractionStrategy` uses the TextTiling algorithm to segment the HTML content into topics and extracts keywords for each segment. This strategy is useful for identifying and summarizing thematic content. + +**Constructor Parameters:** +- `num_keywords` (int, optional): Number of keywords to represent each topic segment. Default is `3`. + +**Example usage:** +```python +extractor = TopicExtractionStrategy(num_keywords=3) +extracted_content = extractor.extract(url, html) +``` ## Contributing 🀝 @@ -259,5 +480,6 @@ If you have any questions, suggestions, or feedback, please feel free to reach o - GitHub: [unclecode](https://github.com/unclecode) - Twitter: [@unclecode](https://twitter.com/unclecode) +- Website: [crawl4ai.com](https://crawl4ai.com) Let's work together to make the web more accessible and useful for AI applications! πŸ’ͺπŸŒπŸ€– diff --git a/crawl4ai/chunking_strategy.py b/crawl4ai/chunking_strategy.py index d6f0e5d5..53e48c68 100644 --- a/crawl4ai/chunking_strategy.py +++ b/crawl4ai/chunking_strategy.py @@ -38,7 +38,12 @@ class RegexChunking(ChunkingStrategy): class NlpSentenceChunking(ChunkingStrategy): def __init__(self, model='en_core_web_sm'): import spacy - self.nlp = spacy.load(model) + try: + self.nlp = spacy.load(model) + except IOError: + spacy.cli.download("en_core_web_sm") + self.nlp = spacy.load(model) + # raise ImportError(f"Spacy model '{model}' not found. Please download the model using 'python -m spacy download {model}'") def chunk(self, text: str) -> list: doc = self.nlp(text) diff --git a/crawl4ai/crawler_strategy.py b/crawl4ai/crawler_strategy.py index 8d183e38..c1a06072 100644 --- a/crawl4ai/crawler_strategy.py +++ b/crawl4ai/crawler_strategy.py @@ -18,15 +18,16 @@ class CrawlerStrategy(ABC): pass class CloudCrawlerStrategy(CrawlerStrategy): - def crawl(self, url: str, use_cached_html = False, css_selector = None) -> str: + def __init__(self, use_cached_html = False): + super().__init__() + self.use_cached_html = use_cached_html + + def crawl(self, url: str) -> str: data = { "urls": [url], - "provider_model": "", - "api_token": "token", "include_raw_html": True, "forced": True, "extract_blocks": False, - "word_count_threshold": 10 } response = requests.post("http://crawl4ai.uccode.io/crawl", json=data) @@ -35,19 +36,24 @@ class CloudCrawlerStrategy(CrawlerStrategy): return html class LocalSeleniumCrawlerStrategy(CrawlerStrategy): - def __init__(self): + def __init__(self, use_cached_html=False, js_code=None): + super().__init__() self.options = Options() self.options.headless = True self.options.add_argument("--no-sandbox") self.options.add_argument("--disable-dev-shm-usage") + self.options.add_argument("--disable-gpu") + self.options.add_argument("--disable-extensions") self.options.add_argument("--headless") + self.use_cached_html = use_cached_html + self.js_code = js_code # chromedriver_autoinstaller.install() self.service = Service(chromedriver_autoinstaller.install()) self.driver = webdriver.Chrome(service=self.service, options=self.options) - def crawl(self, url: str, use_cached_html = False, css_selector = None) -> str: - if use_cached_html: + def crawl(self, url: str) -> str: + if self.use_cached_html: cache_file_path = os.path.join(Path.home(), ".crawl4ai", "cache", url.replace("/", "_")) if os.path.exists(cache_file_path): with open(cache_file_path, "r") as f: @@ -58,6 +64,15 @@ class LocalSeleniumCrawlerStrategy(CrawlerStrategy): WebDriverWait(self.driver, 10).until( EC.presence_of_all_elements_located((By.TAG_NAME, "html")) ) + + # Execute JS code if provided + if self.js_code: + self.driver.execute_script(self.js_code) + # Optionally, wait for some condition after executing the JS code + WebDriverWait(self.driver, 10).until( + lambda driver: driver.execute_script("return document.readyState") == "complete" + ) + html = self.driver.page_source # Store in cache diff --git a/crawl4ai/database.py b/crawl4ai/database.py index b2169c84..391d3f4f 100644 --- a/crawl4ai/database.py +++ b/crawl4ai/database.py @@ -8,9 +8,9 @@ DB_PATH = os.path.join(Path.home(), ".crawl4ai") os.makedirs(DB_PATH, exist_ok=True) DB_PATH = os.path.join(DB_PATH, "crawl4ai.db") -def init_db(db_path: str): +def init_db(): global DB_PATH - conn = sqlite3.connect(db_path) + conn = sqlite3.connect(DB_PATH) cursor = conn.cursor() cursor.execute(''' CREATE TABLE IF NOT EXISTS crawled_data ( @@ -18,13 +18,12 @@ def init_db(db_path: str): html TEXT, cleaned_html TEXT, markdown TEXT, - parsed_json TEXT, + extracted_content TEXT, success BOOLEAN ) ''') conn.commit() conn.close() - DB_PATH = db_path def check_db_path(): if not DB_PATH: @@ -35,7 +34,7 @@ def get_cached_url(url: str) -> Optional[Tuple[str, str, str, str, str, bool]]: try: conn = sqlite3.connect(DB_PATH) cursor = conn.cursor() - cursor.execute('SELECT url, html, cleaned_html, markdown, parsed_json, success FROM crawled_data WHERE url = ?', (url,)) + cursor.execute('SELECT url, html, cleaned_html, markdown, extracted_content, success FROM crawled_data WHERE url = ?', (url,)) result = cursor.fetchone() conn.close() return result @@ -43,21 +42,21 @@ def get_cached_url(url: str) -> Optional[Tuple[str, str, str, str, str, bool]]: print(f"Error retrieving cached URL: {e}") return None -def cache_url(url: str, html: str, cleaned_html: str, markdown: str, parsed_json: str, success: bool): +def cache_url(url: str, html: str, cleaned_html: str, markdown: str, extracted_content: str, success: bool): check_db_path() try: conn = sqlite3.connect(DB_PATH) cursor = conn.cursor() cursor.execute(''' - INSERT INTO crawled_data (url, html, cleaned_html, markdown, parsed_json, success) + INSERT INTO crawled_data (url, html, cleaned_html, markdown, extracted_content, success) VALUES (?, ?, ?, ?, ?, ?) ON CONFLICT(url) DO UPDATE SET html = excluded.html, cleaned_html = excluded.cleaned_html, markdown = excluded.markdown, - parsed_json = excluded.parsed_json, + extracted_content = excluded.extracted_content, success = excluded.success - ''', (url, html, cleaned_html, markdown, parsed_json, success)) + ''', (url, html, cleaned_html, markdown, extracted_content, success)) conn.commit() conn.close() except Exception as e: @@ -85,4 +84,15 @@ def clear_db(): conn.commit() conn.close() except Exception as e: - print(f"Error clearing database: {e}") \ No newline at end of file + print(f"Error clearing database: {e}") + +def flush_db(): + check_db_path() + try: + conn = sqlite3.connect(DB_PATH) + cursor = conn.cursor() + cursor.execute('DROP TABLE crawled_data') + conn.commit() + conn.close() + except Exception as e: + print(f"Error flushing database: {e}") \ No newline at end of file diff --git a/crawl4ai/extraction_strategy.py b/crawl4ai/extraction_strategy.py index 91e44e3f..c9074eb2 100644 --- a/crawl4ai/extraction_strategy.py +++ b/crawl4ai/extraction_strategy.py @@ -3,19 +3,20 @@ from typing import Any, List, Dict, Optional, Union from concurrent.futures import ThreadPoolExecutor, as_completed import json, time # from optimum.intel import IPEXModel -from .prompts import PROMPT_EXTRACT_BLOCKS +from .prompts import PROMPT_EXTRACT_BLOCKS, PROMPT_EXTRACT_BLOCKS_WITH_INSTRUCTION from .config import * from .utils import * from functools import partial from .model_loader import load_bert_base_uncased, load_bge_small_en_v1_5, load_spacy_model - - +from transformers import pipeline +from sklearn.metrics.pairwise import cosine_similarity +import numpy as np class ExtractionStrategy(ABC): """ Abstract base class for all extraction strategies. """ - def __init__(self): + def __init__(self, **kwargs): self.DEL = "<|DEL|>" self.name = self.__class__.__name__ @@ -38,12 +39,12 @@ class ExtractionStrategy(ABC): :param sections: List of sections (strings) to process. :return: A list of processed JSON blocks. """ - parsed_json = [] + extracted_content = [] with ThreadPoolExecutor() as executor: futures = [executor.submit(self.extract, url, section, **kwargs) for section in sections] for future in as_completed(futures): - parsed_json.extend(future.result()) - return parsed_json + extracted_content.extend(future.result()) + return extracted_content class NoExtractionStrategy(ExtractionStrategy): def extract(self, url: str, html: str, *q, **kwargs) -> List[Dict[str, Any]]: @@ -53,37 +54,41 @@ class NoExtractionStrategy(ExtractionStrategy): return [{"index": i, "tags": [], "content": section} for i, section in enumerate(sections)] class LLMExtractionStrategy(ExtractionStrategy): - def __init__(self, provider: str = DEFAULT_PROVIDER, api_token: Optional[str] = None): + def __init__(self, provider: str = DEFAULT_PROVIDER, api_token: Optional[str] = None, instruction:str = None, **kwargs): """ Initialize the strategy with clustering parameters. - :param word_count_threshold: Minimum number of words per cluster. - :param max_dist: The maximum cophenetic distance on the dendrogram to form clusters. - :param linkage_method: The linkage method for hierarchical clustering. + :param provider: The provider to use for extraction. + :param api_token: The API token for the provider. + :param instruction: The instruction to use for the LLM model. """ super().__init__() self.provider = provider self.api_token = api_token or PROVIDER_MODELS.get(provider, None) or os.getenv("OPENAI_API_KEY") + self.instruction = instruction if not self.api_token: raise ValueError("API token must be provided for LLMExtractionStrategy. Update the config.py or set OPENAI_API_KEY environment variable.") - def extract(self, url: str, html: str) -> List[Dict[str, Any]]: - print("[LOG] Extracting blocks from URL:", url) + def extract(self, url: str, ix:int, html: str) -> List[Dict[str, Any]]: + # print("[LOG] Extracting blocks from URL:", url) + print(f"[LOG] Call LLM for {url} - block index: {ix}") variable_values = { "URL": url, "HTML": escape_json_string(sanitize_html(html)), } + + if self.instruction: + variable_values["REQUEST"] = self.instruction - prompt_with_variables = PROMPT_EXTRACT_BLOCKS + prompt_with_variables = PROMPT_EXTRACT_BLOCKS if not self.instruction else PROMPT_EXTRACT_BLOCKS_WITH_INSTRUCTION for variable in variable_values: prompt_with_variables = prompt_with_variables.replace( "{" + variable + "}", variable_values[variable] ) response = perform_completion_with_backoff(self.provider, prompt_with_variables, self.api_token) - try: blocks = extract_xml_data(["blocks"], response.choices[0].message.content)['blocks'] blocks = json.loads(blocks) @@ -101,7 +106,7 @@ class LLMExtractionStrategy(ExtractionStrategy): "content": unparsed }) - print("[LOG] Extracted", len(blocks), "blocks from URL:", url) + print("[LOG] Extracted", len(blocks), "blocks from URL:", url, "block index:", ix) return blocks def _merge(self, documents): @@ -130,29 +135,30 @@ class LLMExtractionStrategy(ExtractionStrategy): """ merged_sections = self._merge(sections) - parsed_json = [] + extracted_content = [] if self.provider.startswith("groq/"): # Sequential processing with a delay - for section in merged_sections: - parsed_json.extend(self.extract(url, section)) + for ix, section in enumerate(merged_sections): + extracted_content.extend(self.extract(ix, url, section)) time.sleep(0.5) # 500 ms delay between each processing else: # Parallel processing using ThreadPoolExecutor with ThreadPoolExecutor(max_workers=4) as executor: extract_func = partial(self.extract, url) - futures = [executor.submit(extract_func, section) for section in merged_sections] + futures = [executor.submit(extract_func, ix, section) for ix, section in enumerate(merged_sections)] for future in as_completed(futures): - parsed_json.extend(future.result()) + extracted_content.extend(future.result()) - return parsed_json + return extracted_content class CosineStrategy(ExtractionStrategy): - def __init__(self, word_count_threshold=20, max_dist=0.2, linkage_method='ward', top_k=3, model_name = 'BAAI/bge-small-en-v1.5'): + def __init__(self, semantic_filter = None, word_count_threshold=10, max_dist=0.2, linkage_method='ward', top_k=3, model_name = 'BAAI/bge-small-en-v1.5', **kwargs): """ Initialize the strategy with clustering parameters. + :param semantic_filter: A keyword filter for document filtering. :param word_count_threshold: Minimum number of words per cluster. :param max_dist: The maximum cophenetic distance on the dendrogram to form clusters. :param linkage_method: The linkage method for hierarchical clustering. @@ -163,11 +169,14 @@ class CosineStrategy(ExtractionStrategy): from transformers import AutoTokenizer, AutoModel import spacy + self.semantic_filter = semantic_filter self.word_count_threshold = word_count_threshold self.max_dist = max_dist self.linkage_method = linkage_method self.top_k = top_k self.timer = time.time() + + self.buffer_embeddings = np.array([]) if model_name == "bert-base-uncased": self.tokenizer, self.model = load_bert_base_uncased() @@ -177,13 +186,42 @@ class CosineStrategy(ExtractionStrategy): self.nlp = load_spacy_model() print(f"[LOG] Model loaded {model_name}, models/reuters, took " + str(time.time() - self.timer) + " seconds") - def get_embeddings(self, sentences: List[str]): + + def filter_documents_embeddings(self, documents: List[str], semantic_filter: str, threshold: float = 0.5) -> List[str]: + """ + Filter documents based on the cosine similarity of their embeddings with the semantic_filter embedding. + + :param documents: List of text chunks (documents). + :param semantic_filter: A string containing the keywords for filtering. + :param threshold: Cosine similarity threshold for filtering documents. + :return: Filtered list of documents. + """ + if not semantic_filter: + return documents + # Compute embedding for the keyword filter + query_embedding = self.get_embeddings([semantic_filter])[0] + + # Compute embeddings for the docu ments + document_embeddings = self.get_embeddings(documents) + + # Calculate cosine similarity between the query embedding and document embeddings + similarities = cosine_similarity([query_embedding], document_embeddings).flatten() + + # Filter documents based on the similarity threshold + filtered_docs = [doc for doc, sim in zip(documents, similarities) if sim >= threshold] + + return filtered_docs + + def get_embeddings(self, sentences: List[str], bypass_buffer=True): """ Get BERT embeddings for a list of sentences. :param sentences: List of text chunks (sentences). :return: NumPy array of embeddings. """ + # if self.buffer_embeddings.any() and not bypass_buffer: + # return self.buffer_embeddings + import torch # Tokenize sentences and convert to tensor encoded_input = self.tokenizer(sentences, padding=True, truncation=True, return_tensors='pt') @@ -193,6 +231,7 @@ class CosineStrategy(ExtractionStrategy): # Get embeddings from the last hidden state (mean pooling) embeddings = model_output.last_hidden_state.mean(1) + self.buffer_embeddings = embeddings.numpy() return embeddings.numpy() def hierarchical_clustering(self, sentences: List[str]): @@ -206,7 +245,7 @@ class CosineStrategy(ExtractionStrategy): from scipy.cluster.hierarchy import linkage, fcluster from scipy.spatial.distance import pdist self.timer = time.time() - embeddings = self.get_embeddings(sentences) + embeddings = self.get_embeddings(sentences, bypass_buffer=False) # print(f"[LOG] πŸš€ Embeddings computed in {time.time() - self.timer:.2f} seconds") # Compute pairwise cosine distances distance_matrix = pdist(embeddings, 'cosine') @@ -247,6 +286,12 @@ class CosineStrategy(ExtractionStrategy): # Assume `html` is a list of text chunks for this strategy t = time.time() text_chunks = html.split(self.DEL) # Split by lines or paragraphs as needed + + # Pre-filter documents using embeddings and semantic_filter + text_chunks = self.filter_documents_embeddings(text_chunks, self.semantic_filter) + + if not text_chunks: + return [] # Perform clustering labels = self.hierarchical_clustering(text_chunks) @@ -290,7 +335,7 @@ class CosineStrategy(ExtractionStrategy): return self.extract(url, self.DEL.join(sections), **kwargs) class TopicExtractionStrategy(ExtractionStrategy): - def __init__(self, num_keywords: int = 3): + def __init__(self, num_keywords: int = 3, **kwargs): """ Initialize the topic extraction strategy with parameters for topic segmentation. @@ -358,7 +403,7 @@ class TopicExtractionStrategy(ExtractionStrategy): return self.extract(url, self.DEL.join(sections), **kwargs) class ContentSummarizationStrategy(ExtractionStrategy): - def __init__(self, model_name: str = "sshleifer/distilbart-cnn-12-6"): + def __init__(self, model_name: str = "sshleifer/distilbart-cnn-12-6", **kwargs): """ Initialize the content summarization strategy with a specific model. diff --git a/crawl4ai/models.py b/crawl4ai/models.py index b9373f78..c2c2d61e 100644 --- a/crawl4ai/models.py +++ b/crawl4ai/models.py @@ -11,5 +11,6 @@ class CrawlResult(BaseModel): success: bool cleaned_html: str = None markdown: str = None - parsed_json: str = None + extracted_content: str = None + metadata: dict = None error_message: str = None \ No newline at end of file diff --git a/crawl4ai/prompts.py b/crawl4ai/prompts.py index be7091bc..e0498ccc 100644 --- a/crawl4ai/prompts.py +++ b/crawl4ai/prompts.py @@ -59,7 +59,7 @@ Please provide your output within tags, like this: Remember, the output should be a complete, parsable JSON wrapped in tags, with no omissions or errors. The JSON objects should semantically break down the content into relevant blocks, maintaining the original order.""" -PROMPT_EXTRACT_BLOCKS = """YHere is the URL of the webpage: +PROMPT_EXTRACT_BLOCKS = """Here is the URL of the webpage: {URL} And here is the cleaned HTML content of that webpage: @@ -107,4 +107,61 @@ Please provide your output within tags, like this: }] +Remember, the output should be a complete, parsable JSON wrapped in tags, with no omissions or errors. The JSON objects should semantically break down the content into relevant blocks, maintaining the original order.""" + +PROMPT_EXTRACT_BLOCKS_WITH_INSTRUCTION = """Here is the URL of the webpage: +{URL} + +And here is the cleaned HTML content of that webpage: + +{HTML} + + +Your task is to break down this HTML content into semantically relevant blocks, following the provided user's REQUEST, and for each block, generate a JSON object with the following keys: + +- index: an integer representing the index of the block in the content +- content: a list of strings containing the text content of the block + +This is the user's REQUEST, pay attention to it: + +{REQUEST} + + +To generate the JSON objects: + +1. Carefully read through the HTML content and identify logical breaks or shifts in the content that would warrant splitting it into separate blocks. + +2. For each block: + a. Assign it an index based on its order in the content. + b. Analyze the content and generate ONE semantic tag that describe what the block is about. + c. Extract the text content, EXACTLY SAME AS GIVE DATA, clean it up if needed, and store it as a list of strings in the "content" field. + +3. Ensure that the order of the JSON objects matches the order of the blocks as they appear in the original HTML content. + +4. Double-check that each JSON object includes all required keys (index, tag, content) and that the values are in the expected format (integer, list of strings, etc.). + +5. Make sure the generated JSON is complete and parsable, with no errors or omissions. + +6. Make sur to escape any special characters in the HTML content, and also single or double quote to avoid JSON parsing issues. + +7. Never alter the extracted content, just copy and paste it as it is. + +Please provide your output within tags, like this: + + +[{ + "index": 0, + "tags": ["introduction"], + "content": ["This is the first paragraph of the article, which provides an introduction and overview of the main topic."] +}, +{ + "index": 1, + "tags": ["background"], + "content": ["This is the second paragraph, which delves into the history and background of the topic.", + "It provides context and sets the stage for the rest of the article."] +}] + + +**Make sure to follow the user instruction to extract blocks aligin with the instruction.** + Remember, the output should be a complete, parsable JSON wrapped in tags, with no omissions or errors. The JSON objects should semantically break down the content into relevant blocks, maintaining the original order.""" \ No newline at end of file diff --git a/crawl4ai/utils.py b/crawl4ai/utils.py index 37729656..31ef2695 100644 --- a/crawl4ai/utils.py +++ b/crawl4ai/utils.py @@ -461,17 +461,17 @@ def merge_chunks_based_on_token_threshold(chunks, token_threshold): return merged_sections def process_sections(url: str, sections: list, provider: str, api_token: str) -> list: - parsed_json = [] + extracted_content = [] if provider.startswith("groq/"): # Sequential processing with a delay for section in sections: - parsed_json.extend(extract_blocks(url, section, provider, api_token)) + extracted_content.extend(extract_blocks(url, section, provider, api_token)) time.sleep(0.5) # 500 ms delay between each processing else: # Parallel processing using ThreadPoolExecutor with ThreadPoolExecutor() as executor: futures = [executor.submit(extract_blocks, url, section, provider, api_token) for section in sections] for future in as_completed(futures): - parsed_json.extend(future.result()) + extracted_content.extend(future.result()) - return parsed_json \ No newline at end of file + return extracted_content \ No newline at end of file diff --git a/crawl4ai/web_crawler.py b/crawl4ai/web_crawler.py index 361c06dd..753cee86 100644 --- a/crawl4ai/web_crawler.py +++ b/crawl4ai/web_crawler.py @@ -1,8 +1,9 @@ import os, time +os.environ["TOKENIZERS_PARALLELISM"] = "false" from pathlib import Path from .models import UrlModel, CrawlResult -from .database import init_db, get_cached_url, cache_url, DB_PATH +from .database import init_db, get_cached_url, cache_url, DB_PATH, flush_db from .utils import * from .chunking_strategy import * from .extraction_strategy import * @@ -16,11 +17,13 @@ from .config import * class WebCrawler: def __init__( self, - db_path: str = None, + # db_path: str = None, crawler_strategy: CrawlerStrategy = LocalSeleniumCrawlerStrategy(), + always_by_pass_cache: bool = False, ): - self.db_path = db_path + # self.db_path = db_path self.crawler_strategy = crawler_strategy + self.always_by_pass_cache = always_by_pass_cache # Create the .crawl4ai folder in the user's home directory if it doesn't exist self.crawl4ai_folder = os.path.join(Path.home(), ".crawl4ai") @@ -28,10 +31,11 @@ class WebCrawler: os.makedirs(f"{self.crawl4ai_folder}/cache", exist_ok=True) # If db_path is not provided, use the default path - if not db_path: - self.db_path = f"{self.crawl4ai_folder}/crawl4ai.db" + # if not db_path: + # self.db_path = f"{self.crawl4ai_folder}/crawl4ai.db" - init_db(self.db_path) + flush_db() + init_db() self.ready = False @@ -93,7 +97,7 @@ class WebCrawler: word_count_threshold = MIN_WORD_THRESHOLD # Check cache first - if not bypass_cache: + if not bypass_cache and not self.always_by_pass_cache: cached = get_cached_url(url) if cached: return CrawlResult( @@ -102,7 +106,7 @@ class WebCrawler: "html": cached[1], "cleaned_html": cached[2], "markdown": cached[3], - "parsed_json": cached[4], + "extracted_content": cached[4], "success": cached[5], "error_message": "", } @@ -130,7 +134,7 @@ class WebCrawler: f"[LOG] πŸš€ Crawling done for {url}, success: {success}, time taken: {time.time() - t} seconds" ) - parsed_json = [] + extracted_content = [] if verbose: print(f"[LOG] πŸ”₯ Extracting semantic blocks for {url}, Strategy: {extraction_strategy.name}") t = time.time() @@ -138,10 +142,10 @@ class WebCrawler: sections = chunking_strategy.chunk(markdown) # sections = merge_chunks_based_on_token_threshold(sections, CHUNK_TOKEN_THRESHOLD) - parsed_json = extraction_strategy.run( + extracted_content = extraction_strategy.run( url, sections, ) - parsed_json = json.dumps(parsed_json) + extracted_content = json.dumps(extracted_content) if verbose: print( @@ -155,7 +159,7 @@ class WebCrawler: html, cleaned_html, markdown, - parsed_json, + extracted_content, success, ) @@ -164,7 +168,7 @@ class WebCrawler: html=html, cleaned_html=cleaned_html, markdown=markdown, - parsed_json=parsed_json, + extracted_content=extracted_content, success=success, error_message=error_message, ) diff --git a/docs/extraction_strategies.json b/docs/extraction_strategies.json index 207ab981..570e1e32 100644 --- a/docs/extraction_strategies.json +++ b/docs/extraction_strategies.json @@ -1,9 +1,9 @@ { "NoExtractionStrategy": "### NoExtractionStrategy\n\n`NoExtractionStrategy` is a basic extraction strategy that returns the entire HTML content without any modification. It is useful for cases where no specific extraction is required. Only clean html, and amrkdown.\n\n#### Constructor Parameters:\nNone.\n\n#### Example usage:\n```python\nextractor = NoExtractionStrategy()\nextracted_content = extractor.extract(url, html)\n```", - "LLMExtractionStrategy": "### LLMExtractionStrategy\n\n`LLMExtractionStrategy` uses a Language Model (LLM) to extract meaningful blocks or chunks from the given HTML content. This strategy leverages an external provider for language model completions.\n\n#### Constructor Parameters:\n- `provider` (str, optional): The provider to use for the language model completions. Default is `DEFAULT_PROVIDER` (following provider/model eg. openai/gpt-4o).\n- `api_token` (str, optional): The API token for the provider. If not provided, it will try to load from the environment variable `OPENAI_API_KEY`.\n\n#### Example usage:\n```python\nextractor = LLMExtractionStrategy(provider='openai', api_token='your_api_token')\nextracted_content = extractor.extract(url, html)\n```", + "LLMExtractionStrategy": "### LLMExtractionStrategy\n\n`LLMExtractionStrategy` uses a Language Model (LLM) to extract meaningful blocks or chunks from the given HTML content. This strategy leverages an external provider for language model completions.\n\n#### Constructor Parameters:\n- `provider` (str, optional): The provider to use for the language model completions. Default is `DEFAULT_PROVIDER` (e.g., openai/gpt-4).\n- `api_token` (str, optional): The API token for the provider. If not provided, it will try to load from the environment variable `OPENAI_API_KEY`.\n- `instruction` (str, optional): An instruction to guide the LLM on how to perform the extraction. This allows users to specify the type of data they are interested in or set the tone of the response. Default is `None`.\n\n#### Example usage:\n```python\nextractor = LLMExtractionStrategy(provider='openai', api_token='your_api_token', instruction='Extract only news about AI.')\nextracted_content = extractor.extract(url, html)\n```\n\nBy providing clear instructions, users can tailor the extraction process to their specific needs, enhancing the relevance and utility of the extracted content.", - "CosineStrategy": "### CosineStrategy\n\n`CosineStrategy` uses hierarchical clustering based on cosine similarity to extract clusters of text from the given HTML content. This strategy is suitable for identifying related content sections.\n\n#### Constructor Parameters:\n- `word_count_threshold` (int, optional): Minimum number of words per cluster. Default is `20`.\n- `max_dist` (float, optional): The maximum cophenetic distance on the dendrogram to form clusters. Default is `0.2`.\n- `linkage_method` (str, optional): The linkage method for hierarchical clustering. Default is `'ward'`.\n- `top_k` (int, optional): Number of top categories to extract. Default is `3`.\n- `model_name` (str, optional): The model name for embedding generation. Default is `'BAAI/bge-small-en-v1.5'`.\n\n#### Example usage:\n```python\nextractor = CosineStrategy(word_count_threshold=20, max_dist=0.2, linkage_method='ward', top_k=3, model_name='BAAI/bge-small-en-v1.5')\nextracted_content = extractor.extract(url, html)\n```", + "CosineStrategy": "### CosineStrategy\n\n`CosineStrategy` uses hierarchical clustering based on cosine similarity to extract clusters of text from the given HTML content. This strategy is suitable for identifying related content sections.\n\n#### Constructor Parameters:\n- `semantic_filter` (str, optional): A string containing keywords for filtering relevant documents before clustering. If provided, documents are filtered based on their cosine similarity to the keyword filter embedding. Default is `None`.\n- `word_count_threshold` (int, optional): Minimum number of words per cluster. Default is `20`.\n- `max_dist` (float, optional): The maximum cophenetic distance on the dendrogram to form clusters. Default is `0.2`.\n- `linkage_method` (str, optional): The linkage method for hierarchical clustering. Default is `'ward'`.\n- `top_k` (int, optional): Number of top categories to extract. Default is `3`.\n- `model_name` (str, optional): The model name for embedding generation. Default is `'BAAI/bge-small-en-v1.5'`.\n\n#### Example usage:\n```python\nextractor = CosineStrategy(semantic_filter='artificial intelligence', word_count_threshold=10, max_dist=0.2, linkage_method='ward', top_k=3, model_name='BAAI/bge-small-en-v1.5')\nextracted_content = extractor.extract(url, html)\n```\n\n#### Cosine Similarity Filtering\n\nWhen a `semantic_filter` is provided, the `CosineStrategy` applies an embedding-based filtering process to select relevant documents before performing hierarchical clustering.", "TopicExtractionStrategy": "### TopicExtractionStrategy\n\n`TopicExtractionStrategy` uses the TextTiling algorithm to segment the HTML content into topics and extracts keywords for each segment. This strategy is useful for identifying and summarizing thematic content.\n\n#### Constructor Parameters:\n- `num_keywords` (int, optional): Number of keywords to represent each topic segment. Default is `3`.\n\n#### Example usage:\n```python\nextractor = TopicExtractionStrategy(num_keywords=3)\nextracted_content = extractor.extract(url, html)\n```" } diff --git a/docs/quickstart.py b/docs/quickstart.py index cbdfbe0d..cdcef7e4 100644 --- a/docs/quickstart.py +++ b/docs/quickstart.py @@ -1,22 +1,195 @@ -import os +import os, time from crawl4ai.web_crawler import WebCrawler from crawl4ai.chunking_strategy import * from crawl4ai.extraction_strategy import * +from crawl4ai.crawler_strategy import * +from rich import print +from rich.console import Console +console = Console() + +def print_result(result): + # Print each key in one line and just the first 10 characters of each one's value and three dots + console.print(f"\t[bold]Result:[/bold]") + for key, value in result.model_dump().items(): + if type(value) == str and value: + console.print(f"\t{key}: [green]{value[:20]}...[/green]") + +def cprint(message, press_any_key=False): + console.print(message) + if press_any_key: + console.print("Press any key to continue...", style="") + input() def main(): + # πŸš€ Let's get started with the basics! + cprint("🌟 [bold green]Welcome to the Crawl4ai Quickstart Guide! Let's dive into some web crawling fun! 🌐[/bold green]") + + # Basic usage: Just provide the URL + cprint("⛳️ [bold cyan]First Step: Create an instance of WebCrawler and call the `warmup()` function.[/bold cyan]") + cprint("If this is the first time you're running Crawl4ai, this might take a few seconds to lead required model files.", True) + crawler = WebCrawler() crawler.warmup() + cprint("πŸ› οΈ [bold cyan]Basic Usage: Simply provide a URL and let Crawl4ai do the magic![/bold cyan]") + result = crawler.run(url="https://www.nbcnews.com/business") + cprint("[LOG] πŸ“¦ [bold yellow]Basic crawl result:[/bold yellow]") + print_result(result) + + # Explanation of bypass_cache and include_raw_html + cprint("\n🧠 [bold cyan]Understanding 'bypass_cache' and 'include_raw_html' parameters:[/bold cyan]") + cprint("By default, Crawl4ai caches the results of your crawls. This means that subsequent crawls of the same URL will be much faster! Let's see this in action. Becuase we already crawled this URL, the result will be fetched from the cache. Let's try it out!") + # Reads from cache + cprint("1️⃣ First crawl (caches the result):", True) + start_time = time.time() + result = crawler.run(url="https://www.nbcnews.com/business") + end_time = time.time() + cprint(f"[LOG] πŸ“¦ [bold yellow]First crawl took {end_time - start_time} seconds and result (from cache):[/bold yellow]") + print_result(result) + + # Force to crawl again + cprint("2️⃣ Second crawl (Force to crawl again):", True) + start_time = time.time() + result = crawler.run(url="https://www.nbcnews.com/business", bypass_cache=True) + end_time = time.time() + cprint(f"[LOG] πŸ“¦ [bold yellow]Second crawl took {end_time - start_time} seconds and result (forced to crawl):[/bold yellow]") + print_result(result) + + # Retrieve raw HTML content + cprint("\nπŸ”„ [bold cyan]By default 'include_raw_html' is set to True, which includes the raw HTML content in the response.[/bold cyan]", True) + result = crawler.run(url="https://www.nbcnews.com/business", include_raw_html=False) + cprint("[LOG] πŸ“¦ [bold yellow]Craw result (without raw HTML content):[/bold yellow]") + print_result(result) + + cprint("\nπŸ“„ The 'include_raw_html' parameter, when set to True, includes the raw HTML content in the response. By default is set to True. Let's move on to exploring different chunking strategies now!") + + cprint("For the rest of this guide, I set crawler.always_by_pass_cache to True to force the crawler to bypass the cache. This is to ensure that we get fresh results for each run.", True) + crawler.always_by_pass_cache = True + + # Adding a chunking strategy: RegexChunking + cprint("\n🧩 [bold cyan]Let's add a chunking strategy: RegexChunking![/bold cyan]", True) + cprint("RegexChunking is a simple chunking strategy that splits the text based on a given regex pattern. Let's see it in action!") + result = crawler.run( + url="https://www.nbcnews.com/business", + chunking_strategy=RegexChunking(patterns=["\n\n"]) + ) + cprint("[LOG] πŸ“¦ [bold yellow]RegexChunking result:[/bold yellow]") + print_result(result) + + # Adding another chunking strategy: NlpSentenceChunking + cprint("\nπŸ” [bold cyan]Time to explore another chunking strategy: NlpSentenceChunking![/bold cyan]", True) + cprint("NlpSentenceChunking uses NLP techniques to split the text into sentences. Let's see how it performs!") + result = crawler.run( + url="https://www.nbcnews.com/business", + chunking_strategy=NlpSentenceChunking() + ) + cprint("[LOG] πŸ“¦ [bold yellow]NlpSentenceChunking result:[/bold yellow]") + print_result(result) + + cprint("There are more chunking strategies to explore, make sure to check document, but let's move on to extraction strategies now!") + + # Adding an extraction strategy: CosineStrategy + cprint("\n🧠 [bold cyan]Let's get smarter with an extraction strategy: CosineStrategy![/bold cyan]", True) + cprint("CosineStrategy uses cosine similarity to extract semantically similar blocks of text. Let's see it in action!") + result = crawler.run( + url="https://www.nbcnews.com/business", + extraction_strategy=CosineStrategy(word_count_threshold=10, max_dist=0.2, linkage_method="ward", top_k=3) + ) + cprint("[LOG] πŸ“¦ [bold yellow]CosineStrategy result:[/bold yellow]") + print_result(result) + + cprint("You can pass other parameters like 'semantic_filter' to the CosineStrategy to extract semantically similar blocks of text. Let's see it in action!") + result = crawler.run( + url="https://www.nbcnews.com/business", + extraction_strategy=CosineStrategy( + semantic_filter="inflation rent prices", + ) + ) + + cprint("[LOG] πŸ“¦ [bold yellow]CosineStrategy result with semantic filter:[/bold yellow]") + print_result(result) + + # Adding an LLM extraction strategy without instructions + cprint("\nπŸ€– [bold cyan]Time to bring in the big guns: LLMExtractionStrategy without instructions![/bold cyan]", True) + cprint("LLMExtractionStrategy uses a large language model to extract relevant information from the web page. Let's see it in action!") + result = crawler.run( + url="https://www.nbcnews.com/business", + extraction_strategy=LLMExtractionStrategy(provider="openai/gpt-4o", api_token=os.getenv('OPENAI_API_KEY')) + ) + cprint("[LOG] πŸ“¦ [bold yellow]LLMExtractionStrategy (no instructions) result:[/bold yellow]") + print_result(result) + + cprint("You can pass other providers like 'groq/llama3-70b-8192' or 'ollama/llama3' to the LLMExtractionStrategy.") + + # Adding an LLM extraction strategy with instructions + cprint("\nπŸ“œ [bold cyan]Let's make it even more interesting: LLMExtractionStrategy with instructions![/bold cyan]", True) + cprint("Let's say we are only interested in financial news. Let's see how LLMExtractionStrategy performs with instructions!") + result = crawler.run( + url="https://www.nbcnews.com/business", + extraction_strategy=LLMExtractionStrategy( + provider="openai/gpt-4o", + api_token=os.getenv('OPENAI_API_KEY'), + instruction="I am interested in only financial news" + ) + ) + cprint("[LOG] πŸ“¦ [bold yellow]LLMExtractionStrategy (with instructions) result:[/bold yellow]") + print_result(result) + + result = crawler.run( + url="https://www.example.com", + extraction_strategy=LLMExtractionStrategy( + provider="openai/gpt-4o", + api_token=os.getenv('OPENAI_API_KEY'), + instruction="Extract only content related to technology" + ) + ) + + cprint("You can pass other instructions like 'Extract only content related to technology' to the LLMExtractionStrategy.") + + cprint("There are more extraction strategies to explore, make sure to check the documentation!") + + # Using a CSS selector to extract only H2 tags + cprint("\n🎯 [bold cyan]Targeted extraction: Let's use a CSS selector to extract only H2 tags![/bold cyan]", True) + result = crawler.run( + url="https://www.nbcnews.com/business", + css_selector="h2" + ) + cprint("[LOG] πŸ“¦ [bold yellow]CSS Selector (H2 tags) result:[/bold yellow]") + print_result(result) + + # Passing JavaScript code to interact with the page + cprint("\nπŸ–±οΈ [bold cyan]Let's get interactive: Passing JavaScript code to click 'Load More' button![/bold cyan]", True) + cprint("In this example we try to click the 'Load More' button on the page using JavaScript code.") + js_code = """ + const loadMoreButton = Array.from(document.querySelectorAll('button')).find(button => button.textContent.includes('Load More')); + loadMoreButton && loadMoreButton.click(); + """ + crawler_strategy = LocalSeleniumCrawlerStrategy(js_code=js_code) + crawler = WebCrawler(crawler_strategy=crawler_strategy, always_by_pass_cache=True) + result = crawler.run( + url="https://www.nbcnews.com/business", + ) + cprint("[LOG] πŸ“¦ [bold yellow]JavaScript Code (Load More button) result:[/bold yellow]") + print_result(result) + + cprint("\nπŸŽ‰ [bold green]Congratulations! You've made it through the Crawl4ai Quickstart Guide! Now go forth and crawl the web like a pro! πŸ•ΈοΈ[/bold green]") + +if __name__ == "__main__": + main() + +def old_main(): + js_code = """const loadMoreButton = Array.from(document.querySelectorAll('button')).find(button => button.textContent.includes('Load More')); loadMoreButton && loadMoreButton.click();""" + # js_code = None + crawler = WebCrawler( crawler_strategy=LocalSeleniumCrawlerStrategy(use_cached_html=False, js_code=js_code)) + crawler.warmup() # Single page crawl result = crawler.run( url="https://www.nbcnews.com/business", word_count_threshold=5, # Minimum word count for a HTML tag to be considered as a worthy block chunking_strategy=RegexChunking(patterns=["\n\n"]), # Default is RegexChunking - extraction_strategy=CosineStrategy( - word_count_threshold=20, max_dist=0.2, linkage_method="ward", top_k=3 - ), # Default is CosineStrategy - # extraction_strategy= LLMExtractionStrategy(provider= "openai/gpt-4o", api_token = os.getenv('OPENAI_API_KEY')), + # extraction_strategy=CosineStrategy(word_count_threshold=10, max_dist=0.2, linkage_method="ward", top_k=3), # Default is CosineStrategy + extraction_strategy= LLMExtractionStrategy(provider= "openai/gpt-4o", api_token = os.getenv('OPENAI_API_KEY'), instruction = "I am intrested in only financial news"), bypass_cache=True, extract_blocks=True, # Whether to extract semantical blocks of text from the HTML css_selector="", # Eg: "div.article-body" or all H2 tags liek "h2" @@ -28,6 +201,3 @@ def main(): print("[LOG] πŸ“¦ Crawl result:") print(result.model_dump()) - -if __name__ == "__main__": - main() diff --git a/main.py b/main.py index 3cc141b7..5fc01a75 100644 --- a/main.py +++ b/main.py @@ -7,6 +7,7 @@ from fastapi import FastAPI, HTTPException, Request from fastapi.responses import HTMLResponse, JSONResponse from fastapi.staticfiles import StaticFiles from fastapi.middleware.cors import CORSMiddleware +from fastapi.templating import Jinja2Templates from pydantic import BaseModel, HttpUrl from concurrent.futures import ThreadPoolExecutor, as_completed @@ -35,7 +36,7 @@ app.add_middleware( # Mount the pages directory as a static directory app.mount("/pages", StaticFiles(directory=__location__ + "/pages"), name="pages") - +templates = Jinja2Templates(directory=__location__ + "/pages") # chromedriver_autoinstaller.install() # Ensure chromedriver is installed @lru_cache() def get_crawler(): @@ -51,16 +52,24 @@ class CrawlRequest(BaseModel): extract_blocks: bool = True word_count_threshold: Optional[int] = 5 extraction_strategy: Optional[str] = "CosineStrategy" + extraction_strategy_args: Optional[dict] = {} chunking_strategy: Optional[str] = "RegexChunking" + chunking_strategy_args: Optional[dict] = {} css_selector: Optional[str] = None verbose: Optional[bool] = True @app.get("/", response_class=HTMLResponse) -async def read_index(): - with open(f"{__location__}/pages/index.html", "r") as file: - html_content = file.read() - return HTMLResponse(content=html_content, status_code=200) +async def read_index(request: Request): + partials_dir = os.path.join(__location__, "pages", "partial") + partials = {} + + for filename in os.listdir(partials_dir): + if filename.endswith(".html"): + with open(os.path.join(partials_dir, filename), "r") as file: + partials[filename[:-5]] = file.read() + + return templates.TemplateResponse("index.html", {"request": request, **partials}) @app.get("/total-count") async def get_total_url_count(): @@ -73,11 +82,11 @@ async def clear_database(): clear_db() return JSONResponse(content={"message": "Database cleared."}) -def import_strategy(module_name: str, class_name: str): +def import_strategy(module_name: str, class_name: str, *args, **kwargs): try: module = importlib.import_module(module_name) strategy_class = getattr(module, class_name) - return strategy_class() + return strategy_class(*args, **kwargs) except ImportError: raise HTTPException(status_code=400, detail=f"Module {module_name} not found.") except AttributeError: @@ -95,8 +104,8 @@ async def crawl_urls(crawl_request: CrawlRequest, request: Request): current_requests += 1 try: - extraction_strategy = import_strategy("crawl4ai.extraction_strategy", crawl_request.extraction_strategy) - chunking_strategy = import_strategy("crawl4ai.chunking_strategy", crawl_request.chunking_strategy) + extraction_strategy = import_strategy("crawl4ai.extraction_strategy", crawl_request.extraction_strategy, **crawl_request.extraction_strategy_args) + chunking_strategy = import_strategy("crawl4ai.chunking_strategy", crawl_request.chunking_strategy, **crawl_request.chunking_strategy_args) # Use ThreadPoolExecutor to run the synchronous WebCrawler in async manner with ThreadPoolExecutor() as executor: diff --git a/pages/app.css b/pages/app.css new file mode 100644 index 00000000..0e94a2e5 --- /dev/null +++ b/pages/app.css @@ -0,0 +1,131 @@ +:root { + --ifm-font-size-base: 100%; + --ifm-line-height-base: 1.65; + --ifm-font-family-base: system-ui, -apple-system, Segoe UI, Roboto, Ubuntu, Cantarell, Noto Sans, sans-serif, + BlinkMacSystemFont, "Segoe UI", Helvetica, Arial, sans-serif, "Apple Color Emoji", "Segoe UI Emoji", + "Segoe UI Symbol"; +} +html { + -webkit-font-smoothing: antialiased; + -webkit-text-size-adjust: 100%; + text-size-adjust: 100%; + font: var(--ifm-font-size-base) / var(--ifm-line-height-base) var(--ifm-font-family-base); +} +body { + background-color: #1a202c; + color: #fff; +} +.tab-content { + max-height: 400px; + overflow: auto; +} +pre { + white-space: pre-wrap; + font-size: 14px; +} +pre code { + width: 100%; +} + +/* Custom styling for docs-item class and Markdown generated elements */ +.docs-item { + background-color: #2d3748; /* bg-gray-800 */ + padding: 1rem; /* p-4 */ + border-radius: 0.375rem; /* rounded */ + box-shadow: 0 4px 6px rgba(0, 0, 0, 0.1); /* shadow-md */ + margin-bottom: 1rem; /* space between items */ + line-height: 1.5; /* leading-normal */ +} + +.docs-item h3, +.docs-item h4 { + color: #ffffff; /* text-white */ + font-size: 1.25rem; /* text-xl */ + font-weight: 700; /* font-bold */ + margin-bottom: 0.5rem; /* mb-2 */ +} +.docs-item h4 { + font-size: 1rem; /* text-xl */ +} + +.docs-item p { + color: #e2e8f0; /* text-gray-300 */ + margin-bottom: 0.5rem; /* mb-2 */ +} + +.docs-item code { + background-color: #1a202c; /* bg-gray-900 */ + color: #e2e8f0; /* text-gray-300 */ + padding: 0.25rem 0.5rem; /* px-2 py-1 */ + border-radius: 0.25rem; /* rounded */ + font-size: 0.875rem; /* text-sm */ +} + +.docs-item pre { + background-color: #1a202c; /* bg-gray-900 */ + color: #e2e8f0; /* text-gray-300 */ + padding: 0.5rem; /* p-2 */ + border-radius: 0.375rem; /* rounded */ + overflow: auto; /* overflow-auto */ + margin-bottom: 0.5rem; /* mb-2 */ +} + +.docs-item div { + color: #e2e8f0; /* text-gray-300 */ + font-size: 1rem; /* prose prose-sm */ + line-height: 1.25rem; /* line-height for readability */ +} + +/* Adjustments to make prose class more suitable for dark mode */ +.prose { + max-width: none; /* max-w-none */ +} + +.prose p, +.prose ul { + margin-bottom: 1rem; /* mb-4 */ +} + +.prose code { + /* background-color: #4a5568; */ /* bg-gray-700 */ + color: #65a30d; /* text-white */ + padding: 0.25rem 0.5rem; /* px-1 py-0.5 */ + border-radius: 0.25rem; /* rounded */ + display: inline-block; /* inline-block */ +} + +.prose pre { + background-color: #1a202c; /* bg-gray-900 */ + color: #ffffff; /* text-white */ + padding: 0.5rem; /* p-2 */ + border-radius: 0.375rem; /* rounded */ +} + +.prose h3 { + color: #65a30d; /* text-white */ + font-size: 1.25rem; /* text-xl */ + font-weight: 700; /* font-bold */ + margin-bottom: 0.5rem; /* mb-2 */ +} + +body { + background-color: #1a1a1a; + color: #b3ff00; +} +.sidebar { + color: #b3ff00; + border-right: 1px solid #333; +} +.sidebar a { + color: #b3ff00; + text-decoration: none; +} +.sidebar a:hover { + background-color: #555; +} +.content-section { + display: none; +} +.content-section.active { + display: block; +} diff --git a/pages/app.js b/pages/app.js new file mode 100644 index 00000000..a30581a5 --- /dev/null +++ b/pages/app.js @@ -0,0 +1,303 @@ +// JavaScript to manage dynamic form changes and logic +document.getElementById("extraction-strategy-select").addEventListener("change", function () { + const strategy = this.value; + const providerModelSelect = document.getElementById("provider-model-select"); + const tokenInput = document.getElementById("token-input"); + const instruction = document.getElementById("instruction"); + const semantic_filter = document.getElementById("semantic_filter"); + const instruction_div = document.getElementById("instruction_div"); + const semantic_filter_div = document.getElementById("semantic_filter_div"); + const llm_settings = document.getElementById("llm_settings"); + + if (strategy === "LLMExtractionStrategy") { + // providerModelSelect.disabled = false; + // tokenInput.disabled = false; + // semantic_filter.disabled = true; + // instruction.disabled = false; + llm_settings.classList.remove("hidden"); + instruction_div.classList.remove("hidden"); + semantic_filter_div.classList.add("hidden"); + } else if (strategy === "NoExtractionStrategy") { + semantic_filter_div.classList.add("hidden"); + instruction_div.classList.add("hidden"); + llm_settings.classList.add("hidden"); + } else { + // providerModelSelect.disabled = true; + // tokenInput.disabled = true; + // semantic_filter.disabled = false; + // instruction.disabled = true; + llm_settings.classList.add("hidden"); + instruction_div.classList.add("hidden"); + semantic_filter_div.classList.remove("hidden"); + } + + +}); + +// Get the selected provider model and token from local storage +const storedProviderModel = localStorage.getItem("provider_model"); +const storedToken = localStorage.getItem(storedProviderModel); + +if (storedProviderModel) { + document.getElementById("provider-model-select").value = storedProviderModel; +} + +if (storedToken) { + document.getElementById("token-input").value = storedToken; +} + +// Handle provider model dropdown change +document.getElementById("provider-model-select").addEventListener("change", () => { + const selectedProviderModel = document.getElementById("provider-model-select").value; + const storedToken = localStorage.getItem(selectedProviderModel); + + if (storedToken) { + document.getElementById("token-input").value = storedToken; + } else { + document.getElementById("token-input").value = ""; + } +}); + +// Fetch total count from the database +axios + .get("/total-count") + .then((response) => { + document.getElementById("total-count").textContent = response.data.count; + }) + .catch((error) => console.error(error)); + +// Handle crawl button click +document.getElementById("crawl-btn").addEventListener("click", () => { + // validate input to have both URL and API token + if (!document.getElementById("url-input").value || !document.getElementById("token-input").value) { + alert("Please enter both URL(s) and API token."); + return; + } + + const selectedProviderModel = document.getElementById("provider-model-select").value; + const apiToken = document.getElementById("token-input").value; + const extractBlocks = document.getElementById("extract-blocks-checkbox").checked; + const bypassCache = document.getElementById("bypass-cache-checkbox").checked; + + // Save the selected provider model and token to local storage + localStorage.setItem("provider_model", selectedProviderModel); + localStorage.setItem(selectedProviderModel, apiToken); + + const urlsInput = document.getElementById("url-input").value; + const urls = urlsInput.split(",").map((url) => url.trim()); + const data = { + urls: urls, + provider_model: selectedProviderModel, + api_token: apiToken, + include_raw_html: true, + bypass_cache: bypassCache, + extract_blocks: extractBlocks, + word_count_threshold: parseInt(document.getElementById("threshold").value), + extraction_strategy: document.getElementById("extraction-strategy-select").value, + extraction_strategy_args: { + provider: selectedProviderModel, + api_token: apiToken, + instruction: document.getElementById("instruction").value, + semantic_filter: document.getElementById("semantic_filter").value, + }, + chunking_strategy: document.getElementById("chunking-strategy-select").value, + chunking_strategy_args: {}, + css_selector: document.getElementById("css-selector").value, + // instruction: document.getElementById("instruction").value, + // semantic_filter: document.getElementById("semantic_filter").value, + verbose: true, + }; + + // save api token to local storage + localStorage.setItem("api_token", document.getElementById("token-input").value); + + document.getElementById("loading").classList.remove("hidden"); + document.getElementById("result").classList.add("hidden"); + document.getElementById("code_help").classList.add("hidden"); + + axios + .post("/crawl", data) + .then((response) => { + const result = response.data.results[0]; + const parsedJson = JSON.parse(result.extracted_content); + document.getElementById("json-result").textContent = JSON.stringify(parsedJson, null, 2); + document.getElementById("cleaned-html-result").textContent = result.cleaned_html; + document.getElementById("markdown-result").textContent = result.markdown; + + // Update code examples dynamically + const extractionStrategy = data.extraction_strategy; + const isLLMExtraction = extractionStrategy === "LLMExtractionStrategy"; + + document.getElementById( + "curl-code" + ).textContent = `curl -X POST -H "Content-Type: application/json" -d '${JSON.stringify({ + ...data, + api_token: isLLMExtraction ? "your_api_token" : undefined, + })}' http://crawl4ai.uccode.io/crawl`; + + document.getElementById("python-code").textContent = `import requests\n\ndata = ${JSON.stringify( + { ...data, api_token: isLLMExtraction ? "your_api_token" : undefined }, + null, + 2 + )}\n\nresponse = requests.post("http://crawl4ai.uccode.io/crawl", json=data) # OR local host if your run locally \nprint(response.json())`; + + document.getElementById( + "nodejs-code" + ).textContent = `const axios = require('axios');\n\nconst data = ${JSON.stringify( + { ...data, api_token: isLLMExtraction ? "your_api_token" : undefined }, + null, + 2 + )};\n\naxios.post("http://crawl4ai.uccode.io/crawl", data) // OR local host if your run locally \n .then(response => console.log(response.data))\n .catch(error => console.error(error));`; + + document.getElementById( + "library-code" + ).textContent = `from crawl4ai.web_crawler import WebCrawler\nfrom crawl4ai.extraction_strategy import *\nfrom crawl4ai.chunking_strategy import *\n\ncrawler = WebCrawler()\ncrawler.warmup()\n\nresult = crawler.run(\n url='${ + urls[0] + }',\n word_count_threshold=${data.word_count_threshold},\n extraction_strategy=${ + isLLMExtraction + ? `${extractionStrategy}(provider="${data.provider_model}", api_token="${data.api_token}")` + : extractionStrategy + "()" + },\n chunking_strategy=${data.chunking_strategy}(),\n bypass_cache=${ + data.bypass_cache + },\n css_selector="${data.css_selector}"\n)\nprint(result)`; + + // Highlight code syntax + hljs.highlightAll(); + + // Select JSON tab by default + document.querySelector('.tab-btn[data-tab="json"]').click(); + + document.getElementById("loading").classList.add("hidden"); + + document.getElementById("result").classList.remove("hidden"); + document.getElementById("code_help").classList.remove("hidden"); + + // increment the total count + document.getElementById("total-count").textContent = + parseInt(document.getElementById("total-count").textContent) + 1; + }) + .catch((error) => { + console.error(error); + document.getElementById("loading").classList.add("hidden"); + }); +}); + +// Handle tab clicks +document.querySelectorAll(".tab-btn").forEach((btn) => { + btn.addEventListener("click", () => { + const tab = btn.dataset.tab; + document.querySelectorAll(".tab-btn").forEach((b) => b.classList.remove("bg-lime-700", "text-white")); + btn.classList.add("bg-lime-700", "text-white"); + document.querySelectorAll(".tab-content.code pre").forEach((el) => el.classList.add("hidden")); + document.getElementById(`${tab}-result`).parentElement.classList.remove("hidden"); + }); +}); + +// Handle code tab clicks +document.querySelectorAll(".code-tab-btn").forEach((btn) => { + btn.addEventListener("click", () => { + const tab = btn.dataset.tab; + document.querySelectorAll(".code-tab-btn").forEach((b) => b.classList.remove("bg-lime-700", "text-white")); + btn.classList.add("bg-lime-700", "text-white"); + document.querySelectorAll(".tab-content.result pre").forEach((el) => el.classList.add("hidden")); + document.getElementById(`${tab}-code`).parentElement.classList.remove("hidden"); + }); +}); + +// Handle copy to clipboard button clicks + +async function copyToClipboard(text) { + if (navigator.clipboard && navigator.clipboard.writeText) { + return navigator.clipboard.writeText(text); + } else { + return fallbackCopyTextToClipboard(text); + } +} + +function fallbackCopyTextToClipboard(text) { + return new Promise((resolve, reject) => { + const textArea = document.createElement("textarea"); + textArea.value = text; + + // Avoid scrolling to bottom + textArea.style.top = "0"; + textArea.style.left = "0"; + textArea.style.position = "fixed"; + + document.body.appendChild(textArea); + textArea.focus(); + textArea.select(); + + try { + const successful = document.execCommand("copy"); + if (successful) { + resolve(); + } else { + reject(); + } + } catch (err) { + reject(err); + } + + document.body.removeChild(textArea); + }); +} + +document.querySelectorAll(".copy-btn").forEach((btn) => { + btn.addEventListener("click", () => { + const target = btn.dataset.target; + const code = document.getElementById(target).textContent; + //navigator.clipboard.writeText(code).then(() => { + copyToClipboard(code).then(() => { + btn.textContent = "Copied!"; + setTimeout(() => { + btn.textContent = "Copy"; + }, 2000); + }); + }); +}); + +document.addEventListener("DOMContentLoaded", async () => { + try { + const extractionResponse = await fetch("/strategies/extraction"); + const extractionStrategies = await extractionResponse.json(); + + const chunkingResponse = await fetch("/strategies/chunking"); + const chunkingStrategies = await chunkingResponse.json(); + + renderStrategies("extraction-strategies", extractionStrategies); + renderStrategies("chunking-strategies", chunkingStrategies); + } catch (error) { + console.error("Error fetching strategies:", error); + } +}); + +function renderStrategies(containerId, strategies) { + const container = document.getElementById(containerId); + container.innerHTML = ""; // Clear any existing content + strategies = JSON.parse(strategies); + Object.entries(strategies).forEach(([strategy, description]) => { + const strategyElement = document.createElement("div"); + strategyElement.classList.add("bg-zinc-800", "p-4", "rounded", "shadow-md", "docs-item"); + + const strategyDescription = document.createElement("div"); + strategyDescription.classList.add("text-gray-300", "prose", "prose-sm"); + strategyDescription.innerHTML = marked.parse(description); + + strategyElement.appendChild(strategyDescription); + + container.appendChild(strategyElement); + }); +} +document.querySelectorAll(".sidebar a").forEach((link) => { + link.addEventListener("click", function (event) { + event.preventDefault(); + document.querySelectorAll(".content-section").forEach((section) => { + section.classList.remove("active"); + }); + const target = event.target.getAttribute("data-target"); + document.getElementById(target).classList.add("active"); + }); +}); +// Highlight code syntax +hljs.highlightAll(); diff --git a/pages/index copy.html b/pages/index copy.html new file mode 100644 index 00000000..b61b7298 --- /dev/null +++ b/pages/index copy.html @@ -0,0 +1,971 @@ + + + + + + Crawl4AI + + + + + + + + + + + + + + + + +

+
+

πŸ”₯πŸ•·οΈ Crawl4AI: Web Data for your Thoughts

+
+
+ πŸ“Š Total Website Processed + 2 +
+
+ +
+
+

Try It Now

+
+
+
+ + +
+
+ + +
+
+ + +
+
+ + +
+
+ + +
+
+ + +
+
+ + +
+
+
+ + +
+
+ + +
+ +
+
+ +
+ +
+ + + +
+
+
+ + +
+
+ +
+
+ + + + +
+
+
+                                
+                                
+                            
+ + + +
+
+
+
+
+
+
+ +
+ 🌟 Welcome to the Crawl4ai Quickstart Guide! Let's dive into some web crawling fun! +
+
+ First Step: Create an instance of WebCrawler and call the warmup() function. +
+
+
crawler = WebCrawler()
+            crawler.warmup()
+
+ + +
+ 🧠 Understanding 'bypass_cache' and 'include_raw_html' parameters: +
+
First crawl (caches the result):
+
+
result = crawler.run(url="https://www.nbcnews.com/business")
+
+
Second crawl (Force to crawl again):
+
+
result = crawler.run(url="https://www.nbcnews.com/business", bypass_cache=True)
+
+
Crawl result without raw HTML content:
+
+
result = crawler.run(url="https://www.nbcnews.com/business", include_raw_html=False)
+
+ + +
+ πŸ“„ + The 'include_raw_html' parameter, when set to True, includes the raw HTML content in the + response. By default, it is set to True. +
+
Set always_by_pass_cache to True:
+
+
crawler.always_by_pass_cache = True
+
+ + +
+ 🧩 Let's add a chunking strategy: RegexChunking! +
+
Using RegexChunking:
+
+
result = crawler.run(
+                url="https://www.nbcnews.com/business",
+                chunking_strategy=RegexChunking(patterns=["\n\n"])
+            )
+
+
Using NlpSentenceChunking:
+
+
result = crawler.run(
+                url="https://www.nbcnews.com/business",
+                chunking_strategy=NlpSentenceChunking()
+            )
+
+ + +
+ 🧠 Let's get smarter with an extraction strategy: CosineStrategy! +
+
Using CosineStrategy:
+
+
result = crawler.run(
+                url="https://www.nbcnews.com/business",
+                extraction_strategy=CosineStrategy(word_count_threshold=10, max_dist=0.2, linkage_method="ward", top_k=3)
+            )
+
+ + +
+ πŸ€– Time to bring in the big guns: LLMExtractionStrategy without instructions! +
+
Using LLMExtractionStrategy without instructions:
+
+
result = crawler.run(
+                url="https://www.nbcnews.com/business",
+                extraction_strategy=LLMExtractionStrategy(provider="openai/gpt-4o", api_token=os.getenv('OPENAI_API_KEY'))
+            )
+
+ + +
+ πŸ“œ Let's make it even more interesting: LLMExtractionStrategy with instructions! +
+
Using LLMExtractionStrategy with instructions:
+
+
result = crawler.run(
+                url="https://www.nbcnews.com/business",
+                extraction_strategy=LLMExtractionStrategy(
+                    provider="openai/gpt-4o",
+                    api_token=os.getenv('OPENAI_API_KEY'),
+                    instruction="I am interested in only financial news"
+                )
+            )
+
+ + +
+ 🎯 Targeted extraction: Let's use a CSS selector to extract only H2 tags! +
+
Using CSS selector to extract H2 tags:
+
+
result = crawler.run(
+                url="https://www.nbcnews.com/business",
+                css_selector="h2"
+            )
+
+ + +
+ πŸ–±οΈ Let's get interactive: Passing JavaScript code to click 'Load More' button! +
+
Using JavaScript to click 'Load More' button:
+
+
js_code = """
+            const loadMoreButton = Array.from(document.querySelectorAll('button')).find(button => button.textContent.includes('Load More'));
+            loadMoreButton && loadMoreButton.click();
+            """
+            crawler_strategy = LocalSeleniumCrawlerStrategy(js_code=js_code)
+            crawler = WebCrawler(crawler_strategy=crawler_strategy, always_by_pass_cache=True)
+            result = crawler.run(url="https://www.nbcnews.com/business")
+
+ + +
+ πŸŽ‰ + Congratulations! You've made it through the Crawl4ai Quickstart Guide! Now go forth and crawl + the web like a pro! πŸ•ΈοΈ +
+
+
+
+

Installation πŸ’»

+

+ There are two ways to use Crawl4AI: as a library in your Python projects or as a standalone local + server. +

+ +

+ You can also try Crawl4AI in a Google Colab + Open In Colab +

+ +

Using Crawl4AI as a Library πŸ“š

+

To install Crawl4AI as a library, follow these steps:

+ +
    +
  1. + Install the package from GitHub: +
    pip install git+https://github.com/unclecode/crawl4ai.git
    +
  2. +
  3. + Alternatively, you can clone the repository and install the package locally: +
    virtualenv venv
    +source venv/bin/activate
    +git clone https://github.com/unclecode/crawl4ai.git
    +cd crawl4ai
    +pip install -e .
    +        
    +
  4. +
  5. + Import the necessary modules in your Python script: +
    from crawl4ai.web_crawler import WebCrawler
    +from crawl4ai.chunking_strategy import *
    +from crawl4ai.extraction_strategy import *
    +import os
    +
    +crawler = WebCrawler()
    +
    +# Single page crawl
    +single_url = UrlModel(url='https://www.nbcnews.com/business', forced=False)
    +result = crawl4ai.fetch_page(
    +    url='https://www.nbcnews.com/business',
    +    word_count_threshold=5, # Minimum word count for a HTML tag to be considered as a worthy block
    +    chunking_strategy= RegexChunking( patterns = ["\\n\\n"]), # Default is RegexChunking
    +    extraction_strategy= CosineStrategy(word_count_threshold=10, max_dist=0.2, linkage_method='ward', top_k=3) # Default is CosineStrategy
    +    # extraction_strategy= LLMExtractionStrategy(provider= "openai/gpt-4o", api_token = os.getenv('OPENAI_API_KEY')),
    +    bypass_cache=False,
    +    extract_blocks =True, # Whether to extract semantical blocks of text from the HTML
    +    css_selector = "", # Eg: "div.article-body"
    +    verbose=True,
    +    include_raw_html=True, # Whether to include the raw HTML content in the response
    +)
    +print(result.model_dump())
    +        
    +
  6. +
+

+ For more information about how to run Crawl4AI as a local server, please refer to the + GitHub repository. +

+ +
+ +
+

πŸ“– Parameters

+
+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
ParameterDescriptionRequiredDefault Value
urls + A list of URLs to crawl and extract data from. + Yes-
include_raw_html + Whether to include the raw HTML content in the response. + Nofalse
bypass_cache + Whether to force a fresh crawl even if the URL has been previously crawled. + Nofalse
extract_blocks + Whether to extract semantical blocks of text from the HTML. + Notrue
word_count_threshold + The minimum number of words a block must contain to be considered meaningful (minimum + value is 5). + No5
extraction_strategy + The strategy to use for extracting content from the HTML (e.g., "CosineStrategy"). + NoCosineStrategy
chunking_strategy + The strategy to use for chunking the text before processing (e.g., "RegexChunking"). + NoRegexChunking
css_selector + The CSS selector to target specific parts of the HTML for extraction. + NoNone
verboseWhether to enable verbose logging.Notrue
+
+
+ +
+
+

Extraction Strategies

+
+
+
+ +
+
+

Chunking Strategies

+
+
+
+ +
+
+

πŸ€” Why building this?

+

+ In recent times, we've witnessed a surge of startups emerging, riding the AI hype wave and charging + for services that should rightfully be accessible to everyone. πŸŒπŸ’Έ One such example is scraping and + crawling web pages and transforming them into a format suitable for Large Language Models (LLMs). + πŸ•ΈοΈπŸ€– We believe that building a business around this is not the right approach; instead, it should + definitely be open-source. πŸ†“πŸŒŸ So, if you possess the skills to build such tools and share our + philosophy, we invite you to join our "Robinhood" band and help set these products free for the + benefit of all. 🀝πŸ’ͺ +

+
+
+ +
+
+

βš™οΈ Installation

+

+ To install and run Crawl4AI as a library or a local server, please refer to the πŸ“š + GitHub repository. +

+
+
+ + + + + + diff --git a/pages/index.html b/pages/index.html index e354ae3e..2947c34a 100644 --- a/pages/index.html +++ b/pages/index.html @@ -12,6 +12,7 @@ + - - -
+
+

πŸ”₯πŸ•·οΈ Crawl4AI: Web Data for your Thoughts

@@ -137,675 +32,42 @@ 2
+ + {{ try_it | safe }} -
-
-

Try It Now

-
-
-
- - -
-
- - -
-
- - -
-
- - -
-
- - -
-
- - -
-
- - -
-
-
- - -
-
- - -
- -
+
+
+
+ -
- -
- - - -
-
-
- - -
-
+ +
+ {{installation | safe}} {{how_to_guide | safe}} -
-
- - - - -
-
-
-                                
-                                
-                            
- - - -
+
+

Chunking Strategies

+

Content for chunking strategies...

+
+
+

Extraction Strategies

+

Content for extraction strategies...

+
-
-
-

Installation πŸ’»

-

There are two ways to use Crawl4AI: as a library in your Python projects or as a standalone local server.

- -

You can also try Crawl4AI in a Google Colab Open In Colab

+ -

Using Crawl4AI as a Library πŸ“š

-

To install Crawl4AI as a library, follow these steps:

- -
    -
  1. - Install the package from GitHub: -
    pip install git+https://github.com/unclecode/crawl4ai.git
    -
  2. -
  3. - Alternatively, you can clone the repository and install the package locally: -
    virtualenv venv
    -source venv/bin/activate
    -git clone https://github.com/unclecode/crawl4ai.git
    -cd crawl4ai
    -pip install -e .
    -        
    -
  4. -
  5. - Import the necessary modules in your Python script: -
    from crawl4ai.web_crawler import WebCrawler
    -from crawl4ai.chunking_strategy import *
    -from crawl4ai.extraction_strategy import *
    -import os
    -
    -crawler = WebCrawler()
    -
    -# Single page crawl
    -single_url = UrlModel(url='https://www.nbcnews.com/business', forced=False)
    -result = crawl4ai.fetch_page(
    -    url='https://www.nbcnews.com/business',
    -    word_count_threshold=5, # Minimum word count for a HTML tag to be considered as a worthy block
    -    chunking_strategy= RegexChunking( patterns = ["\\n\\n"]), # Default is RegexChunking
    -    extraction_strategy= CosineStrategy(word_count_threshold=20, max_dist=0.2, linkage_method='ward', top_k=3) # Default is CosineStrategy
    -    # extraction_strategy= LLMExtractionStrategy(provider= "openai/gpt-4o", api_token = os.getenv('OPENAI_API_KEY')),
    -    bypass_cache=False,
    -    extract_blocks =True, # Whether to extract semantical blocks of text from the HTML
    -    css_selector = "", # Eg: "div.article-body"
    -    verbose=True,
    -    include_raw_html=True, # Whether to include the raw HTML content in the response
    -)
    -print(result.model_dump())
    -        
    -
  6. -
-

For more information about how to run Crawl4AI as a local server, please refer to the GitHub repository.

- -

πŸ“– Parameters

-
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
ParameterDescriptionRequiredDefault Value
urls - A list of URLs to crawl and extract data from. - Yes-
include_raw_html - Whether to include the raw HTML content in the response. - Nofalse
bypass_cache - Whether to force a fresh crawl even if the URL has been previously crawled. - Nofalse
extract_blocks - Whether to extract semantical blocks of text from the HTML. - Notrue
word_count_threshold - The minimum number of words a block must contain to be considered meaningful (minimum - value is 5). - No5
extraction_strategy - The strategy to use for extracting content from the HTML (e.g., "CosineStrategy"). - NoCosineStrategy
chunking_strategy - The strategy to use for chunking the text before processing (e.g., "RegexChunking"). - NoRegexChunking
css_selector - The CSS selector to target specific parts of the HTML for extraction. - NoNone
verboseWhether to enable verbose logging.Notrue
-
-
- -
-
-

Extraction Strategies

-
-
-
- -
-
-

Chunking Strategies

-
-
-
- -
-
-

πŸ€” Why building this?

-

- In recent times, we've witnessed a surge of startups emerging, riding the AI hype wave and charging - for services that should rightfully be accessible to everyone. πŸŒπŸ’Έ One such example is scraping and - crawling web pages and transforming them into a format suitable for Large Language Models (LLMs). - πŸ•ΈοΈπŸ€– We believe that building a business around this is not the right approach; instead, it should - definitely be open-source. πŸ†“πŸŒŸ So, if you possess the skills to build such tools and share our - philosophy, we invite you to join our "Robinhood" band and help set these products free for the - benefit of all. 🀝πŸ’ͺ -

-
-
- -
- -
- - - - + {{ footer | safe }} + diff --git a/pages/index_pooling.html b/pages/index_pooling.html index 50e57f01..920801d1 100644 --- a/pages/index_pooling.html +++ b/pages/index_pooling.html @@ -283,7 +283,7 @@ .post("/crawl", data) .then((response) => { const result = response.data.results[0]; - const parsedJson = JSON.parse(result.parsed_json); + const parsedJson = JSON.parse(result.extracted_content); document.getElementById("json-result").textContent = JSON.stringify(parsedJson, null, 2); document.getElementById("cleaned-html-result").textContent = result.cleaned_html; document.getElementById("markdown-result").textContent = result.markdown; diff --git a/pages/partial/footer.html b/pages/partial/footer.html new file mode 100644 index 00000000..3ab189e1 --- /dev/null +++ b/pages/partial/footer.html @@ -0,0 +1,36 @@ +
+
+

πŸ€” Why building this?

+

+ In recent times, we've witnessed a surge of startups emerging, riding the AI hype wave and charging + for services that should rightfully be accessible to everyone. πŸŒπŸ’Έ One such example is scraping and + crawling web pages and transforming them into a format suitable for Large Language Models (LLMs). + πŸ•ΈοΈπŸ€– We believe that building a business around this is not the right approach; instead, it should + definitely be open-source. πŸ†“πŸŒŸ So, if you possess the skills to build such tools and share our + philosophy, we invite you to join our "Robinhood" band and help set these products free for the + benefit of all. 🀝πŸ’ͺ +

+
+
+ + \ No newline at end of file diff --git a/pages/partial/how_to_guide.html b/pages/partial/how_to_guide.html new file mode 100644 index 00000000..b8f85ed6 --- /dev/null +++ b/pages/partial/how_to_guide.html @@ -0,0 +1,160 @@ +
+

How to Guide

+
+ +
+ 🌟 + Welcome to the Crawl4ai Quickstart Guide! Let's dive into some web crawling + fun! +
+
+ First Step: Create an instance of WebCrawler and call the + warmup() function. +
+
+
crawler = WebCrawler()
+crawler.warmup()
+
+ + +
+ 🧠 Understanding 'bypass_cache' and 'include_raw_html' parameters: +
+
First crawl (caches the result):
+
+
result = crawler.run(url="https://www.nbcnews.com/business")
+
+
Second crawl (Force to crawl again):
+
+
result = crawler.run(url="https://www.nbcnews.com/business", bypass_cache=True)
+
+ ⚠️ Don't forget to set `bypass_cache` to True if you want to try different strategies for the same URL. Otherwise, the cached result will be returned. You can also set `always_by_pass_cache` in constructor to True to always bypass the cache. +
+
+
Crawl result without raw HTML content:
+
+
result = crawler.run(url="https://www.nbcnews.com/business", include_raw_html=False)
+
+ + +
+ πŸ“„ + The 'include_raw_html' parameter, when set to True, includes the raw HTML content + in the response. By default, it is set to True. +
+
Set always_by_pass_cache to True:
+
+
crawler.always_by_pass_cache = True
+
+ + +
+ 🧩 Let's add a chunking strategy: RegexChunking! +
+
Using RegexChunking:
+
+
result = crawler.run(
+    url="https://www.nbcnews.com/business",
+    chunking_strategy=RegexChunking(patterns=["\n\n"])
+)
+
+
Using NlpSentenceChunking:
+
+
result = crawler.run(
+    url="https://www.nbcnews.com/business",
+    chunking_strategy=NlpSentenceChunking()
+)
+
+ + +
+ 🧠 Let's get smarter with an extraction strategy: CosineStrategy! +
+
Using CosineStrategy:
+
+
result = crawler.run(
+    url="https://www.nbcnews.com/business",
+    extraction_strategy=CosineStrategy(word_count_threshold=10, max_dist=0.2, linkage_method="ward", top_k=3)
+)
+
+ + +
+ πŸ€– + Time to bring in the big guns: LLMExtractionStrategy without instructions! +
+
Using LLMExtractionStrategy without instructions:
+
+
result = crawler.run(
+    url="https://www.nbcnews.com/business",
+    extraction_strategy=LLMExtractionStrategy(provider="openai/gpt-4o", api_token=os.getenv('OPENAI_API_KEY'))
+)
+
+ + +
+ πŸ“œ + Let's make it even more interesting: LLMExtractionStrategy with + instructions! +
+
Using LLMExtractionStrategy with instructions:
+
+
result = crawler.run(
+    url="https://www.nbcnews.com/business",
+    extraction_strategy=LLMExtractionStrategy(
+    provider="openai/gpt-4o",
+    api_token=os.getenv('OPENAI_API_KEY'),
+    instruction="I am interested in only financial news"
+)
+)
+
+ + +
+ 🎯 + Targeted extraction: Let's use a CSS selector to extract only H2 tags! +
+
Using CSS selector to extract H2 tags:
+
+
result = crawler.run(
+    url="https://www.nbcnews.com/business",
+    css_selector="h2"
+)
+
+ + +
+ πŸ–±οΈ + Let's get interactive: Passing JavaScript code to click 'Load More' button! +
+
Using JavaScript to click 'Load More' button:
+
+
js_code = """
+const loadMoreButton = Array.from(document.querySelectorAll('button')).find(button => button.textContent.includes('Load More'));
+loadMoreButton && loadMoreButton.click();
+"""
+crawler_strategy = LocalSeleniumCrawlerStrategy(js_code=js_code)
+crawler = WebCrawler(crawler_strategy=crawler_strategy, always_by_pass_cache=True)
+result = crawler.run(url="https://www.nbcnews.com/business")
+
+ + +
+ πŸŽ‰ + Congratulations! You've made it through the Crawl4ai Quickstart Guide! Now go forth + and crawl the web like a pro! πŸ•ΈοΈ +
+
+
\ No newline at end of file diff --git a/pages/partial/installation.html b/pages/partial/installation.html new file mode 100644 index 00000000..919db240 --- /dev/null +++ b/pages/partial/installation.html @@ -0,0 +1,56 @@ +
+

Installation πŸ’»

+

+ There are three ways to use Crawl4AI: +

    +
  1. + As a library +
  2. +
  3. + As a local server (Docker) +
  4. +
  5. + As a Google Colab notebook. Open In Colab +
  6. +

    + + +

    To install Crawl4AI as a library, follow these steps:

    + +
      +
    1. + Install the package from GitHub: +
      pip install git+https://github.com/unclecode/crawl4ai.git
      +
    2. +
    3. + Alternatively, you can clone the repository and install the package locally: +
      virtualenv venv
      +source venv/bin/activate
      +git clone https://github.com/unclecode/crawl4ai.git
      +cd crawl4ai
      +pip install -e .
      +
      +
    4. +
    5. + Use docker to run the local server: +
      docker build -t crawl4ai . 
      +# docker build --platform linux/amd64 -t crawl4ai . For Mac users
      +docker run -d -p 8000:80 crawl4ai
      +
    6. +
    +

    + For more information about how to run Crawl4AI as a local server, please refer to the + GitHub repository. +

    +
\ No newline at end of file diff --git a/pages/partial/try_it.html b/pages/partial/try_it.html new file mode 100644 index 00000000..56f85062 --- /dev/null +++ b/pages/partial/try_it.html @@ -0,0 +1,204 @@ +
+
+

Try It Now

+
+
+
+ + +
+
+
+ + +
+
+ + +
+
+
+
+ + +
+
+ + +
+
+ +
+ +
+ + +
+ +
+
+
+ + +
+
+ + +
+ +
+
+ +
+ +
+ + + +
+
+
+ + +
+
+ +
+
+ + + + +
+
+
+                        
+                        
+                    
+ + + +
+
+
+
+
diff --git a/pages/tmp.html b/pages/tmp.html new file mode 100644 index 00000000..190afd98 --- /dev/null +++ b/pages/tmp.html @@ -0,0 +1,435 @@ +
+
+

Installation πŸ’»

+

There are three ways to use Crawl4AI:

+
    +
  1. As a library
  2. +
  3. As a local server (Docker)
  4. +
  5. + As a Google Colab notebook. + Open In Colab +
  6. +

    + +

    To install Crawl4AI as a library, follow these steps:

    + +
      +
    1. + Install the package from GitHub: +
      pip install git+https://github.com/unclecode/crawl4ai.git
      +
    2. +
    3. + Alternatively, you can clone the repository and install the package locally: +
      virtualenv venv
      +source venv/bin/activate
      +git clone https://github.com/unclecode/crawl4ai.git
      +cd crawl4ai
      +pip install -e .
      +
      +
    4. +
    5. + Use docker to run the local server: +
      docker build -t crawl4ai . 
      +# docker build --platform linux/amd64 -t crawl4ai . For Mac users
      +docker run -d -p 8000:80 crawl4ai
      +
    6. +
    +

    + For more information about how to run Crawl4AI as a local server, please refer to the + GitHub repository. +

    +
+
+
+

How to Guide

+
+ +
+ 🌟 + Welcome to the Crawl4ai Quickstart Guide! Let's dive into some web crawling fun! +
+
+ First Step: Create an instance of WebCrawler and call the + warmup() function. +
+
+
crawler = WebCrawler()
+crawler.warmup()
+
+ + +
+ 🧠 Understanding 'bypass_cache' and 'include_raw_html' parameters: +
+
First crawl (caches the result):
+
+
result = crawler.run(url="https://www.nbcnews.com/business")
+
+
Second crawl (Force to crawl again):
+
+
result = crawler.run(url="https://www.nbcnews.com/business", bypass_cache=True)
+
+ ⚠️ Don't forget to set `bypass_cache` to True if you want to try different strategies + for the same URL. Otherwise, the cached result will be returned. You can also set + `always_by_pass_cache` in constructor to True to always bypass the cache. +
+
+
Crawl result without raw HTML content:
+
+
result = crawler.run(url="https://www.nbcnews.com/business", include_raw_html=False)
+
+ + +
+ πŸ“„ + The 'include_raw_html' parameter, when set to True, includes the raw HTML content in the response. + By default, it is set to True. +
+
Set always_by_pass_cache to True:
+
+
crawler.always_by_pass_cache = True
+
+ + +
+ 🧩 Let's add a chunking strategy: RegexChunking! +
+
Using RegexChunking:
+
+
result = crawler.run(
+url="https://www.nbcnews.com/business",
+chunking_strategy=RegexChunking(patterns=["\n\n"])
+)
+
+
Using NlpSentenceChunking:
+
+
result = crawler.run(
+url="https://www.nbcnews.com/business",
+chunking_strategy=NlpSentenceChunking()
+)
+
+ + +
+ 🧠 Let's get smarter with an extraction strategy: CosineStrategy! +
+
Using CosineStrategy:
+
+
result = crawler.run(
+url="https://www.nbcnews.com/business",
+extraction_strategy=CosineStrategy(word_count_threshold=20, max_dist=0.2, linkage_method="ward", top_k=3)
+)
+
+ + +
+ πŸ€– + Time to bring in the big guns: LLMExtractionStrategy without instructions! +
+
Using LLMExtractionStrategy without instructions:
+
+
result = crawler.run(
+url="https://www.nbcnews.com/business",
+extraction_strategy=LLMExtractionStrategy(provider="openai/gpt-4o", api_token=os.getenv('OPENAI_API_KEY'))
+)
+
+ + +
+ πŸ“œ + Let's make it even more interesting: LLMExtractionStrategy with instructions! +
+
Using LLMExtractionStrategy with instructions:
+
+
result = crawler.run(
+url="https://www.nbcnews.com/business",
+extraction_strategy=LLMExtractionStrategy(
+provider="openai/gpt-4o",
+api_token=os.getenv('OPENAI_API_KEY'),
+instruction="I am interested in only financial news"
+)
+)
+
+ + +
+ 🎯 + Targeted extraction: Let's use a CSS selector to extract only H2 tags! +
+
Using CSS selector to extract H2 tags:
+
+
result = crawler.run(
+url="https://www.nbcnews.com/business",
+css_selector="h2"
+)
+
+ + +
+ πŸ–±οΈ + Let's get interactive: Passing JavaScript code to click 'Load More' button! +
+
Using JavaScript to click 'Load More' button:
+
+
js_code = """
+const loadMoreButton = Array.from(document.querySelectorAll('button')).find(button => button.textContent.includes('Load More'));
+loadMoreButton && loadMoreButton.click();
+"""
+crawler_strategy = LocalSeleniumCrawlerStrategy(js_code=js_code)
+crawler = WebCrawler(crawler_strategy=crawler_strategy, always_by_pass_cache=True)
+result = crawler.run(url="https://www.nbcnews.com/business")
+
+ + +
+ πŸŽ‰ + Congratulations! You've made it through the Crawl4ai Quickstart Guide! Now go forth and crawl the + web like a pro! πŸ•ΈοΈ +
+
+
+ +
+
+
+

RegexChunking

+

+ RegexChunking is a text chunking strategy that splits a given text into smaller parts + using regular expressions. This is useful for preparing large texts for processing by language + models, ensuring they are divided into manageable segments. +

+

Constructor Parameters:

+
    +
  • + patterns (list, optional): A list of regular expression patterns used to split the + text. Default is to split by double newlines (['\n\n']). +
  • +
+

Example usage:

+
chunker = RegexChunking(patterns=[r'\n\n', r'\. '])
+chunks = chunker.chunk("This is a sample text. It will be split into chunks.")
+
+
+
+
+
+

NlpSentenceChunking

+

+ NlpSentenceChunking uses a natural language processing model to chunk a given text into + sentences. This approach leverages SpaCy to accurately split text based on sentence boundaries. +

+

Constructor Parameters:

+
    +
  • + model (str, optional): The SpaCy model to use for sentence detection. Default is + 'en_core_web_sm'. +
  • +
+

Example usage:

+
chunker = NlpSentenceChunking(model='en_core_web_sm')
+chunks = chunker.chunk("This is a sample text. It will be split into sentences.")
+
+
+
+
+
+

TopicSegmentationChunking

+

+ TopicSegmentationChunking uses the TextTiling algorithm to segment a given text into + topic-based chunks. This method identifies thematic boundaries in the text. +

+

Constructor Parameters:

+
    +
  • + num_keywords (int, optional): The number of keywords to extract for each topic + segment. Default is 3. +
  • +
+

Example usage:

+
chunker = TopicSegmentationChunking(num_keywords=3)
+chunks = chunker.chunk("This is a sample text. It will be split into topic-based segments.")
+
+
+
+
+
+

FixedLengthWordChunking

+

+ FixedLengthWordChunking splits a given text into chunks of fixed length, based on the + number of words. +

+

Constructor Parameters:

+
    +
  • + chunk_size (int, optional): The number of words in each chunk. Default is + 100. +
  • +
+

Example usage:

+
chunker = FixedLengthWordChunking(chunk_size=100)
+chunks = chunker.chunk("This is a sample text. It will be split into fixed-length word chunks.")
+
+
+
+
+
+

SlidingWindowChunking

+

+ SlidingWindowChunking uses a sliding window approach to chunk a given text. Each chunk + has a fixed length, and the window slides by a specified step size. +

+

Constructor Parameters:

+
    +
  • + window_size (int, optional): The number of words in each chunk. Default is + 100. +
  • +
  • + step (int, optional): The number of words to slide the window. Default is + 50. +
  • +
+

Example usage:

+
chunker = SlidingWindowChunking(window_size=100, step=50)
+chunks = chunker.chunk("This is a sample text. It will be split using a sliding window approach.")
+
+
+
+
+
+
+
+

NoExtractionStrategy

+

+ NoExtractionStrategy is a basic extraction strategy that returns the entire HTML + content without any modification. It is useful for cases where no specific extraction is required. + Only clean html, and amrkdown. +

+

Constructor Parameters:

+

None.

+

Example usage:

+
extractor = NoExtractionStrategy()
+extracted_content = extractor.extract(url, html)
+
+
+
+
+
+

LLMExtractionStrategy

+

+ LLMExtractionStrategy uses a Language Model (LLM) to extract meaningful blocks or + chunks from the given HTML content. This strategy leverages an external provider for language model + completions. +

+

Constructor Parameters:

+
    +
  • + provider (str, optional): The provider to use for the language model completions. + Default is DEFAULT_PROVIDER (e.g., openai/gpt-4). +
  • +
  • + api_token (str, optional): The API token for the provider. If not provided, it will + try to load from the environment variable OPENAI_API_KEY. +
  • +
  • + instruction (str, optional): An instruction to guide the LLM on how to perform the + extraction. This allows users to specify the type of data they are interested in or set the tone + of the response. Default is None. +
  • +
+

Example usage:

+
extractor = LLMExtractionStrategy(provider='openai', api_token='your_api_token', instruction='Extract only news about AI.')
+extracted_content = extractor.extract(url, html)
+
+

+ By providing clear instructions, users can tailor the extraction process to their specific needs, + enhancing the relevance and utility of the extracted content. +

+
+
+
+
+

CosineStrategy

+

+ CosineStrategy uses hierarchical clustering based on cosine similarity to extract + clusters of text from the given HTML content. This strategy is suitable for identifying related + content sections. +

+

Constructor Parameters:

+
    +
  • + semantic_filter (str, optional): A string containing keywords for filtering relevant + documents before clustering. If provided, documents are filtered based on their cosine + similarity to the keyword filter embedding. Default is None. +
  • +
  • + word_count_threshold (int, optional): Minimum number of words per cluster. Default + is 20. +
  • +
  • + max_dist (float, optional): The maximum cophenetic distance on the dendrogram to + form clusters. Default is 0.2. +
  • +
  • + linkage_method (str, optional): The linkage method for hierarchical clustering. + Default is 'ward'. +
  • +
  • + top_k (int, optional): Number of top categories to extract. Default is + 3. +
  • +
  • + model_name (str, optional): The model name for embedding generation. Default is + 'BAAI/bge-small-en-v1.5'. +
  • +
+

Example usage:

+
extractor = CosineStrategy(semantic_filter='artificial intelligence', word_count_threshold=10, max_dist=0.2, linkage_method='ward', top_k=3, model_name='BAAI/bge-small-en-v1.5')
+extracted_content = extractor.extract(url, html)
+
+

Cosine Similarity Filtering

+

+ When a semantic_filter is provided, the CosineStrategy applies an + embedding-based filtering process to select relevant documents before performing hierarchical + clustering. +

+
+
+
+
+

TopicExtractionStrategy

+

+ TopicExtractionStrategy uses the TextTiling algorithm to segment the HTML content into + topics and extracts keywords for each segment. This strategy is useful for identifying and + summarizing thematic content. +

+

Constructor Parameters:

+
    +
  • + num_keywords (int, optional): Number of keywords to represent each topic segment. + Default is 3. +
  • +
+

Example usage:

+
extractor = TopicExtractionStrategy(num_keywords=3)
+extracted_content = extractor.extract(url, html)
+
+
+
+
+
diff --git a/requirements.txt b/requirements.txt index 7995d648..add8619d 100644 --- a/requirements.txt +++ b/requirements.txt @@ -13,4 +13,5 @@ litellm python-dotenv nltk lazy_import +rich # spacy \ No newline at end of file