Compare commits

...

16 Commits

Author SHA1 Message Date
unclecode
6f96dcd649 chore: Update README 2024-05-17 18:12:50 +08:00
unclecode
957a2458b1 chore: Update web crawler URLs to use NBC News business section 2024-05-17 18:11:13 +08:00
unclecode
36e46be23d chore: Add verbose option to ExtractionStrategy classes
This commit adds a new `verbose` option to the `ExtractionStrategy` classes. The `verbose` option allows for logging of extraction details, such as the number of extracted blocks and the URL being processed. This improves the debugging and monitoring capabilities of the code.
2024-05-17 18:06:10 +08:00
unclecode
32c87f0388 chore: Update NlpSentenceChunking constructor parameters to None
The NlpSentenceChunking constructor parameters have been updated to None in order to simplify the usage of the class. This change removes the need for specifying the SpaCy model for sentence detection, making the code more concise and easier to understand.
2024-05-17 17:00:43 +08:00
unclecode
647cfda225 chore: Update Crawl4AI quickstart script in README.md
This commit updates the Crawl4AI quickstart script in the README.md file. The script is now properly formatted and aligned, making it easier to read and understand. The unnecessary indentation has been removed, and the script is now more concise and efficient.
2024-05-17 16:55:34 +08:00
unclecode
1cc67df301 chore: Update pip installation command and requirements, add new dependencies 2024-05-17 16:53:03 +08:00
unclecode
d7b37e849d chore: Update CrawlRequest model to use NoExtractionStrategy as default 2024-05-17 16:50:38 +08:00
unclecode
f52f526002 chore: Update web_crawler.py to use NoExtractionStrategy as default 2024-05-17 16:03:35 +08:00
unclecode
3593f017d7 chore: Update setup.py to exclude torch, transformers, and nltk dependencies
This commit updates the setup.py file to exclude the torch, transformers, and nltk dependencies from the install_requires section. Instead, it creates separate extras_require sections for different environments, including all requirements, excluding torch for Colab, and excluding torch, transformers, and nltk for the crawl environment.
2024-05-17 16:01:04 +08:00
unclecode
e7bb76f19b chore: Update torch dependency to version 2.3.0 2024-05-17 15:52:39 +08:00
unclecode
593b928967 Update requirements.txt to include latest versions of dependencies 2024-05-17 15:48:14 +08:00
unclecode
bb3d37face chore: Update requirements.txt to include latest versions of dependencies 2024-05-17 15:32:37 +08:00
unclecode
3f8576f870 chore: Update model_loader.py to use pretrained models without resume_download 2024-05-17 15:26:15 +08:00
unclecode
bf3b040f10 chore: Update pip installation command and requirements, add new dependencies 2024-05-17 15:21:45 +08:00
unclecode
a317dc5e1d Load CosineStrategy in the function 2024-05-17 15:13:06 +08:00
unclecode
a5f9d07dbf Remove dependency on Spacy model. 2024-05-17 15:08:03 +08:00
26 changed files with 215 additions and 84065 deletions

View File

@@ -22,32 +22,26 @@ Crawl4AI has one clear task: to simplify crawling and extract useful information
## Power and Simplicity of Crawl4AI 🚀
Crawl4AI makes even complex web crawling tasks simple and intuitive. Below is an example of how you can execute JavaScript, filter data using keywords, and use a CSS selector to extract specific content—all in one go!
To show the simplicity take a look at the first example:
**Example Task:**
```python
from crawl4ai import WebCrawler
# Create the WebCrawler instance
crawler = WebCrawler()
# Run the crawler with keyword filtering and CSS selector
result = crawler.run(url="https://www.nbcnews.com/business")
print(result) # {url, html, markdown, extracted_content, metadata}
```
Now let's try a complex task. Below is an example of how you can execute JavaScript, filter data using keywords, and use a CSS selector to extract specific content—all in one go!
1. Instantiate a WebCrawler object.
2. Execute custom JavaScript to click a "Load More" button.
3. Filter the data to include only content related to "technology".
3. Extract semantical chunks of content and filter the data to include only content related to technology.
4. Use a CSS selector to extract only paragraphs (`<p>` tags).
**Example Code:**
Simply, firtsy install the package:
```bash
virtualenv venv
source venv/bin/activate
# Install Crawl4AI
pip install git+https://github.com/unclecode/crawl4ai.git
```
Run the following command to load the required models. This is optional, but it will boost the performance and speed of the crawler. You need to do this only once.
```bash
crawl4ai-download-models
```
Now, you can run the following code:
```python
# Import necessary modules
from crawl4ai import WebCrawler
@@ -69,7 +63,7 @@ crawler = WebCrawler(crawler_strategy=crawler_strategy)
# Run the crawler with keyword filtering and CSS selector
result = crawler.run(
url="https://www.example.com",
url="https://www.nbcnews.com/business",
extraction_strategy=CosineStrategy(
semantic_filter="technology",
),
@@ -77,7 +71,7 @@ result = crawler.run(
# Run the crawler with LLM extraction strategy
result = crawler.run(
url="https://www.example.com",
url="https://www.nbcnews.com/business",
extraction_strategy=LLMExtractionStrategy(
provider="openai/gpt-4o",
api_token=os.getenv('OPENAI_API_KEY'),
@@ -99,16 +93,16 @@ With Crawl4AI, you can perform advanced web crawling and data extraction tasks w
## Table of Contents
1. [Features](#features)
2. [Installation](#installation)
3. [REST API/Local Server](#using-the-local-server-ot-rest-api)
4. [Python Library Usage](#usage)
5. [Parameters](#parameters)
6. [Chunking Strategies](#chunking-strategies)
7. [Extraction Strategies](#extraction-strategies)
8. [Contributing](#contributing)
9. [License](#license)
10. [Contact](#contact)
1. [Features](#features-)
2. [Installation](#installation-)
3. [REST API/Local Server](#using-the-local-server-ot-rest-api-)
4. [Python Library Usage](#python-library-usage-)
5. [Parameters](#parameters-)
6. [Chunking Strategies](#chunking-strategies-)
7. [Extraction Strategies](#extraction-strategies-)
8. [Contributing](#contributing-)
9. [License](#license-)
10. [Contact](#contact-)
## Features ✨
@@ -137,7 +131,7 @@ To install Crawl4AI as a library, follow these steps:
```bash
virtualenv venv
source venv/bin/activate
pip install git+https://github.com/unclecode/crawl4ai.git
pip install "crawl4ai[all] @ git+https://github.com/unclecode/crawl4ai.git"
```
💡 Better to run the following CLI-command to load the required models. This is optional, but it will boost the performance and speed of the crawler. You need to do this only once.
@@ -150,12 +144,12 @@ virtualenv venv
source venv/bin/activate
git clone https://github.com/unclecode/crawl4ai.git
cd crawl4ai
pip install -e .
pip install -e .[all]
```
3. Use docker to run the local server:
```bash
docker build -t crawl4ai .
docker build -t crawl4ai .
# For Mac users
# docker build --platform linux/amd64 -t crawl4ai .
docker run -d -p 8000:80 crawl4ai
@@ -174,7 +168,7 @@ To use the REST API, send a POST request to `https://crawl4ai.com/crawl` with th
**Example Request:**
```json
{
"urls": ["https://www.example.com"],
"urls": ["https://www.nbcnews.com/business"],
"include_raw_html": false,
"bypass_cache": true,
"word_count_threshold": 5,
@@ -201,7 +195,7 @@ To use the REST API, send a POST request to `https://crawl4ai.com/crawl` with th
"status": "success",
"data": [
{
"url": "https://www.example.com",
"url": "https://www.nbcnews.com/business",
"extracted_content": "...",
"html": "...",
"markdown": "...",
@@ -216,7 +210,7 @@ For more information about the available parameters and their descriptions, refe
## Python Library Usage 🚀
🔥 A great way to try out Crawl4AI is to run `quickstart.py` in the `docs/examples` directory. This script demonstrates how to use Crawl4AI to crawl a website and extract content from it.
🔥 A great way to try out Crawl4AI is to run `quickstart.py` in the `docs/examples` directory. This script demonstrates how to use Crawl4AI to crawl a website and extract content from it.
### Quickstart Guide
@@ -264,6 +258,8 @@ result = crawler.run(
### Extraction strategy: CosineStrategy
So far, the extracted content is just the result of chunking. To extract meaningful content, you can use extraction strategies. These strategies cluster consecutive chunks into meaningful blocks, keeping the same order as the text in the HTML. This approach is perfect for use in RAG applications and semantical search queries.
Using CosineStrategy:
```python
result = crawler.run(
@@ -349,7 +345,7 @@ result = crawler.run(url="https://www.nbcnews.com/business")
| `include_raw_html` | Whether to include the raw HTML content in the response. | No | `false` |
| `bypass_cache` | Whether to force a fresh crawl even if the URL has been previously crawled. | No | `false` |
| `word_count_threshold`| The minimum number of words a block must contain to be considered meaningful (minimum value is 5). | No | `5` |
| `extraction_strategy` | The strategy to use for extracting content from the HTML (e.g., "CosineStrategy"). | No | `CosineStrategy` |
| `extraction_strategy` | The strategy to use for extracting content from the HTML (e.g., "CosineStrategy"). | No | `NoExtractionStrategy` |
| `chunking_strategy` | The strategy to use for chunking the text before processing (e.g., "RegexChunking"). | No | `RegexChunking` |
| `css_selector` | The CSS selector to target specific parts of the HTML for extraction. | No | `None` |
| `verbose` | Whether to enable verbose logging. | No | `true` |
@@ -374,11 +370,11 @@ chunks = chunker.chunk("This is a sample text. It will be split into chunks.")
`NlpSentenceChunking` uses a natural language processing model to chunk a given text into sentences. This approach leverages SpaCy to accurately split text based on sentence boundaries.
**Constructor Parameters:**
- `model` (str, optional): The SpaCy model to use for sentence detection. Default is `'en_core_web_sm'`.
- None.
**Example usage:**
```python
chunker = NlpSentenceChunking(model='en_core_web_sm')
chunker = NlpSentenceChunking()
chunks = chunker.chunk("This is a sample text. It will be split into sentences.")
```
@@ -466,7 +462,7 @@ extracted_content = extractor.extract(url, html)
**Example usage:**
```python
extractor = CosineStrategy(semantic_filter='artificial intelligence', word_count_threshold=10, max_dist=0.2, linkage_method='ward', top_k=3, model_name='BAAI/bge-small-en-v1.5')
extractor = CosineStrategy(semantic_filter='finance rental prices', word_count_threshold=10, max_dist=0.2, linkage_method='ward', top_k=3, model_name='BAAI/bge-small-en-v1.5')
extracted_content = extractor.extract(url, html)
```

View File

@@ -1,12 +1,8 @@
from abc import ABC, abstractmethod
import re
# spacy = lazy_import.lazy_module('spacy')
# nl = lazy_import.lazy_module('nltk')
# from nltk.corpus import stopwords
# from nltk.tokenize import word_tokenize, TextTilingTokenizer
from collections import Counter
import string
from .model_loader import load_spacy_en_core_web_sm
from .model_loader import load_nltk_punkt
# Define the abstract base class for chunking strategies
class ChunkingStrategy(ABC):
@@ -34,15 +30,24 @@ class RegexChunking(ChunkingStrategy):
paragraphs = new_paragraphs
return paragraphs
# NLP-based sentence chunking using spaCy
# NLP-based sentence chunking
class NlpSentenceChunking(ChunkingStrategy):
def __init__(self, model='en_core_web_sm'):
self.nlp = load_spacy_en_core_web_sm()
def __init__(self):
load_nltk_punkt()
pass
def chunk(self, text: str) -> list:
doc = self.nlp(text)
return [sent.text.strip() for sent in doc.sents]
# Improved regex for sentence splitting
# sentence_endings = re.compile(
# r'(?<!\w\.\w.)(?<![A-Z][a-z]\.)(?<![A-Z][A-Z]\.)(?<![A-Za-z]\.)(?<=\.|\?|\!|\n)\s'
# )
# sentences = sentence_endings.split(text)
# sens = [sent.strip() for sent in sentences if sent]
from nltk.tokenize import sent_tokenize
sentences = sent_tokenize(text)
sens = [sent.strip() for sent in sentences]
return list(set(sens))
# Topic-based segmentation using TextTiling
class TopicSegmentationChunking(ChunkingStrategy):

View File

@@ -7,7 +7,7 @@ from .prompts import PROMPT_EXTRACT_BLOCKS, PROMPT_EXTRACT_BLOCKS_WITH_INSTRUCTI
from .config import *
from .utils import *
from functools import partial
from .model_loader import load_bert_base_uncased, load_bge_small_en_v1_5, load_spacy_model
from .model_loader import *
import numpy as np
@@ -19,6 +19,7 @@ class ExtractionStrategy(ABC):
def __init__(self, **kwargs):
self.DEL = "<|DEL|>"
self.name = self.__class__.__name__
self.verbose = kwargs.get("verbose", False)
@abstractmethod
def extract(self, url: str, html: str, *q, **kwargs) -> List[Dict[str, Any]]:
@@ -45,14 +46,13 @@ class ExtractionStrategy(ABC):
for future in as_completed(futures):
extracted_content.extend(future.result())
return extracted_content
class NoExtractionStrategy(ExtractionStrategy):
def extract(self, url: str, html: str, *q, **kwargs) -> List[Dict[str, Any]]:
return [{"index": 0, "content": html}]
def run(self, url: str, sections: List[str], *q, **kwargs) -> List[Dict[str, Any]]:
return [{"index": i, "tags": [], "content": section} for i, section in enumerate(sections)]
class LLMExtractionStrategy(ExtractionStrategy):
def __init__(self, provider: str = DEFAULT_PROVIDER, api_token: Optional[str] = None, instruction:str = None, **kwargs):
"""
@@ -62,10 +62,11 @@ class LLMExtractionStrategy(ExtractionStrategy):
:param api_token: The API token for the provider.
:param instruction: The instruction to use for the LLM model.
"""
super().__init__()
super().__init__()
self.provider = provider
self.api_token = api_token or PROVIDER_MODELS.get(provider, None) or os.getenv("OPENAI_API_KEY")
self.instruction = instruction
self.verbose = kwargs.get("verbose", False)
if not self.api_token:
raise ValueError("API token must be provided for LLMExtractionStrategy. Update the config.py or set OPENAI_API_KEY environment variable.")
@@ -106,7 +107,8 @@ class LLMExtractionStrategy(ExtractionStrategy):
"content": unparsed
})
print("[LOG] Extracted", len(blocks), "blocks from URL:", url, "block index:", ix)
if self.verbose:
print("[LOG] Extracted", len(blocks), "blocks from URL:", url, "block index:", ix)
return blocks
def _merge(self, documents):
@@ -166,16 +168,13 @@ class CosineStrategy(ExtractionStrategy):
"""
super().__init__()
from transformers import BertTokenizer, BertModel, pipeline
from transformers import AutoTokenizer, AutoModel
import spacy
self.semantic_filter = semantic_filter
self.word_count_threshold = word_count_threshold
self.max_dist = max_dist
self.linkage_method = linkage_method
self.top_k = top_k
self.timer = time.time()
self.verbose = kwargs.get("verbose", False)
self.buffer_embeddings = np.array([])
@@ -184,9 +183,10 @@ class CosineStrategy(ExtractionStrategy):
elif model_name == "BAAI/bge-small-en-v1.5":
self.tokenizer, self.model = load_bge_small_en_v1_5()
self.nlp = load_spacy_model()
print(f"[LOG] Model loaded {model_name}, models/reuters, took " + str(time.time() - self.timer) + " seconds")
self.nlp = load_text_multilabel_classifier()
if self.verbose:
print(f"[LOG] Model loaded {model_name}, models/reuters, took " + str(time.time() - self.timer) + " seconds")
def filter_documents_embeddings(self, documents: List[str], semantic_filter: str, threshold: float = 0.5) -> List[str]:
"""
@@ -310,13 +310,19 @@ class CosineStrategy(ExtractionStrategy):
# Convert filtered clusters to a sorted list of dictionaries
cluster_list = [{"index": int(idx), "tags" : [], "content": " ".join(filtered_clusters[idx])} for idx in sorted(filtered_clusters)]
labels = self.nlp([cluster['content'] for cluster in cluster_list])
for cluster, label in zip(cluster_list, labels):
cluster['tags'] = label
# Process the text with the loaded model
for cluster in cluster_list:
doc = self.nlp(cluster['content'])
tok_k = self.top_k
top_categories = sorted(doc.cats.items(), key=lambda x: x[1], reverse=True)[:tok_k]
cluster['tags'] = [cat for cat, _ in top_categories]
# for cluster in cluster_list:
# cluster['tags'] = self.nlp(cluster['content'])[0]['label']
# doc = self.nlp(cluster['content'])
# tok_k = self.top_k
# top_categories = sorted(doc.cats.items(), key=lambda x: x[1], reverse=True)[:tok_k]
# cluster['tags'] = [cat for cat, _ in top_categories]
# print(f"[LOG] 🚀 Categorization done in {time.time() - t:.2f} seconds")

View File

@@ -28,68 +28,66 @@ def load_bge_small_en_v1_5():
return tokenizer, model
@lru_cache()
def load_spacy_en_core_web_sm():
import spacy
try:
print("[LOG] Loading spaCy model")
nlp = spacy.load("en_core_web_sm")
except IOError:
print("[LOG] ⏬ Downloading spaCy model for the first time")
spacy.cli.download("en_core_web_sm")
nlp = spacy.load("en_core_web_sm")
print("[LOG] ✅ spaCy model loaded successfully")
return nlp
def load_text_classifier():
from transformers import AutoTokenizer, AutoModelForSequenceClassification
from transformers import pipeline
tokenizer = AutoTokenizer.from_pretrained("dstefa/roberta-base_topic_classification_nyt_news")
model = AutoModelForSequenceClassification.from_pretrained("dstefa/roberta-base_topic_classification_nyt_news")
pipe = pipeline("text-classification", model=model, tokenizer=tokenizer)
return pipe
@lru_cache()
def load_spacy_model():
import spacy
name = "models/reuters"
home_folder = get_home_folder()
model_folder = os.path.join(home_folder, name)
# Check if the model directory already exists
if not (Path(model_folder).exists() and any(Path(model_folder).iterdir())):
repo_url = "https://github.com/unclecode/crawl4ai.git"
# branch = "main"
branch = MODEL_REPO_BRANCH
repo_folder = os.path.join(home_folder, "crawl4ai")
model_folder = os.path.join(home_folder, name)
def load_text_multilabel_classifier():
from transformers import AutoModelForSequenceClassification, AutoTokenizer
import numpy as np
from scipy.special import expit
import torch
print("[LOG] ⏬ Downloading model for the first time...")
MODEL = "cardiffnlp/tweet-topic-21-multi"
tokenizer = AutoTokenizer.from_pretrained(MODEL, resume_download=None)
model = AutoModelForSequenceClassification.from_pretrained(MODEL, resume_download=None)
class_mapping = model.config.id2label
# Remove existing repo folder if it exists
if Path(repo_folder).exists():
shutil.rmtree(repo_folder)
shutil.rmtree(model_folder)
# Check for available device: CUDA, MPS (for Apple Silicon), or CPU
if torch.cuda.is_available():
device = torch.device("cuda")
elif torch.backends.mps.is_available():
device = torch.device("mps")
else:
device = torch.device("cpu")
try:
# Clone the repository
subprocess.run(
["git", "clone", "-b", branch, repo_url, repo_folder],
stdout=subprocess.DEVNULL,
stderr=subprocess.DEVNULL,
check=True
)
model.to(device)
# Create the models directory if it doesn't exist
models_folder = os.path.join(home_folder, "models")
os.makedirs(models_folder, exist_ok=True)
def _classifier(texts, threshold=0.5, max_length=64):
tokens = tokenizer(texts, return_tensors='pt', padding=True, truncation=True, max_length=max_length)
tokens = {key: val.to(device) for key, val in tokens.items()} # Move tokens to the selected device
# Copy the reuters model folder to the models directory
source_folder = os.path.join(repo_folder, "models/reuters")
shutil.copytree(source_folder, model_folder)
with torch.no_grad():
output = model(**tokens)
# Remove the cloned repository
shutil.rmtree(repo_folder)
scores = output.logits.detach().cpu().numpy()
scores = expit(scores)
predictions = (scores >= threshold) * 1
# Print completion message
print("[LOG] ✅ Model downloaded successfully")
except subprocess.CalledProcessError as e:
print(f"An error occurred while cloning the repository: {e}")
except Exception as e:
print(f"An error occurred: {e}")
batch_labels = []
for prediction in predictions:
labels = [class_mapping[i] for i, value in enumerate(prediction) if value == 1]
batch_labels.append(labels)
return spacy.load(model_folder)
return batch_labels
return _classifier
@lru_cache()
def load_nltk_punkt():
import nltk
try:
nltk.data.find('tokenizers/punkt')
except LookupError:
nltk.download('punkt')
return nltk.data.find('tokenizers/punkt')
def download_all_models(remove_existing=False):
"""Download all models required for Crawl4AI."""
@@ -110,10 +108,10 @@ def download_all_models(remove_existing=False):
load_bert_base_uncased()
print("[LOG] Downloading BGE Small EN v1.5...")
load_bge_small_en_v1_5()
print("[LOG] Downloading spaCy EN Core Web SM...")
load_spacy_en_core_web_sm()
print("[LOG] Downloading custom spaCy model...")
load_spacy_model()
print("[LOG] Downloading text classifier...")
load_text_multilabel_classifier
print("[LOG] Downloading custom NLTK Punkt model...")
load_nltk_punkt()
print("[LOG] ✅ All models downloaded successfully.")
def main():

View File

@@ -3,6 +3,33 @@ from spacy.training import Example
import random
import nltk
from nltk.corpus import reuters
import torch
def save_spacy_model_as_torch(nlp, model_dir="models/reuters"):
# Extract the TextCategorizer component
textcat = nlp.get_pipe("textcat_multilabel")
# Convert the weights to a PyTorch state dictionary
state_dict = {name: torch.tensor(param.data) for name, param in textcat.model.named_parameters()}
# Save the state dictionary
torch.save(state_dict, f"{model_dir}/model_weights.pth")
# Extract and save the vocabulary
vocab = extract_vocab(nlp)
with open(f"{model_dir}/vocab.txt", "w") as vocab_file:
for word, idx in vocab.items():
vocab_file.write(f"{word}\t{idx}\n")
print(f"Model weights and vocabulary saved to: {model_dir}")
def extract_vocab(nlp):
# Extract vocabulary from the SpaCy model
vocab = {word: i for i, word in enumerate(nlp.vocab.strings)}
return vocab
nlp = spacy.load("models/reuters")
save_spacy_model_as_torch(nlp, model_dir="models")
def train_and_save_reuters_model(model_dir="models/reuters"):
# Ensure the Reuters corpus is downloaded
@@ -96,8 +123,6 @@ def train_model(model_dir, additional_epochs=0):
nlp.to_disk(model_dir)
print(f"Model saved to: {model_dir}")
def load_model_and_predict(model_dir, text, tok_k = 3):
# Load the trained model from the specified directory
nlp = spacy.load(model_dir)
@@ -111,7 +136,6 @@ def load_model_and_predict(model_dir, text, tok_k = 3):
return top_categories
if __name__ == "__main__":
train_and_save_reuters_model()
train_model("models/reuters", additional_epochs=5)
@@ -119,4 +143,4 @@ if __name__ == "__main__":
print(reuters.categories())
example_text = "Apple Inc. is reportedly buying a startup for $1 billion"
r =load_model_and_predict(model_directory, example_text)
print(r)
print(r)

View File

@@ -11,7 +11,6 @@ from .crawler_strategy import *
from typing import List
from concurrent.futures import ThreadPoolExecutor
from .config import *
# from .model_loader import load_bert_base_uncased, load_bge_small_en_v1_5, load_spacy_model
class WebCrawler:
@@ -40,14 +39,11 @@ class WebCrawler:
self.ready = False
def warmup(self):
print("[LOG] 🌤️ Warming up the WebCrawler")
result = self.run(
url='https://crawl4ai.uccode.io/',
word_count_threshold=5,
extraction_strategy= CosineStrategy(),
extraction_strategy= NoExtractionStrategy(),
bypass_cache=False,
verbose = False
)
@@ -63,14 +59,14 @@ class WebCrawler:
extract_blocks_flag: bool = True,
word_count_threshold=MIN_WORD_THRESHOLD,
use_cached_html: bool = False,
extraction_strategy: ExtractionStrategy = CosineStrategy(),
extraction_strategy: ExtractionStrategy = None,
chunking_strategy: ChunkingStrategy = RegexChunking(),
**kwargs,
) -> CrawlResult:
return self.run(
url_model.url,
word_count_threshold,
extraction_strategy,
extraction_strategy or NoExtractionStrategy(),
chunking_strategy,
bypass_cache=url_model.forced,
**kwargs,
@@ -82,13 +78,15 @@ class WebCrawler:
self,
url: str,
word_count_threshold=MIN_WORD_THRESHOLD,
extraction_strategy: ExtractionStrategy = CosineStrategy(),
extraction_strategy: ExtractionStrategy = None,
chunking_strategy: ChunkingStrategy = RegexChunking(),
bypass_cache: bool = False,
css_selector: str = None,
verbose=True,
**kwargs,
) -> CrawlResult:
extraction_strategy = extraction_strategy or NoExtractionStrategy()
extraction_strategy.verbose = verbose
# Check if extraction strategy is an instance of ExtractionStrategy if not raise an error
if not isinstance(extraction_strategy, ExtractionStrategy):
raise ValueError("Unsupported extraction strategy")
@@ -184,11 +182,11 @@ class WebCrawler:
extract_blocks_flag: bool = True,
word_count_threshold=MIN_WORD_THRESHOLD,
use_cached_html: bool = False,
extraction_strategy: ExtractionStrategy = CosineStrategy(),
extraction_strategy: ExtractionStrategy = None,
chunking_strategy: ChunkingStrategy = RegexChunking(),
**kwargs,
) -> List[CrawlResult]:
extraction_strategy = extraction_strategy or NoExtractionStrategy()
def fetch_page_wrapper(url_model, *args, **kwargs):
return self.fetch_page(url_model, *args, **kwargs)

View File

@@ -1,7 +1,7 @@
{
"RegexChunking": "### RegexChunking\n\n`RegexChunking` is a text chunking strategy that splits a given text into smaller parts using regular expressions.\nThis is useful for preparing large texts for processing by language models, ensuring they are divided into manageable segments.\n\n#### Constructor Parameters:\n- `patterns` (list, optional): A list of regular expression patterns used to split the text. Default is to split by double newlines (`['\\n\\n']`).\n\n#### Example usage:\n```python\nchunker = RegexChunking(patterns=[r'\\n\\n', r'\\. '])\nchunks = chunker.chunk(\"This is a sample text. It will be split into chunks.\")\n```",
"NlpSentenceChunking": "### NlpSentenceChunking\n\n`NlpSentenceChunking` uses a natural language processing model to chunk a given text into sentences. This approach leverages SpaCy to accurately split text based on sentence boundaries.\n\n#### Constructor Parameters:\n- `model` (str, optional): The SpaCy model to use for sentence detection. Default is `'en_core_web_sm'`.\n\n#### Example usage:\n```python\nchunker = NlpSentenceChunking(model='en_core_web_sm')\nchunks = chunker.chunk(\"This is a sample text. It will be split into sentences.\")\n```",
"NlpSentenceChunking": "### NlpSentenceChunking\n\n`NlpSentenceChunking` uses a natural language processing model to chunk a given text into sentences. This approach leverages SpaCy to accurately split text based on sentence boundaries.\n\n#### Constructor Parameters:\n- None.\n\n#### Example usage:\n```python\nchunker = NlpSentenceChunking()\nchunks = chunker.chunk(\"This is a sample text. It will be split into sentences.\")\n```",
"TopicSegmentationChunking": "### TopicSegmentationChunking\n\n`TopicSegmentationChunking` uses the TextTiling algorithm to segment a given text into topic-based chunks. This method identifies thematic boundaries in the text.\n\n#### Constructor Parameters:\n- `num_keywords` (int, optional): The number of keywords to extract for each topic segment. Default is `3`.\n\n#### Example usage:\n```python\nchunker = TopicSegmentationChunking(num_keywords=3)\nchunks = chunker.chunk(\"This is a sample text. It will be split into topic-based segments.\")\n```",

View File

@@ -59,12 +59,6 @@ def understanding_parameters(crawler):
cprint(f"[LOG] 📦 [bold yellow]Second crawl took {end_time - start_time} seconds and result (forced to crawl):[/bold yellow]")
print_result(result)
# Retrieve raw HTML content
cprint("\n🔄 [bold cyan]'include_raw_html' parameter example:[/bold cyan]", True)
result = crawler.run(url="https://www.nbcnews.com/business", include_raw_html=False)
cprint("[LOG] 📦 [bold yellow]Crawl result (without raw HTML content):[/bold yellow]")
print_result(result)
def add_chunking_strategy(crawler):
# Adding a chunking strategy: RegexChunking
cprint("\n🧩 [bold cyan]Let's add a chunking strategy: RegexChunking![/bold cyan]", True)
@@ -134,7 +128,7 @@ def add_llm_extraction_strategy(crawler):
print_result(result)
result = crawler.run(
url="https://www.example.com",
url="https://www.nbcnews.com/business",
extraction_strategy=LLMExtractionStrategy(
provider="openai/gpt-4o",
api_token=os.getenv('OPENAI_API_KEY'),
@@ -176,12 +170,11 @@ def main():
cprint("If this is the first time you're running Crawl4ai, this might take a few seconds to load required model files.")
crawler = create_crawler()
cprint("For the rest of this guide, I set crawler.always_by_pass_cache to True to force the crawler to bypass the cache. This is to ensure that we get fresh results for each run.", True)
crawler.always_by_pass_cache = True
basic_usage(crawler)
understanding_parameters(crawler)
crawler.always_by_pass_cache = True
add_chunking_strategy(crawler)
add_extraction_strategy(crawler)
add_llm_extraction_strategy(crawler)

View File

@@ -44,14 +44,12 @@ def get_crawler():
return WebCrawler()
class CrawlRequest(BaseModel):
urls: List[HttpUrl]
provider_model: str
api_token: str
urls: List[str]
include_raw_html: Optional[bool] = False
bypass_cache: bool = False
extract_blocks: bool = True
word_count_threshold: Optional[int] = 5
extraction_strategy: Optional[str] = "CosineStrategy"
extraction_strategy: Optional[str] = "NoExtractionStrategy"
extraction_strategy_args: Optional[dict] = {}
chunking_strategy: Optional[str] = "RegexChunking"
chunking_strategy_args: Optional[dict] = {}
@@ -95,9 +93,6 @@ def import_strategy(module_name: str, class_name: str, *args, **kwargs):
@app.post("/crawl")
async def crawl_urls(crawl_request: CrawlRequest, request: Request):
global current_requests
# Raise error if api_token is not provided
if not crawl_request.api_token:
raise HTTPException(status_code=401, detail="API token is required.")
async with lock:
if current_requests >= MAX_CONCURRENT_REQUESTS:
raise HTTPException(status_code=429, detail="Too many requests - please try again later.")

View File

@@ -1,144 +0,0 @@
[paths]
train = null
dev = null
vectors = null
init_tok2vec = null
[system]
seed = 0
gpu_allocator = null
[nlp]
lang = "en"
pipeline = ["textcat_multilabel"]
disabled = []
before_creation = null
after_creation = null
after_pipeline_creation = null
batch_size = 1000
tokenizer = {"@tokenizers":"spacy.Tokenizer.v1"}
vectors = {"@vectors":"spacy.Vectors.v1"}
[components]
[components.textcat_multilabel]
factory = "textcat_multilabel"
scorer = {"@scorers":"spacy.textcat_multilabel_scorer.v2"}
threshold = 0.5
[components.textcat_multilabel.model]
@architectures = "spacy.TextCatEnsemble.v2"
nO = null
[components.textcat_multilabel.model.linear_model]
@architectures = "spacy.TextCatBOW.v3"
exclusive_classes = false
length = 262144
ngram_size = 1
no_output_layer = false
nO = null
[components.textcat_multilabel.model.tok2vec]
@architectures = "spacy.Tok2Vec.v2"
[components.textcat_multilabel.model.tok2vec.embed]
@architectures = "spacy.MultiHashEmbed.v2"
width = 64
rows = [2000,2000,500,1000,500]
attrs = ["NORM","LOWER","PREFIX","SUFFIX","SHAPE"]
include_static_vectors = false
[components.textcat_multilabel.model.tok2vec.encode]
@architectures = "spacy.MaxoutWindowEncoder.v2"
width = 64
window_size = 1
maxout_pieces = 3
depth = 2
[corpora]
[corpora.dev]
@readers = "spacy.Corpus.v1"
path = ${paths.dev}
gold_preproc = false
max_length = 0
limit = 0
augmenter = null
[corpora.train]
@readers = "spacy.Corpus.v1"
path = ${paths.train}
gold_preproc = false
max_length = 0
limit = 0
augmenter = null
[training]
seed = ${system.seed}
gpu_allocator = ${system.gpu_allocator}
dropout = 0.1
accumulate_gradient = 1
patience = 1600
max_epochs = 0
max_steps = 20000
eval_frequency = 200
frozen_components = []
annotating_components = []
dev_corpus = "corpora.dev"
train_corpus = "corpora.train"
before_to_disk = null
before_update = null
[training.batcher]
@batchers = "spacy.batch_by_words.v1"
discard_oversize = false
tolerance = 0.2
get_length = null
[training.batcher.size]
@schedules = "compounding.v1"
start = 100
stop = 1000
compound = 1.001
t = 0.0
[training.logger]
@loggers = "spacy.ConsoleLogger.v1"
progress_bar = false
[training.optimizer]
@optimizers = "Adam.v1"
beta1 = 0.9
beta2 = 0.999
L2_is_weight_decay = true
L2 = 0.01
grad_clip = 1.0
use_averages = false
eps = 0.00000001
learn_rate = 0.001
[training.score_weights]
cats_score = 1.0
cats_score_desc = null
cats_micro_p = null
cats_micro_r = null
cats_micro_f = null
cats_macro_p = null
cats_macro_r = null
cats_macro_f = null
cats_macro_auc = null
cats_f_per_type = null
[pretraining]
[initialize]
vectors = ${paths.vectors}
init_tok2vec = ${paths.init_tok2vec}
vocab_data = null
lookups = null
before_init = null
after_init = null
[initialize.components]
[initialize.tokenizer]

View File

@@ -1,122 +0,0 @@
{
"lang":"en",
"name":"pipeline",
"version":"0.0.0",
"spacy_version":">=3.7.4,<3.8.0",
"description":"",
"author":"",
"email":"",
"url":"",
"license":"",
"spacy_git_version":"bff8725f4",
"vectors":{
"width":0,
"vectors":0,
"keys":0,
"name":null,
"mode":"default"
},
"labels":{
"textcat_multilabel":[
"acq",
"alum",
"barley",
"bop",
"carcass",
"castor-oil",
"cocoa",
"coconut",
"coconut-oil",
"coffee",
"copper",
"copra-cake",
"corn",
"cotton",
"cotton-oil",
"cpi",
"cpu",
"crude",
"dfl",
"dlr",
"dmk",
"earn",
"fuel",
"gas",
"gnp",
"gold",
"grain",
"groundnut",
"groundnut-oil",
"heat",
"hog",
"housing",
"income",
"instal-debt",
"interest",
"ipi",
"iron-steel",
"jet",
"jobs",
"l-cattle",
"lead",
"lei",
"lin-oil",
"livestock",
"lumber",
"meal-feed",
"money-fx",
"money-supply",
"naphtha",
"nat-gas",
"nickel",
"nkr",
"nzdlr",
"oat",
"oilseed",
"orange",
"palladium",
"palm-oil",
"palmkernel",
"pet-chem",
"platinum",
"potato",
"propane",
"rand",
"rape-oil",
"rapeseed",
"reserves",
"retail",
"rice",
"rubber",
"rye",
"ship",
"silver",
"sorghum",
"soy-meal",
"soy-oil",
"soybean",
"strategic-metal",
"sugar",
"sun-meal",
"sun-oil",
"sunseed",
"tea",
"tin",
"trade",
"veg-oil",
"wheat",
"wpi",
"yen",
"zinc"
]
},
"pipeline":[
"textcat_multilabel"
],
"components":[
"textcat_multilabel"
],
"disabled":[
]
}

View File

@@ -1,95 +0,0 @@
{
"labels":[
"acq",
"alum",
"barley",
"bop",
"carcass",
"castor-oil",
"cocoa",
"coconut",
"coconut-oil",
"coffee",
"copper",
"copra-cake",
"corn",
"cotton",
"cotton-oil",
"cpi",
"cpu",
"crude",
"dfl",
"dlr",
"dmk",
"earn",
"fuel",
"gas",
"gnp",
"gold",
"grain",
"groundnut",
"groundnut-oil",
"heat",
"hog",
"housing",
"income",
"instal-debt",
"interest",
"ipi",
"iron-steel",
"jet",
"jobs",
"l-cattle",
"lead",
"lei",
"lin-oil",
"livestock",
"lumber",
"meal-feed",
"money-fx",
"money-supply",
"naphtha",
"nat-gas",
"nickel",
"nkr",
"nzdlr",
"oat",
"oilseed",
"orange",
"palladium",
"palm-oil",
"palmkernel",
"pet-chem",
"platinum",
"potato",
"propane",
"rand",
"rape-oil",
"rapeseed",
"reserves",
"retail",
"rice",
"rubber",
"rye",
"ship",
"silver",
"sorghum",
"soy-meal",
"soy-oil",
"soybean",
"strategic-metal",
"sugar",
"sun-meal",
"sun-oil",
"sunseed",
"tea",
"tin",
"trade",
"veg-oil",
"wheat",
"wpi",
"yen",
"zinc"
],
"threshold":0.5
}

File diff suppressed because one or more lines are too long

View File

@@ -1 +0,0 @@
<EFBFBD>

View File

@@ -1 +0,0 @@
<EFBFBD>

File diff suppressed because it is too large Load Diff

Binary file not shown.

View File

@@ -1,3 +0,0 @@
{
"mode":"default"
}

View File

@@ -69,9 +69,12 @@ axios
// Handle crawl button click
document.getElementById("crawl-btn").addEventListener("click", () => {
// validate input to have both URL and API token
if (!document.getElementById("url-input").value || !document.getElementById("token-input").value) {
alert("Please enter both URL(s) and API token.");
return;
// if selected extraction strategy is LLMExtractionStrategy, then API token is required
if (document.getElementById("extraction-strategy-select").value === "LLMExtractionStrategy") {
if (!document.getElementById("url-input").value || !document.getElementById("token-input").value) {
alert("Please enter both URL(s) and API token.");
return;
}
}
const selectedProviderModel = document.getElementById("provider-model-select").value;
@@ -87,8 +90,6 @@ document.getElementById("crawl-btn").addEventListener("click", () => {
const urls = urlsInput.split(",").map((url) => url.trim());
const data = {
urls: urls,
provider_model: selectedProviderModel,
api_token: apiToken,
include_raw_html: true,
bypass_cache: bypassCache,
extract_blocks: extractBlocks,
@@ -112,8 +113,8 @@ document.getElementById("crawl-btn").addEventListener("click", () => {
localStorage.setItem("api_token", document.getElementById("token-input").value);
document.getElementById("loading").classList.remove("hidden");
document.getElementById("result").classList.add("hidden");
document.getElementById("code_help").classList.add("hidden");
document.getElementById("result").style.visibility = "hidden";
document.getElementById("code_help").style.visibility = "hidden";
axios
.post("/crawl", data)
@@ -128,18 +129,20 @@ document.getElementById("crawl-btn").addEventListener("click", () => {
const extractionStrategy = data.extraction_strategy;
const isLLMExtraction = extractionStrategy === "LLMExtractionStrategy";
// REMOVE API TOKEN FROM CODE EXAMPLES
data.extraction_strategy_args.api_token = "your_api_token";
document.getElementById(
"curl-code"
).textContent = `curl -X POST -H "Content-Type: application/json" -d '${JSON.stringify({
...data,
api_token: isLLMExtraction ? "your_api_token" : undefined,
})}' http://crawl4ai.uccode.io/crawl`;
}, null, 2)}' http://crawl4ai.com/crawl`;
document.getElementById("python-code").textContent = `import requests\n\ndata = ${JSON.stringify(
{ ...data, api_token: isLLMExtraction ? "your_api_token" : undefined },
null,
2
)}\n\nresponse = requests.post("http://crawl4ai.uccode.io/crawl", json=data) # OR local host if your run locally \nprint(response.json())`;
)}\n\nresponse = requests.post("http://crawl4ai.com/crawl", json=data) # OR local host if your run locally \nprint(response.json())`;
document.getElementById(
"nodejs-code"
@@ -147,7 +150,7 @@ document.getElementById("crawl-btn").addEventListener("click", () => {
{ ...data, api_token: isLLMExtraction ? "your_api_token" : undefined },
null,
2
)};\n\naxios.post("http://crawl4ai.uccode.io/crawl", data) // OR local host if your run locally \n .then(response => console.log(response.data))\n .catch(error => console.error(error));`;
)};\n\naxios.post("http://crawl4ai.com/crawl", data) // OR local host if your run locally \n .then(response => console.log(response.data))\n .catch(error => console.error(error));`;
document.getElementById(
"library-code"
@@ -169,8 +172,8 @@ document.getElementById("crawl-btn").addEventListener("click", () => {
document.getElementById("loading").classList.add("hidden");
document.getElementById("result").classList.remove("hidden");
document.getElementById("code_help").classList.remove("hidden");
document.getElementById("result").style.visibility = "visible";
document.getElementById("code_help").style.visibility = "visible";
// increment the total count
document.getElementById("total-count").textContent =

View File

@@ -29,7 +29,7 @@
class="bg-zinc-800 p-4 rounded mt-2 text-zinc-100"
><code>virtualenv venv
source venv/bin/activate
pip install git+https://github.com/unclecode/crawl4ai.git
pip install "crawl4ai[all] @ git+https://github.com/unclecode/crawl4ai.git"
</code></pre>
</li>
<li class="mb-4">
@@ -46,7 +46,7 @@ pip install git+https://github.com/unclecode/crawl4ai.git
source venv/bin/activate
git clone https://github.com/unclecode/crawl4ai.git
cd crawl4ai
pip install -e .
pip install -e .[all]
</code></pre>
</li>
<li class="">

View File

@@ -46,9 +46,9 @@
id="extraction-strategy-select"
class="border border-zinc-700 rounded px-4 py-1 bg-zinc-900 text-zinc-300"
>
<option value="NoExtractionStrategy" selected>NoExtractionStrategy</option>
<option value="CosineStrategy">CosineStrategy</option>
<option value="LLMExtractionStrategy">LLMExtractionStrategy</option>
<option value="NoExtractionStrategy">NoExtractionStrategy</option>
</select>
</div>
<div class="flex flex-col">
@@ -99,7 +99,7 @@
</div>
<div class="flex gap-2">
<!-- Add two textarea one for getting Keyword Filter and another one Instruction, make both grow whole with-->
<div id = "semantic_filter_div" class="flex flex-col flex-1">
<div id = "semantic_filter_div" class="flex flex-col flex-1 hidden">
<label for="keyword-filter" class="text-lime-500 font-bold text-xs">Keyword Filter</label>
<textarea
id="semantic_filter"
@@ -131,10 +131,10 @@
</div>
</div>
<div id="loading" class="hidden">
<p class="text-white">Loading... Please wait.</p>
</div>
<div id="result" class="flex-1">
<div id="loading" class="hidden">
<p class="text-white">Loading... Please wait.</p>
</div>
<div class="tab-buttons flex gap-2">
<button class="tab-btn px-4 py-1 text-sm bg-zinc-700 rounded-t text-lime-500" data-tab="json">
JSON
@@ -181,19 +181,19 @@
</button> -->
</div>
<div class="tab-content result bg-zinc-900 p-2 rounded h-full border border-zinc-700 text-sm">
<pre class="h-full flex relative">
<pre class="h-full flex relative overflow-x-auto">
<code id="curl-code" class="language-bash"></code>
<button class="absolute top-2 right-2 bg-zinc-700 text-white px-2 py-1 rounded copy-btn" data-target="curl-code">Copy</button>
</pre>
<pre class="hidden h-full flex relative">
<pre class="hidden h-full flex relative overflow-x-auto">
<code id="python-code" class="language-python"></code>
<button class="absolute top-2 right-2 bg-zinc-700 text-white px-2 py-1 rounded copy-btn" data-target="python-code">Copy</button>
</pre>
<pre class="hidden h-full flex relative">
<pre class="hidden h-full flex relative overflow-x-auto">
<code id="nodejs-code" class="language-javascript"></code>
<button class="absolute top-2 right-2 bg-zinc-700 text-white px-2 py-1 rounded copy-btn" data-target="nodejs-code">Copy</button>
</pre>
<pre class="hidden h-full flex relative">
<pre class="hidden h-full flex relative overflow-x-auto">
<code id="library-code" class="language-python"></code>
<button class="absolute top-2 right-2 bg-zinc-700 text-white px-2 py-1 rounded copy-btn" data-target="library-code">Copy</button>
</pre>

View File

@@ -236,12 +236,11 @@ chunks = chunker.chunk("This is a sample text. It will be split into chunks.")
<h4>Constructor Parameters:</h4>
<ul>
<li>
<code>model</code> (str, optional): The SpaCy model to use for sentence detection. Default is
<code>'en_core_web_sm'</code>.
None.
</li>
</ul>
<h4>Example usage:</h4>
<pre><code class="language-python">chunker = NlpSentenceChunking(model='en_core_web_sm')
<pre><code class="language-python">chunker = NlpSentenceChunking()
chunks = chunker.chunk("This is a sample text. It will be split into sentences.")
</code></pre>
</div>

View File

@@ -2,7 +2,6 @@ aiohttp==3.9.5
aiosqlite==0.20.0
bs4==0.0.2
fastapi==0.111.0
typer==0.9.0
html2text==2024.2.26
httpx==0.27.0
lazy_import==0.2.2
@@ -14,7 +13,6 @@ requests==2.31.0
rich==13.7.1
scikit-learn==1.4.2
selenium==4.20.0
spacy==3.7.4
uvicorn==0.29.0
transformers==4.40.2
chromedriver-autoinstaller==0.6.4

View File

@@ -1,24 +1,18 @@
from setuptools import setup, find_packages
from setuptools.command.install import install as _install
import subprocess
import sys
class InstallCommand(_install):
def run(self):
# Run the standard install first
_install.run(self)
# Now handle the dependencies manually
self.manual_dependencies_install()
# Read the requirements from requirements.txt
with open("requirements.txt") as f:
requirements = f.read().splitlines()
def manual_dependencies_install(self):
with open('requirements.txt') as f:
dependencies = f.read().splitlines()
for dependency in dependencies:
subprocess.check_call([sys.executable, '-m', 'pip', 'install', dependency])
# Define the requirements for different environments
requirements_without_torch = [req for req in requirements if not req.startswith("torch")]
requirements_without_transformers = [req for req in requirements if not req.startswith("transformers")]
requirements_without_nltk = [req for req in requirements if not req.startswith("nltk")]
requirements_without_torch_transformers_nlkt = [req for req in requirements if not req.startswith("torch") and not req.startswith("transformers") and not req.startswith("nltk")]
setup(
name="Crawl4AI",
version="0.1.0",
version="0.2.0",
description="🔥🕷️ Crawl4AI: Open-source LLM Friendly Web Crawler & Scrapper",
long_description=open("README.md").read(),
long_description_content_type="text/markdown",
@@ -27,9 +21,11 @@ setup(
author_email="unclecode@kidocode.com",
license="MIT",
packages=find_packages(),
install_requires=[], # Leave this empty to avoid default dependency resolution
cmdclass={
'install': InstallCommand,
install_requires=requirements_without_torch_transformers_nlkt,
extras_require={
"all": requirements, # Include all requirements
"colab": requirements_without_torch, # Exclude torch for Colab
"crawl": requirements_without_torch_transformers_nlkt
},
entry_points={
'console_scripts': [

View File

@@ -1,35 +0,0 @@
import os
def install_crawl4ai():
print("Installing Crawl4AI and its dependencies...")
# Install dependencies
!pip install -U 'spacy[cuda12x]'
!apt-get update -y
!apt install chromium-chromedriver -y
!pip install chromedriver_autoinstaller
!pip install git+https://github.com/unclecode/crawl4ai.git@new-release-0.0.2
# Install ChromeDriver
import chromedriver_autoinstaller
chromedriver_autoinstaller.install()
# Download the reuters model
repo_url = "https://github.com/unclecode/crawl4ai.git"
branch = "new-release-0.0.2"
folder_path = "models/reuters"
!git clone -b {branch} {repo_url}
!mkdir -p models
repo_folder = "crawl4ai"
source_folder = os.path.join(repo_folder, folder_path)
destination_folder = "models"
!mv "{source_folder}" "{destination_folder}"
!rm -rf "{repo_folder}"
print("Installation and model download completed successfully!")
# Run the installer
install_crawl4ai()