Compare commits
16 Commits
new-releas
...
new-releas
| Author | SHA1 | Date | |
|---|---|---|---|
|
|
6f96dcd649 | ||
|
|
957a2458b1 | ||
|
|
36e46be23d | ||
|
|
32c87f0388 | ||
|
|
647cfda225 | ||
|
|
1cc67df301 | ||
|
|
d7b37e849d | ||
|
|
f52f526002 | ||
|
|
3593f017d7 | ||
|
|
e7bb76f19b | ||
|
|
593b928967 | ||
|
|
bb3d37face | ||
|
|
3f8576f870 | ||
|
|
bf3b040f10 | ||
|
|
a317dc5e1d | ||
|
|
a5f9d07dbf |
80
README.md
80
README.md
@@ -22,32 +22,26 @@ Crawl4AI has one clear task: to simplify crawling and extract useful information
|
||||
|
||||
## Power and Simplicity of Crawl4AI 🚀
|
||||
|
||||
Crawl4AI makes even complex web crawling tasks simple and intuitive. Below is an example of how you can execute JavaScript, filter data using keywords, and use a CSS selector to extract specific content—all in one go!
|
||||
To show the simplicity take a look at the first example:
|
||||
|
||||
**Example Task:**
|
||||
```python
|
||||
from crawl4ai import WebCrawler
|
||||
|
||||
# Create the WebCrawler instance
|
||||
crawler = WebCrawler()
|
||||
|
||||
# Run the crawler with keyword filtering and CSS selector
|
||||
result = crawler.run(url="https://www.nbcnews.com/business")
|
||||
print(result) # {url, html, markdown, extracted_content, metadata}
|
||||
```
|
||||
|
||||
Now let's try a complex task. Below is an example of how you can execute JavaScript, filter data using keywords, and use a CSS selector to extract specific content—all in one go!
|
||||
|
||||
1. Instantiate a WebCrawler object.
|
||||
2. Execute custom JavaScript to click a "Load More" button.
|
||||
3. Filter the data to include only content related to "technology".
|
||||
3. Extract semantical chunks of content and filter the data to include only content related to technology.
|
||||
4. Use a CSS selector to extract only paragraphs (`<p>` tags).
|
||||
|
||||
**Example Code:**
|
||||
|
||||
Simply, firtsy install the package:
|
||||
```bash
|
||||
virtualenv venv
|
||||
source venv/bin/activate
|
||||
# Install Crawl4AI
|
||||
pip install git+https://github.com/unclecode/crawl4ai.git
|
||||
```
|
||||
|
||||
Run the following command to load the required models. This is optional, but it will boost the performance and speed of the crawler. You need to do this only once.
|
||||
```bash
|
||||
crawl4ai-download-models
|
||||
```
|
||||
|
||||
Now, you can run the following code:
|
||||
|
||||
```python
|
||||
# Import necessary modules
|
||||
from crawl4ai import WebCrawler
|
||||
@@ -69,7 +63,7 @@ crawler = WebCrawler(crawler_strategy=crawler_strategy)
|
||||
|
||||
# Run the crawler with keyword filtering and CSS selector
|
||||
result = crawler.run(
|
||||
url="https://www.example.com",
|
||||
url="https://www.nbcnews.com/business",
|
||||
extraction_strategy=CosineStrategy(
|
||||
semantic_filter="technology",
|
||||
),
|
||||
@@ -77,7 +71,7 @@ result = crawler.run(
|
||||
|
||||
# Run the crawler with LLM extraction strategy
|
||||
result = crawler.run(
|
||||
url="https://www.example.com",
|
||||
url="https://www.nbcnews.com/business",
|
||||
extraction_strategy=LLMExtractionStrategy(
|
||||
provider="openai/gpt-4o",
|
||||
api_token=os.getenv('OPENAI_API_KEY'),
|
||||
@@ -99,16 +93,16 @@ With Crawl4AI, you can perform advanced web crawling and data extraction tasks w
|
||||
|
||||
## Table of Contents
|
||||
|
||||
1. [Features](#features)
|
||||
2. [Installation](#installation)
|
||||
3. [REST API/Local Server](#using-the-local-server-ot-rest-api)
|
||||
4. [Python Library Usage](#usage)
|
||||
5. [Parameters](#parameters)
|
||||
6. [Chunking Strategies](#chunking-strategies)
|
||||
7. [Extraction Strategies](#extraction-strategies)
|
||||
8. [Contributing](#contributing)
|
||||
9. [License](#license)
|
||||
10. [Contact](#contact)
|
||||
1. [Features](#features-)
|
||||
2. [Installation](#installation-)
|
||||
3. [REST API/Local Server](#using-the-local-server-ot-rest-api-)
|
||||
4. [Python Library Usage](#python-library-usage-)
|
||||
5. [Parameters](#parameters-)
|
||||
6. [Chunking Strategies](#chunking-strategies-)
|
||||
7. [Extraction Strategies](#extraction-strategies-)
|
||||
8. [Contributing](#contributing-)
|
||||
9. [License](#license-)
|
||||
10. [Contact](#contact-)
|
||||
|
||||
|
||||
## Features ✨
|
||||
@@ -137,7 +131,7 @@ To install Crawl4AI as a library, follow these steps:
|
||||
```bash
|
||||
virtualenv venv
|
||||
source venv/bin/activate
|
||||
pip install git+https://github.com/unclecode/crawl4ai.git
|
||||
pip install "crawl4ai[all] @ git+https://github.com/unclecode/crawl4ai.git"
|
||||
```
|
||||
|
||||
💡 Better to run the following CLI-command to load the required models. This is optional, but it will boost the performance and speed of the crawler. You need to do this only once.
|
||||
@@ -150,12 +144,12 @@ virtualenv venv
|
||||
source venv/bin/activate
|
||||
git clone https://github.com/unclecode/crawl4ai.git
|
||||
cd crawl4ai
|
||||
pip install -e .
|
||||
pip install -e .[all]
|
||||
```
|
||||
|
||||
3. Use docker to run the local server:
|
||||
```bash
|
||||
docker build -t crawl4ai .
|
||||
docker build -t crawl4ai .
|
||||
# For Mac users
|
||||
# docker build --platform linux/amd64 -t crawl4ai .
|
||||
docker run -d -p 8000:80 crawl4ai
|
||||
@@ -174,7 +168,7 @@ To use the REST API, send a POST request to `https://crawl4ai.com/crawl` with th
|
||||
**Example Request:**
|
||||
```json
|
||||
{
|
||||
"urls": ["https://www.example.com"],
|
||||
"urls": ["https://www.nbcnews.com/business"],
|
||||
"include_raw_html": false,
|
||||
"bypass_cache": true,
|
||||
"word_count_threshold": 5,
|
||||
@@ -201,7 +195,7 @@ To use the REST API, send a POST request to `https://crawl4ai.com/crawl` with th
|
||||
"status": "success",
|
||||
"data": [
|
||||
{
|
||||
"url": "https://www.example.com",
|
||||
"url": "https://www.nbcnews.com/business",
|
||||
"extracted_content": "...",
|
||||
"html": "...",
|
||||
"markdown": "...",
|
||||
@@ -216,7 +210,7 @@ For more information about the available parameters and their descriptions, refe
|
||||
|
||||
## Python Library Usage 🚀
|
||||
|
||||
🔥 A great way to try out Crawl4AI is to run `quickstart.py` in the `docs/examples` directory. This script demonstrates how to use Crawl4AI to crawl a website and extract content from it.
|
||||
🔥 A great way to try out Crawl4AI is to run `quickstart.py` in the `docs/examples` directory. This script demonstrates how to use Crawl4AI to crawl a website and extract content from it.
|
||||
|
||||
### Quickstart Guide
|
||||
|
||||
@@ -264,6 +258,8 @@ result = crawler.run(
|
||||
|
||||
### Extraction strategy: CosineStrategy
|
||||
|
||||
So far, the extracted content is just the result of chunking. To extract meaningful content, you can use extraction strategies. These strategies cluster consecutive chunks into meaningful blocks, keeping the same order as the text in the HTML. This approach is perfect for use in RAG applications and semantical search queries.
|
||||
|
||||
Using CosineStrategy:
|
||||
```python
|
||||
result = crawler.run(
|
||||
@@ -349,7 +345,7 @@ result = crawler.run(url="https://www.nbcnews.com/business")
|
||||
| `include_raw_html` | Whether to include the raw HTML content in the response. | No | `false` |
|
||||
| `bypass_cache` | Whether to force a fresh crawl even if the URL has been previously crawled. | No | `false` |
|
||||
| `word_count_threshold`| The minimum number of words a block must contain to be considered meaningful (minimum value is 5). | No | `5` |
|
||||
| `extraction_strategy` | The strategy to use for extracting content from the HTML (e.g., "CosineStrategy"). | No | `CosineStrategy` |
|
||||
| `extraction_strategy` | The strategy to use for extracting content from the HTML (e.g., "CosineStrategy"). | No | `NoExtractionStrategy` |
|
||||
| `chunking_strategy` | The strategy to use for chunking the text before processing (e.g., "RegexChunking"). | No | `RegexChunking` |
|
||||
| `css_selector` | The CSS selector to target specific parts of the HTML for extraction. | No | `None` |
|
||||
| `verbose` | Whether to enable verbose logging. | No | `true` |
|
||||
@@ -374,11 +370,11 @@ chunks = chunker.chunk("This is a sample text. It will be split into chunks.")
|
||||
`NlpSentenceChunking` uses a natural language processing model to chunk a given text into sentences. This approach leverages SpaCy to accurately split text based on sentence boundaries.
|
||||
|
||||
**Constructor Parameters:**
|
||||
- `model` (str, optional): The SpaCy model to use for sentence detection. Default is `'en_core_web_sm'`.
|
||||
- None.
|
||||
|
||||
**Example usage:**
|
||||
```python
|
||||
chunker = NlpSentenceChunking(model='en_core_web_sm')
|
||||
chunker = NlpSentenceChunking()
|
||||
chunks = chunker.chunk("This is a sample text. It will be split into sentences.")
|
||||
```
|
||||
|
||||
@@ -466,7 +462,7 @@ extracted_content = extractor.extract(url, html)
|
||||
|
||||
**Example usage:**
|
||||
```python
|
||||
extractor = CosineStrategy(semantic_filter='artificial intelligence', word_count_threshold=10, max_dist=0.2, linkage_method='ward', top_k=3, model_name='BAAI/bge-small-en-v1.5')
|
||||
extractor = CosineStrategy(semantic_filter='finance rental prices', word_count_threshold=10, max_dist=0.2, linkage_method='ward', top_k=3, model_name='BAAI/bge-small-en-v1.5')
|
||||
extracted_content = extractor.extract(url, html)
|
||||
```
|
||||
|
||||
|
||||
@@ -1,12 +1,8 @@
|
||||
from abc import ABC, abstractmethod
|
||||
import re
|
||||
# spacy = lazy_import.lazy_module('spacy')
|
||||
# nl = lazy_import.lazy_module('nltk')
|
||||
# from nltk.corpus import stopwords
|
||||
# from nltk.tokenize import word_tokenize, TextTilingTokenizer
|
||||
from collections import Counter
|
||||
import string
|
||||
from .model_loader import load_spacy_en_core_web_sm
|
||||
from .model_loader import load_nltk_punkt
|
||||
|
||||
# Define the abstract base class for chunking strategies
|
||||
class ChunkingStrategy(ABC):
|
||||
@@ -34,15 +30,24 @@ class RegexChunking(ChunkingStrategy):
|
||||
paragraphs = new_paragraphs
|
||||
return paragraphs
|
||||
|
||||
# NLP-based sentence chunking using spaCy
|
||||
|
||||
# NLP-based sentence chunking
|
||||
class NlpSentenceChunking(ChunkingStrategy):
|
||||
def __init__(self, model='en_core_web_sm'):
|
||||
self.nlp = load_spacy_en_core_web_sm()
|
||||
def __init__(self):
|
||||
load_nltk_punkt()
|
||||
pass
|
||||
|
||||
def chunk(self, text: str) -> list:
|
||||
doc = self.nlp(text)
|
||||
return [sent.text.strip() for sent in doc.sents]
|
||||
# Improved regex for sentence splitting
|
||||
# sentence_endings = re.compile(
|
||||
# r'(?<!\w\.\w.)(?<![A-Z][a-z]\.)(?<![A-Z][A-Z]\.)(?<![A-Za-z]\.)(?<=\.|\?|\!|\n)\s'
|
||||
# )
|
||||
# sentences = sentence_endings.split(text)
|
||||
# sens = [sent.strip() for sent in sentences if sent]
|
||||
from nltk.tokenize import sent_tokenize
|
||||
sentences = sent_tokenize(text)
|
||||
sens = [sent.strip() for sent in sentences]
|
||||
|
||||
return list(set(sens))
|
||||
|
||||
# Topic-based segmentation using TextTiling
|
||||
class TopicSegmentationChunking(ChunkingStrategy):
|
||||
|
||||
@@ -7,7 +7,7 @@ from .prompts import PROMPT_EXTRACT_BLOCKS, PROMPT_EXTRACT_BLOCKS_WITH_INSTRUCTI
|
||||
from .config import *
|
||||
from .utils import *
|
||||
from functools import partial
|
||||
from .model_loader import load_bert_base_uncased, load_bge_small_en_v1_5, load_spacy_model
|
||||
from .model_loader import *
|
||||
|
||||
|
||||
import numpy as np
|
||||
@@ -19,6 +19,7 @@ class ExtractionStrategy(ABC):
|
||||
def __init__(self, **kwargs):
|
||||
self.DEL = "<|DEL|>"
|
||||
self.name = self.__class__.__name__
|
||||
self.verbose = kwargs.get("verbose", False)
|
||||
|
||||
@abstractmethod
|
||||
def extract(self, url: str, html: str, *q, **kwargs) -> List[Dict[str, Any]]:
|
||||
@@ -45,14 +46,13 @@ class ExtractionStrategy(ABC):
|
||||
for future in as_completed(futures):
|
||||
extracted_content.extend(future.result())
|
||||
return extracted_content
|
||||
|
||||
class NoExtractionStrategy(ExtractionStrategy):
|
||||
def extract(self, url: str, html: str, *q, **kwargs) -> List[Dict[str, Any]]:
|
||||
return [{"index": 0, "content": html}]
|
||||
|
||||
def run(self, url: str, sections: List[str], *q, **kwargs) -> List[Dict[str, Any]]:
|
||||
return [{"index": i, "tags": [], "content": section} for i, section in enumerate(sections)]
|
||||
|
||||
|
||||
class LLMExtractionStrategy(ExtractionStrategy):
|
||||
def __init__(self, provider: str = DEFAULT_PROVIDER, api_token: Optional[str] = None, instruction:str = None, **kwargs):
|
||||
"""
|
||||
@@ -62,10 +62,11 @@ class LLMExtractionStrategy(ExtractionStrategy):
|
||||
:param api_token: The API token for the provider.
|
||||
:param instruction: The instruction to use for the LLM model.
|
||||
"""
|
||||
super().__init__()
|
||||
super().__init__()
|
||||
self.provider = provider
|
||||
self.api_token = api_token or PROVIDER_MODELS.get(provider, None) or os.getenv("OPENAI_API_KEY")
|
||||
self.instruction = instruction
|
||||
self.verbose = kwargs.get("verbose", False)
|
||||
|
||||
if not self.api_token:
|
||||
raise ValueError("API token must be provided for LLMExtractionStrategy. Update the config.py or set OPENAI_API_KEY environment variable.")
|
||||
@@ -106,7 +107,8 @@ class LLMExtractionStrategy(ExtractionStrategy):
|
||||
"content": unparsed
|
||||
})
|
||||
|
||||
print("[LOG] Extracted", len(blocks), "blocks from URL:", url, "block index:", ix)
|
||||
if self.verbose:
|
||||
print("[LOG] Extracted", len(blocks), "blocks from URL:", url, "block index:", ix)
|
||||
return blocks
|
||||
|
||||
def _merge(self, documents):
|
||||
@@ -166,16 +168,13 @@ class CosineStrategy(ExtractionStrategy):
|
||||
"""
|
||||
super().__init__()
|
||||
|
||||
from transformers import BertTokenizer, BertModel, pipeline
|
||||
from transformers import AutoTokenizer, AutoModel
|
||||
import spacy
|
||||
|
||||
self.semantic_filter = semantic_filter
|
||||
self.word_count_threshold = word_count_threshold
|
||||
self.max_dist = max_dist
|
||||
self.linkage_method = linkage_method
|
||||
self.top_k = top_k
|
||||
self.timer = time.time()
|
||||
self.verbose = kwargs.get("verbose", False)
|
||||
|
||||
self.buffer_embeddings = np.array([])
|
||||
|
||||
@@ -184,9 +183,10 @@ class CosineStrategy(ExtractionStrategy):
|
||||
elif model_name == "BAAI/bge-small-en-v1.5":
|
||||
self.tokenizer, self.model = load_bge_small_en_v1_5()
|
||||
|
||||
self.nlp = load_spacy_model()
|
||||
print(f"[LOG] Model loaded {model_name}, models/reuters, took " + str(time.time() - self.timer) + " seconds")
|
||||
|
||||
self.nlp = load_text_multilabel_classifier()
|
||||
|
||||
if self.verbose:
|
||||
print(f"[LOG] Model loaded {model_name}, models/reuters, took " + str(time.time() - self.timer) + " seconds")
|
||||
|
||||
def filter_documents_embeddings(self, documents: List[str], semantic_filter: str, threshold: float = 0.5) -> List[str]:
|
||||
"""
|
||||
@@ -310,13 +310,19 @@ class CosineStrategy(ExtractionStrategy):
|
||||
|
||||
# Convert filtered clusters to a sorted list of dictionaries
|
||||
cluster_list = [{"index": int(idx), "tags" : [], "content": " ".join(filtered_clusters[idx])} for idx in sorted(filtered_clusters)]
|
||||
|
||||
labels = self.nlp([cluster['content'] for cluster in cluster_list])
|
||||
|
||||
for cluster, label in zip(cluster_list, labels):
|
||||
cluster['tags'] = label
|
||||
|
||||
# Process the text with the loaded model
|
||||
for cluster in cluster_list:
|
||||
doc = self.nlp(cluster['content'])
|
||||
tok_k = self.top_k
|
||||
top_categories = sorted(doc.cats.items(), key=lambda x: x[1], reverse=True)[:tok_k]
|
||||
cluster['tags'] = [cat for cat, _ in top_categories]
|
||||
# for cluster in cluster_list:
|
||||
# cluster['tags'] = self.nlp(cluster['content'])[0]['label']
|
||||
# doc = self.nlp(cluster['content'])
|
||||
# tok_k = self.top_k
|
||||
# top_categories = sorted(doc.cats.items(), key=lambda x: x[1], reverse=True)[:tok_k]
|
||||
# cluster['tags'] = [cat for cat, _ in top_categories]
|
||||
|
||||
# print(f"[LOG] 🚀 Categorization done in {time.time() - t:.2f} seconds")
|
||||
|
||||
|
||||
@@ -28,68 +28,66 @@ def load_bge_small_en_v1_5():
|
||||
return tokenizer, model
|
||||
|
||||
@lru_cache()
|
||||
def load_spacy_en_core_web_sm():
|
||||
import spacy
|
||||
try:
|
||||
print("[LOG] Loading spaCy model")
|
||||
nlp = spacy.load("en_core_web_sm")
|
||||
except IOError:
|
||||
print("[LOG] ⏬ Downloading spaCy model for the first time")
|
||||
spacy.cli.download("en_core_web_sm")
|
||||
nlp = spacy.load("en_core_web_sm")
|
||||
print("[LOG] ✅ spaCy model loaded successfully")
|
||||
return nlp
|
||||
def load_text_classifier():
|
||||
from transformers import AutoTokenizer, AutoModelForSequenceClassification
|
||||
from transformers import pipeline
|
||||
|
||||
tokenizer = AutoTokenizer.from_pretrained("dstefa/roberta-base_topic_classification_nyt_news")
|
||||
model = AutoModelForSequenceClassification.from_pretrained("dstefa/roberta-base_topic_classification_nyt_news")
|
||||
pipe = pipeline("text-classification", model=model, tokenizer=tokenizer)
|
||||
|
||||
return pipe
|
||||
|
||||
@lru_cache()
|
||||
def load_spacy_model():
|
||||
import spacy
|
||||
name = "models/reuters"
|
||||
home_folder = get_home_folder()
|
||||
model_folder = os.path.join(home_folder, name)
|
||||
|
||||
# Check if the model directory already exists
|
||||
if not (Path(model_folder).exists() and any(Path(model_folder).iterdir())):
|
||||
repo_url = "https://github.com/unclecode/crawl4ai.git"
|
||||
# branch = "main"
|
||||
branch = MODEL_REPO_BRANCH
|
||||
repo_folder = os.path.join(home_folder, "crawl4ai")
|
||||
model_folder = os.path.join(home_folder, name)
|
||||
def load_text_multilabel_classifier():
|
||||
from transformers import AutoModelForSequenceClassification, AutoTokenizer
|
||||
import numpy as np
|
||||
from scipy.special import expit
|
||||
import torch
|
||||
|
||||
print("[LOG] ⏬ Downloading model for the first time...")
|
||||
MODEL = "cardiffnlp/tweet-topic-21-multi"
|
||||
tokenizer = AutoTokenizer.from_pretrained(MODEL, resume_download=None)
|
||||
model = AutoModelForSequenceClassification.from_pretrained(MODEL, resume_download=None)
|
||||
class_mapping = model.config.id2label
|
||||
|
||||
# Remove existing repo folder if it exists
|
||||
if Path(repo_folder).exists():
|
||||
shutil.rmtree(repo_folder)
|
||||
shutil.rmtree(model_folder)
|
||||
# Check for available device: CUDA, MPS (for Apple Silicon), or CPU
|
||||
if torch.cuda.is_available():
|
||||
device = torch.device("cuda")
|
||||
elif torch.backends.mps.is_available():
|
||||
device = torch.device("mps")
|
||||
else:
|
||||
device = torch.device("cpu")
|
||||
|
||||
try:
|
||||
# Clone the repository
|
||||
subprocess.run(
|
||||
["git", "clone", "-b", branch, repo_url, repo_folder],
|
||||
stdout=subprocess.DEVNULL,
|
||||
stderr=subprocess.DEVNULL,
|
||||
check=True
|
||||
)
|
||||
model.to(device)
|
||||
|
||||
# Create the models directory if it doesn't exist
|
||||
models_folder = os.path.join(home_folder, "models")
|
||||
os.makedirs(models_folder, exist_ok=True)
|
||||
def _classifier(texts, threshold=0.5, max_length=64):
|
||||
tokens = tokenizer(texts, return_tensors='pt', padding=True, truncation=True, max_length=max_length)
|
||||
tokens = {key: val.to(device) for key, val in tokens.items()} # Move tokens to the selected device
|
||||
|
||||
# Copy the reuters model folder to the models directory
|
||||
source_folder = os.path.join(repo_folder, "models/reuters")
|
||||
shutil.copytree(source_folder, model_folder)
|
||||
with torch.no_grad():
|
||||
output = model(**tokens)
|
||||
|
||||
# Remove the cloned repository
|
||||
shutil.rmtree(repo_folder)
|
||||
scores = output.logits.detach().cpu().numpy()
|
||||
scores = expit(scores)
|
||||
predictions = (scores >= threshold) * 1
|
||||
|
||||
# Print completion message
|
||||
print("[LOG] ✅ Model downloaded successfully")
|
||||
except subprocess.CalledProcessError as e:
|
||||
print(f"An error occurred while cloning the repository: {e}")
|
||||
except Exception as e:
|
||||
print(f"An error occurred: {e}")
|
||||
batch_labels = []
|
||||
for prediction in predictions:
|
||||
labels = [class_mapping[i] for i, value in enumerate(prediction) if value == 1]
|
||||
batch_labels.append(labels)
|
||||
|
||||
return spacy.load(model_folder)
|
||||
return batch_labels
|
||||
|
||||
return _classifier
|
||||
|
||||
@lru_cache()
|
||||
def load_nltk_punkt():
|
||||
import nltk
|
||||
try:
|
||||
nltk.data.find('tokenizers/punkt')
|
||||
except LookupError:
|
||||
nltk.download('punkt')
|
||||
return nltk.data.find('tokenizers/punkt')
|
||||
|
||||
def download_all_models(remove_existing=False):
|
||||
"""Download all models required for Crawl4AI."""
|
||||
@@ -110,10 +108,10 @@ def download_all_models(remove_existing=False):
|
||||
load_bert_base_uncased()
|
||||
print("[LOG] Downloading BGE Small EN v1.5...")
|
||||
load_bge_small_en_v1_5()
|
||||
print("[LOG] Downloading spaCy EN Core Web SM...")
|
||||
load_spacy_en_core_web_sm()
|
||||
print("[LOG] Downloading custom spaCy model...")
|
||||
load_spacy_model()
|
||||
print("[LOG] Downloading text classifier...")
|
||||
load_text_multilabel_classifier
|
||||
print("[LOG] Downloading custom NLTK Punkt model...")
|
||||
load_nltk_punkt()
|
||||
print("[LOG] ✅ All models downloaded successfully.")
|
||||
|
||||
def main():
|
||||
|
||||
@@ -3,6 +3,33 @@ from spacy.training import Example
|
||||
import random
|
||||
import nltk
|
||||
from nltk.corpus import reuters
|
||||
import torch
|
||||
|
||||
def save_spacy_model_as_torch(nlp, model_dir="models/reuters"):
|
||||
# Extract the TextCategorizer component
|
||||
textcat = nlp.get_pipe("textcat_multilabel")
|
||||
|
||||
# Convert the weights to a PyTorch state dictionary
|
||||
state_dict = {name: torch.tensor(param.data) for name, param in textcat.model.named_parameters()}
|
||||
|
||||
# Save the state dictionary
|
||||
torch.save(state_dict, f"{model_dir}/model_weights.pth")
|
||||
|
||||
# Extract and save the vocabulary
|
||||
vocab = extract_vocab(nlp)
|
||||
with open(f"{model_dir}/vocab.txt", "w") as vocab_file:
|
||||
for word, idx in vocab.items():
|
||||
vocab_file.write(f"{word}\t{idx}\n")
|
||||
|
||||
print(f"Model weights and vocabulary saved to: {model_dir}")
|
||||
|
||||
def extract_vocab(nlp):
|
||||
# Extract vocabulary from the SpaCy model
|
||||
vocab = {word: i for i, word in enumerate(nlp.vocab.strings)}
|
||||
return vocab
|
||||
|
||||
nlp = spacy.load("models/reuters")
|
||||
save_spacy_model_as_torch(nlp, model_dir="models")
|
||||
|
||||
def train_and_save_reuters_model(model_dir="models/reuters"):
|
||||
# Ensure the Reuters corpus is downloaded
|
||||
@@ -96,8 +123,6 @@ def train_model(model_dir, additional_epochs=0):
|
||||
nlp.to_disk(model_dir)
|
||||
print(f"Model saved to: {model_dir}")
|
||||
|
||||
|
||||
|
||||
def load_model_and_predict(model_dir, text, tok_k = 3):
|
||||
# Load the trained model from the specified directory
|
||||
nlp = spacy.load(model_dir)
|
||||
@@ -111,7 +136,6 @@ def load_model_and_predict(model_dir, text, tok_k = 3):
|
||||
|
||||
return top_categories
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
train_and_save_reuters_model()
|
||||
train_model("models/reuters", additional_epochs=5)
|
||||
@@ -119,4 +143,4 @@ if __name__ == "__main__":
|
||||
print(reuters.categories())
|
||||
example_text = "Apple Inc. is reportedly buying a startup for $1 billion"
|
||||
r =load_model_and_predict(model_directory, example_text)
|
||||
print(r)
|
||||
print(r)
|
||||
@@ -11,7 +11,6 @@ from .crawler_strategy import *
|
||||
from typing import List
|
||||
from concurrent.futures import ThreadPoolExecutor
|
||||
from .config import *
|
||||
# from .model_loader import load_bert_base_uncased, load_bge_small_en_v1_5, load_spacy_model
|
||||
|
||||
|
||||
class WebCrawler:
|
||||
@@ -40,14 +39,11 @@ class WebCrawler:
|
||||
self.ready = False
|
||||
|
||||
def warmup(self):
|
||||
|
||||
|
||||
|
||||
print("[LOG] 🌤️ Warming up the WebCrawler")
|
||||
result = self.run(
|
||||
url='https://crawl4ai.uccode.io/',
|
||||
word_count_threshold=5,
|
||||
extraction_strategy= CosineStrategy(),
|
||||
extraction_strategy= NoExtractionStrategy(),
|
||||
bypass_cache=False,
|
||||
verbose = False
|
||||
)
|
||||
@@ -63,14 +59,14 @@ class WebCrawler:
|
||||
extract_blocks_flag: bool = True,
|
||||
word_count_threshold=MIN_WORD_THRESHOLD,
|
||||
use_cached_html: bool = False,
|
||||
extraction_strategy: ExtractionStrategy = CosineStrategy(),
|
||||
extraction_strategy: ExtractionStrategy = None,
|
||||
chunking_strategy: ChunkingStrategy = RegexChunking(),
|
||||
**kwargs,
|
||||
) -> CrawlResult:
|
||||
return self.run(
|
||||
url_model.url,
|
||||
word_count_threshold,
|
||||
extraction_strategy,
|
||||
extraction_strategy or NoExtractionStrategy(),
|
||||
chunking_strategy,
|
||||
bypass_cache=url_model.forced,
|
||||
**kwargs,
|
||||
@@ -82,13 +78,15 @@ class WebCrawler:
|
||||
self,
|
||||
url: str,
|
||||
word_count_threshold=MIN_WORD_THRESHOLD,
|
||||
extraction_strategy: ExtractionStrategy = CosineStrategy(),
|
||||
extraction_strategy: ExtractionStrategy = None,
|
||||
chunking_strategy: ChunkingStrategy = RegexChunking(),
|
||||
bypass_cache: bool = False,
|
||||
css_selector: str = None,
|
||||
verbose=True,
|
||||
**kwargs,
|
||||
) -> CrawlResult:
|
||||
extraction_strategy = extraction_strategy or NoExtractionStrategy()
|
||||
extraction_strategy.verbose = verbose
|
||||
# Check if extraction strategy is an instance of ExtractionStrategy if not raise an error
|
||||
if not isinstance(extraction_strategy, ExtractionStrategy):
|
||||
raise ValueError("Unsupported extraction strategy")
|
||||
@@ -184,11 +182,11 @@ class WebCrawler:
|
||||
extract_blocks_flag: bool = True,
|
||||
word_count_threshold=MIN_WORD_THRESHOLD,
|
||||
use_cached_html: bool = False,
|
||||
extraction_strategy: ExtractionStrategy = CosineStrategy(),
|
||||
extraction_strategy: ExtractionStrategy = None,
|
||||
chunking_strategy: ChunkingStrategy = RegexChunking(),
|
||||
**kwargs,
|
||||
) -> List[CrawlResult]:
|
||||
|
||||
extraction_strategy = extraction_strategy or NoExtractionStrategy()
|
||||
def fetch_page_wrapper(url_model, *args, **kwargs):
|
||||
return self.fetch_page(url_model, *args, **kwargs)
|
||||
|
||||
|
||||
@@ -1,7 +1,7 @@
|
||||
{
|
||||
"RegexChunking": "### RegexChunking\n\n`RegexChunking` is a text chunking strategy that splits a given text into smaller parts using regular expressions.\nThis is useful for preparing large texts for processing by language models, ensuring they are divided into manageable segments.\n\n#### Constructor Parameters:\n- `patterns` (list, optional): A list of regular expression patterns used to split the text. Default is to split by double newlines (`['\\n\\n']`).\n\n#### Example usage:\n```python\nchunker = RegexChunking(patterns=[r'\\n\\n', r'\\. '])\nchunks = chunker.chunk(\"This is a sample text. It will be split into chunks.\")\n```",
|
||||
|
||||
"NlpSentenceChunking": "### NlpSentenceChunking\n\n`NlpSentenceChunking` uses a natural language processing model to chunk a given text into sentences. This approach leverages SpaCy to accurately split text based on sentence boundaries.\n\n#### Constructor Parameters:\n- `model` (str, optional): The SpaCy model to use for sentence detection. Default is `'en_core_web_sm'`.\n\n#### Example usage:\n```python\nchunker = NlpSentenceChunking(model='en_core_web_sm')\nchunks = chunker.chunk(\"This is a sample text. It will be split into sentences.\")\n```",
|
||||
"NlpSentenceChunking": "### NlpSentenceChunking\n\n`NlpSentenceChunking` uses a natural language processing model to chunk a given text into sentences. This approach leverages SpaCy to accurately split text based on sentence boundaries.\n\n#### Constructor Parameters:\n- None.\n\n#### Example usage:\n```python\nchunker = NlpSentenceChunking()\nchunks = chunker.chunk(\"This is a sample text. It will be split into sentences.\")\n```",
|
||||
|
||||
"TopicSegmentationChunking": "### TopicSegmentationChunking\n\n`TopicSegmentationChunking` uses the TextTiling algorithm to segment a given text into topic-based chunks. This method identifies thematic boundaries in the text.\n\n#### Constructor Parameters:\n- `num_keywords` (int, optional): The number of keywords to extract for each topic segment. Default is `3`.\n\n#### Example usage:\n```python\nchunker = TopicSegmentationChunking(num_keywords=3)\nchunks = chunker.chunk(\"This is a sample text. It will be split into topic-based segments.\")\n```",
|
||||
|
||||
|
||||
@@ -59,12 +59,6 @@ def understanding_parameters(crawler):
|
||||
cprint(f"[LOG] 📦 [bold yellow]Second crawl took {end_time - start_time} seconds and result (forced to crawl):[/bold yellow]")
|
||||
print_result(result)
|
||||
|
||||
# Retrieve raw HTML content
|
||||
cprint("\n🔄 [bold cyan]'include_raw_html' parameter example:[/bold cyan]", True)
|
||||
result = crawler.run(url="https://www.nbcnews.com/business", include_raw_html=False)
|
||||
cprint("[LOG] 📦 [bold yellow]Crawl result (without raw HTML content):[/bold yellow]")
|
||||
print_result(result)
|
||||
|
||||
def add_chunking_strategy(crawler):
|
||||
# Adding a chunking strategy: RegexChunking
|
||||
cprint("\n🧩 [bold cyan]Let's add a chunking strategy: RegexChunking![/bold cyan]", True)
|
||||
@@ -134,7 +128,7 @@ def add_llm_extraction_strategy(crawler):
|
||||
print_result(result)
|
||||
|
||||
result = crawler.run(
|
||||
url="https://www.example.com",
|
||||
url="https://www.nbcnews.com/business",
|
||||
extraction_strategy=LLMExtractionStrategy(
|
||||
provider="openai/gpt-4o",
|
||||
api_token=os.getenv('OPENAI_API_KEY'),
|
||||
@@ -176,12 +170,11 @@ def main():
|
||||
cprint("If this is the first time you're running Crawl4ai, this might take a few seconds to load required model files.")
|
||||
|
||||
crawler = create_crawler()
|
||||
|
||||
cprint("For the rest of this guide, I set crawler.always_by_pass_cache to True to force the crawler to bypass the cache. This is to ensure that we get fresh results for each run.", True)
|
||||
crawler.always_by_pass_cache = True
|
||||
|
||||
basic_usage(crawler)
|
||||
understanding_parameters(crawler)
|
||||
|
||||
crawler.always_by_pass_cache = True
|
||||
add_chunking_strategy(crawler)
|
||||
add_extraction_strategy(crawler)
|
||||
add_llm_extraction_strategy(crawler)
|
||||
|
||||
9
main.py
9
main.py
@@ -44,14 +44,12 @@ def get_crawler():
|
||||
return WebCrawler()
|
||||
|
||||
class CrawlRequest(BaseModel):
|
||||
urls: List[HttpUrl]
|
||||
provider_model: str
|
||||
api_token: str
|
||||
urls: List[str]
|
||||
include_raw_html: Optional[bool] = False
|
||||
bypass_cache: bool = False
|
||||
extract_blocks: bool = True
|
||||
word_count_threshold: Optional[int] = 5
|
||||
extraction_strategy: Optional[str] = "CosineStrategy"
|
||||
extraction_strategy: Optional[str] = "NoExtractionStrategy"
|
||||
extraction_strategy_args: Optional[dict] = {}
|
||||
chunking_strategy: Optional[str] = "RegexChunking"
|
||||
chunking_strategy_args: Optional[dict] = {}
|
||||
@@ -95,9 +93,6 @@ def import_strategy(module_name: str, class_name: str, *args, **kwargs):
|
||||
@app.post("/crawl")
|
||||
async def crawl_urls(crawl_request: CrawlRequest, request: Request):
|
||||
global current_requests
|
||||
# Raise error if api_token is not provided
|
||||
if not crawl_request.api_token:
|
||||
raise HTTPException(status_code=401, detail="API token is required.")
|
||||
async with lock:
|
||||
if current_requests >= MAX_CONCURRENT_REQUESTS:
|
||||
raise HTTPException(status_code=429, detail="Too many requests - please try again later.")
|
||||
|
||||
@@ -1,144 +0,0 @@
|
||||
[paths]
|
||||
train = null
|
||||
dev = null
|
||||
vectors = null
|
||||
init_tok2vec = null
|
||||
|
||||
[system]
|
||||
seed = 0
|
||||
gpu_allocator = null
|
||||
|
||||
[nlp]
|
||||
lang = "en"
|
||||
pipeline = ["textcat_multilabel"]
|
||||
disabled = []
|
||||
before_creation = null
|
||||
after_creation = null
|
||||
after_pipeline_creation = null
|
||||
batch_size = 1000
|
||||
tokenizer = {"@tokenizers":"spacy.Tokenizer.v1"}
|
||||
vectors = {"@vectors":"spacy.Vectors.v1"}
|
||||
|
||||
[components]
|
||||
|
||||
[components.textcat_multilabel]
|
||||
factory = "textcat_multilabel"
|
||||
scorer = {"@scorers":"spacy.textcat_multilabel_scorer.v2"}
|
||||
threshold = 0.5
|
||||
|
||||
[components.textcat_multilabel.model]
|
||||
@architectures = "spacy.TextCatEnsemble.v2"
|
||||
nO = null
|
||||
|
||||
[components.textcat_multilabel.model.linear_model]
|
||||
@architectures = "spacy.TextCatBOW.v3"
|
||||
exclusive_classes = false
|
||||
length = 262144
|
||||
ngram_size = 1
|
||||
no_output_layer = false
|
||||
nO = null
|
||||
|
||||
[components.textcat_multilabel.model.tok2vec]
|
||||
@architectures = "spacy.Tok2Vec.v2"
|
||||
|
||||
[components.textcat_multilabel.model.tok2vec.embed]
|
||||
@architectures = "spacy.MultiHashEmbed.v2"
|
||||
width = 64
|
||||
rows = [2000,2000,500,1000,500]
|
||||
attrs = ["NORM","LOWER","PREFIX","SUFFIX","SHAPE"]
|
||||
include_static_vectors = false
|
||||
|
||||
[components.textcat_multilabel.model.tok2vec.encode]
|
||||
@architectures = "spacy.MaxoutWindowEncoder.v2"
|
||||
width = 64
|
||||
window_size = 1
|
||||
maxout_pieces = 3
|
||||
depth = 2
|
||||
|
||||
[corpora]
|
||||
|
||||
[corpora.dev]
|
||||
@readers = "spacy.Corpus.v1"
|
||||
path = ${paths.dev}
|
||||
gold_preproc = false
|
||||
max_length = 0
|
||||
limit = 0
|
||||
augmenter = null
|
||||
|
||||
[corpora.train]
|
||||
@readers = "spacy.Corpus.v1"
|
||||
path = ${paths.train}
|
||||
gold_preproc = false
|
||||
max_length = 0
|
||||
limit = 0
|
||||
augmenter = null
|
||||
|
||||
[training]
|
||||
seed = ${system.seed}
|
||||
gpu_allocator = ${system.gpu_allocator}
|
||||
dropout = 0.1
|
||||
accumulate_gradient = 1
|
||||
patience = 1600
|
||||
max_epochs = 0
|
||||
max_steps = 20000
|
||||
eval_frequency = 200
|
||||
frozen_components = []
|
||||
annotating_components = []
|
||||
dev_corpus = "corpora.dev"
|
||||
train_corpus = "corpora.train"
|
||||
before_to_disk = null
|
||||
before_update = null
|
||||
|
||||
[training.batcher]
|
||||
@batchers = "spacy.batch_by_words.v1"
|
||||
discard_oversize = false
|
||||
tolerance = 0.2
|
||||
get_length = null
|
||||
|
||||
[training.batcher.size]
|
||||
@schedules = "compounding.v1"
|
||||
start = 100
|
||||
stop = 1000
|
||||
compound = 1.001
|
||||
t = 0.0
|
||||
|
||||
[training.logger]
|
||||
@loggers = "spacy.ConsoleLogger.v1"
|
||||
progress_bar = false
|
||||
|
||||
[training.optimizer]
|
||||
@optimizers = "Adam.v1"
|
||||
beta1 = 0.9
|
||||
beta2 = 0.999
|
||||
L2_is_weight_decay = true
|
||||
L2 = 0.01
|
||||
grad_clip = 1.0
|
||||
use_averages = false
|
||||
eps = 0.00000001
|
||||
learn_rate = 0.001
|
||||
|
||||
[training.score_weights]
|
||||
cats_score = 1.0
|
||||
cats_score_desc = null
|
||||
cats_micro_p = null
|
||||
cats_micro_r = null
|
||||
cats_micro_f = null
|
||||
cats_macro_p = null
|
||||
cats_macro_r = null
|
||||
cats_macro_f = null
|
||||
cats_macro_auc = null
|
||||
cats_f_per_type = null
|
||||
|
||||
[pretraining]
|
||||
|
||||
[initialize]
|
||||
vectors = ${paths.vectors}
|
||||
init_tok2vec = ${paths.init_tok2vec}
|
||||
vocab_data = null
|
||||
lookups = null
|
||||
before_init = null
|
||||
after_init = null
|
||||
|
||||
[initialize.components]
|
||||
|
||||
[initialize.tokenizer]
|
||||
@@ -1,122 +0,0 @@
|
||||
{
|
||||
"lang":"en",
|
||||
"name":"pipeline",
|
||||
"version":"0.0.0",
|
||||
"spacy_version":">=3.7.4,<3.8.0",
|
||||
"description":"",
|
||||
"author":"",
|
||||
"email":"",
|
||||
"url":"",
|
||||
"license":"",
|
||||
"spacy_git_version":"bff8725f4",
|
||||
"vectors":{
|
||||
"width":0,
|
||||
"vectors":0,
|
||||
"keys":0,
|
||||
"name":null,
|
||||
"mode":"default"
|
||||
},
|
||||
"labels":{
|
||||
"textcat_multilabel":[
|
||||
"acq",
|
||||
"alum",
|
||||
"barley",
|
||||
"bop",
|
||||
"carcass",
|
||||
"castor-oil",
|
||||
"cocoa",
|
||||
"coconut",
|
||||
"coconut-oil",
|
||||
"coffee",
|
||||
"copper",
|
||||
"copra-cake",
|
||||
"corn",
|
||||
"cotton",
|
||||
"cotton-oil",
|
||||
"cpi",
|
||||
"cpu",
|
||||
"crude",
|
||||
"dfl",
|
||||
"dlr",
|
||||
"dmk",
|
||||
"earn",
|
||||
"fuel",
|
||||
"gas",
|
||||
"gnp",
|
||||
"gold",
|
||||
"grain",
|
||||
"groundnut",
|
||||
"groundnut-oil",
|
||||
"heat",
|
||||
"hog",
|
||||
"housing",
|
||||
"income",
|
||||
"instal-debt",
|
||||
"interest",
|
||||
"ipi",
|
||||
"iron-steel",
|
||||
"jet",
|
||||
"jobs",
|
||||
"l-cattle",
|
||||
"lead",
|
||||
"lei",
|
||||
"lin-oil",
|
||||
"livestock",
|
||||
"lumber",
|
||||
"meal-feed",
|
||||
"money-fx",
|
||||
"money-supply",
|
||||
"naphtha",
|
||||
"nat-gas",
|
||||
"nickel",
|
||||
"nkr",
|
||||
"nzdlr",
|
||||
"oat",
|
||||
"oilseed",
|
||||
"orange",
|
||||
"palladium",
|
||||
"palm-oil",
|
||||
"palmkernel",
|
||||
"pet-chem",
|
||||
"platinum",
|
||||
"potato",
|
||||
"propane",
|
||||
"rand",
|
||||
"rape-oil",
|
||||
"rapeseed",
|
||||
"reserves",
|
||||
"retail",
|
||||
"rice",
|
||||
"rubber",
|
||||
"rye",
|
||||
"ship",
|
||||
"silver",
|
||||
"sorghum",
|
||||
"soy-meal",
|
||||
"soy-oil",
|
||||
"soybean",
|
||||
"strategic-metal",
|
||||
"sugar",
|
||||
"sun-meal",
|
||||
"sun-oil",
|
||||
"sunseed",
|
||||
"tea",
|
||||
"tin",
|
||||
"trade",
|
||||
"veg-oil",
|
||||
"wheat",
|
||||
"wpi",
|
||||
"yen",
|
||||
"zinc"
|
||||
]
|
||||
},
|
||||
"pipeline":[
|
||||
"textcat_multilabel"
|
||||
],
|
||||
"components":[
|
||||
"textcat_multilabel"
|
||||
],
|
||||
"disabled":[
|
||||
|
||||
]
|
||||
}
|
||||
@@ -1,95 +0,0 @@
|
||||
{
|
||||
"labels":[
|
||||
"acq",
|
||||
"alum",
|
||||
"barley",
|
||||
"bop",
|
||||
"carcass",
|
||||
"castor-oil",
|
||||
"cocoa",
|
||||
"coconut",
|
||||
"coconut-oil",
|
||||
"coffee",
|
||||
"copper",
|
||||
"copra-cake",
|
||||
"corn",
|
||||
"cotton",
|
||||
"cotton-oil",
|
||||
"cpi",
|
||||
"cpu",
|
||||
"crude",
|
||||
"dfl",
|
||||
"dlr",
|
||||
"dmk",
|
||||
"earn",
|
||||
"fuel",
|
||||
"gas",
|
||||
"gnp",
|
||||
"gold",
|
||||
"grain",
|
||||
"groundnut",
|
||||
"groundnut-oil",
|
||||
"heat",
|
||||
"hog",
|
||||
"housing",
|
||||
"income",
|
||||
"instal-debt",
|
||||
"interest",
|
||||
"ipi",
|
||||
"iron-steel",
|
||||
"jet",
|
||||
"jobs",
|
||||
"l-cattle",
|
||||
"lead",
|
||||
"lei",
|
||||
"lin-oil",
|
||||
"livestock",
|
||||
"lumber",
|
||||
"meal-feed",
|
||||
"money-fx",
|
||||
"money-supply",
|
||||
"naphtha",
|
||||
"nat-gas",
|
||||
"nickel",
|
||||
"nkr",
|
||||
"nzdlr",
|
||||
"oat",
|
||||
"oilseed",
|
||||
"orange",
|
||||
"palladium",
|
||||
"palm-oil",
|
||||
"palmkernel",
|
||||
"pet-chem",
|
||||
"platinum",
|
||||
"potato",
|
||||
"propane",
|
||||
"rand",
|
||||
"rape-oil",
|
||||
"rapeseed",
|
||||
"reserves",
|
||||
"retail",
|
||||
"rice",
|
||||
"rubber",
|
||||
"rye",
|
||||
"ship",
|
||||
"silver",
|
||||
"sorghum",
|
||||
"soy-meal",
|
||||
"soy-oil",
|
||||
"soybean",
|
||||
"strategic-metal",
|
||||
"sugar",
|
||||
"sun-meal",
|
||||
"sun-oil",
|
||||
"sunseed",
|
||||
"tea",
|
||||
"tin",
|
||||
"trade",
|
||||
"veg-oil",
|
||||
"wheat",
|
||||
"wpi",
|
||||
"yen",
|
||||
"zinc"
|
||||
],
|
||||
"threshold":0.5
|
||||
}
|
||||
Binary file not shown.
File diff suppressed because one or more lines are too long
@@ -1 +0,0 @@
|
||||
<EFBFBD>
|
||||
@@ -1 +0,0 @@
|
||||
<EFBFBD>
|
||||
File diff suppressed because it is too large
Load Diff
Binary file not shown.
@@ -1,3 +0,0 @@
|
||||
{
|
||||
"mode":"default"
|
||||
}
|
||||
27
pages/app.js
27
pages/app.js
@@ -69,9 +69,12 @@ axios
|
||||
// Handle crawl button click
|
||||
document.getElementById("crawl-btn").addEventListener("click", () => {
|
||||
// validate input to have both URL and API token
|
||||
if (!document.getElementById("url-input").value || !document.getElementById("token-input").value) {
|
||||
alert("Please enter both URL(s) and API token.");
|
||||
return;
|
||||
// if selected extraction strategy is LLMExtractionStrategy, then API token is required
|
||||
if (document.getElementById("extraction-strategy-select").value === "LLMExtractionStrategy") {
|
||||
if (!document.getElementById("url-input").value || !document.getElementById("token-input").value) {
|
||||
alert("Please enter both URL(s) and API token.");
|
||||
return;
|
||||
}
|
||||
}
|
||||
|
||||
const selectedProviderModel = document.getElementById("provider-model-select").value;
|
||||
@@ -87,8 +90,6 @@ document.getElementById("crawl-btn").addEventListener("click", () => {
|
||||
const urls = urlsInput.split(",").map((url) => url.trim());
|
||||
const data = {
|
||||
urls: urls,
|
||||
provider_model: selectedProviderModel,
|
||||
api_token: apiToken,
|
||||
include_raw_html: true,
|
||||
bypass_cache: bypassCache,
|
||||
extract_blocks: extractBlocks,
|
||||
@@ -112,8 +113,8 @@ document.getElementById("crawl-btn").addEventListener("click", () => {
|
||||
localStorage.setItem("api_token", document.getElementById("token-input").value);
|
||||
|
||||
document.getElementById("loading").classList.remove("hidden");
|
||||
document.getElementById("result").classList.add("hidden");
|
||||
document.getElementById("code_help").classList.add("hidden");
|
||||
document.getElementById("result").style.visibility = "hidden";
|
||||
document.getElementById("code_help").style.visibility = "hidden";
|
||||
|
||||
axios
|
||||
.post("/crawl", data)
|
||||
@@ -128,18 +129,20 @@ document.getElementById("crawl-btn").addEventListener("click", () => {
|
||||
const extractionStrategy = data.extraction_strategy;
|
||||
const isLLMExtraction = extractionStrategy === "LLMExtractionStrategy";
|
||||
|
||||
// REMOVE API TOKEN FROM CODE EXAMPLES
|
||||
data.extraction_strategy_args.api_token = "your_api_token";
|
||||
document.getElementById(
|
||||
"curl-code"
|
||||
).textContent = `curl -X POST -H "Content-Type: application/json" -d '${JSON.stringify({
|
||||
...data,
|
||||
api_token: isLLMExtraction ? "your_api_token" : undefined,
|
||||
})}' http://crawl4ai.uccode.io/crawl`;
|
||||
}, null, 2)}' http://crawl4ai.com/crawl`;
|
||||
|
||||
document.getElementById("python-code").textContent = `import requests\n\ndata = ${JSON.stringify(
|
||||
{ ...data, api_token: isLLMExtraction ? "your_api_token" : undefined },
|
||||
null,
|
||||
2
|
||||
)}\n\nresponse = requests.post("http://crawl4ai.uccode.io/crawl", json=data) # OR local host if your run locally \nprint(response.json())`;
|
||||
)}\n\nresponse = requests.post("http://crawl4ai.com/crawl", json=data) # OR local host if your run locally \nprint(response.json())`;
|
||||
|
||||
document.getElementById(
|
||||
"nodejs-code"
|
||||
@@ -147,7 +150,7 @@ document.getElementById("crawl-btn").addEventListener("click", () => {
|
||||
{ ...data, api_token: isLLMExtraction ? "your_api_token" : undefined },
|
||||
null,
|
||||
2
|
||||
)};\n\naxios.post("http://crawl4ai.uccode.io/crawl", data) // OR local host if your run locally \n .then(response => console.log(response.data))\n .catch(error => console.error(error));`;
|
||||
)};\n\naxios.post("http://crawl4ai.com/crawl", data) // OR local host if your run locally \n .then(response => console.log(response.data))\n .catch(error => console.error(error));`;
|
||||
|
||||
document.getElementById(
|
||||
"library-code"
|
||||
@@ -169,8 +172,8 @@ document.getElementById("crawl-btn").addEventListener("click", () => {
|
||||
|
||||
document.getElementById("loading").classList.add("hidden");
|
||||
|
||||
document.getElementById("result").classList.remove("hidden");
|
||||
document.getElementById("code_help").classList.remove("hidden");
|
||||
document.getElementById("result").style.visibility = "visible";
|
||||
document.getElementById("code_help").style.visibility = "visible";
|
||||
|
||||
// increment the total count
|
||||
document.getElementById("total-count").textContent =
|
||||
|
||||
@@ -29,7 +29,7 @@
|
||||
class="bg-zinc-800 p-4 rounded mt-2 text-zinc-100"
|
||||
><code>virtualenv venv
|
||||
source venv/bin/activate
|
||||
pip install git+https://github.com/unclecode/crawl4ai.git
|
||||
pip install "crawl4ai[all] @ git+https://github.com/unclecode/crawl4ai.git"
|
||||
</code></pre>
|
||||
</li>
|
||||
<li class="mb-4">
|
||||
@@ -46,7 +46,7 @@ pip install git+https://github.com/unclecode/crawl4ai.git
|
||||
source venv/bin/activate
|
||||
git clone https://github.com/unclecode/crawl4ai.git
|
||||
cd crawl4ai
|
||||
pip install -e .
|
||||
pip install -e .[all]
|
||||
</code></pre>
|
||||
</li>
|
||||
<li class="">
|
||||
|
||||
@@ -46,9 +46,9 @@
|
||||
id="extraction-strategy-select"
|
||||
class="border border-zinc-700 rounded px-4 py-1 bg-zinc-900 text-zinc-300"
|
||||
>
|
||||
<option value="NoExtractionStrategy" selected>NoExtractionStrategy</option>
|
||||
<option value="CosineStrategy">CosineStrategy</option>
|
||||
<option value="LLMExtractionStrategy">LLMExtractionStrategy</option>
|
||||
<option value="NoExtractionStrategy">NoExtractionStrategy</option>
|
||||
</select>
|
||||
</div>
|
||||
<div class="flex flex-col">
|
||||
@@ -99,7 +99,7 @@
|
||||
</div>
|
||||
<div class="flex gap-2">
|
||||
<!-- Add two textarea one for getting Keyword Filter and another one Instruction, make both grow whole with-->
|
||||
<div id = "semantic_filter_div" class="flex flex-col flex-1">
|
||||
<div id = "semantic_filter_div" class="flex flex-col flex-1 hidden">
|
||||
<label for="keyword-filter" class="text-lime-500 font-bold text-xs">Keyword Filter</label>
|
||||
<textarea
|
||||
id="semantic_filter"
|
||||
@@ -131,10 +131,10 @@
|
||||
</div>
|
||||
</div>
|
||||
|
||||
<div id="loading" class="hidden">
|
||||
<p class="text-white">Loading... Please wait.</p>
|
||||
</div>
|
||||
<div id="result" class="flex-1">
|
||||
<div id="loading" class="hidden">
|
||||
<p class="text-white">Loading... Please wait.</p>
|
||||
</div>
|
||||
<div class="tab-buttons flex gap-2">
|
||||
<button class="tab-btn px-4 py-1 text-sm bg-zinc-700 rounded-t text-lime-500" data-tab="json">
|
||||
JSON
|
||||
@@ -181,19 +181,19 @@
|
||||
</button> -->
|
||||
</div>
|
||||
<div class="tab-content result bg-zinc-900 p-2 rounded h-full border border-zinc-700 text-sm">
|
||||
<pre class="h-full flex relative">
|
||||
<pre class="h-full flex relative overflow-x-auto">
|
||||
<code id="curl-code" class="language-bash"></code>
|
||||
<button class="absolute top-2 right-2 bg-zinc-700 text-white px-2 py-1 rounded copy-btn" data-target="curl-code">Copy</button>
|
||||
</pre>
|
||||
<pre class="hidden h-full flex relative">
|
||||
<pre class="hidden h-full flex relative overflow-x-auto">
|
||||
<code id="python-code" class="language-python"></code>
|
||||
<button class="absolute top-2 right-2 bg-zinc-700 text-white px-2 py-1 rounded copy-btn" data-target="python-code">Copy</button>
|
||||
</pre>
|
||||
<pre class="hidden h-full flex relative">
|
||||
<pre class="hidden h-full flex relative overflow-x-auto">
|
||||
<code id="nodejs-code" class="language-javascript"></code>
|
||||
<button class="absolute top-2 right-2 bg-zinc-700 text-white px-2 py-1 rounded copy-btn" data-target="nodejs-code">Copy</button>
|
||||
</pre>
|
||||
<pre class="hidden h-full flex relative">
|
||||
<pre class="hidden h-full flex relative overflow-x-auto">
|
||||
<code id="library-code" class="language-python"></code>
|
||||
<button class="absolute top-2 right-2 bg-zinc-700 text-white px-2 py-1 rounded copy-btn" data-target="library-code">Copy</button>
|
||||
</pre>
|
||||
|
||||
@@ -236,12 +236,11 @@ chunks = chunker.chunk("This is a sample text. It will be split into chunks.")
|
||||
<h4>Constructor Parameters:</h4>
|
||||
<ul>
|
||||
<li>
|
||||
<code>model</code> (str, optional): The SpaCy model to use for sentence detection. Default is
|
||||
<code>'en_core_web_sm'</code>.
|
||||
None.
|
||||
</li>
|
||||
</ul>
|
||||
<h4>Example usage:</h4>
|
||||
<pre><code class="language-python">chunker = NlpSentenceChunking(model='en_core_web_sm')
|
||||
<pre><code class="language-python">chunker = NlpSentenceChunking()
|
||||
chunks = chunker.chunk("This is a sample text. It will be split into sentences.")
|
||||
</code></pre>
|
||||
</div>
|
||||
|
||||
@@ -2,7 +2,6 @@ aiohttp==3.9.5
|
||||
aiosqlite==0.20.0
|
||||
bs4==0.0.2
|
||||
fastapi==0.111.0
|
||||
typer==0.9.0
|
||||
html2text==2024.2.26
|
||||
httpx==0.27.0
|
||||
lazy_import==0.2.2
|
||||
@@ -14,7 +13,6 @@ requests==2.31.0
|
||||
rich==13.7.1
|
||||
scikit-learn==1.4.2
|
||||
selenium==4.20.0
|
||||
spacy==3.7.4
|
||||
uvicorn==0.29.0
|
||||
transformers==4.40.2
|
||||
chromedriver-autoinstaller==0.6.4
|
||||
|
||||
32
setup.py
32
setup.py
@@ -1,24 +1,18 @@
|
||||
from setuptools import setup, find_packages
|
||||
from setuptools.command.install import install as _install
|
||||
import subprocess
|
||||
import sys
|
||||
|
||||
class InstallCommand(_install):
|
||||
def run(self):
|
||||
# Run the standard install first
|
||||
_install.run(self)
|
||||
# Now handle the dependencies manually
|
||||
self.manual_dependencies_install()
|
||||
# Read the requirements from requirements.txt
|
||||
with open("requirements.txt") as f:
|
||||
requirements = f.read().splitlines()
|
||||
|
||||
def manual_dependencies_install(self):
|
||||
with open('requirements.txt') as f:
|
||||
dependencies = f.read().splitlines()
|
||||
for dependency in dependencies:
|
||||
subprocess.check_call([sys.executable, '-m', 'pip', 'install', dependency])
|
||||
# Define the requirements for different environments
|
||||
requirements_without_torch = [req for req in requirements if not req.startswith("torch")]
|
||||
requirements_without_transformers = [req for req in requirements if not req.startswith("transformers")]
|
||||
requirements_without_nltk = [req for req in requirements if not req.startswith("nltk")]
|
||||
requirements_without_torch_transformers_nlkt = [req for req in requirements if not req.startswith("torch") and not req.startswith("transformers") and not req.startswith("nltk")]
|
||||
|
||||
setup(
|
||||
name="Crawl4AI",
|
||||
version="0.1.0",
|
||||
version="0.2.0",
|
||||
description="🔥🕷️ Crawl4AI: Open-source LLM Friendly Web Crawler & Scrapper",
|
||||
long_description=open("README.md").read(),
|
||||
long_description_content_type="text/markdown",
|
||||
@@ -27,9 +21,11 @@ setup(
|
||||
author_email="unclecode@kidocode.com",
|
||||
license="MIT",
|
||||
packages=find_packages(),
|
||||
install_requires=[], # Leave this empty to avoid default dependency resolution
|
||||
cmdclass={
|
||||
'install': InstallCommand,
|
||||
install_requires=requirements_without_torch_transformers_nlkt,
|
||||
extras_require={
|
||||
"all": requirements, # Include all requirements
|
||||
"colab": requirements_without_torch, # Exclude torch for Colab
|
||||
"crawl": requirements_without_torch_transformers_nlkt
|
||||
},
|
||||
entry_points={
|
||||
'console_scripts': [
|
||||
|
||||
@@ -1,35 +0,0 @@
|
||||
import os
|
||||
|
||||
def install_crawl4ai():
|
||||
print("Installing Crawl4AI and its dependencies...")
|
||||
|
||||
# Install dependencies
|
||||
!pip install -U 'spacy[cuda12x]'
|
||||
!apt-get update -y
|
||||
!apt install chromium-chromedriver -y
|
||||
!pip install chromedriver_autoinstaller
|
||||
!pip install git+https://github.com/unclecode/crawl4ai.git@new-release-0.0.2
|
||||
|
||||
# Install ChromeDriver
|
||||
import chromedriver_autoinstaller
|
||||
chromedriver_autoinstaller.install()
|
||||
|
||||
# Download the reuters model
|
||||
repo_url = "https://github.com/unclecode/crawl4ai.git"
|
||||
branch = "new-release-0.0.2"
|
||||
folder_path = "models/reuters"
|
||||
|
||||
!git clone -b {branch} {repo_url}
|
||||
!mkdir -p models
|
||||
|
||||
repo_folder = "crawl4ai"
|
||||
source_folder = os.path.join(repo_folder, folder_path)
|
||||
destination_folder = "models"
|
||||
|
||||
!mv "{source_folder}" "{destination_folder}"
|
||||
!rm -rf "{repo_folder}"
|
||||
|
||||
print("Installation and model download completed successfully!")
|
||||
|
||||
# Run the installer
|
||||
install_crawl4ai()
|
||||
Reference in New Issue
Block a user