Compare commits

..

5 Commits

Author SHA1 Message Date
unclecode
7afa11a02f Update .gitignore to include test_env/ and tmp/ directories 2024-09-28 00:12:58 +08:00
unclecode
dec3d44224 refactor: Update extraction strategy to handle schema extraction with non-empty schema
This code change updates the `LLMExtractionStrategy` class to handle schema extraction when the schema is non-empty. Previously, the schema extraction was only triggered when the `extract_type` was set to "schema", regardless of whether a schema was provided. With this update, the schema extraction will only be performed if the `extract_type` is "schema" and a non-empty schema is provided. This ensures that the extraction strategy behaves correctly and avoids unnecessary schema extraction when not needed. Also "numpy" is removed from default installation mode.
2024-08-19 15:37:07 +08:00
unclecode
e5e6a34e80 ## [v0.2.77] - 2024-08-04
Significant improvements in text processing and performance:

- 🚀 **Dependency reduction**: Removed dependency on spaCy model for text chunk labeling in cosine extraction strategy.
- 🤖 **Transformer upgrade**: Implemented text sequence classification using a transformer model for labeling text chunks.
-  **Performance enhancement**: Improved model loading speed due to removal of spaCy dependency.
- 🔧 **Future-proofing**: Laid groundwork for potential complete removal of spaCy dependency in future versions.

These changes address issue #68 and provide a foundation for faster, more efficient text processing in Crawl4AI.
2024-08-04 14:54:18 +08:00
unclecode
897e766728 Update README 2024-08-02 16:04:14 +08:00
unclecode
9200a6731d ## [v0.2.76] - 2024-08-02
Major improvements in functionality, performance, and cross-platform compatibility! 🚀

- 🐳 **Docker enhancements**: Significantly improved Dockerfile for easy installation on Linux, Mac, and Windows.
- 🌐 **Official Docker Hub image**: Launched our first official image on Docker Hub for streamlined deployment (unclecode/crawl4ai).
- 🔧 **Selenium upgrade**: Removed dependency on ChromeDriver, now using Selenium's built-in capabilities for better compatibility.
- 🖼️ **Image description**: Implemented ability to generate textual descriptions for extracted images from web pages.
-  **Performance boost**: Various improvements to enhance overall speed and performance.
2024-08-02 16:02:42 +08:00
9 changed files with 116 additions and 47 deletions

4
.gitignore vendored
View File

@@ -189,4 +189,6 @@ a.txt
.lambda_function.py .lambda_function.py
ec2* ec2*
update_changelog.sh update_changelog.sh
test_env/
tmp/

View File

@@ -1,5 +1,33 @@
# Changelog # Changelog
## [v0.2.77] - 2024-08-04
Significant improvements in text processing and performance:
- 🚀 **Dependency reduction**: Removed dependency on spaCy model for text chunk labeling in cosine extraction strategy.
- 🤖 **Transformer upgrade**: Implemented text sequence classification using a transformer model for labeling text chunks.
-**Performance enhancement**: Improved model loading speed due to removal of spaCy dependency.
- 🔧 **Future-proofing**: Laid groundwork for potential complete removal of spaCy dependency in future versions.
These changes address issue #68 and provide a foundation for faster, more efficient text processing in Crawl4AI.
## [v0.2.76] - 2024-08-02
Major improvements in functionality, performance, and cross-platform compatibility! 🚀
- 🐳 **Docker enhancements**: Significantly improved Dockerfile for easy installation on Linux, Mac, and Windows.
- 🌐 **Official Docker Hub image**: Launched our first official image on Docker Hub for streamlined deployment.
- 🔧 **Selenium upgrade**: Removed dependency on ChromeDriver, now using Selenium's built-in capabilities for better compatibility.
- 🖼️ **Image description**: Implemented ability to generate textual descriptions for extracted images from web pages.
-**Performance boost**: Various improvements to enhance overall speed and performance.
A big shoutout to our amazing community contributors:
- [@aravindkarnam](https://github.com/aravindkarnam) for developing the textual description extraction feature.
- [@FractalMind](https://github.com/FractalMind) for creating the first official Docker Hub image and fixing Dockerfile errors.
- [@ketonkss4](https://github.com/ketonkss4) for identifying Selenium's new capabilities, helping us reduce dependencies.
Your contributions are driving Crawl4AI forward! 🙌
## [v0.2.75] - 2024-07-19 ## [v0.2.75] - 2024-07-19
Minor improvements for a more maintainable codebase: Minor improvements for a more maintainable codebase:

View File

@@ -1,4 +1,4 @@
# Crawl4AI v0.2.76 🕷️🤖 # Crawl4AI v0.2.77 🕷️🤖
[![GitHub Stars](https://img.shields.io/github/stars/unclecode/crawl4ai?style=social)](https://github.com/unclecode/crawl4ai/stargazers) [![GitHub Stars](https://img.shields.io/github/stars/unclecode/crawl4ai?style=social)](https://github.com/unclecode/crawl4ai/stargazers)
[![GitHub Forks](https://img.shields.io/github/forks/unclecode/crawl4ai?style=social)](https://github.com/unclecode/crawl4ai/network/members) [![GitHub Forks](https://img.shields.io/github/forks/unclecode/crawl4ai?style=social)](https://github.com/unclecode/crawl4ai/network/members)
@@ -8,6 +8,21 @@
Crawl4AI simplifies web crawling and data extraction, making it accessible for large language models (LLMs) and AI applications. 🆓🌐 Crawl4AI simplifies web crawling and data extraction, making it accessible for large language models (LLMs) and AI applications. 🆓🌐
#### [v0.2.77] - 2024-08-02
Major improvements in functionality, performance, and cross-platform compatibility! 🚀
- 🐳 **Docker enhancements**:
- Significantly improved Dockerfile for easy installation on Linux, Mac, and Windows.
- 🌐 **Official Docker Hub image**:
- Launched our first official image on Docker Hub for streamlined deployment (unclecode/crawl4ai).
- 🔧 **Selenium upgrade**:
- Removed dependency on ChromeDriver, now using Selenium's built-in capabilities for better compatibility.
- 🖼️ **Image description**:
- Implemented ability to generate textual descriptions for extracted images from web pages.
-**Performance boost**:
- Various improvements to enhance overall speed and performance.
## Try it Now! ## Try it Now!
✨ Play around with this [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1sJPAmeLj5PMrg2VgOwMJ2ubGIcK0cJeX?usp=sharing) ✨ Play around with this [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1sJPAmeLj5PMrg2VgOwMJ2ubGIcK0cJeX?usp=sharing)
@@ -35,7 +50,7 @@ Crawl4AI simplifies web crawling and data extraction, making it accessible for l
# Crawl4AI # Crawl4AI
## 🌟 Shoutout to Contributors of v0.2.76! ## 🌟 Shoutout to Contributors of v0.2.77!
A big thank you to the amazing contributors who've made this release possible: A big thank you to the amazing contributors who've made this release possible:

View File

@@ -9,6 +9,7 @@ from .utils import *
from functools import partial from functools import partial
from .model_loader import * from .model_loader import *
import math import math
import numpy as np
class ExtractionStrategy(ABC): class ExtractionStrategy(ABC):
@@ -100,7 +101,7 @@ class LLMExtractionStrategy(ExtractionStrategy):
variable_values["REQUEST"] = self.instruction variable_values["REQUEST"] = self.instruction
prompt_with_variables = PROMPT_EXTRACT_BLOCKS_WITH_INSTRUCTION prompt_with_variables = PROMPT_EXTRACT_BLOCKS_WITH_INSTRUCTION
if self.extract_type == "schema": if self.extract_type == "schema" and self.schema:
variable_values["SCHEMA"] = json.dumps(self.schema, indent=2) variable_values["SCHEMA"] = json.dumps(self.schema, indent=2)
prompt_with_variables = PROMPT_EXTRACT_SCHEMA_WITH_INSTRUCTION prompt_with_variables = PROMPT_EXTRACT_SCHEMA_WITH_INSTRUCTION
@@ -248,6 +249,9 @@ class CosineStrategy(ExtractionStrategy):
self.get_embedding_method = "direct" self.get_embedding_method = "direct"
self.device = get_device() self.device = get_device()
import torch
self.device = torch.device('cpu')
self.default_batch_size = calculate_batch_size(self.device) self.default_batch_size = calculate_batch_size(self.device)
if self.verbose: if self.verbose:
@@ -260,7 +264,9 @@ class CosineStrategy(ExtractionStrategy):
# else: # else:
self.tokenizer, self.model = load_bge_small_en_v1_5() self.tokenizer, self.model = load_bge_small_en_v1_5()
self.model.to(self.device)
self.model.eval() self.model.eval()
self.get_embedding_method = "batch" self.get_embedding_method = "batch"
self.buffer_embeddings = np.array([]) self.buffer_embeddings = np.array([])
@@ -282,7 +288,7 @@ class CosineStrategy(ExtractionStrategy):
if self.verbose: if self.verbose:
print(f"[LOG] Loading Multilabel Classifier for {self.device.type} device.") print(f"[LOG] Loading Multilabel Classifier for {self.device.type} device.")
self.nlp, self.device = load_text_multilabel_classifier() self.nlp, _ = load_text_multilabel_classifier()
# self.default_batch_size = 16 if self.device.type == 'cpu' else 64 # self.default_batch_size = 16 if self.device.type == 'cpu' else 64
if self.verbose: if self.verbose:
@@ -453,21 +459,21 @@ class CosineStrategy(ExtractionStrategy):
if self.verbose: if self.verbose:
print(f"[LOG] 🚀 Assign tags using {self.device}") print(f"[LOG] 🚀 Assign tags using {self.device}")
if self.device.type in ["gpu", "cuda", "mps"]: if self.device.type in ["gpu", "cuda", "mps", "cpu"]:
labels = self.nlp([cluster['content'] for cluster in cluster_list]) labels = self.nlp([cluster['content'] for cluster in cluster_list])
for cluster, label in zip(cluster_list, labels): for cluster, label in zip(cluster_list, labels):
cluster['tags'] = label cluster['tags'] = label
elif self.device == "cpu": # elif self.device.type == "cpu":
# Process the text with the loaded model # # Process the text with the loaded model
texts = [cluster['content'] for cluster in cluster_list] # texts = [cluster['content'] for cluster in cluster_list]
# Batch process texts # # Batch process texts
docs = self.nlp.pipe(texts, disable=["tagger", "parser", "ner", "lemmatizer"]) # docs = self.nlp.pipe(texts, disable=["tagger", "parser", "ner", "lemmatizer"])
for doc, cluster in zip(docs, cluster_list): # for doc, cluster in zip(docs, cluster_list):
tok_k = self.top_k # tok_k = self.top_k
top_categories = sorted(doc.cats.items(), key=lambda x: x[1], reverse=True)[:tok_k] # top_categories = sorted(doc.cats.items(), key=lambda x: x[1], reverse=True)[:tok_k]
cluster['tags'] = [cat for cat, _ in top_categories] # cluster['tags'] = [cat for cat, _ in top_categories]
# for cluster in cluster_list: # for cluster in cluster_list:
# doc = self.nlp(cluster['content']) # doc = self.nlp(cluster['content'])

View File

@@ -6,6 +6,7 @@ import tarfile
from .model_loader import * from .model_loader import *
import argparse import argparse
import urllib.request import urllib.request
from crawl4ai.config import MODEL_REPO_BRANCH
__location__ = os.path.realpath(os.path.join(os.getcwd(), os.path.dirname(__file__))) __location__ = os.path.realpath(os.path.join(os.getcwd(), os.path.dirname(__file__)))
@lru_cache() @lru_cache()
@@ -141,14 +142,15 @@ def load_text_multilabel_classifier():
from scipy.special import expit from scipy.special import expit
import torch import torch
# Check for available device: CUDA, MPS (for Apple Silicon), or CPU # # Check for available device: CUDA, MPS (for Apple Silicon), or CPU
if torch.cuda.is_available(): # if torch.cuda.is_available():
device = torch.device("cuda") # device = torch.device("cuda")
elif torch.backends.mps.is_available(): # elif torch.backends.mps.is_available():
device = torch.device("mps") # device = torch.device("mps")
else: # else:
return load_spacy_model(), torch.device("cpu") # device = torch.device("cpu")
# # return load_spacy_model(), torch.device("cpu")
MODEL = "cardiffnlp/tweet-topic-21-multi" MODEL = "cardiffnlp/tweet-topic-21-multi"
tokenizer = AutoTokenizer.from_pretrained(MODEL, resume_download=None) tokenizer = AutoTokenizer.from_pretrained(MODEL, resume_download=None)
@@ -192,51 +194,61 @@ def load_spacy_model():
import spacy import spacy
name = "models/reuters" name = "models/reuters"
home_folder = get_home_folder() home_folder = get_home_folder()
model_folder = os.path.join(home_folder, name) model_folder = Path(home_folder) / name
# Check if the model directory already exists # Check if the model directory already exists
if not (Path(model_folder).exists() and any(Path(model_folder).iterdir())): if not (model_folder.exists() and any(model_folder.iterdir())):
repo_url = "https://github.com/unclecode/crawl4ai.git" repo_url = "https://github.com/unclecode/crawl4ai.git"
# branch = "main"
branch = MODEL_REPO_BRANCH branch = MODEL_REPO_BRANCH
repo_folder = os.path.join(home_folder, "crawl4ai") repo_folder = Path(home_folder) / "crawl4ai"
model_folder = os.path.join(home_folder, name)
print("[LOG] ⏬ Downloading Spacy model for the first time...")
# print("[LOG] ⏬ Downloading Spacy model for the first time...")
# Remove existing repo folder if it exists # Remove existing repo folder if it exists
if Path(repo_folder).exists(): if repo_folder.exists():
shutil.rmtree(repo_folder) try:
shutil.rmtree(model_folder) shutil.rmtree(repo_folder)
if model_folder.exists():
shutil.rmtree(model_folder)
except PermissionError:
print("[WARNING] Unable to remove existing folders. Please manually delete the following folders and try again:")
print(f"- {repo_folder}")
print(f"- {model_folder}")
return None
try: try:
# Clone the repository # Clone the repository
subprocess.run( subprocess.run(
["git", "clone", "-b", branch, repo_url, repo_folder], ["git", "clone", "-b", branch, repo_url, str(repo_folder)],
stdout=subprocess.DEVNULL, stdout=subprocess.DEVNULL,
stderr=subprocess.DEVNULL, stderr=subprocess.DEVNULL,
check=True check=True
) )
# Create the models directory if it doesn't exist # Create the models directory if it doesn't exist
models_folder = os.path.join(home_folder, "models") models_folder = Path(home_folder) / "models"
os.makedirs(models_folder, exist_ok=True) models_folder.mkdir(parents=True, exist_ok=True)
# Copy the reuters model folder to the models directory # Copy the reuters model folder to the models directory
source_folder = os.path.join(repo_folder, "models/reuters") source_folder = repo_folder / "models" / "reuters"
shutil.copytree(source_folder, model_folder) shutil.copytree(source_folder, model_folder)
# Remove the cloned repository # Remove the cloned repository
shutil.rmtree(repo_folder) shutil.rmtree(repo_folder)
# Print completion message print("[LOG] ✅ Spacy Model downloaded successfully")
# print("[LOG] ✅ Spacy Model downloaded successfully")
except subprocess.CalledProcessError as e: except subprocess.CalledProcessError as e:
print(f"An error occurred while cloning the repository: {e}") print(f"An error occurred while cloning the repository: {e}")
return None
except Exception as e: except Exception as e:
print(f"An error occurred: {e}") print(f"An error occurred: {e}")
return None
return spacy.load(model_folder) try:
return spacy.load(str(model_folder))
except Exception as e:
print(f"Error loading spacy model: {e}")
return None
def download_all_models(remove_existing=False): def download_all_models(remove_existing=False):
"""Download all models required for Crawl4AI.""" """Download all models required for Crawl4AI."""

View File

@@ -834,7 +834,6 @@ def extract_blocks_batch(batch_data, provider = "groq/llama3-70b-8192", api_toke
return sum(all_blocks, []) return sum(all_blocks, [])
def merge_chunks_based_on_token_threshold(chunks, token_threshold): def merge_chunks_based_on_token_threshold(chunks, token_threshold):
""" """
Merges small chunks into larger ones based on the total token threshold. Merges small chunks into larger ones based on the total token threshold.
@@ -880,7 +879,6 @@ def process_sections(url: str, sections: list, provider: str, api_token: str) ->
return extracted_content return extracted_content
def wrap_text(draw, text, font, max_width): def wrap_text(draw, text, font, max_width):
# Wrap the text to fit within the specified width # Wrap the text to fit within the specified width
lines = [] lines = []
@@ -892,7 +890,6 @@ def wrap_text(draw, text, font, max_width):
lines.append(line) lines.append(line)
return '\n'.join(lines) return '\n'.join(lines)
def format_html(html_string): def format_html(html_string):
soup = BeautifulSoup(html_string, 'html.parser') soup = BeautifulSoup(html_string, 'html.parser')
return soup.prettify() return soup.prettify()

View File

@@ -1,6 +1,15 @@
# Changelog # Changelog
# Changelog ## [v0.2.77] - 2024-08-04
Significant improvements in text processing and performance:
- 🚀 **Dependency reduction**: Removed dependency on spaCy model for text chunk labeling in cosine extraction strategy.
- 🤖 **Transformer upgrade**: Implemented text sequence classification using a transformer model for labeling text chunks.
-**Performance enhancement**: Improved model loading speed due to removal of spaCy dependency.
- 🔧 **Future-proofing**: Laid groundwork for potential complete removal of spaCy dependency in future versions.
These changes address issue #68 and provide a foundation for faster, more efficient text processing in Crawl4AI.
## [v0.2.76] - 2024-08-02 ## [v0.2.76] - 2024-08-02

View File

@@ -1,4 +1,4 @@
# Crawl4AI v0.2.76 # Crawl4AI v0.2.77
Welcome to the official documentation for Crawl4AI! 🕷️🤖 Crawl4AI is an open-source Python library designed to simplify web crawling and extract useful information from web pages. This documentation will guide you through the features, usage, and customization of Crawl4AI. Welcome to the official documentation for Crawl4AI! 🕷️🤖 Crawl4AI is an open-source Python library designed to simplify web crawling and extract useful information from web pages. This documentation will guide you through the features, usage, and customization of Crawl4AI.

View File

@@ -19,13 +19,13 @@ with open("requirements.txt") as f:
requirements = f.read().splitlines() requirements = f.read().splitlines()
# Define the requirements for different environments # Define the requirements for different environments
default_requirements = [req for req in requirements if not req.startswith(("torch", "transformers", "onnxruntime", "nltk", "spacy", "tokenizers", "scikit-learn", "numpy"))] default_requirements = [req for req in requirements if not req.startswith(("torch", "transformers", "onnxruntime", "nltk", "spacy", "tokenizers", "scikit-learn"))]
torch_requirements = [req for req in requirements if req.startswith(("torch", "nltk", "spacy", "scikit-learn", "numpy"))] torch_requirements = [req for req in requirements if req.startswith(("torch", "nltk", "spacy", "scikit-learn", "numpy"))]
transformer_requirements = [req for req in requirements if req.startswith(("transformers", "tokenizers", "onnxruntime"))] transformer_requirements = [req for req in requirements if req.startswith(("transformers", "tokenizers", "onnxruntime"))]
setup( setup(
name="Crawl4AI", name="Crawl4AI",
version="0.2.76", version="0.2.77",
description="🔥🕷️ Crawl4AI: Open-source LLM Friendly Web Crawler & Scrapper", description="🔥🕷️ Crawl4AI: Open-source LLM Friendly Web Crawler & Scrapper",
long_description=open("README.md", encoding="utf-8").read(), long_description=open("README.md", encoding="utf-8").read(),
long_description_content_type="text/markdown", long_description_content_type="text/markdown",