Compare commits

...

3 Commits

Author SHA1 Message Date
unclecode
e5e6a34e80 ## [v0.2.77] - 2024-08-04
Significant improvements in text processing and performance:

- 🚀 **Dependency reduction**: Removed dependency on spaCy model for text chunk labeling in cosine extraction strategy.
- 🤖 **Transformer upgrade**: Implemented text sequence classification using a transformer model for labeling text chunks.
-  **Performance enhancement**: Improved model loading speed due to removal of spaCy dependency.
- 🔧 **Future-proofing**: Laid groundwork for potential complete removal of spaCy dependency in future versions.

These changes address issue #68 and provide a foundation for faster, more efficient text processing in Crawl4AI.
2024-08-04 14:54:18 +08:00
unclecode
897e766728 Update README 2024-08-02 16:04:14 +08:00
unclecode
9200a6731d ## [v0.2.76] - 2024-08-02
Major improvements in functionality, performance, and cross-platform compatibility! 🚀

- 🐳 **Docker enhancements**: Significantly improved Dockerfile for easy installation on Linux, Mac, and Windows.
- 🌐 **Official Docker Hub image**: Launched our first official image on Docker Hub for streamlined deployment (unclecode/crawl4ai).
- 🔧 **Selenium upgrade**: Removed dependency on ChromeDriver, now using Selenium's built-in capabilities for better compatibility.
- 🖼️ **Image description**: Implemented ability to generate textual descriptions for extracted images from web pages.
-  **Performance boost**: Various improvements to enhance overall speed and performance.
2024-08-02 16:02:42 +08:00
7 changed files with 111 additions and 41 deletions

View File

@@ -1,5 +1,33 @@
# Changelog # Changelog
## [v0.2.77] - 2024-08-04
Significant improvements in text processing and performance:
- 🚀 **Dependency reduction**: Removed dependency on spaCy model for text chunk labeling in cosine extraction strategy.
- 🤖 **Transformer upgrade**: Implemented text sequence classification using a transformer model for labeling text chunks.
-**Performance enhancement**: Improved model loading speed due to removal of spaCy dependency.
- 🔧 **Future-proofing**: Laid groundwork for potential complete removal of spaCy dependency in future versions.
These changes address issue #68 and provide a foundation for faster, more efficient text processing in Crawl4AI.
## [v0.2.76] - 2024-08-02
Major improvements in functionality, performance, and cross-platform compatibility! 🚀
- 🐳 **Docker enhancements**: Significantly improved Dockerfile for easy installation on Linux, Mac, and Windows.
- 🌐 **Official Docker Hub image**: Launched our first official image on Docker Hub for streamlined deployment.
- 🔧 **Selenium upgrade**: Removed dependency on ChromeDriver, now using Selenium's built-in capabilities for better compatibility.
- 🖼️ **Image description**: Implemented ability to generate textual descriptions for extracted images from web pages.
-**Performance boost**: Various improvements to enhance overall speed and performance.
A big shoutout to our amazing community contributors:
- [@aravindkarnam](https://github.com/aravindkarnam) for developing the textual description extraction feature.
- [@FractalMind](https://github.com/FractalMind) for creating the first official Docker Hub image and fixing Dockerfile errors.
- [@ketonkss4](https://github.com/ketonkss4) for identifying Selenium's new capabilities, helping us reduce dependencies.
Your contributions are driving Crawl4AI forward! 🙌
## [v0.2.75] - 2024-07-19 ## [v0.2.75] - 2024-07-19
Minor improvements for a more maintainable codebase: Minor improvements for a more maintainable codebase:

View File

@@ -1,4 +1,4 @@
# Crawl4AI v0.2.76 🕷️🤖 # Crawl4AI v0.2.77 🕷️🤖
[![GitHub Stars](https://img.shields.io/github/stars/unclecode/crawl4ai?style=social)](https://github.com/unclecode/crawl4ai/stargazers) [![GitHub Stars](https://img.shields.io/github/stars/unclecode/crawl4ai?style=social)](https://github.com/unclecode/crawl4ai/stargazers)
[![GitHub Forks](https://img.shields.io/github/forks/unclecode/crawl4ai?style=social)](https://github.com/unclecode/crawl4ai/network/members) [![GitHub Forks](https://img.shields.io/github/forks/unclecode/crawl4ai?style=social)](https://github.com/unclecode/crawl4ai/network/members)
@@ -8,6 +8,21 @@
Crawl4AI simplifies web crawling and data extraction, making it accessible for large language models (LLMs) and AI applications. 🆓🌐 Crawl4AI simplifies web crawling and data extraction, making it accessible for large language models (LLMs) and AI applications. 🆓🌐
#### [v0.2.77] - 2024-08-02
Major improvements in functionality, performance, and cross-platform compatibility! 🚀
- 🐳 **Docker enhancements**:
- Significantly improved Dockerfile for easy installation on Linux, Mac, and Windows.
- 🌐 **Official Docker Hub image**:
- Launched our first official image on Docker Hub for streamlined deployment (unclecode/crawl4ai).
- 🔧 **Selenium upgrade**:
- Removed dependency on ChromeDriver, now using Selenium's built-in capabilities for better compatibility.
- 🖼️ **Image description**:
- Implemented ability to generate textual descriptions for extracted images from web pages.
-**Performance boost**:
- Various improvements to enhance overall speed and performance.
## Try it Now! ## Try it Now!
✨ Play around with this [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1sJPAmeLj5PMrg2VgOwMJ2ubGIcK0cJeX?usp=sharing) ✨ Play around with this [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1sJPAmeLj5PMrg2VgOwMJ2ubGIcK0cJeX?usp=sharing)
@@ -35,7 +50,7 @@ Crawl4AI simplifies web crawling and data extraction, making it accessible for l
# Crawl4AI # Crawl4AI
## 🌟 Shoutout to Contributors of v0.2.76! ## 🌟 Shoutout to Contributors of v0.2.77!
A big thank you to the amazing contributors who've made this release possible: A big thank you to the amazing contributors who've made this release possible:

View File

@@ -9,6 +9,7 @@ from .utils import *
from functools import partial from functools import partial
from .model_loader import * from .model_loader import *
import math import math
import numpy as np
class ExtractionStrategy(ABC): class ExtractionStrategy(ABC):
@@ -248,6 +249,9 @@ class CosineStrategy(ExtractionStrategy):
self.get_embedding_method = "direct" self.get_embedding_method = "direct"
self.device = get_device() self.device = get_device()
import torch
self.device = torch.device('cpu')
self.default_batch_size = calculate_batch_size(self.device) self.default_batch_size = calculate_batch_size(self.device)
if self.verbose: if self.verbose:
@@ -260,7 +264,9 @@ class CosineStrategy(ExtractionStrategy):
# else: # else:
self.tokenizer, self.model = load_bge_small_en_v1_5() self.tokenizer, self.model = load_bge_small_en_v1_5()
self.model.to(self.device)
self.model.eval() self.model.eval()
self.get_embedding_method = "batch" self.get_embedding_method = "batch"
self.buffer_embeddings = np.array([]) self.buffer_embeddings = np.array([])
@@ -282,7 +288,7 @@ class CosineStrategy(ExtractionStrategy):
if self.verbose: if self.verbose:
print(f"[LOG] Loading Multilabel Classifier for {self.device.type} device.") print(f"[LOG] Loading Multilabel Classifier for {self.device.type} device.")
self.nlp, self.device = load_text_multilabel_classifier() self.nlp, _ = load_text_multilabel_classifier()
# self.default_batch_size = 16 if self.device.type == 'cpu' else 64 # self.default_batch_size = 16 if self.device.type == 'cpu' else 64
if self.verbose: if self.verbose:
@@ -453,21 +459,21 @@ class CosineStrategy(ExtractionStrategy):
if self.verbose: if self.verbose:
print(f"[LOG] 🚀 Assign tags using {self.device}") print(f"[LOG] 🚀 Assign tags using {self.device}")
if self.device.type in ["gpu", "cuda", "mps"]: if self.device.type in ["gpu", "cuda", "mps", "cpu"]:
labels = self.nlp([cluster['content'] for cluster in cluster_list]) labels = self.nlp([cluster['content'] for cluster in cluster_list])
for cluster, label in zip(cluster_list, labels): for cluster, label in zip(cluster_list, labels):
cluster['tags'] = label cluster['tags'] = label
elif self.device == "cpu": # elif self.device.type == "cpu":
# Process the text with the loaded model # # Process the text with the loaded model
texts = [cluster['content'] for cluster in cluster_list] # texts = [cluster['content'] for cluster in cluster_list]
# Batch process texts # # Batch process texts
docs = self.nlp.pipe(texts, disable=["tagger", "parser", "ner", "lemmatizer"]) # docs = self.nlp.pipe(texts, disable=["tagger", "parser", "ner", "lemmatizer"])
for doc, cluster in zip(docs, cluster_list): # for doc, cluster in zip(docs, cluster_list):
tok_k = self.top_k # tok_k = self.top_k
top_categories = sorted(doc.cats.items(), key=lambda x: x[1], reverse=True)[:tok_k] # top_categories = sorted(doc.cats.items(), key=lambda x: x[1], reverse=True)[:tok_k]
cluster['tags'] = [cat for cat, _ in top_categories] # cluster['tags'] = [cat for cat, _ in top_categories]
# for cluster in cluster_list: # for cluster in cluster_list:
# doc = self.nlp(cluster['content']) # doc = self.nlp(cluster['content'])

View File

@@ -6,6 +6,7 @@ import tarfile
from .model_loader import * from .model_loader import *
import argparse import argparse
import urllib.request import urllib.request
from crawl4ai.config import MODEL_REPO_BRANCH
__location__ = os.path.realpath(os.path.join(os.getcwd(), os.path.dirname(__file__))) __location__ = os.path.realpath(os.path.join(os.getcwd(), os.path.dirname(__file__)))
@lru_cache() @lru_cache()
@@ -141,14 +142,15 @@ def load_text_multilabel_classifier():
from scipy.special import expit from scipy.special import expit
import torch import torch
# Check for available device: CUDA, MPS (for Apple Silicon), or CPU # # Check for available device: CUDA, MPS (for Apple Silicon), or CPU
if torch.cuda.is_available(): # if torch.cuda.is_available():
device = torch.device("cuda") # device = torch.device("cuda")
elif torch.backends.mps.is_available(): # elif torch.backends.mps.is_available():
device = torch.device("mps") # device = torch.device("mps")
else: # else:
return load_spacy_model(), torch.device("cpu") # device = torch.device("cpu")
# # return load_spacy_model(), torch.device("cpu")
MODEL = "cardiffnlp/tweet-topic-21-multi" MODEL = "cardiffnlp/tweet-topic-21-multi"
tokenizer = AutoTokenizer.from_pretrained(MODEL, resume_download=None) tokenizer = AutoTokenizer.from_pretrained(MODEL, resume_download=None)
@@ -192,51 +194,61 @@ def load_spacy_model():
import spacy import spacy
name = "models/reuters" name = "models/reuters"
home_folder = get_home_folder() home_folder = get_home_folder()
model_folder = os.path.join(home_folder, name) model_folder = Path(home_folder) / name
# Check if the model directory already exists # Check if the model directory already exists
if not (Path(model_folder).exists() and any(Path(model_folder).iterdir())): if not (model_folder.exists() and any(model_folder.iterdir())):
repo_url = "https://github.com/unclecode/crawl4ai.git" repo_url = "https://github.com/unclecode/crawl4ai.git"
# branch = "main"
branch = MODEL_REPO_BRANCH branch = MODEL_REPO_BRANCH
repo_folder = os.path.join(home_folder, "crawl4ai") repo_folder = Path(home_folder) / "crawl4ai"
model_folder = os.path.join(home_folder, name)
print("[LOG] ⏬ Downloading Spacy model for the first time...")
# print("[LOG] ⏬ Downloading Spacy model for the first time...")
# Remove existing repo folder if it exists # Remove existing repo folder if it exists
if Path(repo_folder).exists(): if repo_folder.exists():
shutil.rmtree(repo_folder) try:
shutil.rmtree(model_folder) shutil.rmtree(repo_folder)
if model_folder.exists():
shutil.rmtree(model_folder)
except PermissionError:
print("[WARNING] Unable to remove existing folders. Please manually delete the following folders and try again:")
print(f"- {repo_folder}")
print(f"- {model_folder}")
return None
try: try:
# Clone the repository # Clone the repository
subprocess.run( subprocess.run(
["git", "clone", "-b", branch, repo_url, repo_folder], ["git", "clone", "-b", branch, repo_url, str(repo_folder)],
stdout=subprocess.DEVNULL, stdout=subprocess.DEVNULL,
stderr=subprocess.DEVNULL, stderr=subprocess.DEVNULL,
check=True check=True
) )
# Create the models directory if it doesn't exist # Create the models directory if it doesn't exist
models_folder = os.path.join(home_folder, "models") models_folder = Path(home_folder) / "models"
os.makedirs(models_folder, exist_ok=True) models_folder.mkdir(parents=True, exist_ok=True)
# Copy the reuters model folder to the models directory # Copy the reuters model folder to the models directory
source_folder = os.path.join(repo_folder, "models/reuters") source_folder = repo_folder / "models" / "reuters"
shutil.copytree(source_folder, model_folder) shutil.copytree(source_folder, model_folder)
# Remove the cloned repository # Remove the cloned repository
shutil.rmtree(repo_folder) shutil.rmtree(repo_folder)
# Print completion message print("[LOG] ✅ Spacy Model downloaded successfully")
# print("[LOG] ✅ Spacy Model downloaded successfully")
except subprocess.CalledProcessError as e: except subprocess.CalledProcessError as e:
print(f"An error occurred while cloning the repository: {e}") print(f"An error occurred while cloning the repository: {e}")
return None
except Exception as e: except Exception as e:
print(f"An error occurred: {e}") print(f"An error occurred: {e}")
return None
return spacy.load(model_folder) try:
return spacy.load(str(model_folder))
except Exception as e:
print(f"Error loading spacy model: {e}")
return None
def download_all_models(remove_existing=False): def download_all_models(remove_existing=False):
"""Download all models required for Crawl4AI.""" """Download all models required for Crawl4AI."""

View File

@@ -1,6 +1,15 @@
# Changelog # Changelog
# Changelog ## [v0.2.77] - 2024-08-04
Significant improvements in text processing and performance:
- 🚀 **Dependency reduction**: Removed dependency on spaCy model for text chunk labeling in cosine extraction strategy.
- 🤖 **Transformer upgrade**: Implemented text sequence classification using a transformer model for labeling text chunks.
-**Performance enhancement**: Improved model loading speed due to removal of spaCy dependency.
- 🔧 **Future-proofing**: Laid groundwork for potential complete removal of spaCy dependency in future versions.
These changes address issue #68 and provide a foundation for faster, more efficient text processing in Crawl4AI.
## [v0.2.76] - 2024-08-02 ## [v0.2.76] - 2024-08-02

View File

@@ -1,4 +1,4 @@
# Crawl4AI v0.2.76 # Crawl4AI v0.2.77
Welcome to the official documentation for Crawl4AI! 🕷️🤖 Crawl4AI is an open-source Python library designed to simplify web crawling and extract useful information from web pages. This documentation will guide you through the features, usage, and customization of Crawl4AI. Welcome to the official documentation for Crawl4AI! 🕷️🤖 Crawl4AI is an open-source Python library designed to simplify web crawling and extract useful information from web pages. This documentation will guide you through the features, usage, and customization of Crawl4AI.

View File

@@ -25,7 +25,7 @@ transformer_requirements = [req for req in requirements if req.startswith(("tran
setup( setup(
name="Crawl4AI", name="Crawl4AI",
version="0.2.76", version="0.2.77",
description="🔥🕷️ Crawl4AI: Open-source LLM Friendly Web Crawler & Scrapper", description="🔥🕷️ Crawl4AI: Open-source LLM Friendly Web Crawler & Scrapper",
long_description=open("README.md", encoding="utf-8").read(), long_description=open("README.md", encoding="utf-8").read(),
long_description_content_type="text/markdown", long_description_content_type="text/markdown",