Update .gitignore to include test_env/ and tmp/ directories

refactor: Update extraction strategy to handle schema extraction with non-empty schema
This code change updates the `LLMExtractionStrategy` class to handle schema extraction when the schema is non-empty. Previously, the schema extraction was only triggered when the `extract_type` was set to "schema", regardless of whether a schema was provided. With this update, the schema extraction will only be performed if the `extract_type` is "schema" and a non-empty schema is provided. This ensures that the extraction strategy behaves correctly and avoids unnecessary schema extraction when not needed. Also "numpy" is removed from default installation mode.
2024-09-28 00:12:58 +08:00 · 2024-08-19 15:37:07 +08:00 · 2024-08-04 14:54:18 +08:00 · 2024-08-02 16:04:14 +08:00 · 2024-08-02 16:02:42 +08:00
9 changed files with 116 additions and 47 deletions
--- a/.gitignore
+++ b/.gitignore
@@ -189,4 +189,6 @@ a.txt
 .lambda_function.py
 ec2*
-update_changelog.sh
+update_changelog.sh
 test_env/
 tmp/
--- a/CHANGELOG.md
+++ b/CHANGELOG.md
@@ -1,5 +1,33 @@
 # Changelog
 ## [v0.2.77] - 2024-08-04
 Significant improvements in text processing and performance:
 - 🚀 **Dependency reduction**: Removed dependency on spaCy model for text chunk labeling in cosine extraction strategy.
 - 🤖 **Transformer upgrade**: Implemented text sequence classification using a transformer model for labeling text chunks.
 - ⚡ **Performance enhancement**: Improved model loading speed due to removal of spaCy dependency.
 - 🔧 **Future-proofing**: Laid groundwork for potential complete removal of spaCy dependency in future versions.
 These changes address issue #68 and provide a foundation for faster, more efficient text processing in Crawl4AI.
 ## [v0.2.76] - 2024-08-02
 Major improvements in functionality, performance, and cross-platform compatibility! 🚀
 - 🐳 **Docker enhancements**: Significantly improved Dockerfile for easy installation on Linux, Mac, and Windows.
 - 🌐 **Official Docker Hub image**: Launched our first official image on Docker Hub for streamlined deployment.
 - 🔧 **Selenium upgrade**: Removed dependency on ChromeDriver, now using Selenium's built-in capabilities for better compatibility.
 - 🖼️ **Image description**: Implemented ability to generate textual descriptions for extracted images from web pages.
 - ⚡ **Performance boost**: Various improvements to enhance overall speed and performance.
 A big shoutout to our amazing community contributors:
 - [@aravindkarnam](https://github.com/aravindkarnam) for developing the textual description extraction feature.
 - [@FractalMind](https://github.com/FractalMind) for creating the first official Docker Hub image and fixing Dockerfile errors.
 - [@ketonkss4](https://github.com/ketonkss4) for identifying Selenium's new capabilities, helping us reduce dependencies.
 Your contributions are driving Crawl4AI forward! 🙌
 ## [v0.2.75] - 2024-07-19
 Minor improvements for a more maintainable codebase:
--- a/README.md
+++ b/README.md
@@ -1,4 +1,4 @@
-# Crawl4AI v0.2.76 🕷️🤖
+# Crawl4AI v0.2.77 🕷️🤖
 [![GitHub Stars](https://img.shields.io/github/stars/unclecode/crawl4ai?style=social)](https://github.com/unclecode/crawl4ai/stargazers)
 [![GitHub Forks](https://img.shields.io/github/forks/unclecode/crawl4ai?style=social)](https://github.com/unclecode/crawl4ai/network/members)
@@ -8,6 +8,21 @@
 Crawl4AI simplifies web crawling and data extraction, making it accessible for large language models (LLMs) and AI applications. 🆓🌐
 #### [v0.2.77] - 2024-08-02
 Major improvements in functionality, performance, and cross-platform compatibility! 🚀
 - 🐳 **Docker enhancements**:
  - Significantly improved Dockerfile for easy installation on Linux, Mac, and Windows.
 - 🌐 **Official Docker Hub image**:
  - Launched our first official image on Docker Hub for streamlined deployment (unclecode/crawl4ai).
 - 🔧 **Selenium upgrade**:
  - Removed dependency on ChromeDriver, now using Selenium's built-in capabilities for better compatibility.
 - 🖼️ **Image description**:
  - Implemented ability to generate textual descriptions for extracted images from web pages.
 - ⚡ **Performance boost**:
  - Various improvements to enhance overall speed and performance.
 ## Try it Now!
 ✨ Play around with this [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1sJPAmeLj5PMrg2VgOwMJ2ubGIcK0cJeX?usp=sharing)
@@ -35,7 +50,7 @@ Crawl4AI simplifies web crawling and data extraction, making it accessible for l
 # Crawl4AI
-## 🌟 Shoutout to Contributors of v0.2.76!
+## 🌟 Shoutout to Contributors of v0.2.77!
 A big thank you to the amazing contributors who've made this release possible:
--- a/crawl4ai/extraction_strategy.py
+++ b/crawl4ai/extraction_strategy.py
@@ -9,6 +9,7 @@ from .utils import *
 from functools import partial
 from .model_loader import *
 import math
 import numpy as np
 class ExtractionStrategy(ABC):
@@ -100,7 +101,7 @@ class LLMExtractionStrategy(ExtractionStrategy):
            variable_values["REQUEST"] = self.instruction
            prompt_with_variables = PROMPT_EXTRACT_BLOCKS_WITH_INSTRUCTION
-        if self.extract_type == "schema":
+        if self.extract_type == "schema" and self.schema:
            variable_values["SCHEMA"] = json.dumps(self.schema, indent=2)
            prompt_with_variables = PROMPT_EXTRACT_SCHEMA_WITH_INSTRUCTION
@@ -248,6 +249,9 @@ class CosineStrategy(ExtractionStrategy):
        self.get_embedding_method = "direct"
        self.device = get_device()
        import torch
        self.device = torch.device('cpu')
        self.default_batch_size = calculate_batch_size(self.device)
        if self.verbose:
@@ -260,7 +264,9 @@ class CosineStrategy(ExtractionStrategy):
        # else:
        self.tokenizer, self.model = load_bge_small_en_v1_5()
        self.model.to(self.device)
        self.model.eval()  
        self.get_embedding_method = "batch"
        self.buffer_embeddings = np.array([])
@@ -282,7 +288,7 @@ class CosineStrategy(ExtractionStrategy):
        if self.verbose:
            print(f"[LOG] Loading Multilabel Classifier for {self.device.type} device.")
-        self.nlp, self.device = load_text_multilabel_classifier()
+        self.nlp, _ = load_text_multilabel_classifier()
        # self.default_batch_size = 16 if self.device.type == 'cpu' else 64
        if self.verbose:
@@ -453,21 +459,21 @@ class CosineStrategy(ExtractionStrategy):
        if self.verbose:
            print(f"[LOG] 🚀 Assign tags using {self.device}")
-        if self.device.type in ["gpu", "cuda", "mps"]:
+        if self.device.type in ["gpu", "cuda", "mps", "cpu"]:
            labels = self.nlp([cluster['content'] for cluster in cluster_list])
            for cluster, label in zip(cluster_list, labels):
                cluster['tags'] = label
-        elif self.device == "cpu":
+        # elif self.device.type == "cpu":
-            # Process the text with the loaded model
+        #     # Process the text with the loaded model
-            texts = [cluster['content'] for cluster in cluster_list]
+        #     texts = [cluster['content'] for cluster in cluster_list]
-            # Batch process texts
+        #     # Batch process texts
-            docs = self.nlp.pipe(texts, disable=["tagger", "parser", "ner", "lemmatizer"])
+        #     docs = self.nlp.pipe(texts, disable=["tagger", "parser", "ner", "lemmatizer"])
-            for doc, cluster in zip(docs, cluster_list):
+        #     for doc, cluster in zip(docs, cluster_list):
-                tok_k = self.top_k
+        #         tok_k = self.top_k
-                top_categories = sorted(doc.cats.items(), key=lambda x: x[1], reverse=True)[:tok_k]
+        #         top_categories = sorted(doc.cats.items(), key=lambda x: x[1], reverse=True)[:tok_k]
-                cluster['tags'] = [cat for cat, _ in top_categories]
+        #         cluster['tags'] = [cat for cat, _ in top_categories]
            # for cluster in  cluster_list:
            #     doc = self.nlp(cluster['content'])
--- a/crawl4ai/model_loader.py
+++ b/crawl4ai/model_loader.py
@@ -6,6 +6,7 @@ import tarfile
 from .model_loader import *
 import argparse
 import urllib.request
 from crawl4ai.config import MODEL_REPO_BRANCH
 __location__ = os.path.realpath(os.path.join(os.getcwd(), os.path.dirname(__file__)))
@lru_cache()
@@ -141,14 +142,15 @@ def load_text_multilabel_classifier():
    from scipy.special import expit
    import torch
-    # Check for available device: CUDA, MPS (for Apple Silicon), or CPU
+    # # Check for available device: CUDA, MPS (for Apple Silicon), or CPU
-    if torch.cuda.is_available():
+    # if torch.cuda.is_available():
-        device = torch.device("cuda")
+    #     device = torch.device("cuda")
-    elif torch.backends.mps.is_available():
+    # elif torch.backends.mps.is_available():
-        device = torch.device("mps")
+    #     device = torch.device("mps")
-    else:
+    # else:
-        return load_spacy_model(), torch.device("cpu")
+    #     device = torch.device("cpu")
-
+    #     # return load_spacy_model(), torch.device("cpu")
    MODEL = "cardiffnlp/tweet-topic-21-multi"
    tokenizer = AutoTokenizer.from_pretrained(MODEL, resume_download=None)
@@ -192,51 +194,61 @@ def load_spacy_model():
    import spacy
    name = "models/reuters"
    home_folder = get_home_folder()
-    model_folder = os.path.join(home_folder, name)
+    model_folder = Path(home_folder) / name
    # Check if the model directory already exists
-    if not (Path(model_folder).exists() and any(Path(model_folder).iterdir())):
+    if not (model_folder.exists() and any(model_folder.iterdir())):
        repo_url = "https://github.com/unclecode/crawl4ai.git"
        # branch = "main"
        branch = MODEL_REPO_BRANCH 
-        repo_folder = os.path.join(home_folder, "crawl4ai")
+        repo_folder = Path(home_folder) / "crawl4ai"
-        model_folder = os.path.join(home_folder, name)
+        
-
+        print("[LOG] ⏬ Downloading Spacy model for the first time...")
        # print("[LOG] ⏬ Downloading Spacy model for the first time...")
        # Remove existing repo folder if it exists
-        if Path(repo_folder).exists():
+        if repo_folder.exists():
-            shutil.rmtree(repo_folder)
+            try:
-            shutil.rmtree(model_folder)
+                shutil.rmtree(repo_folder)
                if model_folder.exists():
                    shutil.rmtree(model_folder)
            except PermissionError:
                print("[WARNING] Unable to remove existing folders. Please manually delete the following folders and try again:")
                print(f"- {repo_folder}")
                print(f"- {model_folder}")
                return None
        try:
            # Clone the repository
            subprocess.run(
-                ["git", "clone", "-b", branch, repo_url, repo_folder],
+                ["git", "clone", "-b", branch, repo_url, str(repo_folder)],
                stdout=subprocess.DEVNULL,
                stderr=subprocess.DEVNULL,
                check=True
            )
            # Create the models directory if it doesn't exist
-            models_folder = os.path.join(home_folder, "models")
+            models_folder = Path(home_folder) / "models"
-            os.makedirs(models_folder, exist_ok=True)
+            models_folder.mkdir(parents=True, exist_ok=True)
            # Copy the reuters model folder to the models directory
-            source_folder = os.path.join(repo_folder, "models/reuters")
+            source_folder = repo_folder / "models" / "reuters"
            shutil.copytree(source_folder, model_folder)
            # Remove the cloned repository
            shutil.rmtree(repo_folder)
-            # Print completion message
+            print("[LOG] ✅ Spacy Model downloaded successfully")
            # print("[LOG] ✅ Spacy Model downloaded successfully")
        except subprocess.CalledProcessError as e:
            print(f"An error occurred while cloning the repository: {e}")
            return None
        except Exception as e:
            print(f"An error occurred: {e}")
            return None
-    return spacy.load(model_folder)
+    try:
        return spacy.load(str(model_folder))
    except Exception as e:
        print(f"Error loading spacy model: {e}")
        return None
 def download_all_models(remove_existing=False):
    """Download all models required for Crawl4AI."""
--- a/crawl4ai/utils.py
+++ b/crawl4ai/utils.py
@@ -834,7 +834,6 @@ def extract_blocks_batch(batch_data, provider = "groq/llama3-70b-8192", api_toke
    return sum(all_blocks, [])
 def merge_chunks_based_on_token_threshold(chunks, token_threshold):
    """
    Merges small chunks into larger ones based on the total token threshold.
@@ -880,7 +879,6 @@ def process_sections(url: str, sections: list, provider: str, api_token: str) ->
    return extracted_content
 def wrap_text(draw, text, font, max_width):
    # Wrap the text to fit within the specified width
    lines = []
@@ -892,7 +890,6 @@ def wrap_text(draw, text, font, max_width):
        lines.append(line)
    return '\n'.join(lines)
 def format_html(html_string):
    soup = BeautifulSoup(html_string, 'html.parser')
    return soup.prettify()
--- a/docs/md/changelog.md
+++ b/docs/md/changelog.md
@@ -1,6 +1,15 @@
 # Changelog
-# Changelog
+## [v0.2.77] - 2024-08-04
 Significant improvements in text processing and performance:
 - 🚀 **Dependency reduction**: Removed dependency on spaCy model for text chunk labeling in cosine extraction strategy.
 - 🤖 **Transformer upgrade**: Implemented text sequence classification using a transformer model for labeling text chunks.
 - ⚡ **Performance enhancement**: Improved model loading speed due to removal of spaCy dependency.
 - 🔧 **Future-proofing**: Laid groundwork for potential complete removal of spaCy dependency in future versions.
 These changes address issue #68 and provide a foundation for faster, more efficient text processing in Crawl4AI.
 ## [v0.2.76] - 2024-08-02
--- a/docs/md/index.md
+++ b/docs/md/index.md
@@ -1,4 +1,4 @@
-# Crawl4AI v0.2.76
+# Crawl4AI v0.2.77
 Welcome to the official documentation for Crawl4AI! 🕷️🤖 Crawl4AI is an open-source Python library designed to simplify web crawling and extract useful information from web pages. This documentation will guide you through the features, usage, and customization of Crawl4AI.
--- a/setup.py
+++ b/setup.py
@@ -19,13 +19,13 @@ with open("requirements.txt") as f:
    requirements = f.read().splitlines()
 # Define the requirements for different environments
-default_requirements = [req for req in requirements if not req.startswith(("torch", "transformers", "onnxruntime", "nltk", "spacy", "tokenizers", "scikit-learn", "numpy"))]
+default_requirements = [req for req in requirements if not req.startswith(("torch", "transformers", "onnxruntime", "nltk", "spacy", "tokenizers", "scikit-learn"))]
 torch_requirements = [req for req in requirements if req.startswith(("torch", "nltk", "spacy", "scikit-learn", "numpy"))]
 transformer_requirements = [req for req in requirements if req.startswith(("transformers", "tokenizers", "onnxruntime"))]
 setup(
    name="Crawl4AI",
-    version="0.2.76",
+    version="0.2.77",
    description="🔥🕷️ Crawl4AI: Open-source LLM Friendly Web Crawler & Scrapper",
    long_description=open("README.md", encoding="utf-8").read(),
    long_description_content_type="text/markdown",
Author	SHA1	Message	Date
unclecode	7afa11a02f	Update .gitignore to include test_env/ and tmp/ directories	2024-09-28 00:12:58 +08:00
unclecode	dec3d44224	refactor: Update extraction strategy to handle schema extraction with non-empty schema This code change updates the `LLMExtractionStrategy` class to handle schema extraction when the schema is non-empty. Previously, the schema extraction was only triggered when the `extract_type` was set to "schema", regardless of whether a schema was provided. With this update, the schema extraction will only be performed if the `extract_type` is "schema" and a non-empty schema is provided. This ensures that the extraction strategy behaves correctly and avoids unnecessary schema extraction when not needed. Also "numpy" is removed from default installation mode.	2024-08-19 15:37:07 +08:00
unclecode	e5e6a34e80	## [v0.2.77] - 2024-08-04 Significant improvements in text processing and performance: - 🚀 Dependency reduction: Removed dependency on spaCy model for text chunk labeling in cosine extraction strategy. - 🤖 Transformer upgrade: Implemented text sequence classification using a transformer model for labeling text chunks. - ⚡ Performance enhancement: Improved model loading speed due to removal of spaCy dependency. - 🔧 Future-proofing: Laid groundwork for potential complete removal of spaCy dependency in future versions. These changes address issue #68 and provide a foundation for faster, more efficient text processing in Crawl4AI.	2024-08-04 14:54:18 +08:00
unclecode	897e766728	Update README	2024-08-02 16:04:14 +08:00
unclecode	9200a6731d	## [v0.2.76] - 2024-08-02 Major improvements in functionality, performance, and cross-platform compatibility! 🚀 - 🐳 Docker enhancements: Significantly improved Dockerfile for easy installation on Linux, Mac, and Windows. - 🌐 Official Docker Hub image: Launched our first official image on Docker Hub for streamlined deployment (unclecode/crawl4ai). - 🔧 Selenium upgrade: Removed dependency on ChromeDriver, now using Selenium's built-in capabilities for better compatibility. - 🖼️ Image description: Implemented ability to generate textual descriptions for extracted images from web pages. - ⚡ Performance boost: Various improvements to enhance overall speed and performance.	2024-08-02 16:02:42 +08:00
`@@ -1,4 +1,4 @@`
	`# Crawl4AI v0.2.76`	`# Crawl4AI v0.2.77`

	`Welcome to the official documentation for Crawl4AI! 🕷️🤖 Crawl4AI is an open-source Python library designed to simplify web crawling and extract useful information from web pages. This documentation will guide you through the features, usage, and customization of Crawl4AI.`	`Welcome to the official documentation for Crawl4AI! 🕷️🤖 Crawl4AI is an open-source Python library designed to simplify web crawling and extract useful information from web pages. This documentation will guide you through the features, usage, and customization of Crawl4AI.`