chore: Update README

chore: Update web crawler URLs to use NBC News business section
chore: Add verbose option to ExtractionStrategy classes
2024-05-17 18:12:50 +08:00 · 2024-05-17 18:11:13 +08:00 · 2024-05-17 18:06:10 +08:00 · 2024-05-17 17:00:43 +08:00 · 2024-05-17 16:55:34 +08:00 · 2024-05-17 16:53:03 +08:00
26 changed files with 215 additions and 84065 deletions
--- a/README.md
+++ b/README.md
@@ -22,32 +22,26 @@ Crawl4AI has one clear task: to simplify crawling and extract useful information

 ## Power and Simplicity of Crawl4AI 🚀

-Crawl4AI makes even complex web crawling tasks simple and intuitive. Below is an example of how you can execute JavaScript, filter data using keywords, and use a CSS selector to extract specific content—all in one go!
+To show the simplicity take a look at the first example:

-**Example Task:**
+```python
+from crawl4ai import WebCrawler
+
+# Create the WebCrawler instance 
+crawler = WebCrawler()
+
+# Run the crawler with keyword filtering and CSS selector
+result = crawler.run(url="https://www.nbcnews.com/business")
+print(result) # {url, html, markdown, extracted_content, metadata}
+```
+
+Now let's try a complex task. Below is an example of how you can execute JavaScript, filter data using keywords, and use a CSS selector to extract specific content—all in one go!

 1. Instantiate a WebCrawler object.
 2. Execute custom JavaScript to click a "Load More" button.
-3. Filter the data to include only content related to "technology".
+3. Extract semantical chunks of content and filter the data to include only content related to technology.
 4. Use a CSS selector to extract only paragraphs (`<p>` tags).

-**Example Code:**
-
-Simply, firtsy install the package:
-```bash
-virtualenv venv
-source venv/bin/activate
-# Install Crawl4AI
-pip install git+https://github.com/unclecode/crawl4ai.git
-```
-
-Run the following command to load the required models. This is optional, but it will boost the performance and speed of the crawler. You need to do this only once.
-```bash
-crawl4ai-download-models
-```
-
-Now, you can run the following code:
-
 ```python
 # Import necessary modules
 from crawl4ai import WebCrawler
@@ -69,7 +63,7 @@ crawler = WebCrawler(crawler_strategy=crawler_strategy)

 # Run the crawler with keyword filtering and CSS selector
 result = crawler.run(
-    url="https://www.example.com",
+    url="https://www.nbcnews.com/business",
    extraction_strategy=CosineStrategy(
        semantic_filter="technology",
    ),
@@ -77,7 +71,7 @@ result = crawler.run(

 # Run the crawler with LLM extraction strategy
 result = crawler.run(
-    url="https://www.example.com",
+    url="https://www.nbcnews.com/business",
    extraction_strategy=LLMExtractionStrategy(
        provider="openai/gpt-4o",
        api_token=os.getenv('OPENAI_API_KEY'),
@@ -99,16 +93,16 @@ With Crawl4AI, you can perform advanced web crawling and data extraction tasks w

 ## Table of Contents

-1. [Features](#features)
-2. [Installation](#installation)
-3. [REST API/Local Server](#using-the-local-server-ot-rest-api)
-4. [Python Library Usage](#usage)
-5. [Parameters](#parameters)
-6. [Chunking Strategies](#chunking-strategies)
-7. [Extraction Strategies](#extraction-strategies)
-8. [Contributing](#contributing)
-9. [License](#license)
-10. [Contact](#contact)
+1. [Features](#features-)
+2. [Installation](#installation-)
+3. [REST API/Local Server](#using-the-local-server-ot-rest-api-)
+4. [Python Library Usage](#python-library-usage-)
+5. [Parameters](#parameters-)
+6. [Chunking Strategies](#chunking-strategies-)
+7. [Extraction Strategies](#extraction-strategies-)
+8. [Contributing](#contributing-)
+9. [License](#license-)
+10. [Contact](#contact-)


 ## Features ✨
@@ -137,7 +131,7 @@ To install Crawl4AI as a library, follow these steps:
 ```bash
 virtualenv venv
 source venv/bin/activate
-pip install git+https://github.com/unclecode/crawl4ai.git
+pip install "crawl4ai[all] @ git+https://github.com/unclecode/crawl4ai.git"
 ```

    💡 Better to run the following CLI-command to load the required models. This is optional, but it will boost the performance and speed of the crawler. You need to do this only once.
@@ -150,12 +144,12 @@ virtualenv venv
 source venv/bin/activate
 git clone https://github.com/unclecode/crawl4ai.git
 cd crawl4ai
-pip install -e .
+pip install -e .[all]
 ```

 3. Use docker to run the local server:
 ```bash
-docker build -t crawl4ai . 
+docker build -t crawl4ai .
 # For Mac users
 # docker build --platform linux/amd64 -t crawl4ai .
 docker run -d -p 8000:80 crawl4ai
@@ -174,7 +168,7 @@ To use the REST API, send a POST request to `https://crawl4ai.com/crawl` with th
 **Example Request:**
 ```json
 {
-    "urls": ["https://www.example.com"],
+    "urls": ["https://www.nbcnews.com/business"],
    "include_raw_html": false,
    "bypass_cache": true,
    "word_count_threshold": 5,
@@ -201,7 +195,7 @@ To use the REST API, send a POST request to `https://crawl4ai.com/crawl` with th
    "status": "success",
    "data": [
        {
-            "url": "https://www.example.com",
+            "url": "https://www.nbcnews.com/business",
            "extracted_content": "...",
            "html": "...",
            "markdown": "...",
@@ -216,7 +210,7 @@ For more information about the available parameters and their descriptions, refe

 ## Python Library Usage 🚀

-    🔥 A great way to try out Crawl4AI is to run `quickstart.py` in the `docs/examples` directory. This script demonstrates how to use Crawl4AI to crawl a website and extract content from it.
+🔥 A great way to try out Crawl4AI is to run `quickstart.py` in the `docs/examples` directory. This script demonstrates how to use Crawl4AI to crawl a website and extract content from it.

 ### Quickstart Guide

@@ -264,6 +258,8 @@ result = crawler.run(

 ### Extraction strategy: CosineStrategy

+So far, the extracted content is just the result of chunking. To extract meaningful content, you can use extraction strategies. These strategies cluster consecutive chunks into meaningful blocks, keeping the same order as the text in the HTML. This approach is perfect for use in RAG applications and semantical search queries.
+
 Using CosineStrategy:
 ```python
 result = crawler.run(
@@ -349,7 +345,7 @@ result = crawler.run(url="https://www.nbcnews.com/business")
 | `include_raw_html`    | Whether to include the raw HTML content in the response.                                              | No       | `false`             |
 | `bypass_cache`        | Whether to force a fresh crawl even if the URL has been previously crawled.                           | No       | `false`             |
 | `word_count_threshold`| The minimum number of words a block must contain to be considered meaningful (minimum value is 5).    | No       | `5`                 |
-| `extraction_strategy` | The strategy to use for extracting content from the HTML (e.g., "CosineStrategy").                    | No       | `CosineStrategy`    |
+| `extraction_strategy` | The strategy to use for extracting content from the HTML (e.g., "CosineStrategy").                    | No       | `NoExtractionStrategy`    |
 | `chunking_strategy`   | The strategy to use for chunking the text before processing (e.g., "RegexChunking").                  | No       | `RegexChunking`     |
 | `css_selector`        | The CSS selector to target specific parts of the HTML for extraction.                                 | No       | `None`              |
 | `verbose`             | Whether to enable verbose logging.                                                                    | No       | `true`              |
@@ -374,11 +370,11 @@ chunks = chunker.chunk("This is a sample text. It will be split into chunks.")
 `NlpSentenceChunking` uses a natural language processing model to chunk a given text into sentences. This approach leverages SpaCy to accurately split text based on sentence boundaries.

 **Constructor Parameters:**
- `model` (str, optional): The SpaCy model to use for sentence detection. Default is `'en_core_web_sm'`.
+- None.

 **Example usage:**
 ```python
-chunker = NlpSentenceChunking(model='en_core_web_sm')
+chunker = NlpSentenceChunking()
 chunks = chunker.chunk("This is a sample text. It will be split into sentences.")
 ```

@@ -466,7 +462,7 @@ extracted_content = extractor.extract(url, html)

 **Example usage:**
 ```python
-extractor = CosineStrategy(semantic_filter='artificial intelligence', word_count_threshold=10, max_dist=0.2, linkage_method='ward', top_k=3, model_name='BAAI/bge-small-en-v1.5')
+extractor = CosineStrategy(semantic_filter='finance rental prices', word_count_threshold=10, max_dist=0.2, linkage_method='ward', top_k=3, model_name='BAAI/bge-small-en-v1.5')
 extracted_content = extractor.extract(url, html)
 ```

--- a/crawl4ai/chunking_strategy.py
+++ b/crawl4ai/chunking_strategy.py
@@ -1,12 +1,8 @@
 from abc import ABC, abstractmethod
 import re
-# spacy = lazy_import.lazy_module('spacy')
-# nl = lazy_import.lazy_module('nltk')
-# from nltk.corpus import stopwords
-# from nltk.tokenize import word_tokenize, TextTilingTokenizer
 from collections import Counter
 import string
-from .model_loader import load_spacy_en_core_web_sm
+from .model_loader import load_nltk_punkt

 # Define the abstract base class for chunking strategies
 class ChunkingStrategy(ABC):
@@ -34,15 +30,24 @@ class RegexChunking(ChunkingStrategy):
            paragraphs = new_paragraphs
        return paragraphs
    
-# NLP-based sentence chunking using spaCy
-
+# NLP-based sentence chunking 
 class NlpSentenceChunking(ChunkingStrategy):
-    def __init__(self, model='en_core_web_sm'):
-        self.nlp = load_spacy_en_core_web_sm()
+    def __init__(self):
+        load_nltk_punkt()
+        pass

    def chunk(self, text: str) -> list:
-        doc = self.nlp(text)
-        return [sent.text.strip() for sent in doc.sents]
+        # Improved regex for sentence splitting
+        # sentence_endings = re.compile(
+        #     r'(?<!\w\.\w.)(?<![A-Z][a-z]\.)(?<![A-Z][A-Z]\.)(?<![A-Za-z]\.)(?<=\.|\?|\!|\n)\s'
+        # )
+        # sentences = sentence_endings.split(text)
+        # sens =  [sent.strip() for sent in sentences if sent]            
+        from nltk.tokenize import sent_tokenize
+        sentences = sent_tokenize(text)
+        sens =  [sent.strip() for sent in sentences]        
+        
+        return list(set(sens))
    
 # Topic-based segmentation using TextTiling
 class TopicSegmentationChunking(ChunkingStrategy):
--- a/crawl4ai/extraction_strategy.py
+++ b/crawl4ai/extraction_strategy.py
@@ -7,7 +7,7 @@ from .prompts import PROMPT_EXTRACT_BLOCKS, PROMPT_EXTRACT_BLOCKS_WITH_INSTRUCTI
 from .config import *
 from .utils import *
 from functools import partial
-from .model_loader import load_bert_base_uncased, load_bge_small_en_v1_5, load_spacy_model
+from .model_loader import *


 import numpy as np
@@ -19,6 +19,7 @@ class ExtractionStrategy(ABC):
    def __init__(self, **kwargs):
        self.DEL = "<|DEL|>"
        self.name = self.__class__.__name__
+        self.verbose = kwargs.get("verbose", False)

    @abstractmethod
    def extract(self, url: str, html: str, *q, **kwargs) -> List[Dict[str, Any]]:
@@ -45,14 +46,13 @@ class ExtractionStrategy(ABC):
            for future in as_completed(futures):
                extracted_content.extend(future.result())
        return extracted_content    
-
 class NoExtractionStrategy(ExtractionStrategy):
    def extract(self, url: str, html: str, *q, **kwargs) -> List[Dict[str, Any]]:
        return [{"index": 0, "content": html}]
    
    def run(self, url: str, sections: List[str], *q, **kwargs) -> List[Dict[str, Any]]:
        return [{"index": i, "tags": [], "content": section} for i, section in enumerate(sections)]
-    
+   
 class LLMExtractionStrategy(ExtractionStrategy):
    def __init__(self, provider: str = DEFAULT_PROVIDER, api_token: Optional[str] = None, instruction:str = None, **kwargs):
        """
@@ -62,10 +62,11 @@ class LLMExtractionStrategy(ExtractionStrategy):
        :param api_token: The API token for the provider.
        :param instruction: The instruction to use for the LLM model.
        """
-        super().__init__()    
+        super().__init__() 
        self.provider = provider
        self.api_token = api_token or PROVIDER_MODELS.get(provider, None) or os.getenv("OPENAI_API_KEY")
        self.instruction = instruction
+        self.verbose = kwargs.get("verbose", False)
        
        if not self.api_token:
            raise ValueError("API token must be provided for LLMExtractionStrategy. Update the config.py or set OPENAI_API_KEY environment variable.")
@@ -106,7 +107,8 @@ class LLMExtractionStrategy(ExtractionStrategy):
                    "content": unparsed
                })
        
-        print("[LOG] Extracted", len(blocks), "blocks from URL:", url, "block index:", ix)
+        if self.verbose:
+            print("[LOG] Extracted", len(blocks), "blocks from URL:", url, "block index:", ix)
        return blocks
    
    def _merge(self, documents):
@@ -166,16 +168,13 @@ class CosineStrategy(ExtractionStrategy):
        """
        super().__init__()
        
-        from transformers import BertTokenizer, BertModel, pipeline
-        from transformers import AutoTokenizer, AutoModel     
-        import spacy  
-
        self.semantic_filter = semantic_filter
        self.word_count_threshold = word_count_threshold
        self.max_dist = max_dist
        self.linkage_method = linkage_method
        self.top_k = top_k
        self.timer = time.time()
+        self.verbose = kwargs.get("verbose", False)
        
        self.buffer_embeddings = np.array([])

@@ -184,9 +183,10 @@ class CosineStrategy(ExtractionStrategy):
        elif model_name == "BAAI/bge-small-en-v1.5":
            self.tokenizer, self.model = load_bge_small_en_v1_5()

-        self.nlp = load_spacy_model()
-        print(f"[LOG] Model loaded {model_name}, models/reuters, took " + str(time.time() - self.timer) + " seconds")
-
+        self.nlp = load_text_multilabel_classifier()
+        
+        if self.verbose:
+            print(f"[LOG] Model loaded {model_name}, models/reuters, took " + str(time.time() - self.timer) + " seconds")

    def filter_documents_embeddings(self, documents: List[str], semantic_filter: str, threshold: float = 0.5) -> List[str]:
        """
@@ -310,13 +310,19 @@ class CosineStrategy(ExtractionStrategy):

        # Convert filtered clusters to a sorted list of dictionaries
        cluster_list = [{"index": int(idx), "tags" : [], "content": " ".join(filtered_clusters[idx])} for idx in sorted(filtered_clusters)]
+        
+        labels = self.nlp([cluster['content'] for cluster in cluster_list])
+        
+        for cluster, label in zip(cluster_list, labels):
+            cluster['tags'] = label

        # Process the text with the loaded model
-        for cluster in  cluster_list:
-            doc = self.nlp(cluster['content'])
-            tok_k = self.top_k
-            top_categories = sorted(doc.cats.items(), key=lambda x: x[1], reverse=True)[:tok_k]
-            cluster['tags'] = [cat for cat, _ in top_categories]
+        # for cluster in  cluster_list:
+        #     cluster['tags'] = self.nlp(cluster['content'])[0]['label']
+            # doc = self.nlp(cluster['content'])
+            # tok_k = self.top_k
+            # top_categories = sorted(doc.cats.items(), key=lambda x: x[1], reverse=True)[:tok_k]
+            # cluster['tags'] = [cat for cat, _ in top_categories]
        
        # print(f"[LOG] 🚀 Categorization done in {time.time() - t:.2f} seconds")
        
--- a/crawl4ai/model_loader.py
+++ b/crawl4ai/model_loader.py
@@ -28,68 +28,66 @@ def load_bge_small_en_v1_5():
    return tokenizer, model

@lru_cache()
-def load_spacy_en_core_web_sm():
-    import spacy
-    try:
-        print("[LOG] Loading spaCy model")
-        nlp = spacy.load("en_core_web_sm")
-    except IOError:
-        print("[LOG] ⏬ Downloading spaCy model for the first time")
-        spacy.cli.download("en_core_web_sm")
-        nlp = spacy.load("en_core_web_sm")    
-    print("[LOG] ✅ spaCy model loaded successfully")
-    return nlp
+def load_text_classifier():
+    from transformers import AutoTokenizer, AutoModelForSequenceClassification
+    from transformers import pipeline
+
+    tokenizer = AutoTokenizer.from_pretrained("dstefa/roberta-base_topic_classification_nyt_news")
+    model = AutoModelForSequenceClassification.from_pretrained("dstefa/roberta-base_topic_classification_nyt_news")
+    pipe = pipeline("text-classification", model=model, tokenizer=tokenizer)
+
+    return pipe

@lru_cache()
-def load_spacy_model():
-    import spacy
-    name = "models/reuters"
-    home_folder = get_home_folder()
-    model_folder = os.path.join(home_folder, name)
-    
-    # Check if the model directory already exists
-    if not (Path(model_folder).exists() and any(Path(model_folder).iterdir())):
-        repo_url = "https://github.com/unclecode/crawl4ai.git"
-        # branch = "main"
-        branch = MODEL_REPO_BRANCH 
-        repo_folder = os.path.join(home_folder, "crawl4ai")
-        model_folder = os.path.join(home_folder, name)
+def load_text_multilabel_classifier():
+    from transformers import AutoModelForSequenceClassification, AutoTokenizer
+    import numpy as np
+    from scipy.special import expit
+    import torch

-        print("[LOG] ⏬ Downloading model for the first time...")
+    MODEL = "cardiffnlp/tweet-topic-21-multi"
+    tokenizer = AutoTokenizer.from_pretrained(MODEL, resume_download=None)
+    model = AutoModelForSequenceClassification.from_pretrained(MODEL, resume_download=None)
+    class_mapping = model.config.id2label

-        # Remove existing repo folder if it exists
-        if Path(repo_folder).exists():
-            shutil.rmtree(repo_folder)
-            shutil.rmtree(model_folder)
+    # Check for available device: CUDA, MPS (for Apple Silicon), or CPU
+    if torch.cuda.is_available():
+        device = torch.device("cuda")
+    elif torch.backends.mps.is_available():
+        device = torch.device("mps")
+    else:
+        device = torch.device("cpu")

-        try:
-            # Clone the repository
-            subprocess.run(
-                ["git", "clone", "-b", branch, repo_url, repo_folder],
-                stdout=subprocess.DEVNULL,
-                stderr=subprocess.DEVNULL,
-                check=True
-            )
+    model.to(device)

-            # Create the models directory if it doesn't exist
-            models_folder = os.path.join(home_folder, "models")
-            os.makedirs(models_folder, exist_ok=True)
+    def _classifier(texts, threshold=0.5, max_length=64):
+        tokens = tokenizer(texts, return_tensors='pt', padding=True, truncation=True, max_length=max_length)
+        tokens = {key: val.to(device) for key, val in tokens.items()}  # Move tokens to the selected device

-            # Copy the reuters model folder to the models directory
-            source_folder = os.path.join(repo_folder, "models/reuters")
-            shutil.copytree(source_folder, model_folder)
+        with torch.no_grad():
+            output = model(**tokens)

-            # Remove the cloned repository
-            shutil.rmtree(repo_folder)
+        scores = output.logits.detach().cpu().numpy()
+        scores = expit(scores)
+        predictions = (scores >= threshold) * 1

-            # Print completion message
-            print("[LOG] ✅ Model downloaded successfully")
-        except subprocess.CalledProcessError as e:
-            print(f"An error occurred while cloning the repository: {e}")
-        except Exception as e:
-            print(f"An error occurred: {e}")
+        batch_labels = []
+        for prediction in predictions:
+            labels = [class_mapping[i] for i, value in enumerate(prediction) if value == 1]
+            batch_labels.append(labels)

-    return spacy.load(model_folder)
+        return batch_labels
+
+    return _classifier
+
+@lru_cache()
+def load_nltk_punkt():
+    import nltk
+    try:
+        nltk.data.find('tokenizers/punkt')
+    except LookupError:
+        nltk.download('punkt')
+    return nltk.data.find('tokenizers/punkt')

 def download_all_models(remove_existing=False):
    """Download all models required for Crawl4AI."""
@@ -110,10 +108,10 @@ def download_all_models(remove_existing=False):
    load_bert_base_uncased()
    print("[LOG] Downloading BGE Small EN v1.5...")
    load_bge_small_en_v1_5()
-    print("[LOG] Downloading spaCy EN Core Web SM...")
-    load_spacy_en_core_web_sm()
-    print("[LOG] Downloading custom spaCy model...")
-    load_spacy_model()
+    print("[LOG] Downloading text classifier...")
+    load_text_multilabel_classifier
+    print("[LOG] Downloading custom NLTK Punkt model...")
+    load_nltk_punkt()
    print("[LOG] ✅ All models downloaded successfully.")

 def main():
--- a/crawl4ai/train.py
+++ b/crawl4ai/train.py
@@ -3,6 +3,33 @@ from spacy.training import Example
 import random
 import nltk
 from nltk.corpus import reuters
+import torch
+
+def save_spacy_model_as_torch(nlp, model_dir="models/reuters"):
+    # Extract the TextCategorizer component
+    textcat = nlp.get_pipe("textcat_multilabel")
+
+    # Convert the weights to a PyTorch state dictionary
+    state_dict = {name: torch.tensor(param.data) for name, param in textcat.model.named_parameters()}
+
+    # Save the state dictionary
+    torch.save(state_dict, f"{model_dir}/model_weights.pth")
+
+    # Extract and save the vocabulary
+    vocab = extract_vocab(nlp)
+    with open(f"{model_dir}/vocab.txt", "w") as vocab_file:
+        for word, idx in vocab.items():
+            vocab_file.write(f"{word}\t{idx}\n")
+    
+    print(f"Model weights and vocabulary saved to: {model_dir}")
+
+def extract_vocab(nlp):
+    # Extract vocabulary from the SpaCy model
+    vocab = {word: i for i, word in enumerate(nlp.vocab.strings)}
+    return vocab
+
+nlp = spacy.load("models/reuters")
+save_spacy_model_as_torch(nlp, model_dir="models")

 def train_and_save_reuters_model(model_dir="models/reuters"):
    # Ensure the Reuters corpus is downloaded
@@ -96,8 +123,6 @@ def train_model(model_dir, additional_epochs=0):
    nlp.to_disk(model_dir)
    print(f"Model saved to: {model_dir}")

-
-
 def load_model_and_predict(model_dir, text, tok_k = 3):
    # Load the trained model from the specified directory
    nlp = spacy.load(model_dir)
@@ -111,7 +136,6 @@ def load_model_and_predict(model_dir, text, tok_k = 3):
    
    return top_categories    

-
 if __name__ == "__main__":
    train_and_save_reuters_model()
    train_model("models/reuters", additional_epochs=5)
@@ -119,4 +143,4 @@ if __name__ == "__main__":
    print(reuters.categories())
    example_text = "Apple Inc. is reportedly buying a startup for $1 billion"
    r =load_model_and_predict(model_directory, example_text)
-    print(r)
+    print(r)
--- a/crawl4ai/web_crawler.py
+++ b/crawl4ai/web_crawler.py
@@ -11,7 +11,6 @@ from .crawler_strategy import *
 from typing import List
 from concurrent.futures import ThreadPoolExecutor
 from .config import *
-# from .model_loader import load_bert_base_uncased, load_bge_small_en_v1_5, load_spacy_model


 class WebCrawler:
@@ -40,14 +39,11 @@ class WebCrawler:
        self.ready = False
        
    def warmup(self):
-        
-        
-        
        print("[LOG] 🌤️  Warming up the WebCrawler")
        result = self.run(
            url='https://crawl4ai.uccode.io/',
            word_count_threshold=5,
-            extraction_strategy= CosineStrategy(),
+            extraction_strategy= NoExtractionStrategy(),
            bypass_cache=False,
            verbose = False
        )
@@ -63,14 +59,14 @@ class WebCrawler:
        extract_blocks_flag: bool = True,
        word_count_threshold=MIN_WORD_THRESHOLD,
        use_cached_html: bool = False,
-        extraction_strategy: ExtractionStrategy = CosineStrategy(),
+        extraction_strategy: ExtractionStrategy = None,
        chunking_strategy: ChunkingStrategy = RegexChunking(),
        **kwargs,
    ) -> CrawlResult:
        return self.run(
            url_model.url,
            word_count_threshold,
-            extraction_strategy,
+            extraction_strategy or NoExtractionStrategy(),
            chunking_strategy,
            bypass_cache=url_model.forced,
            **kwargs,
@@ -82,13 +78,15 @@ class WebCrawler:
        self,
        url: str,
        word_count_threshold=MIN_WORD_THRESHOLD,
-        extraction_strategy: ExtractionStrategy = CosineStrategy(),
+        extraction_strategy: ExtractionStrategy = None,
        chunking_strategy: ChunkingStrategy = RegexChunking(),
        bypass_cache: bool = False,
        css_selector: str = None,
        verbose=True,
        **kwargs,
    ) -> CrawlResult:
+        extraction_strategy = extraction_strategy or NoExtractionStrategy()
+        extraction_strategy.verbose = verbose
        # Check if extraction strategy is an instance of ExtractionStrategy if not raise an error
        if not isinstance(extraction_strategy, ExtractionStrategy):
            raise ValueError("Unsupported extraction strategy")
@@ -184,11 +182,11 @@ class WebCrawler:
        extract_blocks_flag: bool = True,
        word_count_threshold=MIN_WORD_THRESHOLD,
        use_cached_html: bool = False,
-        extraction_strategy: ExtractionStrategy = CosineStrategy(),
+        extraction_strategy: ExtractionStrategy = None,
        chunking_strategy: ChunkingStrategy = RegexChunking(),
        **kwargs,
    ) -> List[CrawlResult]:
-
+        extraction_strategy = extraction_strategy or NoExtractionStrategy()
        def fetch_page_wrapper(url_model, *args, **kwargs):
            return self.fetch_page(url_model, *args, **kwargs)

--- a/docs/chunking_strategies.json
+++ b/docs/chunking_strategies.json
@@ -1,7 +1,7 @@
 {
    "RegexChunking": "### RegexChunking\n\n`RegexChunking` is a text chunking strategy that splits a given text into smaller parts using regular expressions.\nThis is useful for preparing large texts for processing by language models, ensuring they are divided into manageable segments.\n\n#### Constructor Parameters:\n- `patterns` (list, optional): A list of regular expression patterns used to split the text. Default is to split by double newlines (`['\\n\\n']`).\n\n#### Example usage:\n```python\nchunker = RegexChunking(patterns=[r'\\n\\n', r'\\. '])\nchunks = chunker.chunk(\"This is a sample text. It will be split into chunks.\")\n```",
    
-    "NlpSentenceChunking": "### NlpSentenceChunking\n\n`NlpSentenceChunking` uses a natural language processing model to chunk a given text into sentences. This approach leverages SpaCy to accurately split text based on sentence boundaries.\n\n#### Constructor Parameters:\n- `model` (str, optional): The SpaCy model to use for sentence detection. Default is `'en_core_web_sm'`.\n\n#### Example usage:\n```python\nchunker = NlpSentenceChunking(model='en_core_web_sm')\nchunks = chunker.chunk(\"This is a sample text. It will be split into sentences.\")\n```",
+    "NlpSentenceChunking": "### NlpSentenceChunking\n\n`NlpSentenceChunking` uses a natural language processing model to chunk a given text into sentences. This approach leverages SpaCy to accurately split text based on sentence boundaries.\n\n#### Constructor Parameters:\n- None.\n\n#### Example usage:\n```python\nchunker = NlpSentenceChunking()\nchunks = chunker.chunk(\"This is a sample text. It will be split into sentences.\")\n```",
    
    "TopicSegmentationChunking": "### TopicSegmentationChunking\n\n`TopicSegmentationChunking` uses the TextTiling algorithm to segment a given text into topic-based chunks. This method identifies thematic boundaries in the text.\n\n#### Constructor Parameters:\n- `num_keywords` (int, optional): The number of keywords to extract for each topic segment. Default is `3`.\n\n#### Example usage:\n```python\nchunker = TopicSegmentationChunking(num_keywords=3)\nchunks = chunker.chunk(\"This is a sample text. It will be split into topic-based segments.\")\n```",
    
--- a/docs/examples/quickstart.py
+++ b/docs/examples/quickstart.py
@@ -59,12 +59,6 @@ def understanding_parameters(crawler):
    cprint(f"[LOG] 📦 [bold yellow]Second crawl took {end_time - start_time} seconds and result (forced to crawl):[/bold yellow]")
    print_result(result)

-    # Retrieve raw HTML content
-    cprint("\n🔄 [bold cyan]'include_raw_html' parameter example:[/bold cyan]", True)
-    result = crawler.run(url="https://www.nbcnews.com/business", include_raw_html=False)
-    cprint("[LOG] 📦 [bold yellow]Crawl result (without raw HTML content):[/bold yellow]")
-    print_result(result)
-
 def add_chunking_strategy(crawler):
    # Adding a chunking strategy: RegexChunking
    cprint("\n🧩 [bold cyan]Let's add a chunking strategy: RegexChunking![/bold cyan]", True)
@@ -134,7 +128,7 @@ def add_llm_extraction_strategy(crawler):
    print_result(result)
    
    result = crawler.run(
-        url="https://www.example.com",
+        url="https://www.nbcnews.com/business",
        extraction_strategy=LLMExtractionStrategy(
            provider="openai/gpt-4o",
            api_token=os.getenv('OPENAI_API_KEY'),
@@ -176,12 +170,11 @@ def main():
    cprint("If this is the first time you're running Crawl4ai, this might take a few seconds to load required model files.")

    crawler = create_crawler()
-    
-    cprint("For the rest of this guide, I set crawler.always_by_pass_cache to True to force the crawler to bypass the cache. This is to ensure that we get fresh results for each run.", True)
-    crawler.always_by_pass_cache = True

    basic_usage(crawler)
    understanding_parameters(crawler)
+    
+    crawler.always_by_pass_cache = True
    add_chunking_strategy(crawler)
    add_extraction_strategy(crawler)
    add_llm_extraction_strategy(crawler)
--- a/main.py
+++ b/main.py
@@ -44,14 +44,12 @@ def get_crawler():
    return WebCrawler()

 class CrawlRequest(BaseModel):
-    urls: List[HttpUrl]
-    provider_model: str
-    api_token: str
+    urls: List[str]
    include_raw_html: Optional[bool] = False
    bypass_cache: bool = False
    extract_blocks: bool = True
    word_count_threshold: Optional[int] = 5
-    extraction_strategy: Optional[str] = "CosineStrategy"
+    extraction_strategy: Optional[str] = "NoExtractionStrategy"
    extraction_strategy_args: Optional[dict] = {}
    chunking_strategy: Optional[str] = "RegexChunking"
    chunking_strategy_args: Optional[dict] = {}
@@ -95,9 +93,6 @@ def import_strategy(module_name: str, class_name: str, *args, **kwargs):
@app.post("/crawl")
 async def crawl_urls(crawl_request: CrawlRequest, request: Request):
    global current_requests
-    # Raise error if api_token is not provided
-    if not crawl_request.api_token:
-        raise HTTPException(status_code=401, detail="API token is required.")
    async with lock:
        if current_requests >= MAX_CONCURRENT_REQUESTS:
            raise HTTPException(status_code=429, detail="Too many requests - please try again later.")
--- a/models/reuters/config.cfg
+++ b/models/reuters/config.cfg
@@ -1,144 +0,0 @@
-[paths]
-train = null
-dev = null
-vectors = null
-init_tok2vec = null
-
-[system]
-seed = 0
-gpu_allocator = null
-
-[nlp]
-lang = "en"
-pipeline = ["textcat_multilabel"]
-disabled = []
-before_creation = null
-after_creation = null
-after_pipeline_creation = null
-batch_size = 1000
-tokenizer = {"@tokenizers":"spacy.Tokenizer.v1"}
-vectors = {"@vectors":"spacy.Vectors.v1"}
-
-[components]
-
-[components.textcat_multilabel]
-factory = "textcat_multilabel"
-scorer = {"@scorers":"spacy.textcat_multilabel_scorer.v2"}
-threshold = 0.5
-
-[components.textcat_multilabel.model]
-@architectures = "spacy.TextCatEnsemble.v2"
-nO = null
-
-[components.textcat_multilabel.model.linear_model]
-@architectures = "spacy.TextCatBOW.v3"
-exclusive_classes = false
-length = 262144
-ngram_size = 1
-no_output_layer = false
-nO = null
-
-[components.textcat_multilabel.model.tok2vec]
-@architectures = "spacy.Tok2Vec.v2"
-
-[components.textcat_multilabel.model.tok2vec.embed]
-@architectures = "spacy.MultiHashEmbed.v2"
-width = 64
-rows = [2000,2000,500,1000,500]
-attrs = ["NORM","LOWER","PREFIX","SUFFIX","SHAPE"]
-include_static_vectors = false
-
-[components.textcat_multilabel.model.tok2vec.encode]
-@architectures = "spacy.MaxoutWindowEncoder.v2"
-width = 64
-window_size = 1
-maxout_pieces = 3
-depth = 2
-
-[corpora]
-
-[corpora.dev]
-@readers = "spacy.Corpus.v1"
-path = ${paths.dev}
-gold_preproc = false
-max_length = 0
-limit = 0
-augmenter = null
-
-[corpora.train]
-@readers = "spacy.Corpus.v1"
-path = ${paths.train}
-gold_preproc = false
-max_length = 0
-limit = 0
-augmenter = null
-
-[training]
-seed = ${system.seed}
-gpu_allocator = ${system.gpu_allocator}
-dropout = 0.1
-accumulate_gradient = 1
-patience = 1600
-max_epochs = 0
-max_steps = 20000
-eval_frequency = 200
-frozen_components = []
-annotating_components = []
-dev_corpus = "corpora.dev"
-train_corpus = "corpora.train"
-before_to_disk = null
-before_update = null
-
-[training.batcher]
-@batchers = "spacy.batch_by_words.v1"
-discard_oversize = false
-tolerance = 0.2
-get_length = null
-
-[training.batcher.size]
-@schedules = "compounding.v1"
-start = 100
-stop = 1000
-compound = 1.001
-t = 0.0
-
-[training.logger]
-@loggers = "spacy.ConsoleLogger.v1"
-progress_bar = false
-
-[training.optimizer]
-@optimizers = "Adam.v1"
-beta1 = 0.9
-beta2 = 0.999
-L2_is_weight_decay = true
-L2 = 0.01
-grad_clip = 1.0
-use_averages = false
-eps = 0.00000001
-learn_rate = 0.001
-
-[training.score_weights]
-cats_score = 1.0
-cats_score_desc = null
-cats_micro_p = null
-cats_micro_r = null
-cats_micro_f = null
-cats_macro_p = null
-cats_macro_r = null
-cats_macro_f = null
-cats_macro_auc = null
-cats_f_per_type = null
-
-[pretraining]
-
-[initialize]
-vectors = ${paths.vectors}
-init_tok2vec = ${paths.init_tok2vec}
-vocab_data = null
-lookups = null
-before_init = null
-after_init = null
-
-[initialize.components]
-
-[initialize.tokenizer]
--- a/models/reuters/meta.json
+++ b/models/reuters/meta.json
@@ -1,122 +0,0 @@
-{
-  "lang":"en",
-  "name":"pipeline",
-  "version":"0.0.0",
-  "spacy_version":">=3.7.4,<3.8.0",
-  "description":"",
-  "author":"",
-  "email":"",
-  "url":"",
-  "license":"",
-  "spacy_git_version":"bff8725f4",
-  "vectors":{
-    "width":0,
-    "vectors":0,
-    "keys":0,
-    "name":null,
-    "mode":"default"
-  },
-  "labels":{
-    "textcat_multilabel":[
-      "acq",
-      "alum",
-      "barley",
-      "bop",
-      "carcass",
-      "castor-oil",
-      "cocoa",
-      "coconut",
-      "coconut-oil",
-      "coffee",
-      "copper",
-      "copra-cake",
-      "corn",
-      "cotton",
-      "cotton-oil",
-      "cpi",
-      "cpu",
-      "crude",
-      "dfl",
-      "dlr",
-      "dmk",
-      "earn",
-      "fuel",
-      "gas",
-      "gnp",
-      "gold",
-      "grain",
-      "groundnut",
-      "groundnut-oil",
-      "heat",
-      "hog",
-      "housing",
-      "income",
-      "instal-debt",
-      "interest",
-      "ipi",
-      "iron-steel",
-      "jet",
-      "jobs",
-      "l-cattle",
-      "lead",
-      "lei",
-      "lin-oil",
-      "livestock",
-      "lumber",
-      "meal-feed",
-      "money-fx",
-      "money-supply",
-      "naphtha",
-      "nat-gas",
-      "nickel",
-      "nkr",
-      "nzdlr",
-      "oat",
-      "oilseed",
-      "orange",
-      "palladium",
-      "palm-oil",
-      "palmkernel",
-      "pet-chem",
-      "platinum",
-      "potato",
-      "propane",
-      "rand",
-      "rape-oil",
-      "rapeseed",
-      "reserves",
-      "retail",
-      "rice",
-      "rubber",
-      "rye",
-      "ship",
-      "silver",
-      "sorghum",
-      "soy-meal",
-      "soy-oil",
-      "soybean",
-      "strategic-metal",
-      "sugar",
-      "sun-meal",
-      "sun-oil",
-      "sunseed",
-      "tea",
-      "tin",
-      "trade",
-      "veg-oil",
-      "wheat",
-      "wpi",
-      "yen",
-      "zinc"
-    ]
-  },
-  "pipeline":[
-    "textcat_multilabel"
-  ],
-  "components":[
-    "textcat_multilabel"
-  ],
-  "disabled":[
-
-  ]
-}
--- a/models/reuters/textcat_multilabel/cfg
+++ b/models/reuters/textcat_multilabel/cfg
@@ -1,95 +0,0 @@
-{
-  "labels":[
-    "acq",
-    "alum",
-    "barley",
-    "bop",
-    "carcass",
-    "castor-oil",
-    "cocoa",
-    "coconut",
-    "coconut-oil",
-    "coffee",
-    "copper",
-    "copra-cake",
-    "corn",
-    "cotton",
-    "cotton-oil",
-    "cpi",
-    "cpu",
-    "crude",
-    "dfl",
-    "dlr",
-    "dmk",
-    "earn",
-    "fuel",
-    "gas",
-    "gnp",
-    "gold",
-    "grain",
-    "groundnut",
-    "groundnut-oil",
-    "heat",
-    "hog",
-    "housing",
-    "income",
-    "instal-debt",
-    "interest",
-    "ipi",
-    "iron-steel",
-    "jet",
-    "jobs",
-    "l-cattle",
-    "lead",
-    "lei",
-    "lin-oil",
-    "livestock",
-    "lumber",
-    "meal-feed",
-    "money-fx",
-    "money-supply",
-    "naphtha",
-    "nat-gas",
-    "nickel",
-    "nkr",
-    "nzdlr",
-    "oat",
-    "oilseed",
-    "orange",
-    "palladium",
-    "palm-oil",
-    "palmkernel",
-    "pet-chem",
-    "platinum",
-    "potato",
-    "propane",
-    "rand",
-    "rape-oil",
-    "rapeseed",
-    "reserves",
-    "retail",
-    "rice",
-    "rubber",
-    "rye",
-    "ship",
-    "silver",
-    "sorghum",
-    "soy-meal",
-    "soy-oil",
-    "soybean",
-    "strategic-metal",
-    "sugar",
-    "sun-meal",
-    "sun-oil",
-    "sunseed",
-    "tea",
-    "tin",
-    "trade",
-    "veg-oil",
-    "wheat",
-    "wpi",
-    "yen",
-    "zinc"
-  ],
-  "threshold":0.5
-}
--- a/models/reuters/textcat_multilabel/model
+++ b/models/reuters/textcat_multilabel/model
--- a/models/reuters/tokenizer
+++ b/models/reuters/tokenizer
--- a/models/reuters/vocab/key2row
+++ b/models/reuters/vocab/key2row
@@ -1 +0,0 @@
-<EFBFBD>
--- a/models/reuters/vocab/lookups.bin
+++ b/models/reuters/vocab/lookups.bin
@@ -1 +0,0 @@
-<EFBFBD>
--- a/models/reuters/vocab/strings.json
+++ b/models/reuters/vocab/strings.json
--- a/models/reuters/vocab/vectors
+++ b/models/reuters/vocab/vectors
--- a/models/reuters/vocab/vectors.cfg
+++ b/models/reuters/vocab/vectors.cfg
@@ -1,3 +0,0 @@
-{
-  "mode":"default"
-}
--- a/pages/app.js
+++ b/pages/app.js
@@ -69,9 +69,12 @@ axios
 // Handle crawl button click
 document.getElementById("crawl-btn").addEventListener("click", () => {
    // validate input to have both URL and API token
-    if (!document.getElementById("url-input").value || !document.getElementById("token-input").value) {
-        alert("Please enter both URL(s) and API token.");
-        return;
+    // if selected extraction strategy is LLMExtractionStrategy, then API token is required
+    if (document.getElementById("extraction-strategy-select").value === "LLMExtractionStrategy") {
+        if (!document.getElementById("url-input").value || !document.getElementById("token-input").value) {
+            alert("Please enter both URL(s) and API token.");
+            return;
+        }
    }

    const selectedProviderModel = document.getElementById("provider-model-select").value;
@@ -87,8 +90,6 @@ document.getElementById("crawl-btn").addEventListener("click", () => {
    const urls = urlsInput.split(",").map((url) => url.trim());
    const data = {
        urls: urls,
-        provider_model: selectedProviderModel,
-        api_token: apiToken,
        include_raw_html: true,
        bypass_cache: bypassCache,
        extract_blocks: extractBlocks,
@@ -112,8 +113,8 @@ document.getElementById("crawl-btn").addEventListener("click", () => {
    localStorage.setItem("api_token", document.getElementById("token-input").value);

    document.getElementById("loading").classList.remove("hidden");
-    document.getElementById("result").classList.add("hidden");
-    document.getElementById("code_help").classList.add("hidden");
+    document.getElementById("result").style.visibility = "hidden";
+    document.getElementById("code_help").style.visibility = "hidden";

    axios
        .post("/crawl", data)
@@ -128,18 +129,20 @@ document.getElementById("crawl-btn").addEventListener("click", () => {
            const extractionStrategy = data.extraction_strategy;
            const isLLMExtraction = extractionStrategy === "LLMExtractionStrategy";

+            // REMOVE API TOKEN FROM CODE EXAMPLES
+            data.extraction_strategy_args.api_token = "your_api_token";
            document.getElementById(
                "curl-code"
            ).textContent = `curl -X POST -H "Content-Type: application/json" -d '${JSON.stringify({
                ...data,
                api_token: isLLMExtraction ? "your_api_token" : undefined,
-            })}' http://crawl4ai.uccode.io/crawl`;
+            }, null, 2)}' http://crawl4ai.com/crawl`;

            document.getElementById("python-code").textContent = `import requests\n\ndata = ${JSON.stringify(
                { ...data, api_token: isLLMExtraction ? "your_api_token" : undefined },
                null,
                2
-            )}\n\nresponse = requests.post("http://crawl4ai.uccode.io/crawl", json=data) # OR local host if your run locally \nprint(response.json())`;
+            )}\n\nresponse = requests.post("http://crawl4ai.com/crawl", json=data) # OR local host if your run locally \nprint(response.json())`;

            document.getElementById(
                "nodejs-code"
@@ -147,7 +150,7 @@ document.getElementById("crawl-btn").addEventListener("click", () => {
                { ...data, api_token: isLLMExtraction ? "your_api_token" : undefined },
                null,
                2
-            )};\n\naxios.post("http://crawl4ai.uccode.io/crawl", data) // OR local host if your run locally \n    .then(response => console.log(response.data))\n    .catch(error => console.error(error));`;
+            )};\n\naxios.post("http://crawl4ai.com/crawl", data) // OR local host if your run locally \n    .then(response => console.log(response.data))\n    .catch(error => console.error(error));`;

            document.getElementById(
                "library-code"
@@ -169,8 +172,8 @@ document.getElementById("crawl-btn").addEventListener("click", () => {

            document.getElementById("loading").classList.add("hidden");

-            document.getElementById("result").classList.remove("hidden");
-            document.getElementById("code_help").classList.remove("hidden");
+            document.getElementById("result").style.visibility = "visible";
+            document.getElementById("code_help").style.visibility = "visible";

            // increment the total count
            document.getElementById("total-count").textContent =
--- a/pages/partial/installation.html
+++ b/pages/partial/installation.html
@@ -29,7 +29,7 @@
                class="bg-zinc-800 p-4 rounded mt-2 text-zinc-100"
            ><code>virtualenv venv
 source venv/bin/activate
-pip install git+https://github.com/unclecode/crawl4ai.git
+pip install "crawl4ai[all] @ git+https://github.com/unclecode/crawl4ai.git"
            </code></pre>
        </li>
        <li class="mb-4">
@@ -46,7 +46,7 @@ pip install git+https://github.com/unclecode/crawl4ai.git
 source venv/bin/activate
 git clone https://github.com/unclecode/crawl4ai.git
 cd crawl4ai
-pip install -e .
+pip install -e .[all]
 </code></pre>
        </li>
        <li class="">
--- a/pages/partial/try_it.html
+++ b/pages/partial/try_it.html
@@ -46,9 +46,9 @@
                            id="extraction-strategy-select"
                            class="border border-zinc-700 rounded px-4 py-1 bg-zinc-900 text-zinc-300"
                        >
+                            <option value="NoExtractionStrategy" selected>NoExtractionStrategy</option>
                            <option value="CosineStrategy">CosineStrategy</option>
                            <option value="LLMExtractionStrategy">LLMExtractionStrategy</option>
-                            <option value="NoExtractionStrategy">NoExtractionStrategy</option>
                        </select>
                    </div>
                    <div class="flex flex-col">
@@ -99,7 +99,7 @@
                </div>
                <div  class="flex gap-2">
                    <!-- Add two textarea one for getting Keyword Filter and another one Instruction, make both grow whole with-->
-                    <div id = "semantic_filter_div" class="flex flex-col flex-1">
+                    <div id = "semantic_filter_div" class="flex flex-col flex-1 hidden">
                        <label for="keyword-filter" class="text-lime-500 font-bold text-xs">Keyword Filter</label>
                        <textarea
                            id="semantic_filter"
@@ -131,10 +131,10 @@
                </div>
            </div>

+            <div id="loading" class="hidden">
+                <p class="text-white">Loading... Please wait.</p>
+            </div>
            <div id="result" class="flex-1">
-                <div id="loading" class="hidden">
-                    <p class="text-white">Loading... Please wait.</p>
-                </div>
                <div class="tab-buttons flex gap-2">
                    <button class="tab-btn px-4 py-1 text-sm bg-zinc-700 rounded-t text-lime-500" data-tab="json">
                        JSON
@@ -181,19 +181,19 @@
                    </button> -->
                </div>
                <div class="tab-content result bg-zinc-900 p-2 rounded h-full border border-zinc-700 text-sm">
-                    <pre class="h-full flex relative">
+                    <pre class="h-full flex relative overflow-x-auto">
                        <code id="curl-code" class="language-bash"></code>
                        <button class="absolute top-2 right-2 bg-zinc-700 text-white px-2 py-1 rounded copy-btn" data-target="curl-code">Copy</button>
                    </pre>
-                    <pre class="hidden h-full flex relative">
+                    <pre class="hidden h-full flex relative overflow-x-auto">
                        <code id="python-code" class="language-python"></code>
                        <button class="absolute top-2 right-2 bg-zinc-700 text-white px-2 py-1 rounded copy-btn" data-target="python-code">Copy</button>
                    </pre>
-                    <pre class="hidden h-full flex relative">
+                    <pre class="hidden h-full flex relative overflow-x-auto">
                        <code id="nodejs-code" class="language-javascript"></code>
                        <button class="absolute top-2 right-2 bg-zinc-700 text-white px-2 py-1 rounded copy-btn" data-target="nodejs-code">Copy</button>
                    </pre>
-                    <pre class="hidden h-full flex relative">
+                    <pre class="hidden h-full flex relative overflow-x-auto">
                        <code id="library-code" class="language-python"></code>
                        <button class="absolute top-2 right-2 bg-zinc-700 text-white px-2 py-1 rounded copy-btn" data-target="library-code">Copy</button>
                    </pre>
--- a/pages/tmp.html
+++ b/pages/tmp.html
@@ -236,12 +236,11 @@ chunks = chunker.chunk("This is a sample text. It will be split into chunks.")
                <h4>Constructor Parameters:</h4>
                <ul>
                    <li>
-                        <code>model</code> (str, optional): The SpaCy model to use for sentence detection. Default is
-                        <code>'en_core_web_sm'</code>.
+                        None.
                    </li>
                </ul>
                <h4>Example usage:</h4>
-                <pre><code class="language-python">chunker = NlpSentenceChunking(model='en_core_web_sm')
+                <pre><code class="language-python">chunker = NlpSentenceChunking()
 chunks = chunker.chunk("This is a sample text. It will be split into sentences.")
 </code></pre>
            </div>
--- a/requirements.txt
+++ b/requirements.txt
@@ -2,7 +2,6 @@ aiohttp==3.9.5
 aiosqlite==0.20.0
 bs4==0.0.2
 fastapi==0.111.0
-typer==0.9.0 
 html2text==2024.2.26
 httpx==0.27.0
 lazy_import==0.2.2
@@ -14,7 +13,6 @@ requests==2.31.0
 rich==13.7.1
 scikit-learn==1.4.2
 selenium==4.20.0
-spacy==3.7.4
 uvicorn==0.29.0
 transformers==4.40.2
 chromedriver-autoinstaller==0.6.4
--- a/setup.py
+++ b/setup.py
@@ -1,24 +1,18 @@
 from setuptools import setup, find_packages
-from setuptools.command.install import install as _install
-import subprocess
-import sys

-class InstallCommand(_install):
-    def run(self):
-        # Run the standard install first
-        _install.run(self)
-        # Now handle the dependencies manually
-        self.manual_dependencies_install()
+# Read the requirements from requirements.txt
+with open("requirements.txt") as f:
+    requirements = f.read().splitlines()

-    def manual_dependencies_install(self):
-        with open('requirements.txt') as f:
-            dependencies = f.read().splitlines()
-        for dependency in dependencies:
-            subprocess.check_call([sys.executable, '-m', 'pip', 'install', dependency])
+# Define the requirements for different environments
+requirements_without_torch = [req for req in requirements if not req.startswith("torch")]
+requirements_without_transformers = [req for req in requirements if not req.startswith("transformers")]
+requirements_without_nltk = [req for req in requirements if not req.startswith("nltk")]
+requirements_without_torch_transformers_nlkt = [req for req in requirements if not req.startswith("torch") and not req.startswith("transformers") and not req.startswith("nltk")]

 setup(
    name="Crawl4AI",
-    version="0.1.0",
+    version="0.2.0",
    description="🔥🕷️ Crawl4AI: Open-source LLM Friendly Web Crawler & Scrapper",
    long_description=open("README.md").read(),
    long_description_content_type="text/markdown",
@@ -27,9 +21,11 @@ setup(
    author_email="unclecode@kidocode.com",
    license="MIT",
    packages=find_packages(),
-    install_requires=[],  # Leave this empty to avoid default dependency resolution
-    cmdclass={
-        'install': InstallCommand,
+    install_requires=requirements_without_torch_transformers_nlkt,
+    extras_require={
+        "all": requirements,  # Include all requirements
+        "colab": requirements_without_torch,  # Exclude torch for Colab
+        "crawl": requirements_without_torch_transformers_nlkt
    },
    entry_points={
        'console_scripts': [
--- a/setup_colab.py
+++ b/setup_colab.py
@@ -1,35 +0,0 @@
-import os
-
-def install_crawl4ai():
-    print("Installing Crawl4AI and its dependencies...")
-    
-    # Install dependencies
-    !pip install -U 'spacy[cuda12x]'
-    !apt-get update -y
-    !apt install chromium-chromedriver -y
-    !pip install chromedriver_autoinstaller
-    !pip install git+https://github.com/unclecode/crawl4ai.git@new-release-0.0.2
-    
-    # Install ChromeDriver
-    import chromedriver_autoinstaller
-    chromedriver_autoinstaller.install()
-    
-    # Download the reuters model
-    repo_url = "https://github.com/unclecode/crawl4ai.git"
-    branch = "new-release-0.0.2"
-    folder_path = "models/reuters"
-    
-    !git clone -b {branch} {repo_url}
-    !mkdir -p models
-    
-    repo_folder = "crawl4ai"
-    source_folder = os.path.join(repo_folder, folder_path)
-    destination_folder = "models"
-    
-    !mv "{source_folder}" "{destination_folder}"
-    !rm -rf "{repo_folder}"
-    
-    print("Installation and model download completed successfully!")
-
-# Run the installer
-install_crawl4ai()
Author	SHA1	Message	Date
unclecode	6f96dcd649	chore: Update README	2024-05-17 18:12:50 +08:00
unclecode	957a2458b1	chore: Update web crawler URLs to use NBC News business section	2024-05-17 18:11:13 +08:00
unclecode	36e46be23d	chore: Add verbose option to ExtractionStrategy classes This commit adds a new `verbose` option to the `ExtractionStrategy` classes. The `verbose` option allows for logging of extraction details, such as the number of extracted blocks and the URL being processed. This improves the debugging and monitoring capabilities of the code.	2024-05-17 18:06:10 +08:00
unclecode	32c87f0388	chore: Update NlpSentenceChunking constructor parameters to None The NlpSentenceChunking constructor parameters have been updated to None in order to simplify the usage of the class. This change removes the need for specifying the SpaCy model for sentence detection, making the code more concise and easier to understand.	2024-05-17 17:00:43 +08:00
unclecode	647cfda225	chore: Update Crawl4AI quickstart script in README.md This commit updates the Crawl4AI quickstart script in the README.md file. The script is now properly formatted and aligned, making it easier to read and understand. The unnecessary indentation has been removed, and the script is now more concise and efficient.	2024-05-17 16:55:34 +08:00
unclecode	1cc67df301	chore: Update pip installation command and requirements, add new dependencies	2024-05-17 16:53:03 +08:00
unclecode	d7b37e849d	chore: Update CrawlRequest model to use NoExtractionStrategy as default	2024-05-17 16:50:38 +08:00
unclecode	f52f526002	chore: Update web_crawler.py to use NoExtractionStrategy as default	2024-05-17 16:03:35 +08:00
unclecode	3593f017d7	chore: Update setup.py to exclude torch, transformers, and nltk dependencies This commit updates the setup.py file to exclude the torch, transformers, and nltk dependencies from the install_requires section. Instead, it creates separate extras_require sections for different environments, including all requirements, excluding torch for Colab, and excluding torch, transformers, and nltk for the crawl environment.	2024-05-17 16:01:04 +08:00
unclecode	e7bb76f19b	chore: Update torch dependency to version 2.3.0	2024-05-17 15:52:39 +08:00
unclecode	593b928967	Update requirements.txt to include latest versions of dependencies	2024-05-17 15:48:14 +08:00
unclecode	bb3d37face	chore: Update requirements.txt to include latest versions of dependencies	2024-05-17 15:32:37 +08:00
unclecode	3f8576f870	chore: Update model_loader.py to use pretrained models without resume_download	2024-05-17 15:26:15 +08:00
unclecode	bf3b040f10	chore: Update pip installation command and requirements, add new dependencies	2024-05-17 15:21:45 +08:00
unclecode	a317dc5e1d	Load CosineStrategy in the function	2024-05-17 15:13:06 +08:00
unclecode	a5f9d07dbf	Remove dependency on Spacy model.	2024-05-17 15:08:03 +08:00