πŸ”₯πŸ•·οΈ Crawl4AI: Web Data for your Thoughts

πŸ“Š Total Website Processed 2

Try It Now

                                
                                
                            

Installation πŸ’»

There are two ways to use Crawl4AI: as a library in your Python projects or as a standalone local server.

You can also try Crawl4AI in a Google Colab Open In Colab

Using Crawl4AI as a Library πŸ“š

To install Crawl4AI as a library, follow these steps:

  1. Install the package from GitHub:
    pip install git+https://github.com/unclecode/crawl4ai.git
  2. Alternatively, you can clone the repository and install the package locally:
    virtualenv venv
    source venv/bin/activate
    git clone https://github.com/unclecode/crawl4ai.git
    cd crawl4ai
    pip install -e .
            
  3. Import the necessary modules in your Python script:
    from crawl4ai.web_crawler import WebCrawler
    from crawl4ai.chunking_strategy import *
    from crawl4ai.extraction_strategy import *
    import os
    
    crawler = WebCrawler()
    
    # Single page crawl
    single_url = UrlModel(url='https://www.nbcnews.com/business', forced=False)
    result = crawl4ai.fetch_page(
        url='https://www.nbcnews.com/business',
        word_count_threshold=5, # Minimum word count for a HTML tag to be considered as a worthy block
        chunking_strategy= RegexChunking( patterns = ["\\n\\n"]), # Default is RegexChunking
        extraction_strategy= CosineStrategy(word_count_threshold=20, max_dist=0.2, linkage_method='ward', top_k=3) # Default is CosineStrategy
        # extraction_strategy= LLMExtractionStrategy(provider= "openai/gpt-4o", api_token = os.getenv('OPENAI_API_KEY')),
        bypass_cache=False,
        extract_blocks =True, # Whether to extract semantical blocks of text from the HTML
        css_selector = "", # Eg: "div.article-body"
        verbose=True,
        include_raw_html=True, # Whether to include the raw HTML content in the response
    )
    print(result.model_dump())
            

For more information about how to run Crawl4AI as a local server, please refer to the GitHub repository.

πŸ“– Parameters

Parameter Description Required Default Value
urls A list of URLs to crawl and extract data from. Yes -
include_raw_html Whether to include the raw HTML content in the response. No false
bypass_cache Whether to force a fresh crawl even if the URL has been previously crawled. No false
extract_blocks Whether to extract semantical blocks of text from the HTML. No true
word_count_threshold The minimum number of words a block must contain to be considered meaningful (minimum value is 5). No 5
extraction_strategy The strategy to use for extracting content from the HTML (e.g., "CosineStrategy"). No CosineStrategy
chunking_strategy The strategy to use for chunking the text before processing (e.g., "RegexChunking"). No RegexChunking
css_selector The CSS selector to target specific parts of the HTML for extraction. No None
verbose Whether to enable verbose logging. No true

Extraction Strategies

Chunking Strategies

πŸ€” Why building this?

In recent times, we've witnessed a surge of startups emerging, riding the AI hype wave and charging for services that should rightfully be accessible to everyone. πŸŒπŸ’Έ One such example is scraping and crawling web pages and transforming them into a format suitable for Large Language Models (LLMs). πŸ•ΈοΈπŸ€– We believe that building a business around this is not the right approach; instead, it should definitely be open-source. πŸ†“πŸŒŸ So, if you possess the skills to build such tools and share our philosophy, we invite you to join our "Robinhood" band and help set these products free for the benefit of all. 🀝πŸ’ͺ

βš™οΈ Installation

To install and run Crawl4AI as a library or a local server, please refer to the πŸ“š GitHub repository.