πŸ”₯πŸ•·οΈ Crawl4AI: Web Data for your Thoughts

πŸ“Š Total Website Processed 2

Try It Now

                                
                                
                            
🌟 Welcome to the Crawl4ai Quickstart Guide! Let's dive into some web crawling fun!
First Step: Create an instance of WebCrawler and call the warmup() function.
crawler = WebCrawler()
            crawler.warmup()
🧠 Understanding 'bypass_cache' and 'include_raw_html' parameters:
First crawl (caches the result):
result = crawler.run(url="https://www.nbcnews.com/business")
Second crawl (Force to crawl again):
result = crawler.run(url="https://www.nbcnews.com/business", bypass_cache=True)
Crawl result without raw HTML content:
result = crawler.run(url="https://www.nbcnews.com/business", include_raw_html=False)
πŸ“„ The 'include_raw_html' parameter, when set to True, includes the raw HTML content in the response. By default, it is set to True.
Set always_by_pass_cache to True:
crawler.always_by_pass_cache = True
🧩 Let's add a chunking strategy: RegexChunking!
Using RegexChunking:
result = crawler.run(
                url="https://www.nbcnews.com/business",
                chunking_strategy=RegexChunking(patterns=["\n\n"])
            )
Using NlpSentenceChunking:
result = crawler.run(
                url="https://www.nbcnews.com/business",
                chunking_strategy=NlpSentenceChunking()
            )
🧠 Let's get smarter with an extraction strategy: CosineStrategy!
Using CosineStrategy:
result = crawler.run(
                url="https://www.nbcnews.com/business",
                extraction_strategy=CosineStrategy(word_count_threshold=10, max_dist=0.2, linkage_method="ward", top_k=3)
            )
πŸ€– Time to bring in the big guns: LLMExtractionStrategy without instructions!
Using LLMExtractionStrategy without instructions:
result = crawler.run(
                url="https://www.nbcnews.com/business",
                extraction_strategy=LLMExtractionStrategy(provider="openai/gpt-4o", api_token=os.getenv('OPENAI_API_KEY'))
            )
πŸ“œ Let's make it even more interesting: LLMExtractionStrategy with instructions!
Using LLMExtractionStrategy with instructions:
result = crawler.run(
                url="https://www.nbcnews.com/business",
                extraction_strategy=LLMExtractionStrategy(
                    provider="openai/gpt-4o",
                    api_token=os.getenv('OPENAI_API_KEY'),
                    instruction="I am interested in only financial news"
                )
            )
🎯 Targeted extraction: Let's use a CSS selector to extract only H2 tags!
Using CSS selector to extract H2 tags:
result = crawler.run(
                url="https://www.nbcnews.com/business",
                css_selector="h2"
            )
πŸ–±οΈ Let's get interactive: Passing JavaScript code to click 'Load More' button!
Using JavaScript to click 'Load More' button:
js_code = """
            const loadMoreButton = Array.from(document.querySelectorAll('button')).find(button => button.textContent.includes('Load More'));
            loadMoreButton && loadMoreButton.click();
            """
            crawler_strategy = LocalSeleniumCrawlerStrategy(js_code=js_code)
            crawler = WebCrawler(crawler_strategy=crawler_strategy, always_by_pass_cache=True)
            result = crawler.run(url="https://www.nbcnews.com/business")
πŸŽ‰ Congratulations! You've made it through the Crawl4ai Quickstart Guide! Now go forth and crawl the web like a pro! πŸ•ΈοΈ

Installation πŸ’»

There are two ways to use Crawl4AI: as a library in your Python projects or as a standalone local server.

You can also try Crawl4AI in a Google Colab Open In Colab

Using Crawl4AI as a Library πŸ“š

To install Crawl4AI as a library, follow these steps:

  1. Install the package from GitHub:
    pip install git+https://github.com/unclecode/crawl4ai.git
  2. Alternatively, you can clone the repository and install the package locally:
    virtualenv venv
    source venv/bin/activate
    git clone https://github.com/unclecode/crawl4ai.git
    cd crawl4ai
    pip install -e .
            
  3. Import the necessary modules in your Python script:
    from crawl4ai.web_crawler import WebCrawler
    from crawl4ai.chunking_strategy import *
    from crawl4ai.extraction_strategy import *
    import os
    
    crawler = WebCrawler()
    
    # Single page crawl
    single_url = UrlModel(url='https://www.nbcnews.com/business', forced=False)
    result = crawl4ai.fetch_page(
        url='https://www.nbcnews.com/business',
        word_count_threshold=5, # Minimum word count for a HTML tag to be considered as a worthy block
        chunking_strategy= RegexChunking( patterns = ["\\n\\n"]), # Default is RegexChunking
        extraction_strategy= CosineStrategy(word_count_threshold=10, max_dist=0.2, linkage_method='ward', top_k=3) # Default is CosineStrategy
        # extraction_strategy= LLMExtractionStrategy(provider= "openai/gpt-4o", api_token = os.getenv('OPENAI_API_KEY')),
        bypass_cache=False,
        extract_blocks =True, # Whether to extract semantical blocks of text from the HTML
        css_selector = "", # Eg: "div.article-body"
        verbose=True,
        include_raw_html=True, # Whether to include the raw HTML content in the response
    )
    print(result.model_dump())
            

For more information about how to run Crawl4AI as a local server, please refer to the GitHub repository.

πŸ“– Parameters

Parameter Description Required Default Value
urls A list of URLs to crawl and extract data from. Yes -
include_raw_html Whether to include the raw HTML content in the response. No false
bypass_cache Whether to force a fresh crawl even if the URL has been previously crawled. No false
extract_blocks Whether to extract semantical blocks of text from the HTML. No true
word_count_threshold The minimum number of words a block must contain to be considered meaningful (minimum value is 5). No 5
extraction_strategy The strategy to use for extracting content from the HTML (e.g., "CosineStrategy"). No CosineStrategy
chunking_strategy The strategy to use for chunking the text before processing (e.g., "RegexChunking"). No RegexChunking
css_selector The CSS selector to target specific parts of the HTML for extraction. No None
verbose Whether to enable verbose logging. No true

Extraction Strategies

Chunking Strategies

πŸ€” Why building this?

In recent times, we've witnessed a surge of startups emerging, riding the AI hype wave and charging for services that should rightfully be accessible to everyone. πŸŒπŸ’Έ One such example is scraping and crawling web pages and transforming them into a format suitable for Large Language Models (LLMs). πŸ•ΈοΈπŸ€– We believe that building a business around this is not the right approach; instead, it should definitely be open-source. πŸ†“πŸŒŸ So, if you possess the skills to build such tools and share our philosophy, we invite you to join our "Robinhood" band and help set these products free for the benefit of all. 🀝πŸ’ͺ

βš™οΈ Installation

To install and run Crawl4AI as a library or a local server, please refer to the πŸ“š GitHub repository.