Loading... Please wait.
warmup() function.
crawler = WebCrawler()
crawler.warmup()
result = crawler.run(url="https://www.nbcnews.com/business")
result = crawler.run(url="https://www.nbcnews.com/business", bypass_cache=True)
result = crawler.run(url="https://www.nbcnews.com/business", include_raw_html=False)
always_by_pass_cache to True:crawler.always_by_pass_cache = True
result = crawler.run(
url="https://www.nbcnews.com/business",
chunking_strategy=RegexChunking(patterns=["\n\n"])
)
result = crawler.run(
url="https://www.nbcnews.com/business",
chunking_strategy=NlpSentenceChunking()
)
result = crawler.run(
url="https://www.nbcnews.com/business",
extraction_strategy=CosineStrategy(word_count_threshold=10, max_dist=0.2, linkage_method="ward", top_k=3)
)
result = crawler.run(
url="https://www.nbcnews.com/business",
extraction_strategy=LLMExtractionStrategy(provider="openai/gpt-4o", api_token=os.getenv('OPENAI_API_KEY'))
)
result = crawler.run(
url="https://www.nbcnews.com/business",
extraction_strategy=LLMExtractionStrategy(
provider="openai/gpt-4o",
api_token=os.getenv('OPENAI_API_KEY'),
instruction="I am interested in only financial news"
)
)
result = crawler.run(
url="https://www.nbcnews.com/business",
css_selector="h2"
)
js_code = """
const loadMoreButton = Array.from(document.querySelectorAll('button')).find(button => button.textContent.includes('Load More'));
loadMoreButton && loadMoreButton.click();
"""
crawler_strategy = LocalSeleniumCrawlerStrategy(js_code=js_code)
crawler = WebCrawler(crawler_strategy=crawler_strategy, always_by_pass_cache=True)
result = crawler.run(url="https://www.nbcnews.com/business")
There are two ways to use Crawl4AI: as a library in your Python projects or as a standalone local server.
You can also try Crawl4AI in a Google Colab
To install Crawl4AI as a library, follow these steps:
pip install git+https://github.com/unclecode/crawl4ai.git
virtualenv venv
source venv/bin/activate
git clone https://github.com/unclecode/crawl4ai.git
cd crawl4ai
pip install -e .
from crawl4ai.web_crawler import WebCrawler
from crawl4ai.chunking_strategy import *
from crawl4ai.extraction_strategy import *
import os
crawler = WebCrawler()
# Single page crawl
single_url = UrlModel(url='https://www.nbcnews.com/business', forced=False)
result = crawl4ai.fetch_page(
url='https://www.nbcnews.com/business',
word_count_threshold=5, # Minimum word count for a HTML tag to be considered as a worthy block
chunking_strategy= RegexChunking( patterns = ["\\n\\n"]), # Default is RegexChunking
extraction_strategy= CosineStrategy(word_count_threshold=10, max_dist=0.2, linkage_method='ward', top_k=3) # Default is CosineStrategy
# extraction_strategy= LLMExtractionStrategy(provider= "openai/gpt-4o", api_token = os.getenv('OPENAI_API_KEY')),
bypass_cache=False,
extract_blocks =True, # Whether to extract semantical blocks of text from the HTML
css_selector = "", # Eg: "div.article-body"
verbose=True,
include_raw_html=True, # Whether to include the raw HTML content in the response
)
print(result.model_dump())
For more information about how to run Crawl4AI as a local server, please refer to the GitHub repository.
| Parameter | Description | Required | Default Value |
|---|---|---|---|
| urls | A list of URLs to crawl and extract data from. | Yes | - |
| include_raw_html | Whether to include the raw HTML content in the response. | No | false |
| bypass_cache | Whether to force a fresh crawl even if the URL has been previously crawled. | No | false |
| extract_blocks | Whether to extract semantical blocks of text from the HTML. | No | true |
| word_count_threshold | The minimum number of words a block must contain to be considered meaningful (minimum value is 5). | No | 5 |
| extraction_strategy | The strategy to use for extracting content from the HTML (e.g., "CosineStrategy"). | No | CosineStrategy |
| chunking_strategy | The strategy to use for chunking the text before processing (e.g., "RegexChunking"). | No | RegexChunking |
| css_selector | The CSS selector to target specific parts of the HTML for extraction. | No | None |
| verbose | Whether to enable verbose logging. | No | true |
In recent times, we've witnessed a surge of startups emerging, riding the AI hype wave and charging for services that should rightfully be accessible to everyone. ππΈ One such example is scraping and crawling web pages and transforming them into a format suitable for Large Language Models (LLMs). πΈοΈπ€ We believe that building a business around this is not the right approach; instead, it should definitely be open-source. ππ So, if you possess the skills to build such tools and share our philosophy, we invite you to join our "Robinhood" band and help set these products free for the benefit of all. π€πͺ
To install and run Crawl4AI as a library or a local server, please refer to the π GitHub repository.