How to Guide

🌟 Welcome to the Crawl4ai Quickstart Guide! Let's dive into some web crawling fun!
First Step: Create an instance of WebCrawler and call the warmup() function.
crawler = WebCrawler()
crawler.warmup()
🧠 Understanding 'bypass_cache' and 'include_raw_html' parameters:
First crawl (caches the result):
result = crawler.run(url="https://www.nbcnews.com/business")
Second crawl (Force to crawl again):
result = crawler.run(url="https://www.nbcnews.com/business", bypass_cache=True)
⚠️ Don't forget to set `bypass_cache` to True if you want to try different strategies for the same URL. Otherwise, the cached result will be returned. You can also set `always_by_pass_cache` in constructor to True to always bypass the cache.
Crawl result without raw HTML content:
result = crawler.run(url="https://www.nbcnews.com/business", include_raw_html=False)
πŸ“„ The 'include_raw_html' parameter, when set to True, includes the raw HTML content in the response. By default, it is set to True.
Set always_by_pass_cache to True:
crawler.always_by_pass_cache = True
πŸ“Έ Let's take a screenshot of the page!
result = crawler.run(
    url="https://www.nbcnews.com/business",
    screenshot=True
)
with open("screenshot.png", "wb") as f:
    f.write(base64.b64decode(result.screenshot))
🧩 Let's add a chunking strategy: RegexChunking!
Using RegexChunking:
result = crawler.run(
    url="https://www.nbcnews.com/business",
    chunking_strategy=RegexChunking(patterns=["\n\n"])
)
Using NlpSentenceChunking:
result = crawler.run(
    url="https://www.nbcnews.com/business",
    chunking_strategy=NlpSentenceChunking()
)
🧠 Let's get smarter with an extraction strategy: CosineStrategy!
Using CosineStrategy:
result = crawler.run(
    url="https://www.nbcnews.com/business",
    extraction_strategy=CosineStrategy(word_count_threshold=10, max_dist=0.2, linkage_method="ward", top_k=3)
)
πŸ€– Time to bring in the big guns: LLMExtractionStrategy without instructions!
Using LLMExtractionStrategy without instructions:
result = crawler.run(
    url="https://www.nbcnews.com/business",
    extraction_strategy=LLMExtractionStrategy(provider="openai/gpt-4o", api_token=os.getenv('OPENAI_API_KEY'))
)
πŸ“œ Let's make it even more interesting: LLMExtractionStrategy with instructions!
Using LLMExtractionStrategy with instructions:
result = crawler.run(
    url="https://www.nbcnews.com/business",
    extraction_strategy=LLMExtractionStrategy(
    provider="openai/gpt-4o",
    api_token=os.getenv('OPENAI_API_KEY'),
    instruction="I am interested in only financial news"
)
)
🎯 Targeted extraction: Let's use a CSS selector to extract only H2 tags!
Using CSS selector to extract H2 tags:
result = crawler.run(
    url="https://www.nbcnews.com/business",
    css_selector="h2"
)
πŸ–±οΈ Let's get interactive: Passing JavaScript code to click 'Load More' button!
Using JavaScript to click 'Load More' button:
js_code = ["""
const loadMoreButton = Array.from(document.querySelectorAll('button')).find(button => button.textContent.includes('Load More'));
loadMoreButton && loadMoreButton.click();
"""]
crawler = WebCrawler(verbos=crawler_strategy, always_by_pass_cache=True)
result = crawler.run(url="https://www.nbcnews.com/business", js = js_code)
Remember that you can pass multiple JavaScript code snippets in the list. They all will be executed in the order they are passed.
πŸŽ‰ Congratulations! You've made it through the Crawl4ai Quickstart Guide! Now go forth and crawl the web like a pro! πŸ•ΈοΈ