π
Welcome to the Crawl4ai Quickstart Guide! Let's dive into some web crawling
fun!
First Step: Create an instance of WebCrawler and call the
warmup() function.
crawler = WebCrawler()
crawler.warmup()
π§ Understanding 'bypass_cache' and 'include_raw_html' parameters:
First crawl (caches the result):
result = crawler.run(url="https://www.nbcnews.com/business")
Second crawl (Force to crawl again):
result = crawler.run(url="https://www.nbcnews.com/business", bypass_cache=True)
β οΈ Don't forget to set `bypass_cache` to True if you want to try different strategies for the same URL. Otherwise, the cached result will be returned. You can also set `always_by_pass_cache` in constructor to True to always bypass the cache.
Crawl result without raw HTML content:
result = crawler.run(url="https://www.nbcnews.com/business", include_raw_html=False)
π
The 'include_raw_html' parameter, when set to True, includes the raw HTML content
in the response. By default, it is set to True.
Set always_by_pass_cache to True:
crawler.always_by_pass_cache = True
πΈ
Let's take a screenshot of the page!
result = crawler.run(
url="https://www.nbcnews.com/business",
screenshot=True
)
with open("screenshot.png", "wb") as f:
f.write(base64.b64decode(result.screenshot))
π§© Let's add a chunking strategy: RegexChunking!
Using RegexChunking:
result = crawler.run(
url="https://www.nbcnews.com/business",
chunking_strategy=RegexChunking(patterns=["\n\n"])
)
Using NlpSentenceChunking:
result = crawler.run(
url="https://www.nbcnews.com/business",
chunking_strategy=NlpSentenceChunking()
)
π§ Let's get smarter with an extraction strategy: CosineStrategy!
Using CosineStrategy:
result = crawler.run(
url="https://www.nbcnews.com/business",
extraction_strategy=CosineStrategy(word_count_threshold=10, max_dist=0.2, linkage_method="ward", top_k=3)
)
π€
Time to bring in the big guns: LLMExtractionStrategy without instructions!
Using LLMExtractionStrategy without instructions:
result = crawler.run(
url="https://www.nbcnews.com/business",
extraction_strategy=LLMExtractionStrategy(provider="openai/gpt-4o", api_token=os.getenv('OPENAI_API_KEY'))
)
π
Let's make it even more interesting: LLMExtractionStrategy with
instructions!
Using LLMExtractionStrategy with instructions:
result = crawler.run(
url="https://www.nbcnews.com/business",
extraction_strategy=LLMExtractionStrategy(
provider="openai/gpt-4o",
api_token=os.getenv('OPENAI_API_KEY'),
instruction="I am interested in only financial news"
)
)
π―
Targeted extraction: Let's use a CSS selector to extract only H2 tags!
Using CSS selector to extract H2 tags:
result = crawler.run(
url="https://www.nbcnews.com/business",
css_selector="h2"
)
π±οΈ
Let's get interactive: Passing JavaScript code to click 'Load More' button!
Using JavaScript to click 'Load More' button:
js_code = ["""
const loadMoreButton = Array.from(document.querySelectorAll('button')).find(button => button.textContent.includes('Load More'));
loadMoreButton && loadMoreButton.click();
"""]
crawler = WebCrawler(verbos=crawler_strategy, always_by_pass_cache=True)
result = crawler.run(url="https://www.nbcnews.com/business", js = js_code)
Remember that you can pass multiple JavaScript code snippets in the list. They all will be executed in the order they are passed.
π
Congratulations! You've made it through the Crawl4ai Quickstart Guide! Now go forth
and crawl the web like a pro! πΈοΈ