- Import the necessary modules in your Python script:
-
from crawl4ai.web_crawler import WebCrawler
-from crawl4ai.chunking_strategy import *
-from crawl4ai.extraction_strategy import *
-import os
-
-crawler = WebCrawler()
-
-# Single page crawl
-single_url = UrlModel(url='https://www.nbcnews.com/business', forced=False)
-result = crawl4ai.fetch_page(
- url='https://www.nbcnews.com/business',
- word_count_threshold=5, # Minimum word count for a HTML tag to be considered as a worthy block
- chunking_strategy= RegexChunking( patterns = ["\\n\\n"]), # Default is RegexChunking
- extraction_strategy= CosineStrategy(word_count_threshold=10, max_dist=0.2, linkage_method='ward', top_k=3) # Default is CosineStrategy
- # extraction_strategy= LLMExtractionStrategy(provider= "openai/gpt-4o", api_token = os.getenv('OPENAI_API_KEY')),
- bypass_cache=False,
- extract_blocks =True, # Whether to extract semantical blocks of text from the HTML
- css_selector = "", # Eg: "div.article-body"
- verbose=True,
- include_raw_html=True, # Whether to include the raw HTML content in the response
-)
-print(result.model_dump())
-
-
-
-
- For more information about how to run Crawl4AI as a local server, please refer to the
- GitHub repository.
-
-
-
-
-
-
π Parameters
-
-
-
-
-
Parameter
-
Description
-
Required
-
Default Value
-
-
-
-
-
urls
-
- A list of URLs to crawl and extract data from.
-
-
Yes
-
-
-
-
-
include_raw_html
-
- Whether to include the raw HTML content in the response.
-
-
No
-
false
-
-
-
bypass_cache
-
- Whether to force a fresh crawl even if the URL has been previously crawled.
-
-
No
-
false
-
-
-
extract_blocks
-
- Whether to extract semantical blocks of text from the HTML.
-
-
No
-
true
-
-
-
word_count_threshold
-
- The minimum number of words a block must contain to be considered meaningful (minimum
- value is 5).
-
-
No
-
5
-
-
-
extraction_strategy
-
- The strategy to use for extracting content from the HTML (e.g., "CosineStrategy").
-
-
No
-
CosineStrategy
-
-
-
chunking_strategy
-
- The strategy to use for chunking the text before processing (e.g., "RegexChunking").
-
-
No
-
RegexChunking
-
-
-
css_selector
-
- The CSS selector to target specific parts of the HTML for extraction.
-
-
No
-
None
-
-
-
verbose
-
Whether to enable verbose logging.
-
No
-
true
-
-
-
-
-
-
-
-
-
Extraction Strategies
-
-
-
-
-
-
-
Chunking Strategies
-
-
-
-
-
-
-
π€ Why building this?
-
- In recent times, we've witnessed a surge of startups emerging, riding the AI hype wave and charging
- for services that should rightfully be accessible to everyone. ππΈ One such example is scraping and
- crawling web pages and transforming them into a format suitable for Large Language Models (LLMs).
- πΈοΈπ€ We believe that building a business around this is not the right approach; instead, it should
- definitely be open-source. ππ So, if you possess the skills to build such tools and share our
- philosophy, we invite you to join our "Robinhood" band and help set these products free for the
- benefit of all. π€πͺ
-
-
-
-
-
-
-
βοΈ Installation
-
- To install and run Crawl4AI as a library or a local server, please refer to the π
- GitHub repository.
-
π₯π·οΈ Crawl4AI: Open-source LLM Friendly Web scraper
-
-
-
-
-
-
Try It Now
-
-
-
-
-
-
-
-
-
-
-
-
-
Loading...
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
π€ Why building this?
-
- In recent times, we've seen numerous startups emerging, riding the AI hype wave and charging for
- services that should rightfully be accessible to everyone. ππΈ One for example is to scrap and crawl
- a web page, and transform it o a form suitable for LLM. We don't think one should build a business
- out of this, but definilty should be opened source. So if you possess the skills to build such things
- and you have such philosphy you should join our "Robinhood" band and help set
- these products free. ππ€
-
-
-
-
-
-
-
βοΈ Installation
-
- To install and run Crawl4AI locally or on your own service, the best way is to use Docker. π³ Follow
- these steps:
-
- In recent times, we've witnessed a surge of startups emerging, riding the AI hype wave and charging
- for services that should rightfully be accessible to everyone. ππΈ One such example is scraping and
- crawling web pages and transforming them into a format suitable for Large Language Models (LLMs).
- πΈοΈπ€ We believe that building a business around this is not the right approach; instead, it should
- definitely be open-source. ππ So, if you possess the skills to build such tools and share our
- philosophy, we invite you to join our "Robinhood" band and help set these products free for the
- benefit of all. π€πͺ
-
-
-
-
-
\ No newline at end of file
diff --git a/pages/partial/how_to_guide.html b/pages/partial/how_to_guide.html
deleted file mode 100644
index 785915c1..00000000
--- a/pages/partial/how_to_guide.html
+++ /dev/null
@@ -1,174 +0,0 @@
-
-
How to Guide
-
-
-
- π
- Welcome to the Crawl4ai Quickstart Guide! Let's dive into some web crawling
- fun!
-
-
- First Step: Create an instance of WebCrawler and call the
- warmup() function.
-
-
-
crawler = WebCrawler()
-crawler.warmup()
-
-
-
-
- π§ Understanding 'bypass_cache' and 'include_raw_html' parameters:
-
-
First crawl (caches the result):
-
-
result = crawler.run(url="https://www.nbcnews.com/business")
-
-
Second crawl (Force to crawl again):
-
-
result = crawler.run(url="https://www.nbcnews.com/business", bypass_cache=True)
-
- β οΈ Don't forget to set `bypass_cache` to True if you want to try different strategies for the same URL. Otherwise, the cached result will be returned. You can also set `always_by_pass_cache` in constructor to True to always bypass the cache.
-
-
-
Crawl result without raw HTML content:
-
-
result = crawler.run(url="https://www.nbcnews.com/business", include_raw_html=False)
-
-
-
-
- π
- The 'include_raw_html' parameter, when set to True, includes the raw HTML content
- in the response. By default, it is set to True.
-
-
Set always_by_pass_cache to True:
-
-
crawler.always_by_pass_cache = True
-
-
-
- πΈ
- Let's take a screenshot of the page!
-
-
-
result = crawler.run(
- url="https://www.nbcnews.com/business",
- screenshot=True
-)
-with open("screenshot.png", "wb") as f:
- f.write(base64.b64decode(result.screenshot))
- Run the following command to load the required models. This is optional, but it will boost the performance and speed of the crawler. You need to do this only once.
-
crawl4ai-download-models
-
-
- Alternatively, you can clone the repository and install the package locally:
-
docker build -t crawl4ai .
-# docker build --platform linux/amd64 -t crawl4ai . For Mac users
-docker run -d -p 8000:80 crawl4ai
-
-
-
- For more information about how to run Crawl4AI as a local server, please refer to the
- GitHub repository.
-
-
-
-
-
How to Guide
-
-
-
- π
- Welcome to the Crawl4ai Quickstart Guide! Let's dive into some web crawling fun!
-
-
- First Step: Create an instance of WebCrawler and call the
- warmup() function.
-
-
-
crawler = WebCrawler()
-crawler.warmup()
-
-
-
-
- π§ Understanding 'bypass_cache' and 'include_raw_html' parameters:
-
-
First crawl (caches the result):
-
-
result = crawler.run(url="https://www.nbcnews.com/business")
-
-
Second crawl (Force to crawl again):
-
-
result = crawler.run(url="https://www.nbcnews.com/business", bypass_cache=True)
-
- β οΈ Don't forget to set `bypass_cache` to True if you want to try different strategies
- for the same URL. Otherwise, the cached result will be returned. You can also set
- `always_by_pass_cache` in constructor to True to always bypass the cache.
-
-
-
Crawl result without raw HTML content:
-
-
result = crawler.run(url="https://www.nbcnews.com/business", include_raw_html=False)
-
-
-
-
- π
- The 'include_raw_html' parameter, when set to True, includes the raw HTML content in the response.
- By default, it is set to True.
-
result = crawler.run(
-url="https://www.nbcnews.com/business",
-chunking_strategy=RegexChunking(patterns=["\n\n"])
-)
-
-
Using NlpSentenceChunking:
-
-
result = crawler.run(
-url="https://www.nbcnews.com/business",
-chunking_strategy=NlpSentenceChunking()
-)
-
-
-
-
- π§ Let's get smarter with an extraction strategy: CosineStrategy!
-
-
Using CosineStrategy:
-
-
result = crawler.run(
-url="https://www.nbcnews.com/business",
-extraction_strategy=CosineStrategy(word_count_threshold=20, max_dist=0.2, linkage_method="ward", top_k=3)
-)
-
-
-
-
- π€
- Time to bring in the big guns: LLMExtractionStrategy without instructions!
-
-
Using LLMExtractionStrategy without instructions:
-
-
result = crawler.run(
-url="https://www.nbcnews.com/business",
-extraction_strategy=LLMExtractionStrategy(provider="openai/gpt-4o", api_token=os.getenv('OPENAI_API_KEY'))
-)
-
-
-
-
- π
- Let's make it even more interesting: LLMExtractionStrategy with instructions!
-
-
Using LLMExtractionStrategy with instructions:
-
-
result = crawler.run(
-url="https://www.nbcnews.com/business",
-extraction_strategy=LLMExtractionStrategy(
-provider="openai/gpt-4o",
-api_token=os.getenv('OPENAI_API_KEY'),
-instruction="I am interested in only financial news"
-)
-)
-
-
-
-
- π―
- Targeted extraction: Let's use a CSS selector to extract only H2 tags!
-
-
Using CSS selector to extract H2 tags:
-
-
result = crawler.run(
-url="https://www.nbcnews.com/business",
-css_selector="h2"
-)
-
-
-
-
- π±οΈ
- Let's get interactive: Passing JavaScript code to click 'Load More' button!
-
- π
- Congratulations! You've made it through the Crawl4ai Quickstart Guide! Now go forth and crawl the
- web like a pro! πΈοΈ
-
-
-
-
-
-
-
-
RegexChunking
-
- RegexChunking is a text chunking strategy that splits a given text into smaller parts
- using regular expressions. This is useful for preparing large texts for processing by language
- models, ensuring they are divided into manageable segments.
-
-
Constructor Parameters:
-
-
- patterns (list, optional): A list of regular expression patterns used to split the
- text. Default is to split by double newlines (['\n\n']).
-
-
-
Example usage:
-
chunker = RegexChunking(patterns=[r'\n\n', r'\. '])
-chunks = chunker.chunk("This is a sample text. It will be split into chunks.")
-
-
-
-
-
-
NlpSentenceChunking
-
- NlpSentenceChunking uses a natural language processing model to chunk a given text into
- sentences. This approach leverages SpaCy to accurately split text based on sentence boundaries.
-
-
Constructor Parameters:
-
-
- None.
-
-
-
Example usage:
-
chunker = NlpSentenceChunking()
-chunks = chunker.chunk("This is a sample text. It will be split into sentences.")
-
-
-
-
-
-
TopicSegmentationChunking
-
- TopicSegmentationChunking uses the TextTiling algorithm to segment a given text into
- topic-based chunks. This method identifies thematic boundaries in the text.
-
-
Constructor Parameters:
-
-
- num_keywords (int, optional): The number of keywords to extract for each topic
- segment. Default is 3.
-
-
-
Example usage:
-
chunker = TopicSegmentationChunking(num_keywords=3)
-chunks = chunker.chunk("This is a sample text. It will be split into topic-based segments.")
-
-
-
-
-
-
FixedLengthWordChunking
-
- FixedLengthWordChunking splits a given text into chunks of fixed length, based on the
- number of words.
-
-
Constructor Parameters:
-
-
- chunk_size (int, optional): The number of words in each chunk. Default is
- 100.
-
-
-
Example usage:
-
chunker = FixedLengthWordChunking(chunk_size=100)
-chunks = chunker.chunk("This is a sample text. It will be split into fixed-length word chunks.")
-
-
-
-
-
-
SlidingWindowChunking
-
- SlidingWindowChunking uses a sliding window approach to chunk a given text. Each chunk
- has a fixed length, and the window slides by a specified step size.
-
-
Constructor Parameters:
-
-
- window_size (int, optional): The number of words in each chunk. Default is
- 100.
-
-
- step (int, optional): The number of words to slide the window. Default is
- 50.
-
-
-
Example usage:
-
chunker = SlidingWindowChunking(window_size=100, step=50)
-chunks = chunker.chunk("This is a sample text. It will be split using a sliding window approach.")
-
-
-
-
-
-
-
-
NoExtractionStrategy
-
- NoExtractionStrategy is a basic extraction strategy that returns the entire HTML
- content without any modification. It is useful for cases where no specific extraction is required.
- Only clean html, and amrkdown.
-
- LLMExtractionStrategy uses a Language Model (LLM) to extract meaningful blocks or
- chunks from the given HTML content. This strategy leverages an external provider for language model
- completions.
-
-
Constructor Parameters:
-
-
- provider (str, optional): The provider to use for the language model completions.
- Default is DEFAULT_PROVIDER (e.g., openai/gpt-4).
-
-
- api_token (str, optional): The API token for the provider. If not provided, it will
- try to load from the environment variable OPENAI_API_KEY.
-
-
- instruction (str, optional): An instruction to guide the LLM on how to perform the
- extraction. This allows users to specify the type of data they are interested in or set the tone
- of the response. Default is None.
-
-
-
Example usage:
-
extractor = LLMExtractionStrategy(provider='openai', api_token='your_api_token', instruction='Extract only news about AI.')
-extracted_content = extractor.extract(url, html)
-
-
- By providing clear instructions, users can tailor the extraction process to their specific needs,
- enhancing the relevance and utility of the extracted content.
-
-
-
-
-
-
CosineStrategy
-
- CosineStrategy uses hierarchical clustering based on cosine similarity to extract
- clusters of text from the given HTML content. This strategy is suitable for identifying related
- content sections.
-
-
Constructor Parameters:
-
-
- semantic_filter (str, optional): A string containing keywords for filtering relevant
- documents before clustering. If provided, documents are filtered based on their cosine
- similarity to the keyword filter embedding. Default is None.
-
-
- word_count_threshold (int, optional): Minimum number of words per cluster. Default
- is 20.
-
-
- max_dist (float, optional): The maximum cophenetic distance on the dendrogram to
- form clusters. Default is 0.2.
-
-
- linkage_method (str, optional): The linkage method for hierarchical clustering.
- Default is 'ward'.
-
-
- top_k (int, optional): Number of top categories to extract. Default is
- 3.
-
-
- model_name (str, optional): The model name for embedding generation. Default is
- 'BAAI/bge-small-en-v1.5'.
-
- When a semantic_filter is provided, the CosineStrategy applies an
- embedding-based filtering process to select relevant documents before performing hierarchical
- clustering.
-
-
-
-
-
-
TopicExtractionStrategy
-
- TopicExtractionStrategy uses the TextTiling algorithm to segment the HTML content into
- topics and extracts keywords for each segment. This strategy is useful for identifying and
- summarizing thematic content.
-
-
Constructor Parameters:
-
-
- num_keywords (int, optional): Number of keywords to represent each topic segment.
- Default is 3.
-