From edad7b6a742249f324d3baba01095f93fc05912f Mon Sep 17 00:00:00 2001
From: unclecode Loading... Please wait.
- There are two ways to use Crawl4AI: as a library in your Python projects or as a standalone local
- server.
-
- You can also try Crawl4AI in a Google Colab
- To install Crawl4AI as a library, follow these steps:
- For more information about how to run Crawl4AI as a local server, please refer to the
- GitHub repository.
-
- In recent times, we've witnessed a surge of startups emerging, riding the AI hype wave and charging
- for services that should rightfully be accessible to everyone. ππΈ One such example is scraping and
- crawling web pages and transforming them into a format suitable for Large Language Models (LLMs).
- πΈοΈπ€ We believe that building a business around this is not the right approach; instead, it should
- definitely be open-source. ππ So, if you possess the skills to build such tools and share our
- philosophy, we invite you to join our "Robinhood" band and help set these products free for the
- benefit of all. π€πͺ
-
- To install and run Crawl4AI as a library or a local server, please refer to the π
- GitHub repository.
- Content for chunking strategies... Content for extraction strategies... Loading...
- In recent times, we've seen numerous startups emerging, riding the AI hype wave and charging for
- services that should rightfully be accessible to everyone. ππΈ One for example is to scrap and crawl
- a web page, and transform it o a form suitable for LLM. We don't think one should build a business
- out of this, but definilty should be opened source. So if you possess the skills to build such things
- and you have such philosphy you should join our "Robinhood" band and help set
- these products free. ππ€
-
- To install and run Crawl4AI locally or on your own service, the best way is to use Docker. π³ Follow
- these steps:
-
- For more detailed instructions and advanced configuration options, please refer to the π
- GitHub repository.
-
- In recent times, we've witnessed a surge of startups emerging, riding the AI hype wave and charging
- for services that should rightfully be accessible to everyone. ππΈ One such example is scraping and
- crawling web pages and transforming them into a format suitable for Large Language Models (LLMs).
- πΈοΈπ€ We believe that building a business around this is not the right approach; instead, it should
- definitely be open-source. ππ So, if you possess the skills to build such tools and share our
- philosophy, we invite you to join our "Robinhood" band and help set these products free for the
- benefit of all. π€πͺ
-
- There are three ways to use Crawl4AI:
- π₯π·οΈ Crawl4AI: Web Data for your Thoughts
- Try It Now
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- warmup() function.
-
- crawler = WebCrawler()
- crawler.warmup()
- result = crawler.run(url="https://www.nbcnews.com/business")
- result = crawler.run(url="https://www.nbcnews.com/business", bypass_cache=True)
- result = crawler.run(url="https://www.nbcnews.com/business", include_raw_html=False)always_by_pass_cache to True:
- crawler.always_by_pass_cache = True
- result = crawler.run(
- url="https://www.nbcnews.com/business",
- chunking_strategy=RegexChunking(patterns=["\n\n"])
- )
- result = crawler.run(
- url="https://www.nbcnews.com/business",
- chunking_strategy=NlpSentenceChunking()
- )
- result = crawler.run(
- url="https://www.nbcnews.com/business",
- extraction_strategy=CosineStrategy(word_count_threshold=10, max_dist=0.2, linkage_method="ward", top_k=3)
- )
- result = crawler.run(
- url="https://www.nbcnews.com/business",
- extraction_strategy=LLMExtractionStrategy(provider="openai/gpt-4o", api_token=os.getenv('OPENAI_API_KEY'))
- )
- result = crawler.run(
- url="https://www.nbcnews.com/business",
- extraction_strategy=LLMExtractionStrategy(
- provider="openai/gpt-4o",
- api_token=os.getenv('OPENAI_API_KEY'),
- instruction="I am interested in only financial news"
- )
- )
- result = crawler.run(
- url="https://www.nbcnews.com/business",
- css_selector="h2"
- )
- js_code = """
- const loadMoreButton = Array.from(document.querySelectorAll('button')).find(button => button.textContent.includes('Load More'));
- loadMoreButton && loadMoreButton.click();
- """
- crawler_strategy = LocalSeleniumCrawlerStrategy(js_code=js_code)
- crawler = WebCrawler(crawler_strategy=crawler_strategy, always_by_pass_cache=True)
- result = crawler.run(url="https://www.nbcnews.com/business")Installation π»
-
-
Using Crawl4AI as a Library π
-
-
-
- pip install git+https://github.com/unclecode/crawl4ai.git
- virtualenv venv
-source venv/bin/activate
-git clone https://github.com/unclecode/crawl4ai.git
-cd crawl4ai
-pip install -e .
-
- from crawl4ai.web_crawler import WebCrawler
-from crawl4ai.chunking_strategy import *
-from crawl4ai.extraction_strategy import *
-import os
-
-crawler = WebCrawler()
-
-# Single page crawl
-single_url = UrlModel(url='https://www.nbcnews.com/business', forced=False)
-result = crawl4ai.fetch_page(
- url='https://www.nbcnews.com/business',
- word_count_threshold=5, # Minimum word count for a HTML tag to be considered as a worthy block
- chunking_strategy= RegexChunking( patterns = ["\\n\\n"]), # Default is RegexChunking
- extraction_strategy= CosineStrategy(word_count_threshold=10, max_dist=0.2, linkage_method='ward', top_k=3) # Default is CosineStrategy
- # extraction_strategy= LLMExtractionStrategy(provider= "openai/gpt-4o", api_token = os.getenv('OPENAI_API_KEY')),
- bypass_cache=False,
- extract_blocks =True, # Whether to extract semantical blocks of text from the HTML
- css_selector = "", # Eg: "div.article-body"
- verbose=True,
- include_raw_html=True, # Whether to include the raw HTML content in the response
-)
-print(result.model_dump())
- π Parameters
-
-
-
-
-
-
-
- Parameter
- Description
- Required
- Default Value
-
-
- urls
-
- A list of URLs to crawl and extract data from.
-
- Yes
- -
-
-
- include_raw_html
-
- Whether to include the raw HTML content in the response.
-
- No
- false
-
-
- bypass_cache
-
- Whether to force a fresh crawl even if the URL has been previously crawled.
-
- No
- false
-
-
- extract_blocks
-
- Whether to extract semantical blocks of text from the HTML.
-
- No
- true
-
-
- word_count_threshold
-
- The minimum number of words a block must contain to be considered meaningful (minimum
- value is 5).
-
- No
- 5
-
-
- extraction_strategy
-
- The strategy to use for extracting content from the HTML (e.g., "CosineStrategy").
-
- No
- CosineStrategy
-
-
- chunking_strategy
-
- The strategy to use for chunking the text before processing (e.g., "RegexChunking").
-
- No
- RegexChunking
-
-
- css_selector
-
- The CSS selector to target specific parts of the HTML for extraction.
-
- No
- None
-
-
-
- verbose
- Whether to enable verbose logging.
- No
- true
- Extraction Strategies
-
- Chunking Strategies
-
- π€ Why building this?
- βοΈ Installation
- π₯π·οΈ Crawl4AI: Web Data for your Thoughts
- Chunking Strategies
- Extraction Strategies
- π₯π·οΈ Crawl4AI: Open-source LLM Friendly Web scraper
- Try It Now
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- π€ Why building this?
- βοΈ Installation
-
-
- git clone https://github.com/unclecode/crawl4ai.git
- cd crawl4aidocker build -t crawl4ai . On Mac, follow: π
- docker build --platform linux/amd64 -t crawl4ai .
- docker run -p 8000:80 crawl4aiπ€ Why building this?
- How to Guide
- warmup() function.
-
- crawler = WebCrawler()
-crawler.warmup()
- result = crawler.run(url="https://www.nbcnews.com/business")
- result = crawler.run(url="https://www.nbcnews.com/business", bypass_cache=True)`bypass_cache` to True if you want to try different strategies for the same URL. Otherwise, the cached result will be returned. You can also set `always_by_pass_cache` in constructor to True to always bypass the cache.
-
- result = crawler.run(url="https://www.nbcnews.com/business", include_raw_html=False)always_by_pass_cache to True:
- crawler.always_by_pass_cache = True
- result = crawler.run(
- url="https://www.nbcnews.com/business",
- screenshot=True
-)
-with open("screenshot.png", "wb") as f:
- f.write(base64.b64decode(result.screenshot))
- result = crawler.run(
- url="https://www.nbcnews.com/business",
- chunking_strategy=RegexChunking(patterns=["\n\n"])
-)
- result = crawler.run(
- url="https://www.nbcnews.com/business",
- chunking_strategy=NlpSentenceChunking()
-)
- result = crawler.run(
- url="https://www.nbcnews.com/business",
- extraction_strategy=CosineStrategy(word_count_threshold=10, max_dist=0.2, linkage_method="ward", top_k=3)
-)
- result = crawler.run(
- url="https://www.nbcnews.com/business",
- extraction_strategy=LLMExtractionStrategy(provider="openai/gpt-4o", api_token=os.getenv('OPENAI_API_KEY'))
-)
- result = crawler.run(
- url="https://www.nbcnews.com/business",
- extraction_strategy=LLMExtractionStrategy(
- provider="openai/gpt-4o",
- api_token=os.getenv('OPENAI_API_KEY'),
- instruction="I am interested in only financial news"
-)
-)
- result = crawler.run(
- url="https://www.nbcnews.com/business",
- css_selector="h2"
-)
- js_code = ["""
-const loadMoreButton = Array.from(document.querySelectorAll('button')).find(button => button.textContent.includes('Load More'));
-loadMoreButton && loadMoreButton.click();
-"""]
-crawler = WebCrawler(verbos=crawler_strategy, always_by_pass_cache=True)
-result = crawler.run(url="https://www.nbcnews.com/business", js = js_code)Installation π»
-
To install Crawl4AI as a library, follow these steps:
- -virtualenv venv
-source venv/bin/activate
-pip install "crawl4ai[all] @ git+https://github.com/unclecode/crawl4ai.git"
-
- crawl4ai-download-models
- virtualenv venv
-source venv/bin/activate
-git clone https://github.com/unclecode/crawl4ai.git
-cd crawl4ai
-pip install -e .[all]
-
- docker build -t crawl4ai .
-# docker build --platform linux/amd64 -t crawl4ai . For Mac users
-docker run -d -p 8000:80 crawl4ai
- - For more information about how to run Crawl4AI as a local server, please refer to the - GitHub repository. -
- \ No newline at end of file diff --git a/pages/partial/try_it.html b/pages/partial/try_it.html deleted file mode 100644 index e3033eec..00000000 --- a/pages/partial/try_it.html +++ /dev/null @@ -1,217 +0,0 @@ -Loading... Please wait.
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- There are three ways to use Crawl4AI:
-To install Crawl4AI as a library, follow these steps:
- -pip install git+https://github.com/unclecode/crawl4ai.git
- virtualenv venv
-source venv/bin/activate
-git clone https://github.com/unclecode/crawl4ai.git
-cd crawl4ai
-pip install -e .
-
- docker build -t crawl4ai .
-# docker build --platform linux/amd64 -t crawl4ai . For Mac users
-docker run -d -p 8000:80 crawl4ai
- - For more information about how to run Crawl4AI as a local server, please refer to the - GitHub repository. -
-warmup() function.
- crawler = WebCrawler()
-crawler.warmup()
- result = crawler.run(url="https://www.nbcnews.com/business")
- result = crawler.run(url="https://www.nbcnews.com/business", bypass_cache=True)
- `bypass_cache` to True if you want to try different strategies
- for the same URL. Otherwise, the cached result will be returned. You can also set
- `always_by_pass_cache` in constructor to True to always bypass the cache.
- result = crawler.run(url="https://www.nbcnews.com/business", include_raw_html=False)
- always_by_pass_cache to True:crawler.always_by_pass_cache = True
- result = crawler.run(
-url="https://www.nbcnews.com/business",
-chunking_strategy=RegexChunking(patterns=["\n\n"])
-)
- result = crawler.run(
-url="https://www.nbcnews.com/business",
-chunking_strategy=NlpSentenceChunking()
-)
- result = crawler.run(
-url="https://www.nbcnews.com/business",
-extraction_strategy=CosineStrategy(word_count_threshold=20, max_dist=0.2, linkage_method="ward", top_k=3)
-)
- result = crawler.run(
-url="https://www.nbcnews.com/business",
-extraction_strategy=LLMExtractionStrategy(provider="openai/gpt-4o", api_token=os.getenv('OPENAI_API_KEY'))
-)
- result = crawler.run(
-url="https://www.nbcnews.com/business",
-extraction_strategy=LLMExtractionStrategy(
-provider="openai/gpt-4o",
-api_token=os.getenv('OPENAI_API_KEY'),
-instruction="I am interested in only financial news"
-)
-)
- result = crawler.run(
-url="https://www.nbcnews.com/business",
-css_selector="h2"
-)
- js_code = """
-const loadMoreButton = Array.from(document.querySelectorAll('button')).find(button => button.textContent.includes('Load More'));
-loadMoreButton && loadMoreButton.click();
-"""
-crawler_strategy = LocalSeleniumCrawlerStrategy(js_code=js_code)
-crawler = WebCrawler(crawler_strategy=crawler_strategy, always_by_pass_cache=True)
-result = crawler.run(url="https://www.nbcnews.com/business")
-
- RegexChunking is a text chunking strategy that splits a given text into smaller parts
- using regular expressions. This is useful for preparing large texts for processing by language
- models, ensuring they are divided into manageable segments.
-
patterns (list, optional): A list of regular expression patterns used to split the
- text. Default is to split by double newlines (['\n\n']).
- chunker = RegexChunking(patterns=[r'\n\n', r'\. '])
-chunks = chunker.chunk("This is a sample text. It will be split into chunks.")
-
-
- NlpSentenceChunking uses a natural language processing model to chunk a given text into
- sentences. This approach leverages SpaCy to accurately split text based on sentence boundaries.
-
chunker = NlpSentenceChunking()
-chunks = chunker.chunk("This is a sample text. It will be split into sentences.")
-
-
- TopicSegmentationChunking uses the TextTiling algorithm to segment a given text into
- topic-based chunks. This method identifies thematic boundaries in the text.
-
num_keywords (int, optional): The number of keywords to extract for each topic
- segment. Default is 3.
- chunker = TopicSegmentationChunking(num_keywords=3)
-chunks = chunker.chunk("This is a sample text. It will be split into topic-based segments.")
-
-
- FixedLengthWordChunking splits a given text into chunks of fixed length, based on the
- number of words.
-
chunk_size (int, optional): The number of words in each chunk. Default is
- 100.
- chunker = FixedLengthWordChunking(chunk_size=100)
-chunks = chunker.chunk("This is a sample text. It will be split into fixed-length word chunks.")
-
-
- SlidingWindowChunking uses a sliding window approach to chunk a given text. Each chunk
- has a fixed length, and the window slides by a specified step size.
-
window_size (int, optional): The number of words in each chunk. Default is
- 100.
- step (int, optional): The number of words to slide the window. Default is
- 50.
- chunker = SlidingWindowChunking(window_size=100, step=50)
-chunks = chunker.chunk("This is a sample text. It will be split using a sliding window approach.")
-
-
- NoExtractionStrategy is a basic extraction strategy that returns the entire HTML
- content without any modification. It is useful for cases where no specific extraction is required.
- Only clean html, and amrkdown.
-
None.
-extractor = NoExtractionStrategy()
-extracted_content = extractor.extract(url, html)
-
-
- LLMExtractionStrategy uses a Language Model (LLM) to extract meaningful blocks or
- chunks from the given HTML content. This strategy leverages an external provider for language model
- completions.
-
provider (str, optional): The provider to use for the language model completions.
- Default is DEFAULT_PROVIDER (e.g., openai/gpt-4).
- api_token (str, optional): The API token for the provider. If not provided, it will
- try to load from the environment variable OPENAI_API_KEY.
- instruction (str, optional): An instruction to guide the LLM on how to perform the
- extraction. This allows users to specify the type of data they are interested in or set the tone
- of the response. Default is None.
- extractor = LLMExtractionStrategy(provider='openai', api_token='your_api_token', instruction='Extract only news about AI.')
-extracted_content = extractor.extract(url, html)
-
- - By providing clear instructions, users can tailor the extraction process to their specific needs, - enhancing the relevance and utility of the extracted content. -
-
- CosineStrategy uses hierarchical clustering based on cosine similarity to extract
- clusters of text from the given HTML content. This strategy is suitable for identifying related
- content sections.
-
semantic_filter (str, optional): A string containing keywords for filtering relevant
- documents before clustering. If provided, documents are filtered based on their cosine
- similarity to the keyword filter embedding. Default is None.
- word_count_threshold (int, optional): Minimum number of words per cluster. Default
- is 20.
- max_dist (float, optional): The maximum cophenetic distance on the dendrogram to
- form clusters. Default is 0.2.
- linkage_method (str, optional): The linkage method for hierarchical clustering.
- Default is 'ward'.
- top_k (int, optional): Number of top categories to extract. Default is
- 3.
- model_name (str, optional): The model name for embedding generation. Default is
- 'BAAI/bge-small-en-v1.5'.
- extractor = CosineStrategy(semantic_filter='artificial intelligence', word_count_threshold=10, max_dist=0.2, linkage_method='ward', top_k=3, model_name='BAAI/bge-small-en-v1.5')
-extracted_content = extractor.extract(url, html)
-
-
- When a semantic_filter is provided, the CosineStrategy applies an
- embedding-based filtering process to select relevant documents before performing hierarchical
- clustering.
-
- TopicExtractionStrategy uses the TextTiling algorithm to segment the HTML content into
- topics and extracts keywords for each segment. This strategy is useful for identifying and
- summarizing thematic content.
-
num_keywords (int, optional): Number of keywords to represent each topic segment.
- Default is 3.
- extractor = TopicExtractionStrategy(num_keywords=3)
-extracted_content = extractor.extract(url, html)
-
-