diff --git a/README.md b/README.md index 6bbef7e4..8d85d870 100644 --- a/README.md +++ b/README.md @@ -1,4 +1,4 @@ -# Crawl4AI v0.2.77 πŸ•·οΈπŸ€– +# Crawl4AI Async Version πŸ•·οΈπŸ€– [![GitHub Stars](https://img.shields.io/github/stars/unclecode/crawl4ai?style=social)](https://github.com/unclecode/crawl4ai/stargazers) [![GitHub Forks](https://img.shields.io/github/forks/unclecode/crawl4ai?style=social)](https://github.com/unclecode/crawl4ai/network/members) @@ -6,34 +6,22 @@ [![GitHub Pull Requests](https://img.shields.io/github/issues-pr/unclecode/crawl4ai)](https://github.com/unclecode/crawl4ai/pulls) [![License](https://img.shields.io/github/license/unclecode/crawl4ai)](https://github.com/unclecode/crawl4ai/blob/main/LICENSE) -Crawl4AI simplifies web crawling and data extraction, making it accessible for large language models (LLMs) and AI applications. πŸ†“πŸŒ +Crawl4AI simplifies asynchronous web crawling and data extraction, making it accessible for large language models (LLMs) and AI applications. πŸ†“πŸŒ -#### [v0.2.77] - 2024-08-02 +> Looking for the synchronous version? Check out [README.sync.md](./README.sync.md). -Major improvements in functionality, performance, and cross-platform compatibility! πŸš€ - -- 🐳 **Docker enhancements**: - - Significantly improved Dockerfile for easy installation on Linux, Mac, and Windows. -- 🌐 **Official Docker Hub image**: - - Launched our first official image on Docker Hub for streamlined deployment (unclecode/crawl4ai). -- πŸ”§ **Selenium upgrade**: - - Removed dependency on ChromeDriver, now using Selenium's built-in capabilities for better compatibility. -- πŸ–ΌοΈ **Image description**: - - Implemented ability to generate textual descriptions for extracted images from web pages. -- ⚑ **Performance boost**: - - Various improvements to enhance overall speed and performance. - ## Try it Now! ✨ Play around with this [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1sJPAmeLj5PMrg2VgOwMJ2ubGIcK0cJeX?usp=sharing) -✨ visit our [Documentation Website](https://crawl4ai.com/mkdocs/) +✨ Visit our [Documentation Website](https://crawl4ai.com/mkdocs/) -✨ Check [Demo](https://crawl4ai.com/mkdocs/demo) +✨ Check out the [Demo](https://crawl4ai.com/mkdocs/demo) ## Features ✨ - πŸ†“ Completely free and open-source +- πŸš€ Blazing fast performance, outperforming many paid services - πŸ€– LLM-friendly output formats (JSON, cleaned HTML, markdown) - 🌍 Supports crawling multiple URLs simultaneously - 🎨 Extracts and returns all media tags (Images, Audio, and Video) @@ -43,44 +31,17 @@ Major improvements in functionality, performance, and cross-platform compatibili - πŸ•΅οΈ User-agent customization - πŸ–ΌοΈ Takes screenshots of the page - πŸ“œ Executes multiple custom JavaScripts before crawling +- πŸ“Š Generates structured output without LLM using JsonCssExtractionStrategy - πŸ“š Various chunking strategies: topic-based, regex, sentence, and more - 🧠 Advanced extraction strategies: cosine clustering, LLM, and more -- 🎯 CSS selector support +- 🎯 CSS selector support for precise data extraction - πŸ“ Passes instructions/keywords to refine extraction +- πŸ”’ Proxy support for enhanced privacy and access +- πŸ”„ Session management for complex multi-page crawling scenarios +- 🌐 Asynchronous architecture for improved performance and scalability -# Crawl4AI -## 🌟 Shoutout to Contributors of v0.2.77! - -A big thank you to the amazing contributors who've made this release possible: - -- [@aravindkarnam](https://github.com/aravindkarnam) for the new image description feature -- [@FractalMind](https://github.com/FractalMind) for our official Docker Hub image -- [@ketonkss4](https://github.com/ketonkss4) for helping streamline our Selenium setup - -Your contributions are driving Crawl4AI forward! πŸš€ - -## Cool Examples πŸš€ - -### Quick Start - -```python -from crawl4ai import WebCrawler - -# Create an instance of WebCrawler -crawler = WebCrawler() - -# Warm up the crawler (load necessary models) -crawler.warmup() - -# Run the crawler on a URL -result = crawler.run(url="https://www.nbcnews.com/business") - -# Print the extracted content -print(result.markdown) -``` - -## How to install πŸ›  +## Installation πŸ› οΈ ### Using pip 🐍 ```bash @@ -105,118 +66,264 @@ docker pull unclecode/crawl4ai:latest docker run -d -p 8000:80 unclecode/crawl4ai:latest ``` - -## Speed-First Design πŸš€ - -Perhaps the most important design principle for this library is speed. We need to ensure it can handle many links and resources in parallel as quickly as possible. By combining this speed with fast LLMs like Groq, the results will be truly amazing. +## Quick Start πŸš€ ```python -import time -from crawl4ai.web_crawler import WebCrawler -crawler = WebCrawler() -crawler.warmup() +import asyncio +from crawl4ai import AsyncWebCrawler -start = time.time() -url = r"https://www.nbcnews.com/business" -result = crawler.run( url, word_count_threshold=10, bypass_cache=True) -end = time.time() -print(f"Time taken: {end - start}") +async def main(): + async with AsyncWebCrawler(verbose=True) as crawler: + result = await crawler.arun(url="https://www.nbcnews.com/business") + print(result.markdown) + +if __name__ == "__main__": + asyncio.run(main()) ``` -Let's take a look the calculated time for the above code snippet: +## Advanced Usage πŸ”¬ -```bash -[LOG] πŸš€ Crawling done, success: True, time taken: 1.3623387813568115 seconds -[LOG] πŸš€ Content extracted, success: True, time taken: 0.05715131759643555 seconds -[LOG] πŸš€ Extraction, time taken: 0.05750393867492676 seconds. -Time taken: 1.439958095550537 +### Executing JavaScript and Using CSS Selectors + +```python +import asyncio +from crawl4ai import AsyncWebCrawler + +async def main(): + async with AsyncWebCrawler(verbose=True) as crawler: + js_code = ["const loadMoreButton = Array.from(document.querySelectorAll('button')).find(button => button.textContent.includes('Load More')); loadMoreButton && loadMoreButton.click();"] + result = await crawler.arun( + url="https://www.nbcnews.com/business", + js_code=js_code, + css_selector="article.tease-card", + bypass_cache=True + ) + print(result.extracted_content) + +if __name__ == "__main__": + asyncio.run(main()) ``` -Fetching the content from the page took 1.3623 seconds, and extracting the content took 0.0575 seconds. πŸš€ -### Extract Structured Data from Web Pages πŸ“Š +### Using a Proxy -Crawl all OpenAI models and their fees from the official page. +```python +import asyncio +from crawl4ai import AsyncWebCrawler + +async def main(): + async with AsyncWebCrawler(verbose=True, proxy="http://127.0.0.1:7890") as crawler: + result = await crawler.arun( + url="https://www.nbcnews.com/business", + bypass_cache=True + ) + print(result.markdown) + +if __name__ == "__main__": + asyncio.run(main()) +``` + +### Extracting Structured Data with OpenAI ```python import os -from crawl4ai import WebCrawler +import asyncio +from crawl4ai import AsyncWebCrawler from crawl4ai.extraction_strategy import LLMExtractionStrategy from pydantic import BaseModel, Field class OpenAIModelFee(BaseModel): model_name: str = Field(..., description="Name of the OpenAI model.") input_fee: str = Field(..., description="Fee for input token for the OpenAI model.") - output_fee: str = Field(..., description="Fee for output token ßfor the OpenAI model.") + output_fee: str = Field(..., description="Fee for output token for the OpenAI model.") -url = 'https://openai.com/api/pricing/' -crawler = WebCrawler() -crawler.warmup() +async def main(): + async with AsyncWebCrawler(verbose=True) as crawler: + result = await crawler.arun( + url='https://openai.com/api/pricing/', + word_count_threshold=1, + extraction_strategy=LLMExtractionStrategy( + provider="openai/gpt-4o", api_token=os.getenv('OPENAI_API_KEY'), + schema=OpenAIModelFee.schema(), + extraction_type="schema", + instruction="""From the crawled content, extract all mentioned model names along with their fees for input and output tokens. + Do not miss any models in the entire content. One extracted model JSON format should look like this: + {"model_name": "GPT-4", "input_fee": "US$10.00 / 1M tokens", "output_fee": "US$30.00 / 1M tokens"}.""" + ), + bypass_cache=True, + ) + print(result.extracted_content) -result = crawler.run( - url=url, - word_count_threshold=1, - extraction_strategy= LLMExtractionStrategy( - provider= "openai/gpt-4o", api_token = os.getenv('OPENAI_API_KEY'), - schema=OpenAIModelFee.schema(), - extraction_type="schema", - instruction="""From the crawled content, extract all mentioned model names along with their fees for input and output tokens. - Do not miss any models in the entire content. One extracted model JSON format should look like this: - {"model_name": "GPT-4", "input_fee": "US$10.00 / 1M tokens", "output_fee": "US$30.00 / 1M tokens"}.""" - ), - bypass_cache=True, - ) - -print(result.extracted_content) +if __name__ == "__main__": + asyncio.run(main()) ``` -### Execute JS, Filter Data with CSS Selector, and Clustering +### Advanced Multi-Page Crawling with JavaScript Execution + +Crawl4AI excels at handling complex scenarios, such as crawling multiple pages with dynamic content loaded via JavaScript. Here's an example of crawling GitHub commits across multiple pages: ```python -from crawl4ai import WebCrawler -from crawl4ai.chunking_strategy import CosineStrategy +import asyncio +import re +from bs4 import BeautifulSoup +from crawl4ai import AsyncWebCrawler -js_code = ["const loadMoreButton = Array.from(document.querySelectorAll('button')).find(button => button.textContent.includes('Load More')); loadMoreButton && loadMoreButton.click();"] +async def crawl_typescript_commits(): + first_commit = "" + async def on_execution_started(page): + nonlocal first_commit + try: + while True: + await page.wait_for_selector('li.Box-sc-g0xbh4-0 h4') + commit = await page.query_selector('li.Box-sc-g0xbh4-0 h4') + commit = await commit.evaluate('(element) => element.textContent') + commit = re.sub(r'\s+', '', commit) + if commit and commit != first_commit: + first_commit = commit + break + await asyncio.sleep(0.5) + except Exception as e: + print(f"Warning: New content didn't appear after JavaScript execution: {e}") -crawler = WebCrawler() -crawler.warmup() + async with AsyncWebCrawler(verbose=True) as crawler: + crawler.crawler_strategy.set_hook('on_execution_started', on_execution_started) -result = crawler.run( - url="https://www.nbcnews.com/business", - js=js_code, - css_selector="p", - extraction_strategy=CosineStrategy(semantic_filter="technology") -) + url = "https://github.com/microsoft/TypeScript/commits/main" + session_id = "typescript_commits_session" + all_commits = [] -print(result.extracted_content) + js_next_page = """ + const button = document.querySelector('a[data-testid="pagination-next-button"]'); + if (button) button.click(); + """ + + for page in range(3): # Crawl 3 pages + result = await crawler.arun( + url=url, + session_id=session_id, + css_selector="li.Box-sc-g0xbh4-0", + js=js_next_page if page > 0 else None, + bypass_cache=True, + js_only=page > 0 + ) + + assert result.success, f"Failed to crawl page {page + 1}" + + soup = BeautifulSoup(result.cleaned_html, 'html.parser') + commits = soup.select("li") + all_commits.extend(commits) + + print(f"Page {page + 1}: Found {len(commits)} commits") + + await crawler.crawler_strategy.kill_session(session_id) + print(f"Successfully crawled {len(all_commits)} commits across 3 pages") + +if __name__ == "__main__": + asyncio.run(crawl_typescript_commits()) ``` -### Extract Structured Data from Web Pages With Proxy and BaseUrl +This example demonstrates Crawl4AI's ability to handle complex scenarios where content is loaded asynchronously. It crawls multiple pages of GitHub commits, executing JavaScript to load new content and using custom hooks to ensure data is loaded before proceeding. + +### Using JsonCssExtractionStrategy + +The `JsonCssExtractionStrategy` allows for precise extraction of structured data from web pages using CSS selectors. ```python -from crawl4ai import WebCrawler -from crawl4ai.extraction_strategy import LLMExtractionStrategy +import asyncio +import json +from crawl4ai import AsyncWebCrawler +from crawl4ai.extraction_strategy import JsonCssExtractionStrategy -def create_crawler(): - crawler = WebCrawler(verbose=True, proxy="http://127.0.0.1:7890") - crawler.warmup() - return crawler +async def extract_news_teasers(): + schema = { + "name": "News Teaser Extractor", + "baseSelector": ".wide-tease-item__wrapper", + "fields": [ + { + "name": "category", + "selector": ".unibrow span[data-testid='unibrow-text']", + "type": "text", + }, + { + "name": "headline", + "selector": ".wide-tease-item__headline", + "type": "text", + }, + { + "name": "summary", + "selector": ".wide-tease-item__description", + "type": "text", + }, + { + "name": "time", + "selector": "[data-testid='wide-tease-date']", + "type": "text", + }, + { + "name": "image", + "type": "nested", + "selector": "picture.teasePicture img", + "fields": [ + {"name": "src", "type": "attribute", "attribute": "src"}, + {"name": "alt", "type": "attribute", "attribute": "alt"}, + ], + }, + { + "name": "link", + "selector": "a[href]", + "type": "attribute", + "attribute": "href", + }, + ], + } -crawler = create_crawler() + extraction_strategy = JsonCssExtractionStrategy(schema, verbose=True) -crawler.warmup() + async with AsyncWebCrawler(verbose=True) as crawler: + result = await crawler.arun( + url="https://www.nbcnews.com/business", + extraction_strategy=extraction_strategy, + bypass_cache=True, + ) -result = crawler.run( - url="https://www.nbcnews.com/business", - extraction_strategy=LLMExtractionStrategy( - provider="openai/gpt-4o", - api_token="sk-", - base_url="https://api.openai.com/v1" - ) -) + assert result.success, "Failed to crawl the page" -print(result.markdown) + news_teasers = json.loads(result.extracted_content) + print(f"Successfully extracted {len(news_teasers)} news teasers") + print(json.dumps(news_teasers[0], indent=2)) + +if __name__ == "__main__": + asyncio.run(extract_news_teasers()) ``` +## Speed Comparison πŸš€ + +Crawl4AI is designed with speed as a primary focus. Our goal is to provide the fastest possible response with high-quality data extraction, minimizing abstractions between the data and the user. + +We've conducted a speed comparison between Crawl4AI and Firecrawl, a paid service. The results demonstrate Crawl4AI's superior performance: + +``` +Firecrawl: +Time taken: 7.02 seconds +Content length: 42074 characters +Images found: 49 + +Crawl4AI (simple crawl): +Time taken: 1.60 seconds +Content length: 18238 characters +Images found: 49 + +Crawl4AI (with JavaScript execution): +Time taken: 4.64 seconds +Content length: 40869 characters +Images found: 89 +``` + +As you can see, Crawl4AI outperforms Firecrawl significantly: +- Simple crawl: Crawl4AI is over 4 times faster than Firecrawl. +- With JavaScript execution: Even when executing JavaScript to load more content (doubling the number of images found), Crawl4AI is still faster than Firecrawl's simple crawl. + +You can find the full comparison code in our repository at `docs/examples/crawl4ai_vs_firecrawl.py`. + ## Documentation πŸ“š For detailed documentation, including installation instructions, advanced features, and API reference, visit our [Documentation Website](https://crawl4ai.com/mkdocs/). diff --git a/README.sync.md b/README.sync.md new file mode 100644 index 00000000..6bbef7e4 --- /dev/null +++ b/README.sync.md @@ -0,0 +1,244 @@ +# Crawl4AI v0.2.77 πŸ•·οΈπŸ€– + +[![GitHub Stars](https://img.shields.io/github/stars/unclecode/crawl4ai?style=social)](https://github.com/unclecode/crawl4ai/stargazers) +[![GitHub Forks](https://img.shields.io/github/forks/unclecode/crawl4ai?style=social)](https://github.com/unclecode/crawl4ai/network/members) +[![GitHub Issues](https://img.shields.io/github/issues/unclecode/crawl4ai)](https://github.com/unclecode/crawl4ai/issues) +[![GitHub Pull Requests](https://img.shields.io/github/issues-pr/unclecode/crawl4ai)](https://github.com/unclecode/crawl4ai/pulls) +[![License](https://img.shields.io/github/license/unclecode/crawl4ai)](https://github.com/unclecode/crawl4ai/blob/main/LICENSE) + +Crawl4AI simplifies web crawling and data extraction, making it accessible for large language models (LLMs) and AI applications. πŸ†“πŸŒ + +#### [v0.2.77] - 2024-08-02 + +Major improvements in functionality, performance, and cross-platform compatibility! πŸš€ + +- 🐳 **Docker enhancements**: + - Significantly improved Dockerfile for easy installation on Linux, Mac, and Windows. +- 🌐 **Official Docker Hub image**: + - Launched our first official image on Docker Hub for streamlined deployment (unclecode/crawl4ai). +- πŸ”§ **Selenium upgrade**: + - Removed dependency on ChromeDriver, now using Selenium's built-in capabilities for better compatibility. +- πŸ–ΌοΈ **Image description**: + - Implemented ability to generate textual descriptions for extracted images from web pages. +- ⚑ **Performance boost**: + - Various improvements to enhance overall speed and performance. + +## Try it Now! + +✨ Play around with this [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1sJPAmeLj5PMrg2VgOwMJ2ubGIcK0cJeX?usp=sharing) + +✨ visit our [Documentation Website](https://crawl4ai.com/mkdocs/) + +✨ Check [Demo](https://crawl4ai.com/mkdocs/demo) + +## Features ✨ + +- πŸ†“ Completely free and open-source +- πŸ€– LLM-friendly output formats (JSON, cleaned HTML, markdown) +- 🌍 Supports crawling multiple URLs simultaneously +- 🎨 Extracts and returns all media tags (Images, Audio, and Video) +- πŸ”— Extracts all external and internal links +- πŸ“š Extracts metadata from the page +- πŸ”„ Custom hooks for authentication, headers, and page modifications before crawling +- πŸ•΅οΈ User-agent customization +- πŸ–ΌοΈ Takes screenshots of the page +- πŸ“œ Executes multiple custom JavaScripts before crawling +- πŸ“š Various chunking strategies: topic-based, regex, sentence, and more +- 🧠 Advanced extraction strategies: cosine clustering, LLM, and more +- 🎯 CSS selector support +- πŸ“ Passes instructions/keywords to refine extraction + +# Crawl4AI + +## 🌟 Shoutout to Contributors of v0.2.77! + +A big thank you to the amazing contributors who've made this release possible: + +- [@aravindkarnam](https://github.com/aravindkarnam) for the new image description feature +- [@FractalMind](https://github.com/FractalMind) for our official Docker Hub image +- [@ketonkss4](https://github.com/ketonkss4) for helping streamline our Selenium setup + +Your contributions are driving Crawl4AI forward! πŸš€ + +## Cool Examples πŸš€ + +### Quick Start + +```python +from crawl4ai import WebCrawler + +# Create an instance of WebCrawler +crawler = WebCrawler() + +# Warm up the crawler (load necessary models) +crawler.warmup() + +# Run the crawler on a URL +result = crawler.run(url="https://www.nbcnews.com/business") + +# Print the extracted content +print(result.markdown) +``` + +## How to install πŸ›  + +### Using pip 🐍 +```bash +virtualenv venv +source venv/bin/activate +pip install "crawl4ai @ git+https://github.com/unclecode/crawl4ai.git" +``` + +### Using Docker 🐳 + +```bash +# For Mac users (M1/M2) +# docker build --platform linux/amd64 -t crawl4ai . +docker build -t crawl4ai . +docker run -d -p 8000:80 crawl4ai +``` + +### Using Docker Hub 🐳 + +```bash +docker pull unclecode/crawl4ai:latest +docker run -d -p 8000:80 unclecode/crawl4ai:latest +``` + + +## Speed-First Design πŸš€ + +Perhaps the most important design principle for this library is speed. We need to ensure it can handle many links and resources in parallel as quickly as possible. By combining this speed with fast LLMs like Groq, the results will be truly amazing. + +```python +import time +from crawl4ai.web_crawler import WebCrawler +crawler = WebCrawler() +crawler.warmup() + +start = time.time() +url = r"https://www.nbcnews.com/business" +result = crawler.run( url, word_count_threshold=10, bypass_cache=True) +end = time.time() +print(f"Time taken: {end - start}") +``` + +Let's take a look the calculated time for the above code snippet: + +```bash +[LOG] πŸš€ Crawling done, success: True, time taken: 1.3623387813568115 seconds +[LOG] πŸš€ Content extracted, success: True, time taken: 0.05715131759643555 seconds +[LOG] πŸš€ Extraction, time taken: 0.05750393867492676 seconds. +Time taken: 1.439958095550537 +``` +Fetching the content from the page took 1.3623 seconds, and extracting the content took 0.0575 seconds. πŸš€ + +### Extract Structured Data from Web Pages πŸ“Š + +Crawl all OpenAI models and their fees from the official page. + +```python +import os +from crawl4ai import WebCrawler +from crawl4ai.extraction_strategy import LLMExtractionStrategy +from pydantic import BaseModel, Field + +class OpenAIModelFee(BaseModel): + model_name: str = Field(..., description="Name of the OpenAI model.") + input_fee: str = Field(..., description="Fee for input token for the OpenAI model.") + output_fee: str = Field(..., description="Fee for output token ßfor the OpenAI model.") + +url = 'https://openai.com/api/pricing/' +crawler = WebCrawler() +crawler.warmup() + +result = crawler.run( + url=url, + word_count_threshold=1, + extraction_strategy= LLMExtractionStrategy( + provider= "openai/gpt-4o", api_token = os.getenv('OPENAI_API_KEY'), + schema=OpenAIModelFee.schema(), + extraction_type="schema", + instruction="""From the crawled content, extract all mentioned model names along with their fees for input and output tokens. + Do not miss any models in the entire content. One extracted model JSON format should look like this: + {"model_name": "GPT-4", "input_fee": "US$10.00 / 1M tokens", "output_fee": "US$30.00 / 1M tokens"}.""" + ), + bypass_cache=True, + ) + +print(result.extracted_content) +``` + +### Execute JS, Filter Data with CSS Selector, and Clustering + +```python +from crawl4ai import WebCrawler +from crawl4ai.chunking_strategy import CosineStrategy + +js_code = ["const loadMoreButton = Array.from(document.querySelectorAll('button')).find(button => button.textContent.includes('Load More')); loadMoreButton && loadMoreButton.click();"] + +crawler = WebCrawler() +crawler.warmup() + +result = crawler.run( + url="https://www.nbcnews.com/business", + js=js_code, + css_selector="p", + extraction_strategy=CosineStrategy(semantic_filter="technology") +) + +print(result.extracted_content) +``` + +### Extract Structured Data from Web Pages With Proxy and BaseUrl + +```python +from crawl4ai import WebCrawler +from crawl4ai.extraction_strategy import LLMExtractionStrategy + +def create_crawler(): + crawler = WebCrawler(verbose=True, proxy="http://127.0.0.1:7890") + crawler.warmup() + return crawler + +crawler = create_crawler() + +crawler.warmup() + +result = crawler.run( + url="https://www.nbcnews.com/business", + extraction_strategy=LLMExtractionStrategy( + provider="openai/gpt-4o", + api_token="sk-", + base_url="https://api.openai.com/v1" + ) +) + +print(result.markdown) +``` + +## Documentation πŸ“š + +For detailed documentation, including installation instructions, advanced features, and API reference, visit our [Documentation Website](https://crawl4ai.com/mkdocs/). + +## Contributing 🀝 + +We welcome contributions from the open-source community. Check out our [contribution guidelines](https://github.com/unclecode/crawl4ai/blob/main/CONTRIBUTING.md) for more information. + +## License πŸ“„ + +Crawl4AI is released under the [Apache 2.0 License](https://github.com/unclecode/crawl4ai/blob/main/LICENSE). + +## Contact πŸ“§ + +For questions, suggestions, or feedback, feel free to reach out: + +- GitHub: [unclecode](https://github.com/unclecode) +- Twitter: [@unclecode](https://twitter.com/unclecode) +- Website: [crawl4ai.com](https://crawl4ai.com) + +Happy Crawling! πŸ•ΈοΈπŸš€ + +## Star History + +[![Star History Chart](https://api.star-history.com/svg?repos=unclecode/crawl4ai&type=Date)](https://star-history.com/#unclecode/crawl4ai&Date) \ No newline at end of file diff --git a/crawl4ai/model_loader.py b/crawl4ai/model_loader.py index f49a0659..7608ca51 100644 --- a/crawl4ai/model_loader.py +++ b/crawl4ai/model_loader.py @@ -80,47 +80,6 @@ def load_bge_small_en_v1_5(): model, device = set_model_device(model) return tokenizer, model -@lru_cache() -def load_onnx_all_MiniLM_l6_v2(): - from crawl4ai.onnx_embedding import DefaultEmbeddingModel - - model_path = "models/onnx.tar.gz" - model_url = "https://unclecode-files.s3.us-west-2.amazonaws.com/onnx.tar.gz" - __location__ = os.path.realpath( - os.path.join(os.getcwd(), os.path.dirname(__file__))) - download_path = os.path.join(__location__, model_path) - onnx_dir = os.path.join(__location__, "models/onnx") - - # Create the models directory if it does not exist - os.makedirs(os.path.dirname(download_path), exist_ok=True) - - # Download the tar.gz file if it does not exist - if not os.path.exists(download_path): - def download_with_progress(url, filename): - def reporthook(block_num, block_size, total_size): - downloaded = block_num * block_size - percentage = 100 * downloaded / total_size - if downloaded < total_size: - print(f"\rDownloading: {percentage:.2f}% ({downloaded / (1024 * 1024):.2f} MB of {total_size / (1024 * 1024):.2f} MB)", end='') - else: - print("\rDownload complete!") - - urllib.request.urlretrieve(url, filename, reporthook) - - download_with_progress(model_url, download_path) - - # Extract the tar.gz file if the onnx directory does not exist - if not os.path.exists(onnx_dir): - with tarfile.open(download_path, "r:gz") as tar: - tar.extractall(path=os.path.join(__location__, "models")) - - # remove the tar.gz file - os.remove(download_path) - - - - model = DefaultEmbeddingModel() - return model @lru_cache() def load_text_classifier(): diff --git a/crawl4ai/onnx_embedding.py b/crawl4ai/onnx_embedding.py deleted file mode 100644 index af5e5f20..00000000 --- a/crawl4ai/onnx_embedding.py +++ /dev/null @@ -1,50 +0,0 @@ -# A dependency-light way to run the onnx model - - -import numpy as np -from typing import List -import os - -__location__ = os.path.realpath(os.path.join(os.getcwd(), os.path.dirname(__file__))) -MODEL_ID = "sentence-transformers/all-MiniLM-L6-v2" - -def normalize(v): - norm = np.linalg.norm(v, axis=1) - norm[norm == 0] = 1e-12 - return v / norm[:, np.newaxis] - -# Sampel implementation of the default sentence-transformers model using ONNX -class DefaultEmbeddingModel(): - - def __init__(self): - from tokenizers import Tokenizer - import onnxruntime as ort - # max_seq_length = 256, for some reason sentence-transformers uses 256 even though the HF config has a max length of 128 - # https://github.com/UKPLab/sentence-transformers/blob/3e1929fddef16df94f8bc6e3b10598a98f46e62d/docs/_static/html/models_en_sentence_embeddings.html#LL480 - self.tokenizer = Tokenizer.from_file(os.path.join(__location__, "models/onnx/tokenizer.json")) - self.tokenizer.enable_truncation(max_length=256) - self.tokenizer.enable_padding(pad_id=0, pad_token="[PAD]", length=256) - self.model = ort.InferenceSession(os.path.join(__location__,"models/onnx/model.onnx")) - - - def __call__(self, documents: List[str], batch_size: int = 32): - all_embeddings = [] - for i in range(0, len(documents), batch_size): - batch = documents[i:i + batch_size] - encoded = [self.tokenizer.encode(d) for d in batch] - input_ids = np.array([e.ids for e in encoded]) - attention_mask = np.array([e.attention_mask for e in encoded]) - onnx_input = { - "input_ids": np.array(input_ids, dtype=np.int64), - "attention_mask": np.array(attention_mask, dtype=np.int64), - "token_type_ids": np.array([np.zeros(len(e), dtype=np.int64) for e in input_ids], dtype=np.int64), - } - model_output = self.model.run(None, onnx_input) - last_hidden_state = model_output[0] - # Perform mean pooling with attention weighting - input_mask_expanded = np.broadcast_to(np.expand_dims(attention_mask, -1), last_hidden_state.shape) - embeddings = np.sum(last_hidden_state * input_mask_expanded, 1) / np.clip(input_mask_expanded.sum(1), a_min=1e-9, a_max=None) - embeddings = normalize(embeddings).astype(np.float32) - all_embeddings.append(embeddings) - return np.concatenate(all_embeddings) - diff --git a/docs/examples/quickstart.ipynb b/docs/examples/quickstart.ipynb new file mode 100644 index 00000000..738b21b4 --- /dev/null +++ b/docs/examples/quickstart.ipynb @@ -0,0 +1,442 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Crawl4AI: Advanced Web Crawling and Data Extraction\n", + "\n", + "Welcome to this interactive notebook showcasing Crawl4AI, an advanced asynchronous web crawling and data extraction library.\n", + "\n", + "- GitHub Repository: [https://github.com/unclecode/crawl4ai](https://github.com/unclecode/crawl4ai)\n", + "- Twitter: [@unclecode](https://twitter.com/unclecode)\n", + "- Website: [https://crawl4ai.com](https://crawl4ai.com)\n", + "\n", + "Let's explore the powerful features of Crawl4AI!" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Installation\n", + "\n", + "First, let's install Crawl4AI from GitHub:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "!pip install \"crawl4ai @ git+https://github.com/unclecode/crawl4ai.git\"\n", + "!pip install nest-asyncio\n", + "!playwright install" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Now, let's import the necessary libraries:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "import asyncio\n", + "import nest_asyncio\n", + "from crawl4ai import AsyncWebCrawler\n", + "from crawl4ai.extraction_strategy import JsonCssExtractionStrategy, LLMExtractionStrategy\n", + "import json\n", + "import time\n", + "from pydantic import BaseModel, Field\n", + "\n", + "nest_asyncio.apply()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Basic Usage\n", + "\n", + "Let's start with a simple crawl example:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "async def simple_crawl():\n", + " async with AsyncWebCrawler(verbose=True) as crawler:\n", + " result = await crawler.arun(url=\"https://www.nbcnews.com/business\")\n", + " print(result.markdown[:500]) # Print first 500 characters\n", + "\n", + "await simple_crawl()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Advanced Features\n", + "\n", + "### Executing JavaScript and Using CSS Selectors" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "async def js_and_css():\n", + " async with AsyncWebCrawler(verbose=True) as crawler:\n", + " js_code = [\"const loadMoreButton = Array.from(document.querySelectorAll('button')).find(button => button.textContent.includes('Load More')); loadMoreButton && loadMoreButton.click();\"]\n", + " result = await crawler.arun(\n", + " url=\"https://www.nbcnews.com/business\",\n", + " js_code=js_code,\n", + " css_selector=\"article.tease-card\",\n", + " bypass_cache=True\n", + " )\n", + " print(result.extracted_content[:500]) # Print first 500 characters\n", + "\n", + "await js_and_css()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Using a Proxy\n", + "\n", + "Note: You'll need to replace the proxy URL with a working proxy for this example to run successfully." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "async def use_proxy():\n", + " async with AsyncWebCrawler(verbose=True, proxy=\"http://your-proxy-url:port\") as crawler:\n", + " result = await crawler.arun(\n", + " url=\"https://www.nbcnews.com/business\",\n", + " bypass_cache=True\n", + " )\n", + " print(result.markdown[:500]) # Print first 500 characters\n", + "\n", + "# Uncomment the following line to run the proxy example\n", + "# await use_proxy()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Extracting Structured Data with OpenAI\n", + "\n", + "Note: You'll need to set your OpenAI API key as an environment variable for this example to work." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "import os\n", + "\n", + "class OpenAIModelFee(BaseModel):\n", + " model_name: str = Field(..., description=\"Name of the OpenAI model.\")\n", + " input_fee: str = Field(..., description=\"Fee for input token for the OpenAI model.\")\n", + " output_fee: str = Field(..., description=\"Fee for output token for the OpenAI model.\")\n", + "\n", + "async def extract_openai_fees():\n", + " async with AsyncWebCrawler(verbose=True) as crawler:\n", + " result = await crawler.arun(\n", + " url='https://openai.com/api/pricing/',\n", + " word_count_threshold=1,\n", + " extraction_strategy=LLMExtractionStrategy(\n", + " provider=\"openai/gpt-4o\", api_token=os.getenv('OPENAI_API_KEY'), \n", + " schema=OpenAIModelFee.schema(),\n", + " extraction_type=\"schema\",\n", + " instruction=\"\"\"From the crawled content, extract all mentioned model names along with their fees for input and output tokens. \n", + " Do not miss any models in the entire content. One extracted model JSON format should look like this: \n", + " {\"model_name\": \"GPT-4\", \"input_fee\": \"US$10.00 / 1M tokens\", \"output_fee\": \"US$30.00 / 1M tokens\"}.\"\"\"\n", + " ), \n", + " bypass_cache=True,\n", + " )\n", + " print(result.extracted_content)\n", + "\n", + "# Uncomment the following line to run the OpenAI extraction example\n", + "# await extract_openai_fees()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Advanced Multi-Page Crawling with JavaScript Execution" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "import re\n", + "from bs4 import BeautifulSoup\n", + "\n", + "async def crawl_typescript_commits():\n", + " first_commit = \"\"\n", + " async def on_execution_started(page):\n", + " nonlocal first_commit \n", + " try:\n", + " while True:\n", + " await page.wait_for_selector('li.Box-sc-g0xbh4-0 h4')\n", + " commit = await page.query_selector('li.Box-sc-g0xbh4-0 h4')\n", + " commit = await commit.evaluate('(element) => element.textContent')\n", + " commit = re.sub(r'\\s+', '', commit)\n", + " if commit and commit != first_commit:\n", + " first_commit = commit\n", + " break\n", + " await asyncio.sleep(0.5)\n", + " except Exception as e:\n", + " print(f\"Warning: New content didn't appear after JavaScript execution: {e}\")\n", + "\n", + " async with AsyncWebCrawler(verbose=True) as crawler:\n", + " crawler.crawler_strategy.set_hook('on_execution_started', on_execution_started)\n", + "\n", + " url = \"https://github.com/microsoft/TypeScript/commits/main\"\n", + " session_id = \"typescript_commits_session\"\n", + " all_commits = []\n", + "\n", + " js_next_page = \"\"\"\n", + " const button = document.querySelector('a[data-testid=\"pagination-next-button\"]');\n", + " if (button) button.click();\n", + " \"\"\"\n", + "\n", + " for page in range(3): # Crawl 3 pages\n", + " result = await crawler.arun(\n", + " url=url,\n", + " session_id=session_id,\n", + " css_selector=\"li.Box-sc-g0xbh4-0\",\n", + " js=js_next_page if page > 0 else None,\n", + " bypass_cache=True,\n", + " js_only=page > 0\n", + " )\n", + "\n", + " assert result.success, f\"Failed to crawl page {page + 1}\"\n", + "\n", + " soup = BeautifulSoup(result.cleaned_html, 'html.parser')\n", + " commits = soup.select(\"li\")\n", + " all_commits.extend(commits)\n", + "\n", + " print(f\"Page {page + 1}: Found {len(commits)} commits\")\n", + "\n", + " await crawler.crawler_strategy.kill_session(session_id)\n", + " print(f\"Successfully crawled {len(all_commits)} commits across 3 pages\")\n", + "\n", + "await crawl_typescript_commits()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Using JsonCssExtractionStrategy for Fast Structured Output" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "async def extract_news_teasers():\n", + " schema = {\n", + " \"name\": \"News Teaser Extractor\",\n", + " \"baseSelector\": \".wide-tease-item__wrapper\",\n", + " \"fields\": [\n", + " {\n", + " \"name\": \"category\",\n", + " \"selector\": \".unibrow span[data-testid='unibrow-text']\",\n", + " \"type\": \"text\",\n", + " },\n", + " {\n", + " \"name\": \"headline\",\n", + " \"selector\": \".wide-tease-item__headline\",\n", + " \"type\": \"text\",\n", + " },\n", + " {\n", + " \"name\": \"summary\",\n", + " \"selector\": \".wide-tease-item__description\",\n", + " \"type\": \"text\",\n", + " },\n", + " {\n", + " \"name\": \"time\",\n", + " \"selector\": \"[data-testid='wide-tease-date']\",\n", + " \"type\": \"text\",\n", + " },\n", + " {\n", + " \"name\": \"image\",\n", + " \"type\": \"nested\",\n", + " \"selector\": \"picture.teasePicture img\",\n", + " \"fields\": [\n", + " {\"name\": \"src\", \"type\": \"attribute\", \"attribute\": \"src\"},\n", + " {\"name\": \"alt\", \"type\": \"attribute\", \"attribute\": \"alt\"},\n", + " ],\n", + " },\n", + " {\n", + " \"name\": \"link\",\n", + " \"selector\": \"a[href]\",\n", + " \"type\": \"attribute\",\n", + " \"attribute\": \"href\",\n", + " },\n", + " ],\n", + " }\n", + "\n", + " extraction_strategy = JsonCssExtractionStrategy(schema, verbose=True)\n", + "\n", + " async with AsyncWebCrawler(verbose=True) as crawler:\n", + " result = await crawler.arun(\n", + " url=\"https://www.nbcnews.com/business\",\n", + " extraction_strategy=extraction_strategy,\n", + " bypass_cache=True,\n", + " )\n", + "\n", + " assert result.success, \"Failed to crawl the page\"\n", + "\n", + " news_teasers = json.loads(result.extracted_content)\n", + " print(f\"Successfully extracted {len(news_teasers)} news teasers\")\n", + " print(json.dumps(news_teasers[0], indent=2))\n", + "\n", + "await extract_news_teasers()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Speed Comparison\n", + "\n", + "Let's compare the speed of Crawl4AI with Firecrawl, a paid service. Note that we can't run Firecrawl in this Colab environment, so we'll simulate its performance based on previously recorded data." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "import time\n", + "\n", + "async def speed_comparison():\n", + " # Simulated Firecrawl performance\n", + " print(\"Firecrawl (simulated):\")\n", + " print(\"Time taken: 7.02 seconds\")\n", + " print(\"Content length: 42074 characters\")\n", + " print(\"Images found: 49\")\n", + " print()\n", + "\n", + " async with AsyncWebCrawler() as crawler:\n", + " # Crawl4AI simple crawl\n", + " start = time.time()\n", + " result = await crawler.arun(\n", + " url=\"https://www.nbcnews.com/business\",\n", + " word_count_threshold=0,\n", + " bypass_cache=True, \n", + " verbose=False\n", + " )\n", + " end = time.time()\n", + " print(\"Crawl4AI (simple crawl):\")\n", + " print(f\"Time taken: {end - start:.2f} seconds\")\n", + " print(f\"Content length: {len(result.markdown)} characters\")\n", + " print(f\"Images found: {result.markdown.count('cldnry.s-nbcnews.com')}\")\n", + " print()\n", + "\n", + " # Crawl4AI with JavaScript execution\n", + " start = time.time()\n", + " result = await crawler.arun(\n", + " url=\"https://www.nbcnews.com/business\",\n", + " js_code=[\"const loadMoreButton = Array.from(document.querySelectorAll('button')).find(button => button.textContent.includes('Load More')); loadMoreButton && loadMoreButton.click();\"],\n", + " word_count_threshold=0,\n", + " bypass_cache=True, \n", + " verbose=False\n", + " )\n", + " end = time.time()\n", + " print(\"Crawl4AI (with JavaScript execution):\")\n", + " print(f\"Time taken: {end - start:.2f} seconds\")\n", + " print(f\"Content length: {len(result.markdown)} characters\")\n", + " print(f\"Images found: {result.markdown.count('cldnry.s-nbcnews.com')}\")\n", + "\n", + "await speed_comparison()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "As you can see, Crawl4AI outperforms Firecrawl significantly:\n", + "- Simple crawl: Crawl4AI is typically over 4 times faster than Firecrawl.\n", + "- With JavaScript execution: Even when executing JavaScript to load more content (potentially doubling the number of images found), Crawl4AI is still faster than Firecrawl's simple crawl.\n", + "\n", + "Please note that actual performance may vary depending on network conditions and the specific content being crawled." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Conclusion\n", + "\n", + "In this notebook, we've explored the powerful features of Crawl4AI, including:\n", + "\n", + "1. Basic crawling\n", + "2. JavaScript execution and CSS selector usage\n", + "3. Proxy support\n", + "4. Structured data extraction with OpenAI\n", + "5. Advanced multi-page crawling with JavaScript execution\n", + "6. Fast structured output using JsonCssExtractionStrategy\n", + "7. Speed comparison with other services\n", + "\n", + "Crawl4AI offers a fast, flexible, and powerful solution for web crawling and data extraction tasks. Its asynchronous architecture and advanced features make it suitable for a wide range of applications, from simple web scraping to complex, multi-page data extraction scenarios.\n", + "\n", + "For more information and advanced usage, please visit the [Crawl4AI documentation](https://crawl4ai.com/mkdocs/).\n", + "\n", + "Happy crawling!" + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.8.10" + } + }, + "nbformat": 4, + "nbformat_minor": 4 +} diff --git a/requirements.txt b/requirements.txt index 2574cf60..772acd7e 100644 --- a/requirements.txt +++ b/requirements.txt @@ -1,24 +1,22 @@ -numpy==1.25.0 -aiohttp==3.9.5 -aiosqlite==0.20.0 -beautifulsoup4==4.12.3 -fastapi==0.111.0 -html2text==2024.2.26 -httpx==0.27.0 -litellm==1.40.17 -nltk==3.8.1 -pydantic==2.7.4 -python-dotenv==1.0.1 -requests==2.32.3 -rich==13.7.1 -scikit-learn==1.5.0 -selenium==4.23.1 -uvicorn==0.30.1 -transformers==4.41.2 -# webdriver-manager==4.0.1 -# chromedriver-autoinstaller==0.6.4 -torch==2.3.1 -onnxruntime==1.18.0 -tokenizers==0.19.1 -pillow==10.3.0 -slowapi==0.1.9 \ No newline at end of file +numpy>=1.25.0 +aiohttp>=3.9.5 +aiosqlite>=0.20.0 +beautifulsoup4>=4.12.3 +fastapi>=0.111.0 +html2text>=2024.2.26 +httpx>=0.27.0 +litellm>=1.40.17 +nltk>=3.8.1 +pydantic>=2.7.4 +python-dotenv>=1.0.1 +requests>=2.32.3 +rich>=13.7.1 +scikit-learn>=1.5.0 +selenium>=4.23.1 +uvicorn>=0.30.1 +transformers>=4.41.2 +torch>=2.3.1 +tokenizers>=0.19.1 +pillow>=10.3.0 +slowapi>=0.1.9 +playwright>=1.46.0 \ No newline at end of file