Compare commits

..

58 Commits

Author SHA1 Message Date
UncleCode
c0e87abaee fix: update package versions in requirements.txt for compatibility 2024-11-28 21:43:08 +08:00
UncleCode
c8485776fe docs: update README to reflect latest version v0.3.745 2024-11-28 20:04:16 +08:00
UncleCode
aa3e2d0fe6 Merge branch 'main' of https://github.com/unclecode/crawl4ai 2024-11-28 20:03:43 +08:00
UncleCode
98c64f9d5f Merge branch 'next' 2024-11-28 20:03:11 +08:00
UncleCode
7d81c17cca fix: improve handling of CRAWL4_AI_BASE_DIRECTORY environment variable in setup.py 2024-11-28 20:02:39 +08:00
UncleCode
652d396a81 chore: update version to 0.3.745 2024-11-28 20:00:29 +08:00
UncleCode
1d83c493af Enhance setup process and update contributors list
- Acknowledge contributor paulokuong for fixing RAWL4_AI_BASE_DIRECTORY issue
  - Refine base directory handling in `setup.py`
  - Clarify Playwright installation instructions and improve error handling
2024-11-28 19:58:40 +08:00
Paulo Kuong
cf35cbe59e CRAWL4_AI_BASE_DIRECTORY should be Path object instead of string (#298)
Thank you so much for your point. Yes, that's correct. I accept your pull request, and I add your name to a contribution list. Thank you again.
2024-11-28 19:46:36 +08:00
UncleCode
9221c08418 docs: fix link formatting for recent updates section in README 2024-11-28 19:33:36 +08:00
UncleCode
48d43c14b1 docs: fix link formatting for recent updates section in README 2024-11-28 19:33:02 +08:00
UncleCode
776efa74a4 docs: fix link formatting for recent updates section in README 2024-11-28 19:32:32 +08:00
UncleCode
b14e83f499 docs: fix link formatting for recent updates section in README 2024-11-28 19:31:09 +08:00
UncleCode
a9b6b65238 chore: update version to 0.3.744 and add publish.sh to .gitignore 2024-11-28 19:26:50 +08:00
UncleCode
a036b7f122 feat: implement create_box_message utility for formatted error messages and enhance error logging in AsyncWebCrawler 2024-11-28 19:24:07 +08:00
UncleCode
0bccf23db3 docs: update quickstart_async.py to enable example function calls for better demonstration 2024-11-28 18:19:42 +08:00
UncleCode
0cbd594512 Merge branch 'next' - Update README, and quickstart examples 2024-11-28 16:43:16 +08:00
UncleCode
efe93a5f57 docs: enhance README with development TODOs and refine mission statement for clarity 2024-11-28 16:41:11 +08:00
UncleCode
3fda66b85b docs: refine README content for clarity and conciseness, improving descriptions and formatting 2024-11-28 16:36:24 +08:00
UncleCode
ddfb6707b4 docs: update README to reflect new branding and improve section headings for clarity 2024-11-28 16:34:08 +08:00
UncleCode
a69f7a9531 fix: correct typo in function documentation for clarity and accuracy 2024-11-28 16:31:41 +08:00
UncleCode
d583aa43ca refactor: update cache handling in quickstart_async example to use CacheMode enum 2024-11-28 15:53:25 +08:00
UncleCode
3abb573142 docs: update README for version 0.3.743 with improved formatting and contributor acknowledgments 2024-11-28 13:07:59 +08:00
UncleCode
d556dada9f docs: update README to keep details open for extraction capabilities, browser integration, input/output flexibility, utility & debugging, security & accessibility, community & documentation, and cutting-edge features 2024-11-28 13:07:33 +08:00
UncleCode
ce7d49484f docs: update README for version 0.3.743 with new features, enhancements, and contributor acknowledgments 2024-11-28 13:06:46 +08:00
UncleCode
e4acd18429 docs: update README for version 0.3.743 with new features, enhancements, and contributor acknowledgments 2024-11-28 13:06:30 +08:00
UncleCode
c2d4784810 fix: resolve merge conflict in DefaultMarkdownGenerator affecting fit_markdown generation 2024-11-28 12:56:31 +08:00
UncleCode
76bea6c577 Merge branch 'main' into 0.3.743 2024-11-28 12:53:30 +08:00
UncleCode
3ff0b0b2c4 feat: update changelog for version 0.3.743 with new features, improvements, and contributor acknowledgments 2024-11-28 12:48:07 +08:00
UncleCode
a1c7dc17ce Merge branch 'next' of https://github.com/unclecode/crawl4ai into next 2024-11-28 12:45:57 +08:00
UncleCode
24723b2f10 Enhance features and documentation
- Updated version to 0.3.743
  - Improved ManagedBrowser configuration with dynamic host/port
  - Implemented fast HTML formatting in web crawler
  - Enhanced markdown generation with a new generator class
  - Improved sanitization and utility functions
  - Added contributor details and pull request acknowledgments
  - Updated documentation for clearer usage scenarios
  - Adjusted tests to reflect class name changes
2024-11-28 12:45:05 +08:00
Hamza Farhan
f998e9e949 Fix: handled the cases where markdown_with_citations, references_markdown, and filtered_html might not be defined. (#293)
Thanks, dear Farhan, for the changes you made in the code. I accepted and merged them into the main branch. Also, I will add your name to our contributor list. Thank you so much.
2024-11-27 19:20:54 +08:00
zhounan
73661f7d1f docs: enhance development installation instructions (#286)
Thanks for your contribution. I'm merging your changes and I'll add your name to our contributor list. Thank you so much.
2024-11-27 15:04:20 +08:00
UncleCode
b5d4db07d1 Merge branch 'main' of https://github.com/unclecode/crawl4ai 2024-11-27 14:55:58 +08:00
UncleCode
c6a022132b docs: update CONTRIBUTORS.md to acknowledge aadityakanjolia4 for fixing 'CustomHTML2Text' bug 2024-11-27 14:55:56 +08:00
unclecode
195c0ccf8a chore: remove deprecated Docker Compose configurations for crawl4ai service 2024-11-24 19:40:27 +08:00
unclecode
b09a86c0c1 chore: remove deprecated Docker Compose configurations for crawl4ai service 2024-11-24 19:40:10 +08:00
unclecode
de43505ae4 feat: update version to 0.3.742 2024-11-24 19:36:30 +08:00
unclecode
d7c5b900b8 feat: add support for arm64 platform in Docker commands and update INSTALL_TYPE variable in docker-compose 2024-11-24 19:35:53 +08:00
unclecode
edad7b6a74 chore: remove Railway deployment configuration and related documentation 2024-11-24 18:48:39 +08:00
UncleCode
829a1f7992 feat: update version to 0.3.741 and enhance content filtering with heuristic strategy. Fixing the issue that when the past HTML to BM25 content filter does not have any HTML elements. 2024-11-23 19:45:41 +08:00
UncleCode
d729aa7d5e refactor: Add group ID to for images extracted from srcset. 2024-11-23 18:00:32 +08:00
UncleCode
0d0cef3438 feat: add enhanced markdown generation example with citations and file output 2024-11-22 20:14:58 +08:00
UncleCode
d7a112fefe Merge branch 'main' of https://github.com/unclecode/crawl4ai 2024-11-22 19:56:56 +08:00
UncleCode
a5decaa7cf Merge branch '0.3.74' 2024-11-22 19:55:52 +08:00
UncleCode
8dea3f470f chore: update README to include new features and improvements for version 0.3.74 2024-11-22 18:50:12 +08:00
UncleCode
e02935dc5b chore: update README to reflect new features and improvements in version 0.3.74 2024-11-22 18:49:22 +08:00
UncleCode
24ad2fe2dd feat: enhance Markdown generation to include fit_html attribute 2024-11-22 18:47:17 +08:00
UncleCode
571dda6549 Update Redme 2024-11-22 18:27:43 +08:00
UncleCode
006bee4a5a feat: enhance image processing capabilities
- Enhanced image processing with srcset support and validation checks for better image selection.
2024-11-22 16:00:17 +08:00
UncleCode
dbb751c8f0 In this commit, we introduce the new concept of MakrdownGenerationStrategy, which allows us to expand our future strategies to generate better markdown. Right now, we generate raw markdown as we were doing before. We have a new algorithm for fitting markdown based on BM25, and now we add the ability to refine markdown into a citation form. Our links will be extracted and replaced by a citation reference number, and then we will have reference sections at the very end; we add all the links with the descriptions. This format is more suitable for large language models. In case we don't need to pass links, we can reduce the size of the markdown significantly and also attach the list of references as a separate file to a large language model. This commit contains changes for this direction. 2024-11-21 18:21:43 +08:00
程序员阿江(Relakkes)
3439f7886d fix: crawler strategy exception handling and fixes (#271) 2024-11-20 20:30:25 +08:00
Darwing Medina
d418a04602 Fix #260 prevent pass duplicated kwargs to scrapping_strategy (#269)
Thank you for the suggestions. It totally makes sense now. Change to pop operator.
2024-11-20 18:52:11 +08:00
UncleCode
b654c49e55 Update .gitignore to exclude additional scripts and files 2024-11-19 19:32:06 +08:00
UncleCode
fbcff85ecb Remove test files 2024-11-19 19:03:23 +08:00
UncleCode
788c67c29a Merge branch 'main' of https://github.com/unclecode/crawl4ai 2024-11-19 19:02:44 +08:00
UncleCode
2f19d38693 Update .gitignore to include .gitboss/ and todo_executor.md 2024-11-19 19:02:41 +08:00
ntohidikplay
3aae30ed2a test1: trying to push to main 2024-11-19 11:57:07 +01:00
ntohidikplay
593c7ad307 test: trying to push to main 2024-11-19 11:45:26 +01:00
43 changed files with 1564 additions and 4150 deletions

6
.gitignore vendored
View File

@@ -211,5 +211,7 @@ git_issues.md
.docs/
.issues/
.gitboss/
manage-collab.sh
todo_executor.md
protect-all-except-feature.sh
manage-collab.sh
publish.sh

View File

@@ -1,5 +1,53 @@
# Changelog
## [0.3.743] November 27, 2024
Enhance features and documentation
- Updated version to 0.3.743
- Improved ManagedBrowser configuration with dynamic host/port
- Implemented fast HTML formatting in web crawler
- Enhanced markdown generation with a new generator class
- Improved sanitization and utility functions
- Added contributor details and pull request acknowledgments
- Updated documentation for clearer usage scenarios
- Adjusted tests to reflect class name changes
### CONTRIBUTORS.md
Added new contributors and pull request details.
Updated community contributions and acknowledged pull requests.
### crawl4ai/__version__.py
Version update.
Bumped version to 0.3.743.
### crawl4ai/async_crawler_strategy.py
Improved ManagedBrowser configuration.
Enhanced browser initialization with configurable host and debugging port; improved hook execution.
### crawl4ai/async_webcrawler.py
Optimized HTML processing.
Implemented 'fast_format_html' for optimized HTML formatting; applied it when 'prettiify' is enabled.
### crawl4ai/content_scraping_strategy.py
Enhanced markdown generation strategy.
Updated to use DefaultMarkdownGenerator and improved markdown generation with filters option.
### crawl4ai/markdown_generation_strategy.py
Refactored markdown generation class.
Renamed DefaultMarkdownGenerationStrategy to DefaultMarkdownGenerator; added content filter handling.
### crawl4ai/utils.py
Enhanced utility functions.
Improved input sanitization and enhanced HTML formatting method.
### docs/md_v2/advanced/hooks-auth.md
Improved documentation for hooks.
Updated code examples to include cookies in crawler strategy initialization.
### tests/async/test_markdown_genertor.py
Refactored tests to match class renaming.
Updated tests to use renamed DefaultMarkdownGenerator class.
## [0.3.74] November 17, 2024
This changelog details the updates and changes introduced in Crawl4AI version 0.3.74. It's designed to inform developers about new features, modifications to existing components, removals, and other important information.

View File

@@ -10,11 +10,20 @@ We would like to thank the following people for their contributions to Crawl4AI:
## Community Contributors
- [aadityakanjolia4](https://github.com/aadityakanjolia4) - Fix for `CustomHTML2Text` is not defined.
- [FractalMind](https://github.com/FractalMind) - Created the first official Docker Hub image and fixed Dockerfile errors
- [ketonkss4](https://github.com/ketonkss4) - Identified Selenium's new capabilities, helping reduce dependencies
- [jonymusky](https://github.com/jonymusky) - Javascript execution documentation, and wait_for
- [datehoer](https://github.com/datehoer) - Add browser prxy support
## Pull Requests
- [nelzomal](https://github.com/nelzomal) - Enhance development installation instructions [#286](https://github.com/unclecode/crawl4ai/pull/286)
- [HamzaFarhan](https://github.com/HamzaFarhan) - Handled the cases where markdown_with_citations, references_markdown, and filtered_html might not be defined [#293](https://github.com/unclecode/crawl4ai/pull/293)
- [NanmiCoder](https://github.com/NanmiCoder) - fix: crawler strategy exception handling and fixes [#271](https://github.com/unclecode/crawl4ai/pull/271)
- [paulokuong](https://github.com/paulokuong) - fix: RAWL4_AI_BASE_DIRECTORY should be Path object instead of string [#298](https://github.com/unclecode/crawl4ai/pull/298)
## Other Contributors
- [Gokhan](https://github.com/gkhngyk)

571
README.md
View File

@@ -1,4 +1,4 @@
# 🔥🕷️ Crawl4AI: LLM Friendly Web Crawler & Scraper
# 🔥🕷️ Crawl4AI: Crawl Smarter, Faster, Freely. For AI.
<a href="https://trendshift.io/repositories/11716" target="_blank"><img src="https://trendshift.io/api/badge/repositories/11716" alt="unclecode%2Fcrawl4ai | Trendshift" style="width: 250px; height: 55px;" width="250" height="55"/></a>
@@ -9,19 +9,115 @@
[![GitHub Pull Requests](https://img.shields.io/github/issues-pr/unclecode/crawl4ai)](https://github.com/unclecode/crawl4ai/pulls)
[![License](https://img.shields.io/github/license/unclecode/crawl4ai)](https://github.com/unclecode/crawl4ai/blob/main/LICENSE)
Crawl4AI simplifies asynchronous web crawling and data extraction, making it accessible for large language models (LLMs) and AI applications. 🆓🌐
Crawl4AI is the #1 trending GitHub repository, actively maintained by a vibrant community. It delivers blazing-fast, AI-ready web crawling tailored for LLMs, AI agents, and data pipelines. Open source, flexible, and built for real-time performance, Crawl4AI empowers developers with unmatched speed, precision, and deployment ease.
[✨ Check out latest update v0.3.745](#-recent-updates)
## 🧐 Why Crawl4AI?
1. **Built for LLMs**: Creates smart, concise Markdown optimized for RAG and fine-tuning applications.
2. **Lightning Fast**: Delivers results 6x faster with real-time, cost-efficient performance.
3. **Flexible Browser Control**: Offers session management, proxies, and custom hooks for seamless data access.
4. **Heuristic Intelligence**: Uses advanced algorithms for efficient extraction, reducing reliance on costly models.
5. **Open Source & Deployable**: Fully open-source with no API keys—ready for Docker and cloud integration.
6. **Thriving Community**: Actively maintained by a vibrant community and the #1 trending GitHub repository.
## 🚀 Quick Start
1. Install Crawl4AI:
```bash
pip install crawl4ai
```
2. Run a simple web crawl:
```python
import asyncio
from crawl4ai import AsyncWebCrawler, CacheMode
async def main():
async with AsyncWebCrawler(verbose=True) as crawler:
result = await crawler.arun(url="https://www.nbcnews.com/business")
# Soone will be change to result.markdown
print(result.markdown_v2.raw_markdown)
if __name__ == "__main__":
asyncio.run(main())
```
## ✨ Features
<details>
<summary>📝 <strong>Markdown Generation</strong></summary>
- 🧹 **Clean Markdown**: Generates clean, structured Markdown with accurate formatting.
- 🎯 **Fit Markdown**: Heuristic-based filtering to remove noise and irrelevant parts for AI-friendly processing.
- 🔗 **Citations and References**: Converts page links into a numbered reference list with clean citations.
- 🛠️ **Custom Strategies**: Users can create their own Markdown generation strategies tailored to specific needs.
- 📚 **BM25 Algorithm**: Employs BM25-based filtering for extracting core information and removing irrelevant content.
</details>
<details>
<summary>📊 <strong>Structured Data Extraction</strong></summary>
- 🤖 **LLM-Driven Extraction**: Supports all LLMs (open-source and proprietary) for structured data extraction.
- 🧱 **Chunking Strategies**: Implements chunking (topic-based, regex, sentence-level) for targeted content processing.
- 🌌 **Cosine Similarity**: Find relevant content chunks based on user queries for semantic extraction.
- 🔎 **CSS-Based Extraction**: Fast schema-based data extraction using XPath and CSS selectors.
- 🔧 **Schema Definition**: Define custom schemas for extracting structured JSON from repetitive patterns.
</details>
<details>
<summary>🌐 <strong>Browser Integration</strong></summary>
- 🖥️ **Managed Browser**: Use user-owned browsers with full control, avoiding bot detection.
- 🔄 **Remote Browser Control**: Connect to Chrome Developer Tools Protocol for remote, large-scale data extraction.
- 🔒 **Session Management**: Preserve browser states and reuse them for multi-step crawling.
- 🧩 **Proxy Support**: Seamlessly connect to proxies with authentication for secure access.
- ⚙️ **Full Browser Control**: Modify headers, cookies, user agents, and more for tailored crawling setups.
- 🌍 **Multi-Browser Support**: Compatible with Chromium, Firefox, and WebKit.
</details>
<details>
<summary>🔎 <strong>Crawling & Scraping</strong></summary>
- 🖼️ **Media Support**: Extract images, audio, videos, and responsive image formats like `srcset` and `picture`.
- 🚀 **Dynamic Crawling**: Execute JS and wait for async or sync for dynamic content extraction.
- 📸 **Screenshots**: Capture page screenshots during crawling for debugging or analysis.
- 📂 **Raw Data Crawling**: Directly process raw HTML (`raw:`) or local files (`file://`).
- 🔗 **Comprehensive Link Extraction**: Extracts internal, external links, and embedded iframe content.
- 🛠️ **Customizable Hooks**: Define hooks at every step to customize crawling behavior.
- 💾 **Caching**: Cache data for improved speed and to avoid redundant fetches.
- 📄 **Metadata Extraction**: Retrieve structured metadata from web pages.
- 📡 **IFrame Content Extraction**: Seamless extraction from embedded iframe content.
</details>
<details>
<summary>🚀 <strong>Deployment</strong></summary>
- 🐳 **Dockerized Setup**: Optimized Docker image with API server for easy deployment.
- 🔄 **API Gateway**: One-click deployment with secure token authentication for API-based workflows.
- 🌐 **Scalable Architecture**: Designed for mass-scale production and optimized server performance.
- ⚙️ **DigitalOcean Deployment**: Ready-to-deploy configurations for DigitalOcean and similar platforms.
</details>
<details>
<summary>🎯 <strong>Additional Features</strong></summary>
- 🕶️ **Stealth Mode**: Avoid bot detection by mimicking real users.
- 🏷️ **Tag-Based Content Extraction**: Refine crawling based on custom tags, headers, or metadata.
- 🔗 **Link Analysis**: Extract and analyze all links for detailed data exploration.
- 🛡️ **Error Handling**: Robust error management for seamless execution.
- 🔐 **CORS & Static Serving**: Supports filesystem-based caching and cross-origin requests.
- 📖 **Clear Documentation**: Simplified and updated guides for onboarding and advanced usage.
- 🙌 **Community Recognition**: Acknowledges contributors and pull requests for transparency.
</details>
## New in 0.3.74 ✨
- 🚀 **Blazing Fast Scraping:** The scraping process is now significantly faster, often completing in under 100 milliseconds (excluding web fetch time)!
- 📥 **Download Manager:** Integrated file crawling and downloading capabilities, with full control over file management and tracking within the `CrawlResult` object.
- 🔎 **Markdown Filter:** Enhanced content extraction using BM25 algorithm to create cleaner markdown with only relevant webpage content.
- 🗂️ **Local & Raw HTML:** Crawl local files (`file://`) and raw HTML strings (`raw:`) directly.
- 🤖 **Browser Control:** Use your own browser setup for crawling, with persistent contexts and stealth integration to bypass anti-bot measures.
- ☁️ **API & Cache Boost:** CORS support, static file serving, and a new filesystem-based cache for blazing-fast performance. Fine-tune caching with the `CacheMode` enum (ENABLED, DISABLED, READ_ONLY, WRITE_ONLY, BYPASS) and the `always_bypass_cache` parameter.
- 🐳 **API Gateway:** Run Crawl4AI as a local or cloud API service, enabling cross-platform usage through a containerized server with secure token authentication via `CRAWL4AI_API_TOKEN`.
- 🛠️ **Database Improvements:** Enhanced database system for handling larger content sets with improved caching and faster performance.
- 🐛 **Squashed Bugs:** Fixed browser context issues in Docker, memory leaks, enhanced error handling, and improved HTML parsing.
## Try it Now!
@@ -61,11 +157,12 @@ Crawl4AI simplifies asynchronous web crawling and data extraction, making it acc
Crawl4AI offers flexible installation options to suit various use cases. You can install it as a Python package or use Docker.
### Using pip 🐍
<details>
<summary>🐍 <strong>Using pip</strong></summary>
Choose the installation option that best fits your needs:
#### Basic Installation
### Basic Installation
For basic web crawling and scraping tasks:
@@ -75,7 +172,7 @@ pip install crawl4ai
By default, this will install the asynchronous version of Crawl4AI, using Playwright for web crawling.
👉 Note: When you install Crawl4AI, the setup script should automatically install and set up Playwright. However, if you encounter any Playwright-related errors, you can manually install it using one of these methods:
👉 **Note**: When you install Crawl4AI, the setup script should automatically install and set up Playwright. However, if you encounter any Playwright-related errors, you can manually install it using one of these methods:
1. Through the command line:
@@ -91,25 +188,42 @@ By default, this will install the asynchronous version of Crawl4AI, using Playwr
This second method has proven to be more reliable in some cases.
#### Installation with Synchronous Version
---
If you need the synchronous version using Selenium:
### Installation with Synchronous Version
The sync version is deprecated and will be removed in future versions. If you need the synchronous version using Selenium:
```bash
pip install crawl4ai[sync]
```
#### Development Installation
---
### Development Installation
For contributors who plan to modify the source code:
```bash
git clone https://github.com/unclecode/crawl4ai.git
cd crawl4ai
pip install -e .
pip install -e . # Basic installation in editable mode
```
## One-Click Deployment 🚀
Install optional features:
```bash
pip install -e ".[torch]" # With PyTorch features
pip install -e ".[transformer]" # With Transformer features
pip install -e ".[cosine]" # With cosine similarity features
pip install -e ".[sync]" # With synchronous crawling (Selenium)
pip install -e ".[all]" # Install all optional features
```
</details>
<details>
<summary>🚀 <strong>One-Click Deployment</strong></summary>
Deploy your own instance of Crawl4AI with one click:
@@ -120,14 +234,19 @@ Deploy your own instance of Crawl4AI with one click:
The deploy will:
- Set up a Docker container with Crawl4AI
- Configure Playwright and all dependencies
- Start the FastAPI server on port 11235
- Start the FastAPI server on port `11235`
- Set up health checks and auto-deployment
### Using Docker 🐳
</details>
<details>
<summary>🐳 <strong>Using Docker</strong></summary>
Crawl4AI is available as Docker images for easy deployment. You can either pull directly from Docker Hub (recommended) or build from the repository.
#### Option 1: Docker Hub (Recommended)
---
### Option 1: Docker Hub (Recommended)
```bash
# Pull and run from Docker Hub (choose one):
@@ -138,11 +257,16 @@ docker pull unclecode/crawl4ai:gpu # GPU-enabled version
# Run the container
docker run -p 11235:11235 unclecode/crawl4ai:basic # Replace 'basic' with your chosen version
# In case you want to set platform to arm64
docker run --platform linux/arm64 -p 11235:11235 unclecode/crawl4ai:basic
# In case to allocate more shared memory for the container
docker run --shm-size=2gb -p 11235:11235 unclecode/crawl4ai:basic
```
#### Option 2: Build from Repository
---
### Option 2: Build from Repository
```bash
# Clone the repository
@@ -154,11 +278,22 @@ docker build -t crawl4ai:local \
--build-arg INSTALL_TYPE=basic \ # Options: basic, all
.
# In case you want to set platform to arm64
docker build -t crawl4ai:local \
--build-arg INSTALL_TYPE=basic \ # Options: basic, all
--platform linux/arm64 \
.
# Run your local build
docker run -p 11235:11235 crawl4ai:local
```
Quick test (works for both options):
---
### Quick Test
Run a quick test (works for both Docker options):
```python
import requests
@@ -175,143 +310,134 @@ result = requests.get(f"http://localhost:11235/task/{task_id}")
For advanced configuration, environment variables, and usage examples, see our [Docker Deployment Guide](https://crawl4ai.com/mkdocs/basic/docker-deployment/).
</details>
## Quick Start 🚀
## 🔬 Advanced Usage Examples 🔬
You can check the project structure in the directory [https://github.com/unclecode/crawl4ai/docs/examples](docs/examples). Over there, you can find a variety of examples; here, some popular examples are shared.
<details>
<summary>📝 <strong>Heuristic Markdown Generation with Clean and Fit Markdown</strong></summary>
```python
import asyncio
from crawl4ai import AsyncWebCrawler
from crawl4ai import AsyncWebCrawler, CacheMode
from crawl4ai.content_filter_strategy import BM25ContentFilter
from crawl4ai.markdown_generation_strategy import DefaultMarkdownGenerator
async def main():
async with AsyncWebCrawler(verbose=True) as crawler:
result = await crawler.arun(url="https://www.nbcnews.com/business")
print(result.markdown)
if __name__ == "__main__":
asyncio.run(main())
```
## Advanced Usage 🔬
### Executing JavaScript and Using CSS Selectors
```python
import asyncio
from crawl4ai import AsyncWebCrawler
async def main():
async with AsyncWebCrawler(verbose=True) as crawler:
js_code = ["const loadMoreButton = Array.from(document.querySelectorAll('button')).find(button => button.textContent.includes('Load More')); loadMoreButton && loadMoreButton.click();"]
async with AsyncWebCrawler(
headless=True,
verbose=True,
) as crawler:
result = await crawler.arun(
url="https://www.nbcnews.com/business",
js_code=js_code,
css_selector=".wide-tease-item__description",
bypass_cache=True
url="https://docs.micronaut.io/4.7.6/guide/",
cache_mode=CacheMode.ENABLED,
markdown_generator=DefaultMarkdownGenerator(
content_filter=BM25ContentFilter(user_query=None, bm25_threshold=1.0)
),
)
print(result.extracted_content)
print(len(result.markdown))
print(len(result.fit_markdown))
print(len(result.markdown_v2.fit_markdown))
if __name__ == "__main__":
asyncio.run(main())
```
### Using a Proxy
</details>
<details>
<summary>🖥️ <strong>Executing JavaScript & Extract Structured Data without LLMs</strong></summary>
```python
import asyncio
from crawl4ai import AsyncWebCrawler
async def main():
async with AsyncWebCrawler(verbose=True, proxy="http://127.0.0.1:7890") as crawler:
result = await crawler.arun(
url="https://www.nbcnews.com/business",
bypass_cache=True
)
print(result.markdown)
if __name__ == "__main__":
asyncio.run(main())
```
### Extracting Structured Data without LLM
The `JsonCssExtractionStrategy` allows for precise extraction of structured data from web pages using CSS selectors.
```python
import asyncio
import json
from crawl4ai import AsyncWebCrawler
from crawl4ai import AsyncWebCrawler, CacheMode
from crawl4ai.extraction_strategy import JsonCssExtractionStrategy
import json
async def extract_news_teasers():
async def main():
schema = {
"name": "News Teaser Extractor",
"baseSelector": ".wide-tease-item__wrapper",
"fields": [
{
"name": "category",
"selector": ".unibrow span[data-testid='unibrow-text']",
"type": "text",
},
{
"name": "headline",
"selector": ".wide-tease-item__headline",
"type": "text",
},
{
"name": "summary",
"selector": ".wide-tease-item__description",
"type": "text",
},
{
"name": "time",
"selector": "[data-testid='wide-tease-date']",
"type": "text",
},
{
"name": "image",
"type": "nested",
"selector": "picture.teasePicture img",
"fields": [
{"name": "src", "type": "attribute", "attribute": "src"},
{"name": "alt", "type": "attribute", "attribute": "alt"},
],
},
{
"name": "link",
"selector": "a[href]",
"type": "attribute",
"attribute": "href",
},
],
}
"name": "KidoCode Courses",
"baseSelector": "section.charge-methodology .w-tab-content > div",
"fields": [
{
"name": "section_title",
"selector": "h3.heading-50",
"type": "text",
},
{
"name": "section_description",
"selector": ".charge-content",
"type": "text",
},
{
"name": "course_name",
"selector": ".text-block-93",
"type": "text",
},
{
"name": "course_description",
"selector": ".course-content-text",
"type": "text",
},
{
"name": "course_icon",
"selector": ".image-92",
"type": "attribute",
"attribute": "src"
}
]
}
extraction_strategy = JsonCssExtractionStrategy(schema, verbose=True)
async with AsyncWebCrawler(verbose=True) as crawler:
async with AsyncWebCrawler(
headless=False,
verbose=True
) as crawler:
# Create the JavaScript that handles clicking multiple times
js_click_tabs = """
(async () => {
const tabs = document.querySelectorAll("section.charge-methodology .tabs-menu-3 > div");
for(let tab of tabs) {
// scroll to the tab
tab.scrollIntoView();
tab.click();
// Wait for content to load and animations to complete
await new Promise(r => setTimeout(r, 500));
}
})();
"""
result = await crawler.arun(
url="https://www.nbcnews.com/business",
extraction_strategy=extraction_strategy,
bypass_cache=True,
url="https://www.kidocode.com/degrees/technology",
extraction_strategy=JsonCssExtractionStrategy(schema, verbose=True),
js_code=[js_click_tabs],
cache_mode=CacheMode.BYPASS
)
assert result.success, "Failed to crawl the page"
companies = json.loads(result.extracted_content)
print(f"Successfully extracted {len(companies)} companies")
print(json.dumps(companies[0], indent=2))
news_teasers = json.loads(result.extracted_content)
print(f"Successfully extracted {len(news_teasers)} news teasers")
print(json.dumps(news_teasers[0], indent=2))
if __name__ == "__main__":
asyncio.run(extract_news_teasers())
asyncio.run(main())
```
For more advanced usage examples, check out our [Examples](https://crawl4ai.com/mkdocs/extraction/css-advanced/) section in the documentation.
</details>
### Extracting Structured Data with OpenAI
<details>
<summary>📚 <strong>Extracting Structured Data with LLMs</strong></summary>
```python
import os
import asyncio
from crawl4ai import AsyncWebCrawler
from crawl4ai import AsyncWebCrawler, CacheMode
from crawl4ai.extraction_strategy import LLMExtractionStrategy
from pydantic import BaseModel, Field
@@ -326,6 +452,8 @@ async def main():
url='https://openai.com/api/pricing/',
word_count_threshold=1,
extraction_strategy=LLMExtractionStrategy(
# Here you can use any provider that Litellm library supports, for instance: ollama/qwen2
# provider="ollama/qwen2", api_token="no-token",
provider="openai/gpt-4o", api_token=os.getenv('OPENAI_API_KEY'),
schema=OpenAIModelFee.schema(),
extraction_type="schema",
@@ -333,7 +461,7 @@ async def main():
Do not miss any models in the entire content. One extracted model JSON format should look like this:
{"model_name": "GPT-4", "input_fee": "US$10.00 / 1M tokens", "output_fee": "US$30.00 / 1M tokens"}."""
),
bypass_cache=True,
cache_mode=CacheMode.BYPASS,
)
print(result.extracted_content)
@@ -341,143 +469,98 @@ if __name__ == "__main__":
asyncio.run(main())
```
### Session Management and Dynamic Content Crawling
</details>
Crawl4AI excels at handling complex scenarios, such as crawling multiple pages with dynamic content loaded via JavaScript. Here's an example of crawling GitHub commits across multiple pages:
<details>
<summary>🤖 <strong>Using You own Browswer with Custome User Profile</strong></summary>
```python
import asyncio
import re
from bs4 import BeautifulSoup
import os, sys
from pathlib import Path
import asyncio, time
from crawl4ai import AsyncWebCrawler
async def crawl_typescript_commits():
first_commit = ""
async def on_execution_started(page):
nonlocal first_commit
try:
while True:
await page.wait_for_selector('li.Box-sc-g0xbh4-0 h4')
commit = await page.query_selector('li.Box-sc-g0xbh4-0 h4')
commit = await commit.evaluate('(element) => element.textContent')
commit = re.sub(r'\s+', '', commit)
if commit and commit != first_commit:
first_commit = commit
break
await asyncio.sleep(0.5)
except Exception as e:
print(f"Warning: New content didn't appear after JavaScript execution: {e}")
async def test_news_crawl():
# Create a persistent user data directory
user_data_dir = os.path.join(Path.home(), ".crawl4ai", "browser_profile")
os.makedirs(user_data_dir, exist_ok=True)
async with AsyncWebCrawler(verbose=True) as crawler:
crawler.crawler_strategy.set_hook('on_execution_started', on_execution_started)
url = "https://github.com/microsoft/TypeScript/commits/main"
session_id = "typescript_commits_session"
all_commits = []
js_next_page = """
const button = document.querySelector('a[data-testid="pagination-next-button"]');
if (button) button.click();
"""
for page in range(3): # Crawl 3 pages
result = await crawler.arun(
url=url,
session_id=session_id,
css_selector="li.Box-sc-g0xbh4-0",
js=js_next_page if page > 0 else None,
bypass_cache=True,
js_only=page > 0
)
assert result.success, f"Failed to crawl page {page + 1}"
soup = BeautifulSoup(result.cleaned_html, 'html.parser')
commits = soup.select("li")
all_commits.extend(commits)
print(f"Page {page + 1}: Found {len(commits)} commits")
await crawler.crawler_strategy.kill_session(session_id)
print(f"Successfully crawled {len(all_commits)} commits across 3 pages")
if __name__ == "__main__":
asyncio.run(crawl_typescript_commits())
async with AsyncWebCrawler(
verbose=True,
headless=True,
user_data_dir=user_data_dir,
use_persistent_context=True,
headers={
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8",
"Accept-Language": "en-US,en;q=0.5",
"Accept-Encoding": "gzip, deflate, br",
"DNT": "1",
"Connection": "keep-alive",
"Upgrade-Insecure-Requests": "1",
"Sec-Fetch-Dest": "document",
"Sec-Fetch-Mode": "navigate",
"Sec-Fetch-Site": "none",
"Sec-Fetch-User": "?1",
"Cache-Control": "max-age=0",
}
) as crawler:
url = "ADDRESS_OF_A_CHALLENGING_WEBSITE"
result = await crawler.arun(
url,
cache_mode=CacheMode.BYPASS,
magic=True,
)
print(f"Successfully crawled {url}")
print(f"Content length: {len(result.markdown)}")
```
This example demonstrates Crawl4AI's ability to handle complex scenarios where content is loaded asynchronously. It crawls multiple pages of GitHub commits, executing JavaScript to load new content and using custom hooks to ensure data is loaded before proceeding.
For more advanced usage examples, check out our [Examples](https://crawl4ai.com/mkdocs/tutorial/episode_12_Session-Based_Crawling_for_Dynamic_Websites/) section in the documentation.
</details>
## Speed Comparison 🚀
## ✨ Recent Updates
Crawl4AI is designed with speed as a primary focus. Our goal is to provide the fastest possible response with high-quality data extraction, minimizing abstractions between the data and the user.
- 🚀 **Improved ManagedBrowser Configuration**: Dynamic host and port support for more flexible browser management.
- 📝 **Enhanced Markdown Generation**: New generator class for better formatting and customization.
-**Fast HTML Formatting**: Significantly optimized HTML formatting in the web crawler.
- 🛠️ **Utility & Sanitization Upgrades**: Improved sanitization and expanded utility functions for streamlined workflows.
- 👥 **Acknowledgments**: Added contributor details and pull request acknowledgments for better transparency.
We've conducted a speed comparison between Crawl4AI and Firecrawl, a paid service. The results demonstrate Crawl4AI's superior performance:
```bash
Firecrawl:
Time taken: 7.02 seconds
Content length: 42074 characters
Images found: 49
Crawl4AI (simple crawl):
Time taken: 1.60 seconds
Content length: 18238 characters
Images found: 49
Crawl4AI (with JavaScript execution):
Time taken: 4.64 seconds
Content length: 40869 characters
Images found: 89
```
As you can see, Crawl4AI outperforms Firecrawl significantly:
- Simple crawl: Crawl4AI is over 4 times faster than Firecrawl.
- With JavaScript execution: Even when executing JavaScript to load more content (doubling the number of images found), Crawl4AI is still faster than Firecrawl's simple crawl.
You can find the full comparison code in our repository at `docs/examples/crawl4ai_vs_firecrawl.py`.
## Documentation 📚
## 📖 Documentation & Roadmap
For detailed documentation, including installation instructions, advanced features, and API reference, visit our [Documentation Website](https://crawl4ai.com/mkdocs/).
## Crawl4AI Roadmap 🗺️
Moreover to check our development plans and upcoming features, check out our [Roadmap](https://github.com/unclecode/crawl4ai/blob/main/ROADMAP.md).
For detailed information on our development plans and upcoming features, check out our [Roadmap](https://github.com/unclecode/crawl4ai/blob/main/ROADMAP.md).
<details>
<summary>📈 <strong>Development TODOs</strong></summary>
### Advanced Crawling Systems 🔧
- [x] 0. Graph Crawler: Smart website traversal using graph search algorithms for comprehensive nested page extraction
- [ ] 1. Question-Based Crawler: Natural language driven web discovery and content extraction
- [ ] 2. Knowledge-Optimal Crawler: Smart crawling that maximizes knowledge while minimizing data extraction
- [ ] 3. Agentic Crawler: Autonomous system for complex multi-step crawling operations
### Specialized Features 🛠️
- [ ] 4. Automated Schema Generator: Convert natural language to extraction schemas
- [ ] 5. Domain-Specific Scrapers: Pre-configured extractors for common platforms (academic, e-commerce)
- [ ] 6. Web Embedding Index: Semantic search infrastructure for crawled content
### Development Tools 🔨
- [ ] 7. Interactive Playground: Web UI for testing, comparing strategies with AI assistance
- [ ] 8. Performance Monitor: Real-time insights into crawler operations
- [ ] 9. Cloud Integration: One-click deployment solutions across cloud providers
### Community & Growth 🌱
- [ ] 10. Sponsorship Program: Structured support system with tiered benefits
- [ ] 11. Educational Content: "How to Crawl" video series and interactive tutorials
## Contributing 🤝
</details>
## 🤝 Contributing
We welcome contributions from the open-source community. Check out our [contribution guidelines](https://github.com/unclecode/crawl4ai/blob/main/CONTRIBUTING.md) for more information.
## License 📄
## 📄 License
Crawl4AI is released under the [Apache 2.0 License](https://github.com/unclecode/crawl4ai/blob/main/LICENSE).
## Contact 📧
## 📧 Contact
For questions, suggestions, or feedback, feel free to reach out:
@@ -487,32 +570,32 @@ For questions, suggestions, or feedback, feel free to reach out:
Happy Crawling! 🕸️🚀
## 🗾 Mission
# Mission
Our mission is to unlock the value of personal and enterprise data by transforming digital footprints into structured, tradeable assets. Crawl4AI empowers individuals and organizations with open-source tools to extract and structure data, fostering a shared data economy.
Our mission is to unlock the untapped potential of personal and enterprise data in the digital age. In today's world, individuals and organizations generate vast amounts of valuable digital footprints, yet this data remains largely uncapitalized as a true asset.
We envision a future where AI is powered by real human knowledge, ensuring data creators directly benefit from their contributions. By democratizing data and enabling ethical sharing, we are laying the foundation for authentic AI advancement.
Our open-source solution empowers developers and innovators to build tools for data extraction and structuring, laying the foundation for a new era of data ownership. By transforming personal and enterprise data into structured, tradeable assets, we're creating opportunities for individuals to capitalize on their digital footprints and for organizations to unlock the value of their collective knowledge.
<details>
<summary>🔑 <strong>Key Opportunities</strong></summary>
- **Data Capitalization**: Transform digital footprints into measurable, valuable assets.
- **Authentic AI Data**: Provide AI systems with real human insights.
- **Shared Economy**: Create a fair data marketplace that benefits data creators.
This democratization of data represents the first step toward a shared data economy, where willing participation in data sharing drives AI advancement while ensuring the benefits flow back to data creators. Through this approach, we're building a future where AI development is powered by authentic human knowledge rather than synthetic alternatives.
</details>
![Mission Diagram](./docs/assets/pitch-dark.svg)
<details>
<summary>🚀 <strong>Development Pathway</strong></summary>
For a detailed exploration of our vision, opportunities, and pathway forward, please see our [full mission statement](./MISSION.md).
1. **Open-Source Tools**: Community-driven platforms for transparent data extraction.
2. **Digital Asset Structuring**: Tools to organize and value digital knowledge.
3. **Ethical Data Marketplace**: A secure, fair platform for exchanging structured data.
## Key Opportunities
For more details, see our [full mission statement](./MISSION.md).
</details>
- **Data Capitalization**: Transform digital footprints into valuable assets that can appear on personal and enterprise balance sheets
- **Authentic Data**: Unlock the vast reservoir of real human insights and knowledge for AI advancement
- **Shared Economy**: Create new value streams where data creators directly benefit from their contributions
## Development Pathway
1. **Open-Source Foundation**: Building transparent, community-driven data extraction tools
2. **Data Capitalization Platform**: Creating tools to structure and value digital assets
3. **Shared Data Marketplace**: Establishing an economic platform for ethical data exchange
For a detailed exploration of our vision, challenges, and solutions, please see our [full mission statement](./MISSION.md).
## Star History

View File

View File

View File

@@ -1,6 +1,7 @@
# __init__.py
from .async_webcrawler import AsyncWebCrawler, CacheMode
from .models import CrawlResult
from .__version__ import __version__
# __version__ = "0.3.73"

View File

@@ -1,2 +1,2 @@
# crawl4ai/_version.py
__version__ = "0.3.74"
__version__ = "0.3.745"

View File

@@ -15,7 +15,7 @@ import hashlib
import json
import uuid
from .models import AsyncCrawlResponse
from .utils import create_box_message
from playwright_stealth import StealthConfig, stealth_async
stealth_config = StealthConfig(
@@ -35,13 +35,14 @@ stealth_config = StealthConfig(
class ManagedBrowser:
def __init__(self, browser_type: str = "chromium", user_data_dir: Optional[str] = None, headless: bool = False, logger = None):
def __init__(self, browser_type: str = "chromium", user_data_dir: Optional[str] = None, headless: bool = False, logger = None, host: str = "localhost", debugging_port: int = 9222):
self.browser_type = browser_type
self.user_data_dir = user_data_dir
self.headless = headless
self.browser_process = None
self.temp_dir = None
self.debugging_port = 9222
self.debugging_port = debugging_port
self.host = host
self.logger = logger
self.shutting_down = False
@@ -70,7 +71,7 @@ class ManagedBrowser:
# Monitor browser process output for errors
asyncio.create_task(self._monitor_browser_process())
await asyncio.sleep(2) # Give browser time to start
return f"http://localhost:{self.debugging_port}"
return f"http://{self.host}:{self.debugging_port}"
except Exception as e:
await self.cleanup()
raise Exception(f"Failed to start browser: {e}")
@@ -229,6 +230,7 @@ class AsyncPlaywrightCrawlerStrategy(AsyncCrawlerStrategy):
self.headless = kwargs.get("headless", True)
self.browser_type = kwargs.get("browser_type", "chromium")
self.headers = kwargs.get("headers", {})
self.cookies = kwargs.get("cookies", [])
self.sessions = {}
self.session_ttl = 1800
self.js_code = js_code
@@ -295,6 +297,8 @@ class AsyncPlaywrightCrawlerStrategy(AsyncCrawlerStrategy):
# Set up the default context
if self.default_context:
await self.default_context.set_extra_http_headers(self.headers)
if self.cookies:
await self.default_context.add_cookies(self.cookies)
if self.accept_downloads:
await self.default_context.set_default_timeout(60000)
await self.default_context.set_default_navigation_timeout(60000)
@@ -317,10 +321,10 @@ class AsyncPlaywrightCrawlerStrategy(AsyncCrawlerStrategy):
"--disable-infobars",
"--window-position=0,0",
"--ignore-certificate-errors",
"--ignore-certificate-errors-spki-list",
"--ignore-certificate-errors-spki-list"
]
}
# Add channel if specified (try Chrome first)
if self.chrome_channel:
browser_args["channel"] = self.chrome_channel
@@ -413,13 +417,13 @@ class AsyncPlaywrightCrawlerStrategy(AsyncCrawlerStrategy):
else:
raise ValueError(f"Invalid hook type: {hook_type}")
async def execute_hook(self, hook_type: str, *args):
async def execute_hook(self, hook_type: str, *args, **kwargs):
hook = self.hooks.get(hook_type)
if hook:
if asyncio.iscoroutinefunction(hook):
return await hook(*args)
return await hook(*args, **kwargs)
else:
return hook(*args)
return hook(*args, **kwargs)
return args[0] if args else None
def update_user_agent(self, user_agent: str):
@@ -639,6 +643,7 @@ class AsyncPlaywrightCrawlerStrategy(AsyncCrawlerStrategy):
session_id = kwargs.get("session_id")
# Handle page creation differently for managed browser
context = None
if self.use_managed_browser:
if session_id:
# Reuse existing session if available
@@ -669,6 +674,8 @@ class AsyncPlaywrightCrawlerStrategy(AsyncCrawlerStrategy):
# downloads_path=self.downloads_path if self.accept_downloads else None
)
await context.add_cookies([{"name": "cookiesEnabled", "value": "true", "url": url}])
if self.cookies:
await context.add_cookies(self.cookies)
await context.set_extra_http_headers(self.headers)
page = await context.new_page()
self.sessions[session_id] = (context, page, time.time())
@@ -684,6 +691,8 @@ class AsyncPlaywrightCrawlerStrategy(AsyncCrawlerStrategy):
proxy={"server": self.proxy} if self.proxy else None,
accept_downloads=self.accept_downloads,
)
if self.cookies:
await context.add_cookies(self.cookies)
await context.set_extra_http_headers(self.headers)
if kwargs.get("override_navigator", False) or kwargs.get("simulate_user", False) or kwargs.get("magic", False):
@@ -753,20 +762,23 @@ class AsyncPlaywrightCrawlerStrategy(AsyncCrawlerStrategy):
return response
if not kwargs.get("js_only", False):
await self.execute_hook('before_goto', page)
await self.execute_hook('before_goto', page, context = context)
response = await page.goto(
url,
# wait_until=kwargs.get("wait_until", ["domcontentloaded", "networkidle"]),
wait_until=kwargs.get("wait_until", "domcontentloaded"),
timeout=kwargs.get("page_timeout", 60000)
)
try:
response = await page.goto(
url,
# wait_until=kwargs.get("wait_until", ["domcontentloaded", "networkidle"]),
wait_until=kwargs.get("wait_until", "domcontentloaded"),
timeout=kwargs.get("page_timeout", 60000),
)
except Error as e:
raise RuntimeError(f"Failed on navigating ACS-GOTO :\n{str(e)}")
# response = await page.goto("about:blank")
# await page.evaluate(f"window.location.href = '{url}'")
await self.execute_hook('after_goto', page)
await self.execute_hook('after_goto', page, context = context)
# Get status code and headers
status_code = response.status
@@ -828,9 +840,10 @@ class AsyncPlaywrightCrawlerStrategy(AsyncCrawlerStrategy):
for js in js_code:
await page.evaluate(js)
await page.wait_for_load_state('networkidle')
# await page.wait_for_timeout(100)
# Check for on execution event
await self.execute_hook('on_execution_started', page)
await self.execute_hook('on_execution_started', page, context = context)
if kwargs.get("simulate_user", False) or kwargs.get("magic", False):
# Simulate user interactions
@@ -846,6 +859,9 @@ class AsyncPlaywrightCrawlerStrategy(AsyncCrawlerStrategy):
await self.smart_wait(page, wait_for, timeout=kwargs.get("page_timeout", 60000))
except Exception as e:
raise RuntimeError(f"Wait condition failed: {str(e)}")
# if not wait_for and js_code:
# await page.wait_for_load_state('networkidle', timeout=5000)
# Update image dimensions
update_image_dimensions_js = """
@@ -913,7 +929,7 @@ class AsyncPlaywrightCrawlerStrategy(AsyncCrawlerStrategy):
if kwargs.get("process_iframes", False):
page = await self.process_iframes(page)
await self.execute_hook('before_retrieve_html', page)
await self.execute_hook('before_retrieve_html', page, context = context)
# Check if delay_before_return_html is set then wait for that time
delay_before_return_html = kwargs.get("delay_before_return_html")
if delay_before_return_html:
@@ -924,7 +940,7 @@ class AsyncPlaywrightCrawlerStrategy(AsyncCrawlerStrategy):
await self.remove_overlay_elements(page)
html = await page.content()
await self.execute_hook('before_return_html', page, html)
await self.execute_hook('before_return_html', page, html, context = context)
# Check if kwargs has screenshot=True then take screenshot
screenshot_data = None

View File

@@ -1,285 +0,0 @@
import os
from pathlib import Path
import aiosqlite
import asyncio
from typing import Optional, Tuple, Dict
from contextlib import asynccontextmanager
import logging
import json # Added for serialization/deserialization
from .utils import ensure_content_dirs, generate_content_hash
import xxhash
import aiofiles
# Set up logging
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)
DB_PATH = os.path.join(Path.home(), ".crawl4ai")
os.makedirs(DB_PATH, exist_ok=True)
DB_PATH = os.path.join(DB_PATH, "crawl4ai.db")
class AsyncDatabaseManager:
def __init__(self, pool_size: int = 10, max_retries: int = 3):
self.db_path = DB_PATH
self.content_paths = ensure_content_dirs(os.path.dirname(DB_PATH))
self.pool_size = pool_size
self.max_retries = max_retries
self.connection_pool: Dict[int, aiosqlite.Connection] = {}
self.pool_lock = asyncio.Lock()
self.connection_semaphore = asyncio.Semaphore(pool_size)
async def initialize(self):
"""Initialize the database and connection pool"""
await self.ainit_db()
async def cleanup(self):
"""Cleanup connections when shutting down"""
async with self.pool_lock:
for conn in self.connection_pool.values():
await conn.close()
self.connection_pool.clear()
@asynccontextmanager
async def get_connection(self):
"""Connection pool manager"""
async with self.connection_semaphore:
task_id = id(asyncio.current_task())
try:
async with self.pool_lock:
if task_id not in self.connection_pool:
conn = await aiosqlite.connect(
self.db_path,
timeout=30.0
)
await conn.execute('PRAGMA journal_mode = WAL')
await conn.execute('PRAGMA busy_timeout = 5000')
self.connection_pool[task_id] = conn
yield self.connection_pool[task_id]
except Exception as e:
logger.error(f"Connection error: {e}")
raise
finally:
async with self.pool_lock:
if task_id in self.connection_pool:
await self.connection_pool[task_id].close()
del self.connection_pool[task_id]
async def execute_with_retry(self, operation, *args):
"""Execute database operations with retry logic"""
for attempt in range(self.max_retries):
try:
async with self.get_connection() as db:
result = await operation(db, *args)
await db.commit()
return result
except Exception as e:
if attempt == self.max_retries - 1:
logger.error(f"Operation failed after {self.max_retries} attempts: {e}")
raise
await asyncio.sleep(1 * (attempt + 1)) # Exponential backoff
async def ainit_db(self):
"""Initialize database schema"""
async def _init(db):
await db.execute('''
CREATE TABLE IF NOT EXISTS crawled_data (
url TEXT PRIMARY KEY,
html TEXT,
cleaned_html TEXT,
markdown TEXT,
extracted_content TEXT,
success BOOLEAN,
media TEXT DEFAULT "{}",
links TEXT DEFAULT "{}",
metadata TEXT DEFAULT "{}",
screenshot TEXT DEFAULT "",
response_headers TEXT DEFAULT "{}",
downloaded_files TEXT DEFAULT "{}" -- New column added
)
''')
await self.execute_with_retry(_init)
await self.update_db_schema()
async def update_db_schema(self):
"""Update database schema if needed"""
async def _check_columns(db):
cursor = await db.execute("PRAGMA table_info(crawled_data)")
columns = await cursor.fetchall()
return [column[1] for column in columns]
column_names = await self.execute_with_retry(_check_columns)
# List of new columns to add
new_columns = ['media', 'links', 'metadata', 'screenshot', 'response_headers', 'downloaded_files']
for column in new_columns:
if column not in column_names:
await self.aalter_db_add_column(column)
async def aalter_db_add_column(self, new_column: str):
"""Add new column to the database"""
async def _alter(db):
if new_column == 'response_headers':
await db.execute(f'ALTER TABLE crawled_data ADD COLUMN {new_column} TEXT DEFAULT "{{}}"')
else:
await db.execute(f'ALTER TABLE crawled_data ADD COLUMN {new_column} TEXT DEFAULT ""')
logger.info(f"Added column '{new_column}' to the database.")
await self.execute_with_retry(_alter)
async def aget_cached_url(self, url: str) -> Optional[Tuple[str, str, str, str, str, bool, str, str, str, str]]:
"""Retrieve cached URL data"""
async def _get(db):
async with db.execute(
'''
SELECT url, html, cleaned_html, markdown,
extracted_content, success, media, links,
metadata, screenshot, response_headers,
downloaded_files
FROM crawled_data WHERE url = ?
''',
(url,)
) as cursor:
row = await cursor.fetchone()
if row:
# Load content from files using stored hashes
html = await self._load_content(row[1], 'html') if row[1] else ""
cleaned = await self._load_content(row[2], 'cleaned') if row[2] else ""
markdown = await self._load_content(row[3], 'markdown') if row[3] else ""
extracted = await self._load_content(row[4], 'extracted') if row[4] else ""
screenshot = await self._load_content(row[9], 'screenshots') if row[9] else ""
return (
row[0], # url
html or "", # Return empty string if file not found
cleaned or "",
markdown or "",
extracted or "",
row[5], # success
json.loads(row[6] or '{}'), # media
json.loads(row[7] or '{}'), # links
json.loads(row[8] or '{}'), # metadata
screenshot or "",
json.loads(row[10] or '{}'), # response_headers
json.loads(row[11] or '[]') # downloaded_files
)
return None
try:
return await self.execute_with_retry(_get)
except Exception as e:
logger.error(f"Error retrieving cached URL: {e}")
return None
async def acache_url(self, url: str, html: str, cleaned_html: str,
markdown: str, extracted_content: str, success: bool,
media: str = "{}", links: str = "{}",
metadata: str = "{}", screenshot: str = "",
response_headers: str = "{}", downloaded_files: str = "[]"):
"""Cache URL data with content stored in filesystem"""
# Store content files and get hashes
html_hash = await self._store_content(html, 'html')
cleaned_hash = await self._store_content(cleaned_html, 'cleaned')
markdown_hash = await self._store_content(markdown, 'markdown')
extracted_hash = await self._store_content(extracted_content, 'extracted')
screenshot_hash = await self._store_content(screenshot, 'screenshots')
async def _cache(db):
await db.execute('''
INSERT INTO crawled_data (
url, html, cleaned_html, markdown,
extracted_content, success, media, links, metadata,
screenshot, response_headers, downloaded_files
)
VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?)
ON CONFLICT(url) DO UPDATE SET
html = excluded.html,
cleaned_html = excluded.cleaned_html,
markdown = excluded.markdown,
extracted_content = excluded.extracted_content,
success = excluded.success,
media = excluded.media,
links = excluded.links,
metadata = excluded.metadata,
screenshot = excluded.screenshot,
response_headers = excluded.response_headers,
downloaded_files = excluded.downloaded_files
''', (url, html_hash, cleaned_hash, markdown_hash, extracted_hash,
success, media, links, metadata, screenshot_hash,
response_headers, downloaded_files))
try:
await self.execute_with_retry(_cache)
except Exception as e:
logger.error(f"Error caching URL: {e}")
async def aget_total_count(self) -> int:
"""Get total number of cached URLs"""
async def _count(db):
async with db.execute('SELECT COUNT(*) FROM crawled_data') as cursor:
result = await cursor.fetchone()
return result[0] if result else 0
try:
return await self.execute_with_retry(_count)
except Exception as e:
logger.error(f"Error getting total count: {e}")
return 0
async def aclear_db(self):
"""Clear all data from the database"""
async def _clear(db):
await db.execute('DELETE FROM crawled_data')
try:
await self.execute_with_retry(_clear)
except Exception as e:
logger.error(f"Error clearing database: {e}")
async def aflush_db(self):
"""Drop the entire table"""
async def _flush(db):
await db.execute('DROP TABLE IF EXISTS crawled_data')
try:
await self.execute_with_retry(_flush)
except Exception as e:
logger.error(f"Error flushing database: {e}")
async def _store_content(self, content: str, content_type: str) -> str:
"""Store content in filesystem and return hash"""
if not content:
return ""
content_hash = generate_content_hash(content)
file_path = os.path.join(self.content_paths[content_type], content_hash)
# Only write if file doesn't exist
if not os.path.exists(file_path):
async with aiofiles.open(file_path, 'w', encoding='utf-8') as f:
await f.write(content)
return content_hash
async def _load_content(self, content_hash: str, content_type: str) -> Optional[str]:
"""Load content from filesystem by hash"""
if not content_hash:
return None
file_path = os.path.join(self.content_paths[content_type], content_hash)
try:
async with aiofiles.open(file_path, 'r', encoding='utf-8') as f:
return await f.read()
except:
logger.error(f"Failed to load content: {file_path}")
return None
# Create a singleton instance
async_db_manager = AsyncDatabaseManager()

View File

@@ -1,344 +0,0 @@
import os
import time
from pathlib import Path
from typing import Optional
import json
import asyncio
from .models import CrawlResult
from .async_database import async_db_manager
from .chunking_strategy import *
from .extraction_strategy import *
from .async_crawler_strategy import AsyncCrawlerStrategy, AsyncPlaywrightCrawlerStrategy, AsyncCrawlResponse
from .content_scrapping_strategy import WebScrapingStrategy
from .config import MIN_WORD_THRESHOLD, IMAGE_DESCRIPTION_MIN_WORD_THRESHOLD
from .utils import (
sanitize_input_encode,
InvalidCSSSelectorError,
format_html
)
from .__version__ import __version__ as crawl4ai_version
class AsyncWebCrawler:
def __init__(
self,
crawler_strategy: Optional[AsyncCrawlerStrategy] = None,
always_by_pass_cache: bool = False,
base_directory: str = str(Path.home()),
**kwargs,
):
self.crawler_strategy = crawler_strategy or AsyncPlaywrightCrawlerStrategy(
**kwargs
)
self.always_by_pass_cache = always_by_pass_cache
# self.crawl4ai_folder = os.path.join(Path.home(), ".crawl4ai")
self.crawl4ai_folder = os.path.join(base_directory, ".crawl4ai")
os.makedirs(self.crawl4ai_folder, exist_ok=True)
os.makedirs(f"{self.crawl4ai_folder}/cache", exist_ok=True)
self.ready = False
self.verbose = kwargs.get("verbose", False)
async def __aenter__(self):
await self.crawler_strategy.__aenter__()
await self.awarmup()
return self
async def __aexit__(self, exc_type, exc_val, exc_tb):
await self.crawler_strategy.__aexit__(exc_type, exc_val, exc_tb)
async def awarmup(self):
# Print a message for crawl4ai and its version
if self.verbose:
print(f"[LOG] 🚀 Crawl4AI {crawl4ai_version}")
print("[LOG] 🌤️ Warming up the AsyncWebCrawler")
# await async_db_manager.ainit_db()
# # await async_db_manager.initialize()
# await self.arun(
# url="https://google.com/",
# word_count_threshold=5,
# bypass_cache=False,
# verbose=False,
# )
self.ready = True
if self.verbose:
print("[LOG] 🌞 AsyncWebCrawler is ready to crawl")
async def arun(
self,
url: str,
word_count_threshold=MIN_WORD_THRESHOLD,
extraction_strategy: ExtractionStrategy = None,
chunking_strategy: ChunkingStrategy = RegexChunking(),
bypass_cache: bool = False,
css_selector: str = None,
screenshot: bool = False,
user_agent: str = None,
verbose=True,
disable_cache: bool = False,
no_cache_read: bool = False,
no_cache_write: bool = False,
**kwargs,
) -> CrawlResult:
"""
Runs the crawler for a single source: URL (web, local file, or raw HTML).
Args:
url (str): The URL to crawl. Supported prefixes:
- 'http://' or 'https://': Web URL to crawl.
- 'file://': Local file path to process.
- 'raw:': Raw HTML content to process.
... [other existing parameters]
Returns:
CrawlResult: The result of the crawling and processing.
"""
try:
if disable_cache:
bypass_cache = True
no_cache_read = True
no_cache_write = True
extraction_strategy = extraction_strategy or NoExtractionStrategy()
extraction_strategy.verbose = verbose
if not isinstance(extraction_strategy, ExtractionStrategy):
raise ValueError("Unsupported extraction strategy")
if not isinstance(chunking_strategy, ChunkingStrategy):
raise ValueError("Unsupported chunking strategy")
word_count_threshold = max(word_count_threshold, MIN_WORD_THRESHOLD)
async_response: AsyncCrawlResponse = None
cached = None
screenshot_data = None
extracted_content = None
is_web_url = url.startswith(('http://', 'https://'))
is_local_file = url.startswith("file://")
is_raw_html = url.startswith("raw:")
_url = url if not is_raw_html else "Raw HTML"
start_time = time.perf_counter()
cached_result = None
if is_web_url and (not bypass_cache or not no_cache_read) and not self.always_by_pass_cache:
cached_result = await async_db_manager.aget_cached_url(url)
if cached_result:
html = sanitize_input_encode(cached_result.html)
extracted_content = sanitize_input_encode(cached_result.extracted_content or "")
if screenshot:
screenshot_data = cached_result.screenshot
if not screenshot_data:
cached_result = None
if verbose:
print(
f"[LOG] 1⃣ ✅ Page fetched (cache) for {_url}, success: {bool(html)}, time taken: {time.perf_counter() - start_time:.2f} seconds"
)
if not cached or not html:
t1 = time.perf_counter()
if user_agent:
self.crawler_strategy.update_user_agent(user_agent)
async_response: AsyncCrawlResponse = await self.crawler_strategy.crawl(url, screenshot=screenshot, **kwargs)
html = sanitize_input_encode(async_response.html)
screenshot_data = async_response.screenshot
t2 = time.perf_counter()
if verbose:
print(
f"[LOG] 1⃣ ✅ Page fetched (no-cache) for {_url}, success: {bool(html)}, time taken: {t2 - t1:.2f} seconds"
)
t1 = time.perf_counter()
crawl_result = await self.aprocess_html(
url=url,
html=html,
extracted_content=extracted_content,
word_count_threshold=word_count_threshold,
extraction_strategy=extraction_strategy,
chunking_strategy=chunking_strategy,
css_selector=css_selector,
screenshot=screenshot_data,
verbose=verbose,
is_cached=bool(cached),
async_response=async_response,
bypass_cache=bypass_cache,
is_web_url = is_web_url,
is_local_file = is_local_file,
is_raw_html = is_raw_html,
**kwargs,
)
if async_response:
crawl_result.status_code = async_response.status_code
crawl_result.response_headers = async_response.response_headers
crawl_result.downloaded_files = async_response.downloaded_files
else:
crawl_result.status_code = 200
crawl_result.response_headers = cached_result.response_headers if cached_result else {}
crawl_result.success = bool(html)
crawl_result.session_id = kwargs.get("session_id", None)
if verbose:
print(
f"[LOG] 🔥 🚀 Crawling done for {_url}, success: {crawl_result.success}, time taken: {time.perf_counter() - start_time:.2f} seconds"
)
if not is_raw_html and not no_cache_write:
if not bool(cached_result) or kwargs.get("bypass_cache", False) or self.always_by_pass_cache:
await async_db_manager.acache_url(crawl_result)
return crawl_result
except Exception as e:
if not hasattr(e, "msg"):
e.msg = str(e)
print(f"[ERROR] 🚫 arun(): Failed to crawl {_url}, error: {e.msg}")
return CrawlResult(url=url, html="", markdown = f"[ERROR] 🚫 arun(): Failed to crawl {_url}, error: {e.msg}", success=False, error_message=e.msg)
async def arun_many(
self,
urls: List[str],
word_count_threshold=MIN_WORD_THRESHOLD,
extraction_strategy: ExtractionStrategy = None,
chunking_strategy: ChunkingStrategy = RegexChunking(),
bypass_cache: bool = False,
css_selector: str = None,
screenshot: bool = False,
user_agent: str = None,
verbose=True,
**kwargs,
) -> List[CrawlResult]:
"""
Runs the crawler for multiple sources: URLs (web, local files, or raw HTML).
Args:
urls (List[str]): A list of URLs with supported prefixes:
- 'http://' or 'https://': Web URL to crawl.
- 'file://': Local file path to process.
- 'raw:': Raw HTML content to process.
... [other existing parameters]
Returns:
List[CrawlResult]: The results of the crawling and processing.
"""
semaphore_count = kwargs.get('semaphore_count', 5) # Adjust as needed
semaphore = asyncio.Semaphore(semaphore_count)
async def crawl_with_semaphore(url):
async with semaphore:
return await self.arun(
url,
word_count_threshold=word_count_threshold,
extraction_strategy=extraction_strategy,
chunking_strategy=chunking_strategy,
bypass_cache=bypass_cache,
css_selector=css_selector,
screenshot=screenshot,
user_agent=user_agent,
verbose=verbose,
**kwargs,
)
tasks = [crawl_with_semaphore(url) for url in urls]
results = await asyncio.gather(*tasks, return_exceptions=True)
return [result if not isinstance(result, Exception) else str(result) for result in results]
async def aprocess_html(
self,
url: str,
html: str,
extracted_content: str,
word_count_threshold: int,
extraction_strategy: ExtractionStrategy,
chunking_strategy: ChunkingStrategy,
css_selector: str,
screenshot: str,
verbose: bool,
**kwargs,
) -> CrawlResult:
t = time.perf_counter()
# Extract content from HTML
try:
_url = url if not kwargs.get("is_raw_html", False) else "Raw HTML"
t1 = time.perf_counter()
scrapping_strategy = WebScrapingStrategy()
# result = await scrapping_strategy.ascrap(
result = scrapping_strategy.scrap(
url,
html,
word_count_threshold=word_count_threshold,
css_selector=css_selector,
only_text=kwargs.get("only_text", False),
image_description_min_word_threshold=kwargs.get(
"image_description_min_word_threshold", IMAGE_DESCRIPTION_MIN_WORD_THRESHOLD
),
**kwargs,
)
if result is None:
raise ValueError(f"Process HTML, Failed to extract content from the website: {url}")
except InvalidCSSSelectorError as e:
raise ValueError(str(e))
except Exception as e:
raise ValueError(f"Process HTML, Failed to extract content from the website: {url}, error: {str(e)}")
cleaned_html = sanitize_input_encode(result.get("cleaned_html", ""))
markdown = sanitize_input_encode(result.get("markdown", ""))
fit_markdown = sanitize_input_encode(result.get("fit_markdown", ""))
fit_html = sanitize_input_encode(result.get("fit_html", ""))
media = result.get("media", [])
links = result.get("links", [])
metadata = result.get("metadata", {})
if verbose:
print(
f"[LOG] 2⃣ ✅ Scraping done for {_url}, success: True, time taken: {time.perf_counter() - t1:.2f} seconds"
)
if extracted_content is None and extraction_strategy and chunking_strategy and not isinstance(extraction_strategy, NoExtractionStrategy):
t1 = time.perf_counter()
# Check if extraction strategy is type of JsonCssExtractionStrategy
if isinstance(extraction_strategy, JsonCssExtractionStrategy) or isinstance(extraction_strategy, JsonCssExtractionStrategy):
extraction_strategy.verbose = verbose
extracted_content = extraction_strategy.run(url, [html])
extracted_content = json.dumps(extracted_content, indent=4, default=str, ensure_ascii=False)
else:
sections = chunking_strategy.chunk(markdown)
extracted_content = extraction_strategy.run(url, sections)
extracted_content = json.dumps(extracted_content, indent=4, default=str, ensure_ascii=False)
if verbose:
print(
f"[LOG] 3⃣ ✅ Extraction done for {_url}, time taken: {time.perf_counter() - t1:.2f} seconds"
)
screenshot = None if not screenshot else screenshot
return CrawlResult(
url=url,
html=html,
cleaned_html=format_html(cleaned_html),
markdown=markdown,
fit_markdown=fit_markdown,
fit_html= fit_html,
media=media,
links=links,
metadata=metadata,
screenshot=screenshot,
extracted_content=extracted_content,
success=True,
error_message="",
)
async def aclear_cache(self):
# await async_db_manager.aclear_db()
await async_db_manager.cleanup()
async def aflush_cache(self):
await async_db_manager.aflush_db()
async def aget_cache_size(self):
return await async_db_manager.aget_total_count()

View File

@@ -7,14 +7,14 @@ from pathlib import Path
from typing import Optional, List, Union
import json
import asyncio
from .models import CrawlResult
from .models import CrawlResult, MarkdownGenerationResult
from .async_database import async_db_manager
from .chunking_strategy import *
from .content_filter_strategy import *
from .extraction_strategy import *
from .async_crawler_strategy import AsyncCrawlerStrategy, AsyncPlaywrightCrawlerStrategy, AsyncCrawlResponse
from .cache_context import CacheMode, CacheContext, _legacy_to_cache_mode
from .content_scrapping_strategy import WebScrapingStrategy
from .content_scraping_strategy import WebScrapingStrategy
from .async_logger import AsyncLogger
from .config import (
@@ -25,8 +25,11 @@ from .config import (
from .utils import (
sanitize_input_encode,
InvalidCSSSelectorError,
format_html
format_html,
fast_format_html,
create_box_message
)
from urllib.parse import urlparse
import random
from .__version__ import __version__ as crawl4ai_version
@@ -325,15 +328,15 @@ class AsyncWebCrawler:
if not hasattr(e, "msg"):
e.msg = str(e)
# print(f"{Fore.RED}{self.tag_format('ERROR')} {self.log_icons['ERROR']} Failed to crawl {cache_context.display_url[:URL_LOG_SHORTEN_LENGTH]}... | {e.msg}{Style.RESET_ALL}")
self.logger.error_status(
url=cache_context.display_url,
error=e.msg,
error=create_box_message(e.msg, type = "error"),
tag="ERROR"
)
return CrawlResult(
url=url,
html="",
markdown=f"[ERROR] 🚫 arun(): Failed to crawl {cache_context.display_url}, error: {e.msg}",
success=False,
error_message=e.msg
)
@@ -476,8 +479,8 @@ class AsyncWebCrawler:
html,
word_count_threshold=word_count_threshold,
css_selector=css_selector,
only_text=kwargs.get("only_text", False),
image_description_min_word_threshold=kwargs.get(
only_text=kwargs.pop("only_text", False),
image_description_min_word_threshold=kwargs.pop(
"image_description_min_word_threshold", IMAGE_DESCRIPTION_MIN_WORD_THRESHOLD
),
content_filter = content_filter,
@@ -491,6 +494,8 @@ class AsyncWebCrawler:
except Exception as e:
raise ValueError(f"Process HTML, Failed to extract content from the website: {url}, error: {str(e)}")
markdown_v2: MarkdownGenerationResult = result.get("markdown_v2", None)
cleaned_html = sanitize_input_encode(result.get("cleaned_html", ""))
markdown = sanitize_input_encode(result.get("markdown", ""))
fit_markdown = sanitize_input_encode(result.get("fit_markdown", ""))
@@ -532,16 +537,18 @@ class AsyncWebCrawler:
"timing": time.perf_counter() - t1
}
)
screenshot = None if not screenshot else screenshot
if kwargs.get("prettiify", False):
cleaned_html = fast_format_html(cleaned_html)
return CrawlResult(
url=url,
html=html,
cleaned_html=format_html(cleaned_html),
cleaned_html=cleaned_html,
markdown_v2=markdown_v2,
markdown=markdown,
fit_markdown=fit_markdown,
fit_html= fit_html,

View File

@@ -10,6 +10,13 @@ from abc import ABC, abstractmethod
from snowballstemmer import stemmer
# import regex
# def tokenize_text(text):
# # Regular expression to match words or CJK (Chinese, Japanese, Korean) characters
# pattern = r'\p{L}+|\p{N}+|[\p{Script=Han}\p{Script=Hiragana}\p{Script=Katakana}ー]|[\p{P}]'
# return regex.findall(pattern, text)
# from nltk.stem import PorterStemmer
# ps = PorterStemmer()
class RelevantContentFilter(ABC):
@@ -57,9 +64,14 @@ class RelevantContentFilter(ABC):
query_parts = []
# Title
if soup.title:
query_parts.append(soup.title.string)
elif soup.find('h1'):
try:
title = soup.title.string
if title:
query_parts.append(title)
except Exception:
pass
if soup.find('h1'):
query_parts.append(soup.find('h1').get_text())
# Meta tags
@@ -81,7 +93,7 @@ class RelevantContentFilter(ABC):
return ' '.join(filter(None, query_parts))
def extract_text_chunks(self, body: Tag) -> List[Tuple[str, str]]:
def extract_text_chunks(self, body: Tag, min_word_threshold: int = None) -> List[Tuple[str, str]]:
"""
Extracts text chunks from a BeautifulSoup body element while preserving order.
Returns list of tuples (text, tag_name) for classification.
@@ -155,6 +167,9 @@ class RelevantContentFilter(ABC):
if text:
chunks.append((chunk_index, text, 'content', body))
if min_word_threshold:
chunks = [chunk for chunk in chunks if len(chunk[1].split()) >= min_word_threshold]
return chunks
@@ -274,15 +289,26 @@ class BM25ContentFilter(RelevantContentFilter):
}
self.stemmer = stemmer(language)
def filter_content(self, html: str) -> List[str]:
def filter_content(self, html: str, min_word_threshold: int = None) -> List[str]:
"""Implements content filtering using BM25 algorithm with priority tag handling"""
if not html or not isinstance(html, str):
return []
soup = BeautifulSoup(html, 'lxml')
# Check if body is present
if not soup.body:
# Wrap in body tag if missing
soup = BeautifulSoup(f'<body>{html}</body>', 'lxml')
body = soup.find('body')
query = self.extract_page_query(soup.find('head'), body)
candidates = self.extract_text_chunks(body)
query = self.extract_page_query(soup, body)
if not query:
return []
# return [self.clean_element(soup)]
candidates = self.extract_text_chunks(body, min_word_threshold)
if not candidates:
return []
@@ -299,6 +325,10 @@ class BM25ContentFilter(RelevantContentFilter):
for _, chunk, _, _ in candidates]
tokenized_query = [self.stemmer.stemWord(word) for word in query.lower().split()]
# tokenized_corpus = [[self.stemmer.stemWord(word) for word in tokenize_text(chunk.lower())]
# for _, chunk, _, _ in candidates]
# tokenized_query = [self.stemmer.stemWord(word) for word in tokenize_text(query.lower())]
# Clean from stop words and noise
tokenized_corpus = [clean_tokens(tokens) for tokens in tokenized_corpus]
tokenized_query = clean_tokens(tokenized_query)
@@ -326,3 +356,147 @@ class BM25ContentFilter(RelevantContentFilter):
selected_candidates.sort(key=lambda x: x[0])
return [self.clean_element(tag) for _, _, tag in selected_candidates]
class HeuristicContentFilter(RelevantContentFilter):
def __init__(self):
super().__init__()
# Weights for different heuristics
self.tag_weights = {
'article': 10,
'main': 8,
'section': 5,
'div': 3,
'p': 2,
'pre': 2,
'code': 2,
'blockquote': 2,
'li': 1,
'span': 1,
}
self.max_depth = 5 # Maximum depth from body to consider
def filter_content(self, html: str) -> List[str]:
"""Implements heuristic content filtering without relying on a query."""
if not html or not isinstance(html, str):
return []
soup = BeautifulSoup(html, 'lxml')
# Ensure there is a body tag
if not soup.body:
soup = BeautifulSoup(f'<body>{html}</body>', 'lxml')
body = soup.body
# Extract candidate text chunks
candidates = self.extract_text_chunks(body)
if not candidates:
return []
# Score each candidate
scored_candidates = []
for index, text, tag_type, tag in candidates:
score = self.score_element(tag, text)
if score > 0:
scored_candidates.append((score, index, text, tag))
# Sort candidates by score and then by document order
scored_candidates.sort(key=lambda x: (-x[0], x[1]))
# Extract the top candidates (e.g., top 5)
top_candidates = scored_candidates[:5] # Adjust the number as needed
# Sort the top candidates back to their original document order
top_candidates.sort(key=lambda x: x[1])
# Clean and return the content
return [self.clean_element(tag) for _, _, _, tag in top_candidates]
def score_element(self, tag: Tag, text: str) -> float:
"""Compute a score for an element based on heuristics."""
if not text or not tag:
return 0
# Exclude unwanted tags
if self.is_excluded(tag):
return 0
# Text density
text_length = len(text.strip())
html_length = len(str(tag))
text_density = text_length / html_length if html_length > 0 else 0
# Link density
link_text_length = sum(len(a.get_text().strip()) for a in tag.find_all('a'))
link_density = link_text_length / text_length if text_length > 0 else 0
# Tag weight
tag_weight = self.tag_weights.get(tag.name, 1)
# Depth factor (prefer elements closer to the body tag)
depth = self.get_depth(tag)
depth_weight = max(self.max_depth - depth, 1) / self.max_depth
# Compute the final score
score = (text_density * tag_weight * depth_weight) / (1 + link_density)
return score
def get_depth(self, tag: Tag) -> int:
"""Compute the depth of the tag from the body tag."""
depth = 0
current = tag
while current and current != current.parent and current.name != 'body':
current = current.parent
depth += 1
return depth
def extract_text_chunks(self, body: Tag) -> List[Tuple[int, str, str, Tag]]:
"""
Extracts text chunks from the body element while preserving order.
Returns list of tuples (index, text, tag_type, tag) for scoring.
"""
chunks = []
index = 0
def traverse(element):
nonlocal index
if isinstance(element, NavigableString):
return
if not isinstance(element, Tag):
return
if self.is_excluded(element):
return
# Only consider included tags
if element.name in self.included_tags:
text = element.get_text(separator=' ', strip=True)
if len(text.split()) >= self.min_word_count:
tag_type = 'header' if element.name in self.header_tags else 'content'
chunks.append((index, text, tag_type, element))
index += 1
# Do not traverse children of this element to prevent duplication
return
for child in element.children:
traverse(child)
traverse(body)
return chunks
def is_excluded(self, tag: Tag) -> bool:
"""Determine if a tag should be excluded based on heuristics."""
if tag.name in self.excluded_tags:
return True
class_id = ' '.join(filter(None, [
' '.join(tag.get('class', [])),
tag.get('id', '')
]))
if self.negative_patterns.search(class_id):
return True
# Exclude tags with high link density (e.g., navigation menus)
text = tag.get_text(separator=' ', strip=True)
link_text_length = sum(len(a.get_text(strip=True)) for a in tag.find_all('a'))
text_length = len(text)
if text_length > 0 and (link_text_length / text_length) > 0.5:
return True
return False

View File

@@ -1,6 +1,6 @@
import re # Point 1: Pre-Compile Regular Expressions
from abc import ABC, abstractmethod
from typing import Dict, Any
from typing import Dict, Any, Optional
from bs4 import BeautifulSoup
from concurrent.futures import ThreadPoolExecutor
import asyncio, requests, re, os
@@ -9,103 +9,19 @@ from bs4 import element, NavigableString, Comment
from urllib.parse import urljoin
from requests.exceptions import InvalidSchema
# from .content_cleaning_strategy import ContentCleaningStrategy
from .content_filter_strategy import RelevantContentFilter, BM25ContentFilter
from .content_filter_strategy import RelevantContentFilter, BM25ContentFilter#, HeuristicContentFilter
from .markdown_generation_strategy import MarkdownGenerationStrategy, DefaultMarkdownGenerator
from .models import MarkdownGenerationResult
from .utils import (
sanitize_input_encode,
sanitize_html,
extract_metadata,
InvalidCSSSelectorError,
# CustomHTML2Text,
CustomHTML2Text,
normalize_url,
is_external_url
is_external_url
)
from .html2text import HTML2Text
class CustomHTML2Text(HTML2Text):
def __init__(self, *args, **kwargs):
super().__init__(*args, **kwargs)
self.inside_pre = False
self.inside_code = False
self.preserve_tags = set() # Set of tags to preserve
self.current_preserved_tag = None
self.preserved_content = []
self.preserve_depth = 0
# Configuration options
self.skip_internal_links = False
self.single_line_break = False
self.mark_code = False
self.include_sup_sub = False
self.body_width = 0
self.ignore_mailto_links = True
self.ignore_links = False
self.escape_backslash = False
self.escape_dot = False
self.escape_plus = False
self.escape_dash = False
self.escape_snob = False
def update_params(self, **kwargs):
"""Update parameters and set preserved tags."""
for key, value in kwargs.items():
if key == 'preserve_tags':
self.preserve_tags = set(value)
else:
setattr(self, key, value)
def handle_tag(self, tag, attrs, start):
# Handle preserved tags
if tag in self.preserve_tags:
if start:
if self.preserve_depth == 0:
self.current_preserved_tag = tag
self.preserved_content = []
# Format opening tag with attributes
attr_str = ''.join(f' {k}="{v}"' for k, v in attrs.items() if v is not None)
self.preserved_content.append(f'<{tag}{attr_str}>')
self.preserve_depth += 1
return
else:
self.preserve_depth -= 1
if self.preserve_depth == 0:
self.preserved_content.append(f'</{tag}>')
# Output the preserved HTML block with proper spacing
preserved_html = ''.join(self.preserved_content)
self.o('\n' + preserved_html + '\n')
self.current_preserved_tag = None
return
# If we're inside a preserved tag, collect all content
if self.preserve_depth > 0:
if start:
# Format nested tags with attributes
attr_str = ''.join(f' {k}="{v}"' for k, v in attrs.items() if v is not None)
self.preserved_content.append(f'<{tag}{attr_str}>')
else:
self.preserved_content.append(f'</{tag}>')
return
# Handle pre tags
if tag == 'pre':
if start:
self.o('```\n')
self.inside_pre = True
else:
self.o('\n```')
self.inside_pre = False
# elif tag in ["h1", "h2", "h3", "h4", "h5", "h6"]:
# pass
else:
super().handle_tag(tag, attrs, start)
def handle_data(self, data, entity_char=False):
"""Override handle_data to capture content within preserved tags."""
if self.preserve_depth > 0:
self.preserved_content.append(data)
return
super().handle_data(data, entity_char)
from .tools import profile_and_time
# Pre-compile regular expressions for Open Graph and Twitter metadata
OG_REGEX = re.compile(r'^og:')
@@ -164,6 +80,104 @@ class WebScrapingStrategy(ContentScrapingStrategy):
async def ascrap(self, url: str, html: str, **kwargs) -> Dict[str, Any]:
return await asyncio.to_thread(self._get_content_of_website_optimized, url, html, **kwargs)
def _generate_markdown_content(self,
cleaned_html: str,
html: str,
url: str,
success: bool,
**kwargs) -> Dict[str, Any]:
"""Generate markdown content using either new strategy or legacy method.
Args:
cleaned_html: Sanitized HTML content
html: Original HTML content
url: Base URL of the page
success: Whether scraping was successful
**kwargs: Additional options including:
- markdown_generator: Optional[MarkdownGenerationStrategy]
- html2text: Dict[str, Any] options for HTML2Text
- content_filter: Optional[RelevantContentFilter]
- fit_markdown: bool
- fit_markdown_user_query: Optional[str]
- fit_markdown_bm25_threshold: float
Returns:
Dict containing markdown content in various formats
"""
markdown_generator: Optional[MarkdownGenerationStrategy] = kwargs.get('markdown_generator', DefaultMarkdownGenerator())
if markdown_generator:
try:
if kwargs.get('fit_markdown', False) and not markdown_generator.content_filter:
markdown_generator.content_filter = BM25ContentFilter(
user_query=kwargs.get('fit_markdown_user_query', None),
bm25_threshold=kwargs.get('fit_markdown_bm25_threshold', 1.0)
)
markdown_result: MarkdownGenerationResult = markdown_generator.generate_markdown(
cleaned_html=cleaned_html,
base_url=url,
html2text_options=kwargs.get('html2text', {})
)
help_message = """"""
return {
'markdown': markdown_result.raw_markdown,
'fit_markdown': markdown_result.fit_markdown,
'fit_html': markdown_result.fit_html,
'markdown_v2': markdown_result
}
except Exception as e:
self._log('error',
message="Error using new markdown generation strategy: {error}",
tag="SCRAPE",
params={"error": str(e)}
)
markdown_generator = None
return {
'markdown': f"Error using new markdown generation strategy: {str(e)}",
'fit_markdown': "Set flag 'fit_markdown' to True to get cleaned HTML content.",
'fit_html': "Set flag 'fit_markdown' to True to get cleaned HTML content.",
'markdown_v2': None
}
# Legacy method
h = CustomHTML2Text()
h.update_params(**kwargs.get('html2text', {}))
markdown = h.handle(cleaned_html)
markdown = markdown.replace(' ```', '```')
fit_markdown = "Set flag 'fit_markdown' to True to get cleaned HTML content."
fit_html = "Set flag 'fit_markdown' to True to get cleaned HTML content."
if kwargs.get('content_filter', None) or kwargs.get('fit_markdown', False):
content_filter = kwargs.get('content_filter', None)
if not content_filter:
content_filter = BM25ContentFilter(
user_query=kwargs.get('fit_markdown_user_query', None),
bm25_threshold=kwargs.get('fit_markdown_bm25_threshold', 1.0)
)
fit_html = content_filter.filter_content(html)
fit_html = '\n'.join('<div>{}</div>'.format(s) for s in fit_html)
fit_markdown = h.handle(fit_html)
markdown_v2 = MarkdownGenerationResult(
raw_markdown=markdown,
markdown_with_citations=markdown,
references_markdown=markdown,
fit_markdown=fit_markdown
)
return {
'markdown': markdown,
'fit_markdown': fit_markdown,
'fit_html': fit_html,
'markdown_v2' : markdown_v2
}
def _get_content_of_website_optimized(self, url: str, html: str, word_count_threshold: int = MIN_WORD_THRESHOLD, css_selector: str = None, **kwargs) -> Dict[str, Any]:
success = True
if not html:
@@ -226,7 +240,9 @@ class WebScrapingStrategy(ContentScrapingStrategy):
return text_content
return None
def process_image(img, url, index, total_images):
def process_image_old(img, url, index, total_images):
#Check if an image has valid display and inside undesired html elements
def is_valid_image(img, parent, parent_classes):
style = img.get('style', '')
@@ -242,8 +258,6 @@ class WebScrapingStrategy(ContentScrapingStrategy):
#Score an image for it's usefulness
def score_image_for_usefulness(img, base_url, index, images_count):
image_height = img.get('height')
height_value, height_unit = parse_dimension(image_height)
image_width = img.get('width')
@@ -277,14 +291,14 @@ class WebScrapingStrategy(ContentScrapingStrategy):
score+=1
return score
if not is_valid_image(img, img.parent, img.parent.get('class', [])):
return None
score = score_image_for_usefulness(img, url, index, total_images)
if score <= IMAGE_SCORE_THRESHOLD:
if score <= kwargs.get('image_score_threshold', IMAGE_SCORE_THRESHOLD):
return None
return {
base_result = {
'src': img.get('src', ''),
'data-src': img.get('data-src', ''),
'alt': img.get('alt', ''),
@@ -293,6 +307,113 @@ class WebScrapingStrategy(ContentScrapingStrategy):
'type': 'image'
}
sources = []
srcset = img.get('srcset', '')
if srcset:
sources = parse_srcset(srcset)
if sources:
return [dict(base_result, src=source['url'], width=source['width'])
for source in sources]
return [base_result] # Always return a list
def process_image(img, url, index, total_images):
parse_srcset = lambda s: [{'url': u.strip().split()[0], 'width': u.strip().split()[-1].rstrip('w')
if ' ' in u else None}
for u in [f"http{p}" for p in s.split("http") if p]]
# Constants for checks
classes_to_check = frozenset(['button', 'icon', 'logo'])
tags_to_check = frozenset(['button', 'input'])
# Pre-fetch commonly used attributes
style = img.get('style', '')
alt = img.get('alt', '')
src = img.get('src', '')
data_src = img.get('data-src', '')
width = img.get('width')
height = img.get('height')
parent = img.parent
parent_classes = parent.get('class', [])
# Quick validation checks
if ('display:none' in style or
parent.name in tags_to_check or
any(c in cls for c in parent_classes for cls in classes_to_check) or
any(c in src for c in classes_to_check) or
any(c in alt for c in classes_to_check)):
return None
# Quick score calculation
score = 0
if width and width.isdigit():
width_val = int(width)
score += 1 if width_val > 150 else 0
if height and height.isdigit():
height_val = int(height)
score += 1 if height_val > 150 else 0
if alt:
score += 1
score += index/total_images < 0.5
image_format = ''
if "data:image/" in src:
image_format = src.split(',')[0].split(';')[0].split('/')[1].split(';')[0]
else:
image_format = os.path.splitext(src)[1].lower().strip('.').split('?')[0]
if image_format in ('jpg', 'png', 'webp', 'avif'):
score += 1
if score <= kwargs.get('image_score_threshold', IMAGE_SCORE_THRESHOLD):
return None
# Use set for deduplication
unique_urls = set()
image_variants = []
# Generate a unique group ID for this set of variants
group_id = index
# Base image info template
base_info = {
'alt': alt,
'desc': find_closest_parent_with_useful_text(img),
'score': score,
'type': 'image',
'group_id': group_id # Group ID for this set of variants
}
# Inline function for adding variants
def add_variant(src, width=None):
if src and not src.startswith('data:') and src not in unique_urls:
unique_urls.add(src)
image_variants.append({**base_info, 'src': src, 'width': width})
# Process all sources
add_variant(src)
add_variant(data_src)
# Handle srcset and data-srcset in one pass
for attr in ('srcset', 'data-srcset'):
if value := img.get(attr):
for source in parse_srcset(value):
add_variant(source['url'], source['width'])
# Quick picture element check
if picture := img.find_parent('picture'):
for source in picture.find_all('source'):
if srcset := source.get('srcset'):
for src in parse_srcset(srcset):
add_variant(src['url'], src['width'])
# Framework-specific attributes in one pass
for attr, value in img.attrs.items():
if attr.startswith('data-') and ('src' in attr or 'srcset' in attr) and 'http' in value:
add_variant(value)
return image_variants if image_variants else None
def remove_unwanted_attributes(element, important_attrs, keep_data_attributes=False):
attrs_to_remove = []
for attr in element.attrs:
@@ -484,13 +605,16 @@ class WebScrapingStrategy(ContentScrapingStrategy):
links['internal'] = list(internal_links_dict.values())
links['external'] = list(external_links_dict.values())
# # Process images using ThreadPoolExecutor
imgs = body.find_all('img')
with ThreadPoolExecutor() as executor:
image_results = list(executor.map(process_image, imgs, [url]*len(imgs), range(len(imgs)), [len(imgs)]*len(imgs)))
media['images'] = [result for result in image_results if result is not None]
# For test we use for loop instead of thread
media['images'] = [
img for result in (process_image(img, url, i, len(imgs))
for i, img in enumerate(imgs))
if result is not None
for img in result
]
def flatten_nested_elements(node):
if isinstance(node, NavigableString):
@@ -545,41 +669,16 @@ class WebScrapingStrategy(ContentScrapingStrategy):
cleaned_html = str_body.replace('\n\n', '\n').replace(' ', ' ')
try:
h = CustomHTML2Text()
h.update_params(**kwargs.get('html2text', {}))
markdown = h.handle(cleaned_html)
except Exception as e:
if not h:
h = CustomHTML2Text()
self._log('error',
message="Error converting HTML to markdown: {error}",
tag="SCRAPE",
params={"error": str(e)}
)
markdown = h.handle(sanitize_html(cleaned_html))
markdown = markdown.replace(' ```', '```')
markdown_content = self._generate_markdown_content(
cleaned_html=cleaned_html,
html=html,
url=url,
success=success,
**kwargs
)
fit_markdown = "Set flag 'fit_markdown' to True to get cleaned HTML content."
fit_html = "Set flag 'fit_markdown' to True to get cleaned HTML content."
if kwargs.get('content_filter', None) or kwargs.get('fit_markdown', False):
content_filter = kwargs.get('content_filter', None)
if not content_filter:
content_filter = BM25ContentFilter(
user_query= kwargs.get('fit_markdown_user_query', None),
bm25_threshold= kwargs.get('fit_markdown_bm25_threshold', 1.0)
)
fit_html = content_filter.filter_content(html)
fit_html = '\n'.join('<div>{}</div>'.format(s) for s in fit_html)
fit_markdown = h.handle(fit_html)
cleaned_html = sanitize_html(cleaned_html)
return {
'markdown': markdown,
'fit_markdown': fit_markdown,
'fit_html': fit_html,
**markdown_content,
'cleaned_html': cleaned_html,
'success': success,
'media': media,

View File

@@ -283,7 +283,7 @@ class LocalSeleniumCrawlerStrategy(CrawlerStrategy):
print(f"[LOG] ✅ Crawled {url} successfully!")
return html
except InvalidArgumentException:
except InvalidArgumentException as e:
if not hasattr(e, 'msg'):
e.msg = sanitize_input_encode(str(e))
raise InvalidArgumentException(f"Failed to crawl {url}: {e.msg}")

View File

@@ -0,0 +1,124 @@
from abc import ABC, abstractmethod
from typing import Optional, Dict, Any, Tuple
from .models import MarkdownGenerationResult
from .utils import CustomHTML2Text
from .content_filter_strategy import RelevantContentFilter, BM25ContentFilter
import re
from urllib.parse import urljoin
# Pre-compile the regex pattern
LINK_PATTERN = re.compile(r'!?\[([^\]]+)\]\(([^)]+?)(?:\s+"([^"]*)")?\)')
class MarkdownGenerationStrategy(ABC):
"""Abstract base class for markdown generation strategies."""
def __init__(self, content_filter: Optional[RelevantContentFilter] = None):
self.content_filter = content_filter
@abstractmethod
def generate_markdown(self,
cleaned_html: str,
base_url: str = "",
html2text_options: Optional[Dict[str, Any]] = None,
content_filter: Optional[RelevantContentFilter] = None,
citations: bool = True,
**kwargs) -> MarkdownGenerationResult:
"""Generate markdown from cleaned HTML."""
pass
class DefaultMarkdownGenerator(MarkdownGenerationStrategy):
"""Default implementation of markdown generation strategy."""
def __init__(self, content_filter: Optional[RelevantContentFilter] = None):
super().__init__(content_filter)
def convert_links_to_citations(self, markdown: str, base_url: str = "") -> Tuple[str, str]:
link_map = {}
url_cache = {} # Cache for URL joins
parts = []
last_end = 0
counter = 1
for match in LINK_PATTERN.finditer(markdown):
parts.append(markdown[last_end:match.start()])
text, url, title = match.groups()
# Use cached URL if available, otherwise compute and cache
if base_url and not url.startswith(('http://', 'https://', 'mailto:')):
if url not in url_cache:
url_cache[url] = fast_urljoin(base_url, url)
url = url_cache[url]
if url not in link_map:
desc = []
if title: desc.append(title)
if text and text != title: desc.append(text)
link_map[url] = (counter, ": " + " - ".join(desc) if desc else "")
counter += 1
num = link_map[url][0]
parts.append(f"{text}{num}" if not match.group(0).startswith('!') else f"![{text}{num}⟩]")
last_end = match.end()
parts.append(markdown[last_end:])
converted_text = ''.join(parts)
# Pre-build reference strings
references = ["\n\n## References\n\n"]
references.extend(
f"{num}{url}{desc}\n"
for url, (num, desc) in sorted(link_map.items(), key=lambda x: x[1][0])
)
return converted_text, ''.join(references)
def generate_markdown(self,
cleaned_html: str,
base_url: str = "",
html2text_options: Optional[Dict[str, Any]] = None,
content_filter: Optional[RelevantContentFilter] = None,
citations: bool = True,
**kwargs) -> MarkdownGenerationResult:
"""Generate markdown with citations from cleaned HTML."""
# Initialize HTML2Text with options
h = CustomHTML2Text()
if html2text_options:
h.update_params(**html2text_options)
# Generate raw markdown
raw_markdown = h.handle(cleaned_html)
raw_markdown = raw_markdown.replace(' ```', '```')
# Convert links to citations
markdown_with_citations: str = ""
references_markdown: str = ""
if citations:
markdown_with_citations, references_markdown = self.convert_links_to_citations(
raw_markdown, base_url
)
# Generate fit markdown if content filter is provided
fit_markdown: Optional[str] = ""
filtered_html: Optional[str] = ""
if content_filter or self.content_filter:
content_filter = content_filter or self.content_filter
filtered_html = content_filter.filter_content(cleaned_html)
filtered_html = '\n'.join('<div>{}</div>'.format(s) for s in filtered_html)
fit_markdown = h.handle(filtered_html)
return MarkdownGenerationResult(
raw_markdown=raw_markdown,
markdown_with_citations=markdown_with_citations,
references_markdown=references_markdown,
fit_markdown=fit_markdown,
fit_html=filtered_html,
)
def fast_urljoin(base: str, url: str) -> str:
"""Fast URL joining for common cases."""
if url.startswith(('http://', 'https://', 'mailto:', '//')):
return url
if url.startswith('/'):
# Handle absolute paths
if base.endswith('/'):
return base[:-1] + url
return base + url
return urljoin(base, url)

View File

@@ -1,5 +1,5 @@
from pydantic import BaseModel, HttpUrl
from typing import List, Dict, Optional, Callable, Awaitable
from typing import List, Dict, Optional, Callable, Awaitable, Union
@@ -7,6 +7,13 @@ class UrlModel(BaseModel):
url: HttpUrl
forced: bool = False
class MarkdownGenerationResult(BaseModel):
raw_markdown: str
markdown_with_citations: str
references_markdown: str
fit_markdown: Optional[str] = None
fit_html: Optional[str] = None
class CrawlResult(BaseModel):
url: str
html: str
@@ -16,7 +23,8 @@ class CrawlResult(BaseModel):
links: Dict[str, List[Dict]] = {}
downloaded_files: Optional[List[str]] = None
screenshot: Optional[str] = None
markdown: Optional[str] = None
markdown: Optional[Union[str, MarkdownGenerationResult]] = None
markdown_v2: Optional[MarkdownGenerationResult] = None
fit_markdown: Optional[str] = None
fit_html: Optional[str] = None
extracted_content: Optional[str] = None
@@ -36,3 +44,5 @@ class AsyncCrawlResponse(BaseModel):
class Config:
arbitrary_types_allowed = True

34
crawl4ai/tools.py Normal file
View File

@@ -0,0 +1,34 @@
import time
import cProfile
import pstats
from functools import wraps
def profile_and_time(func):
@wraps(func)
def wrapper(self, *args, **kwargs):
# Start timer
start_time = time.perf_counter()
# Setup profiler
profiler = cProfile.Profile()
profiler.enable()
# Run function
result = func(self, *args, **kwargs)
# Stop profiler
profiler.disable()
# Calculate elapsed time
elapsed_time = time.perf_counter() - start_time
# Print timing
print(f"[PROFILER] Scraping completed in {elapsed_time:.2f} seconds")
# Print profiling stats
stats = pstats.Stats(profiler)
stats.sort_stats('cumulative') # Sort by cumulative time
stats.print_stats(20) # Print top 20 time-consuming functions
return result
return wrapper

View File

@@ -17,10 +17,154 @@ from requests.exceptions import InvalidSchema
import hashlib
from typing import Optional, Tuple, Dict, Any
import xxhash
from colorama import Fore, Style, init
import textwrap
from .html2text import HTML2Text
class CustomHTML2Text(HTML2Text):
def __init__(self, *args, **kwargs):
super().__init__(*args, **kwargs)
self.inside_pre = False
self.inside_code = False
self.preserve_tags = set() # Set of tags to preserve
self.current_preserved_tag = None
self.preserved_content = []
self.preserve_depth = 0
# Configuration options
self.skip_internal_links = False
self.single_line_break = False
self.mark_code = False
self.include_sup_sub = False
self.body_width = 0
self.ignore_mailto_links = True
self.ignore_links = False
self.escape_backslash = False
self.escape_dot = False
self.escape_plus = False
self.escape_dash = False
self.escape_snob = False
def update_params(self, **kwargs):
"""Update parameters and set preserved tags."""
for key, value in kwargs.items():
if key == 'preserve_tags':
self.preserve_tags = set(value)
else:
setattr(self, key, value)
def handle_tag(self, tag, attrs, start):
# Handle preserved tags
if tag in self.preserve_tags:
if start:
if self.preserve_depth == 0:
self.current_preserved_tag = tag
self.preserved_content = []
# Format opening tag with attributes
attr_str = ''.join(f' {k}="{v}"' for k, v in attrs.items() if v is not None)
self.preserved_content.append(f'<{tag}{attr_str}>')
self.preserve_depth += 1
return
else:
self.preserve_depth -= 1
if self.preserve_depth == 0:
self.preserved_content.append(f'</{tag}>')
# Output the preserved HTML block with proper spacing
preserved_html = ''.join(self.preserved_content)
self.o('\n' + preserved_html + '\n')
self.current_preserved_tag = None
return
# If we're inside a preserved tag, collect all content
if self.preserve_depth > 0:
if start:
# Format nested tags with attributes
attr_str = ''.join(f' {k}="{v}"' for k, v in attrs.items() if v is not None)
self.preserved_content.append(f'<{tag}{attr_str}>')
else:
self.preserved_content.append(f'</{tag}>')
return
# Handle pre tags
if tag == 'pre':
if start:
self.o('```\n')
self.inside_pre = True
else:
self.o('\n```')
self.inside_pre = False
# elif tag in ["h1", "h2", "h3", "h4", "h5", "h6"]:
# pass
else:
super().handle_tag(tag, attrs, start)
def handle_data(self, data, entity_char=False):
"""Override handle_data to capture content within preserved tags."""
if self.preserve_depth > 0:
self.preserved_content.append(data)
return
super().handle_data(data, entity_char)
class InvalidCSSSelectorError(Exception):
pass
def create_box_message(
message: str,
type: str = "info",
width: int = 80,
add_newlines: bool = True,
double_line: bool = False
) -> str:
init()
# Define border and text colors for different types
styles = {
"warning": (Fore.YELLOW, Fore.LIGHTYELLOW_EX, ""),
"info": (Fore.BLUE, Fore.LIGHTBLUE_EX, ""),
"success": (Fore.GREEN, Fore.LIGHTGREEN_EX, ""),
"error": (Fore.RED, Fore.LIGHTRED_EX, "×"),
}
border_color, text_color, prefix = styles.get(type.lower(), styles["info"])
# Define box characters based on line style
box_chars = {
"single": ("", "", "", "", "", ""),
"double": ("", "", "", "", "", "")
}
line_style = "double" if double_line else "single"
h_line, v_line, tl, tr, bl, br = box_chars[line_style]
# Process lines with lighter text color
formatted_lines = []
raw_lines = message.split('\n')
if raw_lines:
first_line = f"{prefix} {raw_lines[0].strip()}"
wrapped_first = textwrap.fill(first_line, width=width-4)
formatted_lines.extend(wrapped_first.split('\n'))
for line in raw_lines[1:]:
if line.strip():
wrapped = textwrap.fill(f" {line.strip()}", width=width-4)
formatted_lines.extend(wrapped.split('\n'))
else:
formatted_lines.append("")
# Create the box with colored borders and lighter text
horizontal_line = h_line * (width - 1)
box = [
f"{border_color}{tl}{horizontal_line}{tr}",
*[f"{border_color}{v_line}{text_color} {line:<{width-2}}{border_color}{v_line}" for line in formatted_lines],
f"{border_color}{bl}{horizontal_line}{br}{Style.RESET_ALL}"
]
result = "\n".join(box)
if add_newlines:
result = f"\n{result}\n"
return result
def calculate_semaphore_count():
cpu_count = os.cpu_count()
memory_gb = get_system_memory() / (1024 ** 3) # Convert to GB
@@ -145,12 +289,17 @@ def sanitize_html(html):
def sanitize_input_encode(text: str) -> str:
"""Sanitize input to handle potential encoding issues."""
try:
# Attempt to encode and decode as UTF-8 to handle potential encoding issues
return text.encode('utf-8', errors='ignore').decode('utf-8')
except UnicodeEncodeError as e:
print(f"Warning: Encoding issue detected. Some characters may be lost. Error: {e}")
# Fall back to ASCII if UTF-8 fails
return text.encode('ascii', errors='ignore').decode('ascii')
try:
if not text:
return ''
# Attempt to encode and decode as UTF-8 to handle potential encoding issues
return text.encode('utf-8', errors='ignore').decode('utf-8')
except UnicodeEncodeError as e:
print(f"Warning: Encoding issue detected. Some characters may be lost. Error: {e}")
# Fall back to ASCII if UTF-8 fails
return text.encode('ascii', errors='ignore').decode('ascii')
except Exception as e:
raise ValueError(f"Error sanitizing input: {str(e)}") from e
def escape_json_string(s):
"""
@@ -991,9 +1140,54 @@ def wrap_text(draw, text, font, max_width):
return '\n'.join(lines)
def format_html(html_string):
soup = BeautifulSoup(html_string, 'html.parser')
soup = BeautifulSoup(html_string, 'lxml.parser')
return soup.prettify()
def fast_format_html(html_string):
"""
A fast HTML formatter that uses string operations instead of parsing.
Args:
html_string (str): The HTML string to format
Returns:
str: The formatted HTML string
"""
# Initialize variables
indent = 0
indent_str = " " # Two spaces for indentation
formatted = []
in_content = False
# Split by < and > to separate tags and content
parts = html_string.replace('>', '>\n').replace('<', '\n<').split('\n')
for part in parts:
if not part.strip():
continue
# Handle closing tags
if part.startswith('</'):
indent -= 1
formatted.append(indent_str * indent + part)
# Handle self-closing tags
elif part.startswith('<') and part.endswith('/>'):
formatted.append(indent_str * indent + part)
# Handle opening tags
elif part.startswith('<'):
formatted.append(indent_str * indent + part)
indent += 1
# Handle content between tags
else:
content = part.strip()
if content:
formatted.append(indent_str * indent + content)
return '\n'.join(formatted)
def normalize_url(href, base_url):
"""Normalize URLs to ensure consistent format"""
from urllib.parse import urljoin, urlparse

View File

@@ -10,7 +10,7 @@ from .extraction_strategy import *
from .crawler_strategy import *
from typing import List
from concurrent.futures import ThreadPoolExecutor
from .content_scrapping_strategy import WebScrapingStrategy
from .content_scraping_strategy import WebScrapingStrategy
from .config import *
import warnings
import json

View File

@@ -1,19 +0,0 @@
# Railway Deployment
## Quick Deploy
[![Deploy on Railway](https://railway.app/button.svg)](https://railway.app/template/crawl4ai)
## Manual Setup
1. Fork this repository
2. Create a new Railway project
3. Configure environment variables:
- `INSTALL_TYPE`: basic or all
- `ENABLE_GPU`: true/false
4. Deploy!
## Configuration
See `railway.toml` for:
- Memory limits
- Health checks
- Restart policies
- Scaling options

View File

@@ -1,33 +0,0 @@
{
"name": "Crawl4AI",
"description": "LLM Friendly Web Crawler & Scraper",
"render": {
"dockerfile": {
"path": "Dockerfile"
}
},
"env": [
{
"key": "INSTALL_TYPE",
"description": "Installation type (basic/all)",
"default": "basic",
"required": true
},
{
"key": "ENABLE_GPU",
"description": "Enable GPU support",
"default": "false",
"required": false
}
],
"services": [
{
"name": "web",
"dockerfile": "./Dockerfile",
"healthcheck": {
"path": "/health",
"port": 11235
}
}
]
}

View File

@@ -1,18 +0,0 @@
# railway.toml
[build]
builder = "DOCKERFILE"
dockerfilePath = "Dockerfile"
[deploy]
startCommand = "uvicorn main:app --host 0.0.0.0 --port $PORT"
healthcheckPath = "/health"
restartPolicyType = "ON_FAILURE"
restartPolicyMaxRetries = 3
[deploy.memory]
soft = 2048 # 2GB min for Playwright
hard = 4096 # 4GB max
[deploy.scaling]
min = 1
max = 1

View File

@@ -1,27 +0,0 @@
services:
crawl4ai:
image: unclecode/crawl4ai:basic # Pull image from Docker Hub
ports:
- "11235:11235" # FastAPI server
- "8000:8000" # Alternative port
- "9222:9222" # Browser debugging
- "8080:8080" # Additional port
environment:
- CRAWL4AI_API_TOKEN=${CRAWL4AI_API_TOKEN:-} # Optional API token
- OPENAI_API_KEY=${OPENAI_API_KEY:-} # Optional OpenAI API key
- CLAUDE_API_KEY=${CLAUDE_API_KEY:-} # Optional Claude API key
volumes:
- /dev/shm:/dev/shm # Shared memory for browser operations
deploy:
resources:
limits:
memory: 4G
reservations:
memory: 1G
restart: unless-stopped
healthcheck:
test: ["CMD", "curl", "-f", "http://localhost:11235/health"]
interval: 30s
timeout: 10s
retries: 3
start_period: 40s

View File

@@ -1,33 +0,0 @@
services:
crawl4ai:
build:
context: .
dockerfile: Dockerfile
args:
PYTHON_VERSION: 3.10
INSTALL_TYPE: all
ENABLE_GPU: false
ports:
- "11235:11235" # FastAPI server
- "8000:8000" # Alternative port
- "9222:9222" # Browser debugging
- "8080:8080" # Additional port
environment:
- CRAWL4AI_API_TOKEN=${CRAWL4AI_API_TOKEN:-} # Optional API token
- OPENAI_API_KEY=${OPENAI_API_KEY:-} # Optional OpenAI API key
- CLAUDE_API_KEY=${CLAUDE_API_KEY:-} # Optional Claude API key
volumes:
- /dev/shm:/dev/shm # Shared memory for browser operations
deploy:
resources:
limits:
memory: 4G
reservations:
memory: 1G
restart: unless-stopped
healthcheck:
test: ["CMD", "curl", "-f", "http://localhost:11235/health"]
interval: 30s
timeout: 10s
retries: 3
start_period: 40s

View File

@@ -4,8 +4,8 @@ services:
context: .
dockerfile: Dockerfile
args:
PYTHON_VERSION: 3.10
INSTALL_TYPE: all
PYTHON_VERSION: "3.10"
INSTALL_TYPE: ${INSTALL_TYPE:-basic}
ENABLE_GPU: false
profiles: ["local"]
ports:

View File

@@ -13,7 +13,9 @@ import re
from typing import Dict, List
from bs4 import BeautifulSoup
from pydantic import BaseModel, Field
from crawl4ai import AsyncWebCrawler
from crawl4ai import AsyncWebCrawler, CacheMode
from crawl4ai.markdown_generation_strategy import DefaultMarkdownGenerator
from crawl4ai.content_filter_strategy import BM25ContentFilter
from crawl4ai.extraction_strategy import (
JsonCssExtractionStrategy,
LLMExtractionStrategy,
@@ -51,7 +53,7 @@ async def simple_example_with_running_js_code():
url="https://www.nbcnews.com/business",
js_code=js_code,
# wait_for=wait_for,
bypass_cache=True,
cache_mode=CacheMode.BYPASS,
)
print(result.markdown[:500]) # Print first 500 characters
@@ -61,7 +63,7 @@ async def simple_example_with_css_selector():
result = await crawler.arun(
url="https://www.nbcnews.com/business",
css_selector=".wide-tease-item__description",
bypass_cache=True,
cache_mode=CacheMode.BYPASS,
)
print(result.markdown[:500]) # Print first 500 characters
@@ -132,7 +134,7 @@ async def extract_structured_data_using_llm(provider: str, api_token: str = None
{"model_name": "GPT-4", "input_fee": "US$10.00 / 1M tokens", "output_fee": "US$30.00 / 1M tokens"}.""",
extra_args=extra_args
),
bypass_cache=True,
cache_mode=CacheMode.BYPASS,
)
print(result.extracted_content)
@@ -166,7 +168,7 @@ async def extract_structured_data_using_css_extractor():
result = await crawler.arun(
url="https://www.coinbase.com/explore",
extraction_strategy=extraction_strategy,
bypass_cache=True,
cache_mode=CacheMode.BYPASS,
)
assert result.success, "Failed to crawl the page"
@@ -213,7 +215,7 @@ async def crawl_dynamic_content_pages_method_1():
session_id=session_id,
css_selector="li.Box-sc-g0xbh4-0",
js=js_next_page if page > 0 else None,
bypass_cache=True,
cache_mode=CacheMode.BYPASS,
js_only=page > 0,
headless=False,
)
@@ -282,7 +284,7 @@ async def crawl_dynamic_content_pages_method_2():
extraction_strategy=extraction_strategy,
js_code=js_next_page_and_wait if page > 0 else None,
js_only=page > 0,
bypass_cache=True,
cache_mode=CacheMode.BYPASS,
headless=False,
)
@@ -343,7 +345,7 @@ async def crawl_dynamic_content_pages_method_3():
js_code=js_next_page if page > 0 else None,
wait_for=wait_for if page > 0 else None,
js_only=page > 0,
bypass_cache=True,
cache_mode=CacheMode.BYPASS,
headless=False,
)
@@ -384,7 +386,7 @@ async def crawl_with_user_simultion():
url = "YOUR-URL-HERE"
result = await crawler.arun(
url=url,
bypass_cache=True,
cache_mode=CacheMode.BYPASS,
magic = True, # Automatically detects and removes overlays, popups, and other elements that block content
# simulate_user = True,# Causes a series of random mouse movements and clicks to simulate user interaction
# override_navigator = True # Overrides the navigator object to make it look like a real user
@@ -408,7 +410,7 @@ async def speed_comparison():
params={'formats': ['markdown', 'html']}
)
end = time.time()
print("Firecrawl (simulated):")
print("Firecrawl:")
print(f"Time taken: {end - start:.2f} seconds")
print(f"Content length: {len(scrape_status['markdown'])} characters")
print(f"Images found: {scrape_status['markdown'].count('cldnry.s-nbcnews.com')}")
@@ -420,7 +422,7 @@ async def speed_comparison():
result = await crawler.arun(
url="https://www.nbcnews.com/business",
word_count_threshold=0,
bypass_cache=True,
cache_mode=CacheMode.BYPASS,
verbose=False,
)
end = time.time()
@@ -430,6 +432,25 @@ async def speed_comparison():
print(f"Images found: {result.markdown.count('cldnry.s-nbcnews.com')}")
print()
# Crawl4AI with advanced content filtering
start = time.time()
result = await crawler.arun(
url="https://www.nbcnews.com/business",
word_count_threshold=0,
markdown_generator=DefaultMarkdownGenerator(
content_filter=BM25ContentFilter(user_query=None, bm25_threshold=1.0)
),
cache_mode=CacheMode.BYPASS,
verbose=False,
)
end = time.time()
print("Crawl4AI (Markdown Plus):")
print(f"Time taken: {end - start:.2f} seconds")
print(f"Content length: {len(result.markdown_v2.raw_markdown)} characters")
print(f"Fit Markdown: {len(result.markdown_v2.fit_markdown)} characters")
print(f"Images found: {result.markdown.count('cldnry.s-nbcnews.com')}")
print()
# Crawl4AI with JavaScript execution
start = time.time()
result = await crawler.arun(
@@ -438,13 +459,17 @@ async def speed_comparison():
"const loadMoreButton = Array.from(document.querySelectorAll('button')).find(button => button.textContent.includes('Load More')); loadMoreButton && loadMoreButton.click();"
],
word_count_threshold=0,
bypass_cache=True,
cache_mode=CacheMode.BYPASS,
markdown_generator=DefaultMarkdownGenerator(
content_filter=BM25ContentFilter(user_query=None, bm25_threshold=1.0)
),
verbose=False,
)
end = time.time()
print("Crawl4AI (with JavaScript execution):")
print(f"Time taken: {end - start:.2f} seconds")
print(f"Content length: {len(result.markdown)} characters")
print(f"Fit Markdown: {len(result.markdown_v2.fit_markdown)} characters")
print(f"Images found: {result.markdown.count('cldnry.s-nbcnews.com')}")
print("\nNote on Speed Comparison:")
@@ -483,7 +508,7 @@ async def generate_knowledge_graph():
url = "https://paulgraham.com/love.html"
result = await crawler.arun(
url=url,
bypass_cache=True,
cache_mode=CacheMode.BYPASS,
extraction_strategy=extraction_strategy,
# magic=True
)
@@ -496,7 +521,7 @@ async def fit_markdown_remove_overlay():
url = "https://janineintheworld.com/places-to-visit-in-central-mexico"
result = await crawler.arun(
url=url,
bypass_cache=True,
cache_mode=CacheMode.BYPASS,
word_count_threshold = 10,
remove_overlay_elements=True,
screenshot = True
@@ -517,10 +542,10 @@ async def main():
await extract_structured_data_using_css_extractor()
# LLM extraction examples
await extract_structured_data_using_llm()
await extract_structured_data_using_llm("huggingface/meta-llama/Meta-Llama-3.1-8B-Instruct", os.getenv("HUGGINGFACE_API_KEY"))
# await extract_structured_data_using_llm()
# await extract_structured_data_using_llm("huggingface/meta-llama/Meta-Llama-3.1-8B-Instruct", os.getenv("HUGGINGFACE_API_KEY"))
# await extract_structured_data_using_llm("ollama/llama3.2")
await extract_structured_data_using_llm("openai/gpt-4o", os.getenv("OPENAI_API_KEY"))
await extract_structured_data_using_llm("ollama/llama3.2")
# You always can pass custom headers to the extraction strategy
custom_headers = {

View File

@@ -52,34 +52,7 @@ async def download_example():
else:
print("\nNo files were downloaded")
# 2. Content Filtering with BM25 Example
async def content_filtering_example():
"""Example of using the new BM25 content filtering"""
async with AsyncWebCrawler(verbose=True) as crawler:
# Create filter with custom query for OpenAI's blog
content_filter = BM25ContentFilter(
# user_query="Investment and fundraising",
# user_query="Robotic",
bm25_threshold=1.0
)
result = await crawler.arun(
url="https://techcrunch.com/",
content_filter=content_filter,
cache_mode=CacheMode.BYPASS
)
print(f"Filtered content: {len(result.fit_markdown)}")
print(f"Filtered content: {result.fit_markdown}")
# Save html
with open(os.path.join(__data__, "techcrunch.html"), "w") as f:
f.write(result.fit_html)
with open(os.path.join(__data__, "filtered_content.md"), "w") as f:
f.write(result.fit_markdown)
# 3. Local File and Raw HTML Processing Example
# 2. Local File and Raw HTML Processing Example
async def local_and_raw_html_example():
"""Example of processing local files and raw HTML"""
# Create a sample HTML file
@@ -115,6 +88,68 @@ async def local_and_raw_html_example():
print("Local file content:", local_result.markdown)
print("\nRaw HTML content:", raw_result.markdown)
# 3. Enhanced Markdown Generation Example
async def markdown_generation_example():
"""Example of enhanced markdown generation with citations and LLM-friendly features"""
async with AsyncWebCrawler(verbose=True) as crawler:
# Create a content filter (optional)
content_filter = BM25ContentFilter(
# user_query="History and cultivation",
bm25_threshold=1.0
)
result = await crawler.arun(
url="https://en.wikipedia.org/wiki/Apple",
css_selector="main div#bodyContent",
content_filter=content_filter,
cache_mode=CacheMode.BYPASS
)
from crawl4ai import AsyncWebCrawler
from crawl4ai.content_filter_strategy import BM25ContentFilter
result = await crawler.arun(
url="https://en.wikipedia.org/wiki/Apple",
css_selector="main div#bodyContent",
content_filter=BM25ContentFilter()
)
print(result.markdown_v2.fit_markdown)
print("\nMarkdown Generation Results:")
print(f"1. Original markdown length: {len(result.markdown)}")
print(f"2. New markdown versions (markdown_v2):")
print(f" - Raw markdown length: {len(result.markdown_v2.raw_markdown)}")
print(f" - Citations markdown length: {len(result.markdown_v2.markdown_with_citations)}")
print(f" - References section length: {len(result.markdown_v2.references_markdown)}")
if result.markdown_v2.fit_markdown:
print(f" - Filtered markdown length: {len(result.markdown_v2.fit_markdown)}")
# Save examples to files
output_dir = os.path.join(__data__, "markdown_examples")
os.makedirs(output_dir, exist_ok=True)
# Save different versions
with open(os.path.join(output_dir, "1_raw_markdown.md"), "w") as f:
f.write(result.markdown_v2.raw_markdown)
with open(os.path.join(output_dir, "2_citations_markdown.md"), "w") as f:
f.write(result.markdown_v2.markdown_with_citations)
with open(os.path.join(output_dir, "3_references.md"), "w") as f:
f.write(result.markdown_v2.references_markdown)
if result.markdown_v2.fit_markdown:
with open(os.path.join(output_dir, "4_filtered_markdown.md"), "w") as f:
f.write(result.markdown_v2.fit_markdown)
print(f"\nMarkdown examples saved to: {output_dir}")
# Show a sample of citations and references
print("\nSample of markdown with citations:")
print(result.markdown_v2.markdown_with_citations[:500] + "...\n")
print("Sample of references:")
print('\n'.join(result.markdown_v2.references_markdown.split('\n')[:10]) + "...")
# 4. Browser Management Example
async def browser_management_example():
"""Example of using enhanced browser management features"""
@@ -208,9 +243,13 @@ async def api_example():
headers=headers
) as status_response:
result = await status_response.json()
print(f"Task result: {result}")
print(f"Task status: {result['status']}")
if result["status"] == "completed":
print("Task completed!")
print("Results:")
news = json.loads(result["results"][0]['extracted_content'])
print(json.dumps(news[:4], indent=2))
break
else:
await asyncio.sleep(1)
@@ -220,15 +259,15 @@ async def main():
# print("Running Crawl4AI feature examples...")
# print("\n1. Running Download Example:")
await download_example()
# await download_example()
# print("\n2. Running Content Filtering Example:")
await content_filtering_example()
# print("\n2. Running Markdown Generation Example:")
# await markdown_generation_example()
# print("\n3. Running Local and Raw HTML Example:")
await local_and_raw_html_example()
# # print("\n3. Running Local and Raw HTML Example:")
# await local_and_raw_html_example()
# print("\n4. Running Browser Management Example:")
# # print("\n4. Running Browser Management Example:")
await browser_management_example()
# print("\n5. Running API Example:")

View File

@@ -18,7 +18,7 @@ Let's see how we can customize the AsyncWebCrawler using hooks! In this example,
import asyncio
from crawl4ai import AsyncWebCrawler
from crawl4ai.async_crawler_strategy import AsyncPlaywrightCrawlerStrategy
from playwright.async_api import Page, Browser
from playwright.async_api import Page, Browser, BrowserContext
async def on_browser_created(browser: Browser):
print("[HOOK] on_browser_created")
@@ -71,7 +71,11 @@ from crawl4ai.async_crawler_strategy import AsyncPlaywrightCrawlerStrategy
async def main():
print("\n🔗 Using Crawler Hooks: Let's see how we can customize the AsyncWebCrawler using hooks!")
crawler_strategy = AsyncPlaywrightCrawlerStrategy(verbose=True)
initial_cookies = [
{"name": "sessionId", "value": "abc123", "domain": ".example.com"},
{"name": "userId", "value": "12345", "domain": ".example.com"}
]
crawler_strategy = AsyncPlaywrightCrawlerStrategy(verbose=True, cookies=initial_cookies)
crawler_strategy.set_hook('on_browser_created', on_browser_created)
crawler_strategy.set_hook('before_goto', before_goto)
crawler_strategy.set_hook('after_goto', after_goto)

View File

@@ -1,131 +0,0 @@
:root {
--ifm-font-size-base: 100%;
--ifm-line-height-base: 1.65;
--ifm-font-family-base: system-ui, -apple-system, Segoe UI, Roboto, Ubuntu, Cantarell, Noto Sans, sans-serif,
BlinkMacSystemFont, "Segoe UI", Helvetica, Arial, sans-serif, "Apple Color Emoji", "Segoe UI Emoji",
"Segoe UI Symbol";
}
html {
-webkit-font-smoothing: antialiased;
-webkit-text-size-adjust: 100%;
text-size-adjust: 100%;
font: var(--ifm-font-size-base) / var(--ifm-line-height-base) var(--ifm-font-family-base);
}
body {
background-color: #1a202c;
color: #fff;
}
.tab-content {
max-height: 400px;
overflow: auto;
}
pre {
white-space: pre-wrap;
font-size: 14px;
}
pre code {
width: 100%;
}
/* Custom styling for docs-item class and Markdown generated elements */
.docs-item {
background-color: #2d3748; /* bg-gray-800 */
padding: 1rem; /* p-4 */
border-radius: 0.375rem; /* rounded */
box-shadow: 0 4px 6px rgba(0, 0, 0, 0.1); /* shadow-md */
margin-bottom: 1rem; /* space between items */
line-height: 1.5; /* leading-normal */
}
.docs-item h3,
.docs-item h4 {
color: #ffffff; /* text-white */
font-size: 1.25rem; /* text-xl */
font-weight: 700; /* font-bold */
margin-bottom: 0.5rem; /* mb-2 */
}
.docs-item h4 {
font-size: 1rem; /* text-xl */
}
.docs-item p {
color: #e2e8f0; /* text-gray-300 */
margin-bottom: 0.5rem; /* mb-2 */
}
.docs-item code {
background-color: #1a202c; /* bg-gray-900 */
color: #e2e8f0; /* text-gray-300 */
padding: 0.25rem 0.5rem; /* px-2 py-1 */
border-radius: 0.25rem; /* rounded */
font-size: 0.875rem; /* text-sm */
}
.docs-item pre {
background-color: #1a202c; /* bg-gray-900 */
color: #e2e8f0; /* text-gray-300 */
padding: 0.5rem; /* p-2 */
border-radius: 0.375rem; /* rounded */
overflow: auto; /* overflow-auto */
margin-bottom: 0.5rem; /* mb-2 */
}
.docs-item div {
color: #e2e8f0; /* text-gray-300 */
font-size: 1rem; /* prose prose-sm */
line-height: 1.25rem; /* line-height for readability */
}
/* Adjustments to make prose class more suitable for dark mode */
.prose {
max-width: none; /* max-w-none */
}
.prose p,
.prose ul {
margin-bottom: 1rem; /* mb-4 */
}
.prose code {
/* background-color: #4a5568; */ /* bg-gray-700 */
color: #65a30d; /* text-white */
padding: 0.25rem 0.5rem; /* px-1 py-0.5 */
border-radius: 0.25rem; /* rounded */
display: inline-block; /* inline-block */
}
.prose pre {
background-color: #1a202c; /* bg-gray-900 */
color: #ffffff; /* text-white */
padding: 0.5rem; /* p-2 */
border-radius: 0.375rem; /* rounded */
}
.prose h3 {
color: #65a30d; /* text-white */
font-size: 1.25rem; /* text-xl */
font-weight: 700; /* font-bold */
margin-bottom: 0.5rem; /* mb-2 */
}
body {
background-color: #1a1a1a;
color: #b3ff00;
}
.sidebar {
color: #b3ff00;
border-right: 1px solid #333;
}
.sidebar a {
color: #b3ff00;
text-decoration: none;
}
.sidebar a:hover {
background-color: #555;
}
.content-section {
display: none;
}
.content-section.active {
display: block;
}

View File

@@ -1,356 +0,0 @@
// JavaScript to manage dynamic form changes and logic
document.getElementById("extraction-strategy-select").addEventListener("change", function () {
const strategy = this.value;
const providerModelSelect = document.getElementById("provider-model-select");
const tokenInput = document.getElementById("token-input");
const instruction = document.getElementById("instruction");
const semantic_filter = document.getElementById("semantic_filter");
const instruction_div = document.getElementById("instruction_div");
const semantic_filter_div = document.getElementById("semantic_filter_div");
const llm_settings = document.getElementById("llm_settings");
if (strategy === "LLMExtractionStrategy") {
// providerModelSelect.disabled = false;
// tokenInput.disabled = false;
// semantic_filter.disabled = true;
// instruction.disabled = false;
llm_settings.classList.remove("hidden");
instruction_div.classList.remove("hidden");
semantic_filter_div.classList.add("hidden");
} else if (strategy === "NoExtractionStrategy") {
semantic_filter_div.classList.add("hidden");
instruction_div.classList.add("hidden");
llm_settings.classList.add("hidden");
} else {
// providerModelSelect.disabled = true;
// tokenInput.disabled = true;
// semantic_filter.disabled = false;
// instruction.disabled = true;
llm_settings.classList.add("hidden");
instruction_div.classList.add("hidden");
semantic_filter_div.classList.remove("hidden");
}
});
// Get the selected provider model and token from local storage
const storedProviderModel = localStorage.getItem("provider_model");
const storedToken = localStorage.getItem(storedProviderModel);
if (storedProviderModel) {
document.getElementById("provider-model-select").value = storedProviderModel;
}
if (storedToken) {
document.getElementById("token-input").value = storedToken;
}
// Handle provider model dropdown change
document.getElementById("provider-model-select").addEventListener("change", () => {
const selectedProviderModel = document.getElementById("provider-model-select").value;
const storedToken = localStorage.getItem(selectedProviderModel);
if (storedToken) {
document.getElementById("token-input").value = storedToken;
} else {
document.getElementById("token-input").value = "";
}
});
// Fetch total count from the database
axios
.get("/total-count")
.then((response) => {
document.getElementById("total-count").textContent = response.data.count;
})
.catch((error) => console.error(error));
// Handle crawl button click
document.getElementById("crawl-btn").addEventListener("click", () => {
// validate input to have both URL and API token
// if selected extraction strategy is LLMExtractionStrategy, then API token is required
if (document.getElementById("extraction-strategy-select").value === "LLMExtractionStrategy") {
if (!document.getElementById("url-input").value || !document.getElementById("token-input").value) {
alert("Please enter both URL(s) and API token.");
return;
}
}
const selectedProviderModel = document.getElementById("provider-model-select").value;
const apiToken = document.getElementById("token-input").value;
const extractBlocks = document.getElementById("extract-blocks-checkbox").checked;
const bypassCache = document.getElementById("bypass-cache-checkbox").checked;
// Save the selected provider model and token to local storage
localStorage.setItem("provider_model", selectedProviderModel);
localStorage.setItem(selectedProviderModel, apiToken);
const urlsInput = document.getElementById("url-input").value;
const urls = urlsInput.split(",").map((url) => url.trim());
const data = {
urls: urls,
include_raw_html: true,
bypass_cache: bypassCache,
extract_blocks: extractBlocks,
word_count_threshold: parseInt(document.getElementById("threshold").value),
extraction_strategy: document.getElementById("extraction-strategy-select").value,
extraction_strategy_args: {
provider: selectedProviderModel,
api_token: apiToken,
instruction: document.getElementById("instruction").value,
semantic_filter: document.getElementById("semantic_filter").value,
},
chunking_strategy: document.getElementById("chunking-strategy-select").value,
chunking_strategy_args: {},
css_selector: document.getElementById("css-selector").value,
screenshot: document.getElementById("screenshot-checkbox").checked,
// instruction: document.getElementById("instruction").value,
// semantic_filter: document.getElementById("semantic_filter").value,
verbose: true,
};
// import requests
// data = {
// "urls": [
// "https://www.nbcnews.com/business"
// ],
// "word_count_threshold": 10,
// "extraction_strategy": "NoExtractionStrategy",
// }
// response = requests.post("https://crawl4ai.com/crawl", json=data) # OR local host if your run locally
// print(response.json())
// save api token to local storage
localStorage.setItem("api_token", document.getElementById("token-input").value);
document.getElementById("loading").classList.remove("hidden");
document.getElementById("result").style.visibility = "hidden";
document.getElementById("code_help").style.visibility = "hidden";
axios
.post("/crawl", data)
.then((response) => {
const result = response.data.results[0];
const parsedJson = JSON.parse(result.extracted_content);
document.getElementById("json-result").textContent = JSON.stringify(parsedJson, null, 2);
document.getElementById("cleaned-html-result").textContent = result.cleaned_html;
document.getElementById("markdown-result").textContent = result.markdown;
document.getElementById("media-result").textContent = JSON.stringify( result.media, null, 2);
if (result.screenshot){
const imgElement = document.createElement("img");
// Set the src attribute with the base64 data
imgElement.src = `data:image/png;base64,${result.screenshot}`;
document.getElementById("screenshot-result").innerHTML = "";
document.getElementById("screenshot-result").appendChild(imgElement);
}
// Update code examples dynamically
const extractionStrategy = data.extraction_strategy;
const isLLMExtraction = extractionStrategy === "LLMExtractionStrategy";
// REMOVE API TOKEN FROM CODE EXAMPLES
data.extraction_strategy_args.api_token = "your_api_token";
if (data.extraction_strategy === "NoExtractionStrategy") {
delete data.extraction_strategy_args;
delete data.extrac_blocks;
}
if (data.chunking_strategy === "RegexChunking") {
delete data.chunking_strategy_args;
}
delete data.verbose;
if (data.css_selector === "") {
delete data.css_selector;
}
if (!data.bypass_cache) {
delete data.bypass_cache;
}
if (!data.extract_blocks) {
delete data.extract_blocks;
}
if (!data.include_raw_html) {
delete data.include_raw_html;
}
document.getElementById(
"curl-code"
).textContent = `curl -X POST -H "Content-Type: application/json" -d '${JSON.stringify({
...data,
api_token: isLLMExtraction ? "your_api_token" : undefined,
}, null, 2)}' https://crawl4ai.com/crawl`;
document.getElementById("python-code").textContent = `import requests\n\ndata = ${JSON.stringify(
{ ...data, api_token: isLLMExtraction ? "your_api_token" : undefined },
null,
2
)}\n\nresponse = requests.post("https://crawl4ai.com/crawl", json=data) # OR local host if your run locally \nprint(response.json())`;
document.getElementById(
"nodejs-code"
).textContent = `const axios = require('axios');\n\nconst data = ${JSON.stringify(
{ ...data, api_token: isLLMExtraction ? "your_api_token" : undefined },
null,
2
)};\n\naxios.post("https://crawl4ai.com/crawl", data) // OR local host if your run locally \n .then(response => console.log(response.data))\n .catch(error => console.error(error));`;
document.getElementById(
"library-code"
).textContent = `from crawl4ai.web_crawler import WebCrawler\nfrom crawl4ai.extraction_strategy import *\nfrom crawl4ai.chunking_strategy import *\n\ncrawler = WebCrawler()\ncrawler.warmup()\n\nresult = crawler.run(\n url='${
urls[0]
}',\n word_count_threshold=${data.word_count_threshold},\n extraction_strategy=${
isLLMExtraction
? `${extractionStrategy}(provider="${data.provider_model}", api_token="${data.api_token}")`
: extractionStrategy + "()"
},\n chunking_strategy=${data.chunking_strategy}(),\n bypass_cache=${
data.bypass_cache
},\n css_selector="${data.css_selector}"\n)\nprint(result)`;
// Highlight code syntax
hljs.highlightAll();
// Select JSON tab by default
document.querySelector('.tab-btn[data-tab="json"]').click();
document.getElementById("loading").classList.add("hidden");
document.getElementById("result").style.visibility = "visible";
document.getElementById("code_help").style.visibility = "visible";
// increment the total count
document.getElementById("total-count").textContent =
parseInt(document.getElementById("total-count").textContent) + 1;
})
.catch((error) => {
console.error(error);
document.getElementById("loading").classList.add("hidden");
});
});
// Handle tab clicks
document.querySelectorAll(".tab-btn").forEach((btn) => {
btn.addEventListener("click", () => {
const tab = btn.dataset.tab;
document.querySelectorAll(".tab-btn").forEach((b) => b.classList.remove("bg-lime-700", "text-white"));
btn.classList.add("bg-lime-700", "text-white");
document.querySelectorAll(".tab-content.code pre").forEach((el) => el.classList.add("hidden"));
document.getElementById(`${tab}-result`).parentElement.classList.remove("hidden");
});
});
// Handle code tab clicks
document.querySelectorAll(".code-tab-btn").forEach((btn) => {
btn.addEventListener("click", () => {
const tab = btn.dataset.tab;
document.querySelectorAll(".code-tab-btn").forEach((b) => b.classList.remove("bg-lime-700", "text-white"));
btn.classList.add("bg-lime-700", "text-white");
document.querySelectorAll(".tab-content.result pre").forEach((el) => el.classList.add("hidden"));
document.getElementById(`${tab}-code`).parentElement.classList.remove("hidden");
});
});
// Handle copy to clipboard button clicks
async function copyToClipboard(text) {
if (navigator.clipboard && navigator.clipboard.writeText) {
return navigator.clipboard.writeText(text);
} else {
return fallbackCopyTextToClipboard(text);
}
}
function fallbackCopyTextToClipboard(text) {
return new Promise((resolve, reject) => {
const textArea = document.createElement("textarea");
textArea.value = text;
// Avoid scrolling to bottom
textArea.style.top = "0";
textArea.style.left = "0";
textArea.style.position = "fixed";
document.body.appendChild(textArea);
textArea.focus();
textArea.select();
try {
const successful = document.execCommand("copy");
if (successful) {
resolve();
} else {
reject();
}
} catch (err) {
reject(err);
}
document.body.removeChild(textArea);
});
}
document.querySelectorAll(".copy-btn").forEach((btn) => {
btn.addEventListener("click", () => {
const target = btn.dataset.target;
const code = document.getElementById(target).textContent;
//navigator.clipboard.writeText(code).then(() => {
copyToClipboard(code).then(() => {
btn.textContent = "Copied!";
setTimeout(() => {
btn.textContent = "Copy";
}, 2000);
});
});
});
document.addEventListener("DOMContentLoaded", async () => {
try {
const extractionResponse = await fetch("/strategies/extraction");
const extractionStrategies = await extractionResponse.json();
const chunkingResponse = await fetch("/strategies/chunking");
const chunkingStrategies = await chunkingResponse.json();
renderStrategies("extraction-strategies", extractionStrategies);
renderStrategies("chunking-strategies", chunkingStrategies);
} catch (error) {
console.error("Error fetching strategies:", error);
}
});
function renderStrategies(containerId, strategies) {
const container = document.getElementById(containerId);
container.innerHTML = ""; // Clear any existing content
strategies = JSON.parse(strategies);
Object.entries(strategies).forEach(([strategy, description]) => {
const strategyElement = document.createElement("div");
strategyElement.classList.add("bg-zinc-800", "p-4", "rounded", "shadow-md", "docs-item");
const strategyDescription = document.createElement("div");
strategyDescription.classList.add("text-gray-300", "prose", "prose-sm");
strategyDescription.innerHTML = marked.parse(description);
strategyElement.appendChild(strategyDescription);
container.appendChild(strategyElement);
});
}
document.querySelectorAll(".sidebar a").forEach((link) => {
link.addEventListener("click", function (event) {
event.preventDefault();
document.querySelectorAll(".content-section").forEach((section) => {
section.classList.remove("active");
});
const target = event.target.getAttribute("data-target");
document.getElementById(target).classList.add("active");
});
});
// Highlight code syntax
hljs.highlightAll();

View File

@@ -1,971 +0,0 @@
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="UTF-8" />
<meta name="viewport" content="width=device-width, initial-scale=1.0" />
<title>Crawl4AI</title>
<link rel="preconnect" href="https://fonts.googleapis.com" />
<link rel="preconnect" href="https://fonts.gstatic.com" crossorigin />
<link href="https://fonts.googleapis.com/css2?family=Inter:wght@100..900&display=swap" rel="stylesheet" />
<!-- <link href="https://cdn.jsdelivr.net/npm/tailwindcss@3.4.3/dist/tailwind.min.css" rel="stylesheet" /> -->
<script src="https://cdn.tailwindcss.com"></script>
<script src="https://cdn.jsdelivr.net/npm/axios/dist/axios.min.js"></script>
<link
rel="stylesheet"
href="https://cdnjs.cloudflare.com/ajax/libs/highlight.js/11.7.0/styles/monokai.min.css"
/>
<script src="https://cdn.jsdelivr.net/npm/marked/marked.min.js"></script>
<script src="https://cdnjs.cloudflare.com/ajax/libs/highlight.js/11.7.0/highlight.min.js"></script>
<style>
:root {
--ifm-font-size-base: 100%;
--ifm-line-height-base: 1.65;
--ifm-font-family-base: system-ui, -apple-system, Segoe UI, Roboto, Ubuntu, Cantarell, Noto Sans,
sans-serif, BlinkMacSystemFont, "Segoe UI", Helvetica, Arial, sans-serif, "Apple Color Emoji",
"Segoe UI Emoji", "Segoe UI Symbol";
}
html {
-webkit-font-smoothing: antialiased;
-webkit-text-size-adjust: 100%;
text-size-adjust: 100%;
font: var(--ifm-font-size-base) / var(--ifm-line-height-base) var(--ifm-font-family-base);
}
body {
background-color: #1a202c;
color: #fff;
}
.tab-content {
max-height: 400px;
overflow: auto;
}
pre {
white-space: pre-wrap;
font-size: 14px;
}
pre code {
width: 100%;
}
</style>
<style>
/* Custom styling for docs-item class and Markdown generated elements */
.docs-item {
background-color: #2d3748; /* bg-gray-800 */
padding: 1rem; /* p-4 */
border-radius: 0.375rem; /* rounded */
box-shadow: 0 4px 6px rgba(0, 0, 0, 0.1); /* shadow-md */
margin-bottom: 1rem; /* space between items */
}
.docs-item h3,
.docs-item h4 {
color: #ffffff; /* text-white */
font-size: 1.25rem; /* text-xl */
font-weight: 700; /* font-bold */
margin-bottom: 0.5rem; /* mb-2 */
}
.docs-item p {
color: #e2e8f0; /* text-gray-300 */
margin-bottom: 0.5rem; /* mb-2 */
}
.docs-item code {
background-color: #1a202c; /* bg-gray-900 */
color: #e2e8f0; /* text-gray-300 */
padding: 0.25rem 0.5rem; /* px-2 py-1 */
border-radius: 0.25rem; /* rounded */
}
.docs-item pre {
background-color: #1a202c; /* bg-gray-900 */
color: #e2e8f0; /* text-gray-300 */
padding: 0.5rem; /* p-2 */
border-radius: 0.375rem; /* rounded */
overflow: auto; /* overflow-auto */
margin-bottom: 0.5rem; /* mb-2 */
}
.docs-item div {
color: #e2e8f0; /* text-gray-300 */
font-size: 1rem; /* prose prose-sm */
line-height: 1.25rem; /* line-height for readability */
}
/* Adjustments to make prose class more suitable for dark mode */
.prose {
max-width: none; /* max-w-none */
}
.prose p,
.prose ul {
margin-bottom: 1rem; /* mb-4 */
}
.prose code {
/* background-color: #4a5568; */ /* bg-gray-700 */
color: #65a30d; /* text-white */
padding: 0.25rem 0.5rem; /* px-1 py-0.5 */
border-radius: 0.25rem; /* rounded */
display: inline-block; /* inline-block */
}
.prose pre {
background-color: #1a202c; /* bg-gray-900 */
color: #ffffff; /* text-white */
padding: 0.5rem; /* p-2 */
border-radius: 0.375rem; /* rounded */
}
.prose h3 {
color: #65a30d; /* text-white */
font-size: 1.25rem; /* text-xl */
font-weight: 700; /* font-bold */
margin-bottom: 0.5rem; /* mb-2 */
}
</style>
</head>
<body class="bg-black text-gray-200">
<header class="bg-zinc-950 text-white py-4 flex">
<div class="mx-auto px-4">
<h1 class="text-2xl font-bold">🔥🕷️ Crawl4AI: Web Data for your Thoughts</h1>
</div>
<div class="mx-auto px-4 flex font-bold text-xl gap-2">
<span>📊 Total Website Processed</span>
<span id="total-count" class="text-lime-400">2</span>
</div>
</header>
<section class="try-it py-8 px-16 pb-20">
<div class="container mx-auto px-4">
<h2 class="text-2xl font-bold mb-4">Try It Now</h2>
<div class="grid grid-cols-1 lg:grid-cols-3 gap-4">
<div class="space-y-4">
<div class="flex flex-col">
<label for="url-input" class="text-lime-500 font-bold text-xs">URL(s)</label>
<input
type="text"
id="url-input"
value="https://www.nbcnews.com/business"
class="border border-zinc-700 rounded px-4 py-0 bg-zinc-900 text-lime-500"
placeholder="Enter URL(s) separated by commas"
/>
</div>
<div class="flex flex-col">
<label for="threshold" class="text-lime-500 font-bold text-xs">Min Words Threshold</label>
<select
id="threshold"
class="border border-zinc-700 rounded px-4 py-0 bg-zinc-900 text-lime-500"
>
<option value="5">5</option>
<option value="10" selected>10</option>
<option value="15">15</option>
<option value="20">20</option>
<option value="25">25</option>
</select>
</div>
<div class="flex flex-col">
<label for="css-selector" class="text-lime-500 font-bold text-xs">CSS Selector</label>
<input
type="text"
id="css-selector"
class="border border-zinc-700 rounded px-4 py-0 bg-zinc-900 text-lime-500"
placeholder="Enter CSS Selector"
/>
</div>
<div class="flex flex-col">
<label for="extraction-strategy-select" class="text-lime-500 font-bold text-xs"
>Extraction Strategy</label
>
<select
id="extraction-strategy-select"
class="border border-zinc-700 rounded px-4 py-1 bg-zinc-900 text-lime-500"
>
<option value="CosineStrategy">CosineStrategy</option>
<option value="LLMExtractionStrategy">LLMExtractionStrategy</option>
<option value="NoExtractionStrategy">NoExtractionStrategy</option>
</select>
</div>
<div class="flex flex-col">
<label for="chunking-strategy-select" class="text-lime-500 font-bold text-xs"
>Chunking Strategy</label
>
<select
id="chunking-strategy-select"
class="border border-zinc-700 rounded px-4 py-1 bg-zinc-900 text-lime-500"
>
<option value="RegexChunking">RegexChunking</option>
<option value="NlpSentenceChunking">NlpSentenceChunking</option>
<option value="TopicSegmentationChunking">TopicSegmentationChunking</option>
<option value="FixedLengthWordChunking">FixedLengthWordChunking</option>
<option value="SlidingWindowChunking">SlidingWindowChunking</option>
</select>
</div>
<div class="flex flex-col">
<label for="provider-model-select" class="text-lime-500 font-bold text-xs"
>Provider Model</label
>
<select
id="provider-model-select"
class="border border-zinc-700 rounded px-4 py-0 bg-zinc-900 text-lime-500"
disabled
>
<option value="groq/llama3-70b-8192">groq/llama3-70b-8192</option>
<option value="groq/llama3-8b-8192">groq/llama3-8b-8192</option>
<option value="openai/gpt-4-turbo">gpt-4-turbo</option>
<option value="openai/gpt-3.5-turbo">gpt-3.5-turbo</option>
<option value="anthropic/claude-3-haiku-20240307">claude-3-haiku</option>
<option value="anthropic/claude-3-opus-20240229">claude-3-opus</option>
<option value="anthropic/claude-3-sonnet-20240229">claude-3-sonnet</option>
</select>
</div>
<div class="flex flex-col">
<label for="token-input" class="text-lime-500 font-bold text-xs">API Token</label>
<input
type="password"
id="token-input"
class="border border-zinc-700 rounded px-4 py-0 bg-zinc-900 text-lime-500"
placeholder="Enter Groq API token"
disabled
/>
</div>
<div class="flex gap-3">
<div class="flex items-center gap-2">
<input type="checkbox" id="bypass-cache-checkbox" />
<label for="bypass-cache-checkbox" class="text-lime-500 font-bold">Bypass Cache</label>
</div>
<div class="flex items-center gap-2">
<input type="checkbox" id="extract-blocks-checkbox" checked />
<label for="extract-blocks-checkbox" class="text-lime-500 font-bold"
>Extract Blocks</label
>
</div>
<button id="crawl-btn" class="bg-lime-600 text-black font-bold px-4 py-0 rounded">
Crawl
</button>
</div>
</div>
<div id="result" class=" ">
<div id="loading" class="hidden">
<p class="text-white">Loading... Please wait.</p>
</div>
<div class="tab-buttons flex gap-2">
<button
class="tab-btn px-4 py-1 text-sm bg-zinc-700 rounded-t text-lime-500"
data-tab="json"
>
JSON
</button>
<button
class="tab-btn px-4 py-1 text-sm bg-zinc-700 rounded-t text-lime-500"
data-tab="cleaned-html"
>
Cleaned HTML
</button>
<button
class="tab-btn px-4 py-1 text-sm bg-zinc-700 rounded-t text-lime-500"
data-tab="markdown"
>
Markdown
</button>
</div>
<div class="tab-content code bg-zinc-900 p-2 rounded h-full border border-zinc-700 text-sm">
<pre class="h-full flex"><code id="json-result" class="language-json"></code></pre>
<pre
class="hidden h-full flex"
><code id="cleaned-html-result" class="language-html"></code></pre>
<pre
class="hidden h-full flex"
><code id="markdown-result" class="language-markdown"></code></pre>
</div>
</div>
<div id="code_help" class=" ">
<div class="tab-buttons flex gap-2">
<button
class="code-tab-btn px-4 py-1 text-sm bg-zinc-700 rounded-t text-lime-500"
data-tab="curl"
>
cURL
</button>
<button
class="code-tab-btn px-4 py-1 text-sm bg-zinc-700 rounded-t text-lime-500"
data-tab="library"
>
Python Library
</button>
<button
class="code-tab-btn px-4 py-1 text-sm bg-zinc-700 rounded-t text-lime-500"
data-tab="python"
>
Python (Request)
</button>
<button
class="code-tab-btn px-4 py-1 text-sm bg-zinc-700 rounded-t text-lime-500"
data-tab="nodejs"
>
Node.js
</button>
</div>
<div class="tab-content result bg-zinc-900 p-2 rounded h-full border border-zinc-700 text-sm">
<pre class="h-full flex relative">
<code id="curl-code" class="language-bash"></code>
<button class="absolute top-2 right-2 bg-zinc-700 text-white px-2 py-1 rounded copy-btn" data-target="curl-code">Copy</button>
</pre>
<pre class="hidden h-full flex relative">
<code id="python-code" class="language-python"></code>
<button class="absolute top-2 right-2 bg-zinc-700 text-white px-2 py-1 rounded copy-btn" data-target="python-code">Copy</button>
</pre>
<pre class="hidden h-full flex relative">
<code id="nodejs-code" class="language-javascript"></code>
<button class="absolute top-2 right-2 bg-zinc-700 text-white px-2 py-1 rounded copy-btn" data-target="nodejs-code">Copy</button>
</pre>
<pre class="hidden h-full flex relative">
<code id="library-code" class="language-python"></code>
<button class="absolute top-2 right-2 bg-zinc-700 text-white px-2 py-1 rounded copy-btn" data-target="library-code">Copy</button>
</pre>
</div>
</div>
</div>
</div>
</section>
<section class="bg-zinc-900 text-zinc-300 p-6 px-20">
<div class="grid grid-cols-2 gap-4 p-4 bg-zinc-900 text-lime-500">
<!-- Step 1 -->
<div class="col-span-2 bg-yellow-500 p-2 rounded text-zinc-900">
🌟 <strong>Welcome to the Crawl4ai Quickstart Guide! Let's dive into some web crawling fun!</strong>
</div>
<div class="bg-zinc-800 p-2 rounded">
First Step: Create an instance of WebCrawler and call the <code>warmup()</code> function.
</div>
<div>
<pre><code class="language-python">crawler = WebCrawler()
crawler.warmup()</code></pre>
</div>
<!-- Step 2 -->
<div class="col-span-2 bg-yellow-500 p-2 rounded text-zinc-900">
🧠 <strong>Understanding 'bypass_cache' and 'include_raw_html' parameters:</strong>
</div>
<div class="bg-zinc-800 p-2 rounded">First crawl (caches the result):</div>
<div>
<pre><code class="language-python">result = crawler.run(url="https://www.nbcnews.com/business")</code></pre>
</div>
<div class="bg-zinc-800 p-2 rounded">Second crawl (Force to crawl again):</div>
<div>
<pre><code class="language-python">result = crawler.run(url="https://www.nbcnews.com/business", bypass_cache=True)</code></pre>
</div>
<div class="bg-zinc-800 p-2 rounded">Crawl result without raw HTML content:</div>
<div>
<pre><code class="language-python">result = crawler.run(url="https://www.nbcnews.com/business", include_raw_html=False)</code></pre>
</div>
<!-- Step 3 -->
<div class="col-span-2 bg-yellow-500 p-2 rounded text-zinc-900">
📄
<strong
>The 'include_raw_html' parameter, when set to True, includes the raw HTML content in the
response. By default, it is set to True.</strong
>
</div>
<div class="bg-zinc-800 p-2 rounded">Set <code>always_by_pass_cache</code> to True:</div>
<div>
<pre><code class="language-python">crawler.always_by_pass_cache = True</code></pre>
</div>
<!-- Step 4 -->
<div class="col-span-2 bg-yellow-500 p-2 rounded text-zinc-900">
🧩 <strong>Let's add a chunking strategy: RegexChunking!</strong>
</div>
<div class="bg-zinc-800 p-2 rounded">Using RegexChunking:</div>
<div>
<pre><code class="language-python">result = crawler.run(
url="https://www.nbcnews.com/business",
chunking_strategy=RegexChunking(patterns=["\n\n"])
)</code></pre>
</div>
<div class="bg-zinc-800 p-2 rounded">Using NlpSentenceChunking:</div>
<div>
<pre><code class="language-python">result = crawler.run(
url="https://www.nbcnews.com/business",
chunking_strategy=NlpSentenceChunking()
)</code></pre>
</div>
<!-- Step 5 -->
<div class="col-span-2 bg-yellow-500 p-2 rounded text-zinc-900">
🧠 <strong>Let's get smarter with an extraction strategy: CosineStrategy!</strong>
</div>
<div class="bg-zinc-800 p-2 rounded">Using CosineStrategy:</div>
<div>
<pre><code class="language-python">result = crawler.run(
url="https://www.nbcnews.com/business",
extraction_strategy=CosineStrategy(word_count_threshold=10, max_dist=0.2, linkage_method="ward", top_k=3)
)</code></pre>
</div>
<!-- Step 6 -->
<div class="col-span-2 bg-yellow-500 p-2 rounded text-zinc-900">
🤖 <strong>Time to bring in the big guns: LLMExtractionStrategy without instructions!</strong>
</div>
<div class="bg-zinc-800 p-2 rounded">Using LLMExtractionStrategy without instructions:</div>
<div>
<pre><code class="language-python">result = crawler.run(
url="https://www.nbcnews.com/business",
extraction_strategy=LLMExtractionStrategy(provider="openai/gpt-4o", api_token=os.getenv('OPENAI_API_KEY'))
)</code></pre>
</div>
<!-- Step 7 -->
<div class="col-span-2 bg-yellow-500 p-2 rounded text-zinc-900">
📜 <strong>Let's make it even more interesting: LLMExtractionStrategy with instructions!</strong>
</div>
<div class="bg-zinc-800 p-2 rounded">Using LLMExtractionStrategy with instructions:</div>
<div>
<pre><code class="language-python">result = crawler.run(
url="https://www.nbcnews.com/business",
extraction_strategy=LLMExtractionStrategy(
provider="openai/gpt-4o",
api_token=os.getenv('OPENAI_API_KEY'),
instruction="I am interested in only financial news"
)
)</code></pre>
</div>
<!-- Step 8 -->
<div class="col-span-2 bg-yellow-500 p-2 rounded text-zinc-900">
🎯 <strong>Targeted extraction: Let's use a CSS selector to extract only H2 tags!</strong>
</div>
<div class="bg-zinc-800 p-2 rounded">Using CSS selector to extract H2 tags:</div>
<div>
<pre><code class="language-python">result = crawler.run(
url="https://www.nbcnews.com/business",
css_selector="h2"
)</code></pre>
</div>
<!-- Step 9 -->
<div class="col-span-2 bg-yellow-500 p-2 rounded text-zinc-900">
🖱️ <strong>Let's get interactive: Passing JavaScript code to click 'Load More' button!</strong>
</div>
<div class="bg-zinc-800 p-2 rounded">Using JavaScript to click 'Load More' button:</div>
<div>
<pre><code class="language-python">js_code = """
const loadMoreButton = Array.from(document.querySelectorAll('button')).find(button => button.textContent.includes('Load More'));
loadMoreButton && loadMoreButton.click();
"""
crawler_strategy = LocalSeleniumCrawlerStrategy(js_code=js_code)
crawler = WebCrawler(crawler_strategy=crawler_strategy, always_by_pass_cache=True)
result = crawler.run(url="https://www.nbcnews.com/business")</code></pre>
</div>
<!-- Conclusion -->
<div class="col-span-2 bg-yellow-500 p-2 rounded text-zinc-900">
🎉
<strong
>Congratulations! You've made it through the Crawl4ai Quickstart Guide! Now go forth and crawl
the web like a pro! 🕸️</strong
>
</div>
</div>
</section>
<section class="bg-zinc-900 text-zinc-300 p-6 px-20">
<h1 class="text-3xl font-bold mb-4">Installation 💻</h1>
<p class="mb-4">
There are two ways to use Crawl4AI: as a library in your Python projects or as a standalone local
server.
</p>
<p class="mb-4">
You can also try Crawl4AI in a Google Colab
<a href="https://colab.research.google.com/drive/1wz8u30rvbq6Scodye9AGCw8Qg_Z8QGsk"
><img
src="https://colab.research.google.com/assets/colab-badge.svg"
alt="Open In Colab"
style="display: inline-block; width: 100px; height: 20px"
/></a>
</p>
<h2 class="text-2xl font-bold mb-2">Using Crawl4AI as a Library 📚</h2>
<p class="mb-4">To install Crawl4AI as a library, follow these steps:</p>
<ol class="list-decimal list-inside mb-4">
<li class="mb-2">
Install the package from GitHub:
<pre
class="bg-zinc-800 p-4 rounded mt-2 text-zinc-100"
><code>pip install git+https://github.com/unclecode/crawl4ai.git</code></pre>
</li>
<li class="mb-2">
Alternatively, you can clone the repository and install the package locally:
<pre
class="bg-zinc-800 p-4 rounded mt-2 text-zinc-100"
><code class = "language-python bash">virtualenv venv
source venv/bin/activate
git clone https://github.com/unclecode/crawl4ai.git
cd crawl4ai
pip install -e .
</code></pre>
</li>
<li>
Import the necessary modules in your Python script:
<pre
class="bg-zinc-800 p-4 rounded mt-2 text-zinc-100"
><code class = "language-python hljs">from crawl4ai.web_crawler import WebCrawler
from crawl4ai.chunking_strategy import *
from crawl4ai.extraction_strategy import *
import os
crawler = WebCrawler()
# Single page crawl
single_url = UrlModel(url='https://www.nbcnews.com/business', forced=False)
result = crawl4ai.fetch_page(
url='https://www.nbcnews.com/business',
word_count_threshold=5, # Minimum word count for a HTML tag to be considered as a worthy block
chunking_strategy= RegexChunking( patterns = ["\\n\\n"]), # Default is RegexChunking
extraction_strategy= CosineStrategy(word_count_threshold=10, max_dist=0.2, linkage_method='ward', top_k=3) # Default is CosineStrategy
# extraction_strategy= LLMExtractionStrategy(provider= "openai/gpt-4o", api_token = os.getenv('OPENAI_API_KEY')),
bypass_cache=False,
extract_blocks =True, # Whether to extract semantical blocks of text from the HTML
css_selector = "", # Eg: "div.article-body"
verbose=True,
include_raw_html=True, # Whether to include the raw HTML content in the response
)
print(result.model_dump())
</code></pre>
</li>
</ol>
<p class="mb-4">
For more information about how to run Crawl4AI as a local server, please refer to the
<a href="https://github.com/unclecode/crawl4ai" class="text-blue-400">GitHub repository</a>.
</p>
</section>
<section class="bg-zinc-900 text-zinc-300 p-6 px-20">
<h1 class="text-3xl font-bold mb-4">📖 Parameters</h1>
<div class="overflow-x-auto">
<table class="min-w-full bg-zinc-800 border border-zinc-700">
<thead>
<tr>
<th class="py-2 px-4 border-b border-zinc-700">Parameter</th>
<th class="py-2 px-4 border-b border-zinc-700">Description</th>
<th class="py-2 px-4 border-b border-zinc-700">Required</th>
<th class="py-2 px-4 border-b border-zinc-700">Default Value</th>
</tr>
</thead>
<tbody>
<tr>
<td class="py-2 px-4 border-b border-zinc-700">urls</td>
<td class="py-2 px-4 border-b border-zinc-700">
A list of URLs to crawl and extract data from.
</td>
<td class="py-2 px-4 border-b border-zinc-700">Yes</td>
<td class="py-2 px-4 border-b border-zinc-700">-</td>
</tr>
<tr>
<td class="py-2 px-4 border-b border-zinc-700">include_raw_html</td>
<td class="py-2 px-4 border-b border-zinc-700">
Whether to include the raw HTML content in the response.
</td>
<td class="py-2 px-4 border-b border-zinc-700">No</td>
<td class="py-2 px-4 border-b border-zinc-700">false</td>
</tr>
<tr>
<td class="py-2 px-4 border-b border-zinc-700">bypass_cache</td>
<td class="py-2 px-4 border-b border-zinc-700">
Whether to force a fresh crawl even if the URL has been previously crawled.
</td>
<td class="py-2 px-4 border-b border-zinc-700">No</td>
<td class="py-2 px-4 border-b border-zinc-700">false</td>
</tr>
<tr>
<td class="py-2 px-4 border-b border-zinc-700">extract_blocks</td>
<td class="py-2 px-4 border-b border-zinc-700">
Whether to extract semantical blocks of text from the HTML.
</td>
<td class="py-2 px-4 border-b border-zinc-700">No</td>
<td class="py-2 px-4 border-b border-zinc-700">true</td>
</tr>
<tr>
<td class="py-2 px-4 border-b border-zinc-700">word_count_threshold</td>
<td class="py-2 px-4 border-b border-zinc-700">
The minimum number of words a block must contain to be considered meaningful (minimum
value is 5).
</td>
<td class="py-2 px-4 border-b border-zinc-700">No</td>
<td class="py-2 px-4 border-b border-zinc-700">5</td>
</tr>
<tr>
<td class="py-2 px-4 border-b border-zinc-700">extraction_strategy</td>
<td class="py-2 px-4 border-b border-zinc-700">
The strategy to use for extracting content from the HTML (e.g., "CosineStrategy").
</td>
<td class="py-2 px-4 border-b border-zinc-700">No</td>
<td class="py-2 px-4 border-b border-zinc-700">CosineStrategy</td>
</tr>
<tr>
<td class="py-2 px-4 border-b border-zinc-700">chunking_strategy</td>
<td class="py-2 px-4 border-b border-zinc-700">
The strategy to use for chunking the text before processing (e.g., "RegexChunking").
</td>
<td class="py-2 px-4 border-b border-zinc-700">No</td>
<td class="py-2 px-4 border-b border-zinc-700">RegexChunking</td>
</tr>
<tr>
<td class="py-2 px-4 border-b border-zinc-700">css_selector</td>
<td class="py-2 px-4 border-b border-zinc-700">
The CSS selector to target specific parts of the HTML for extraction.
</td>
<td class="py-2 px-4 border-b border-zinc-700">No</td>
<td class="py-2 px-4 border-b border-zinc-700">None</td>
</tr>
<tr>
<td class="py-2 px-4">verbose</td>
<td class="py-2 px-4">Whether to enable verbose logging.</td>
<td class="py-2 px-4">No</td>
<td class="py-2 px-4">true</td>
</tr>
</tbody>
</table>
</div>
</section>
<section id="extraction" class="py-8 px-20">
<div class="overflow-x-auto mx-auto px-6">
<h2 class="text-2xl font-bold mb-4">Extraction Strategies</h2>
<div id="extraction-strategies" class="space-y-4"></div>
</div>
</section>
<section id="chunking" class="py-8 px-20">
<div class="overflow-x-auto mx-auto px-6">
<h2 class="text-2xl font-bold mb-4">Chunking Strategies</h2>
<div id="chunking-strategies" class="space-y-4"></div>
</div>
</section>
<section class="hero bg-zinc-900 py-8 px-20">
<div class="container mx-auto px-4">
<h2 class="text-3xl font-bold mb-4">🤔 Why building this?</h2>
<p class="text-lg mb-4">
In recent times, we've witnessed a surge of startups emerging, riding the AI hype wave and charging
for services that should rightfully be accessible to everyone. 🌍💸 One such example is scraping and
crawling web pages and transforming them into a format suitable for Large Language Models (LLMs).
🕸️🤖 We believe that building a business around this is not the right approach; instead, it should
definitely be open-source. 🆓🌟 So, if you possess the skills to build such tools and share our
philosophy, we invite you to join our "Robinhood" band and help set these products free for the
benefit of all. 🤝💪
</p>
</div>
</section>
<section class="installation py-8 px-20">
<div class="container mx-auto px-4">
<h2 class="text-2xl font-bold mb-4">⚙️ Installation</h2>
<p class="mb-4">
To install and run Crawl4AI as a library or a local server, please refer to the 📚
<a href="https://github.com/unclecode/crawl4ai" class="text-blue-400">GitHub repository</a>.
</p>
</div>
</section>
<footer class="bg-zinc-900 text-white py-4">
<div class="container mx-auto px-4">
<div class="flex justify-between items-center">
<p>© 2024 Crawl4AI. All rights reserved.</p>
<div class="social-links">
<a
href="https://github.com/unclecode/crawl4ai"
class="text-white hover:text-gray-300 mx-2"
target="_blank"
>😺 GitHub</a
>
<a
href="https://twitter.com/unclecode"
class="text-white hover:text-gray-300 mx-2"
target="_blank"
>🐦 Twitter</a
>
</div>
</div>
</div>
</footer>
<script>
// JavaScript to manage dynamic form changes and logic
document.getElementById("extraction-strategy-select").addEventListener("change", function () {
const strategy = this.value;
const providerModelSelect = document.getElementById("provider-model-select");
const tokenInput = document.getElementById("token-input");
if (strategy === "LLMExtractionStrategy") {
providerModelSelect.disabled = false;
tokenInput.disabled = false;
} else {
providerModelSelect.disabled = true;
tokenInput.disabled = true;
}
});
// Get the selected provider model and token from local storage
const storedProviderModel = localStorage.getItem("provider_model");
const storedToken = localStorage.getItem(storedProviderModel);
if (storedProviderModel) {
document.getElementById("provider-model-select").value = storedProviderModel;
}
if (storedToken) {
document.getElementById("token-input").value = storedToken;
}
// Handle provider model dropdown change
document.getElementById("provider-model-select").addEventListener("change", () => {
const selectedProviderModel = document.getElementById("provider-model-select").value;
const storedToken = localStorage.getItem(selectedProviderModel);
if (storedToken) {
document.getElementById("token-input").value = storedToken;
} else {
document.getElementById("token-input").value = "";
}
});
// Fetch total count from the database
axios
.get("/total-count")
.then((response) => {
document.getElementById("total-count").textContent = response.data.count;
})
.catch((error) => console.error(error));
// Handle crawl button click
document.getElementById("crawl-btn").addEventListener("click", () => {
// validate input to have both URL and API token
if (!document.getElementById("url-input").value || !document.getElementById("token-input").value) {
alert("Please enter both URL(s) and API token.");
return;
}
const selectedProviderModel = document.getElementById("provider-model-select").value;
const apiToken = document.getElementById("token-input").value;
const extractBlocks = document.getElementById("extract-blocks-checkbox").checked;
const bypassCache = document.getElementById("bypass-cache-checkbox").checked;
// Save the selected provider model and token to local storage
localStorage.setItem("provider_model", selectedProviderModel);
localStorage.setItem(selectedProviderModel, apiToken);
const urlsInput = document.getElementById("url-input").value;
const urls = urlsInput.split(",").map((url) => url.trim());
const data = {
urls: urls,
provider_model: selectedProviderModel,
api_token: apiToken,
include_raw_html: true,
bypass_cache: bypassCache,
extract_blocks: extractBlocks,
word_count_threshold: parseInt(document.getElementById("threshold").value),
extraction_strategy: document.getElementById("extraction-strategy-select").value,
chunking_strategy: document.getElementById("chunking-strategy-select").value,
css_selector: document.getElementById("css-selector").value,
verbose: true,
};
// save api token to local storage
localStorage.setItem("api_token", document.getElementById("token-input").value);
document.getElementById("loading").classList.remove("hidden");
//document.getElementById("result").classList.add("hidden");
//document.getElementById("code_help").classList.add("hidden");
axios
.post("/crawl", data)
.then((response) => {
const result = response.data.results[0];
const parsedJson = JSON.parse(result.extracted_content);
document.getElementById("json-result").textContent = JSON.stringify(parsedJson, null, 2);
document.getElementById("cleaned-html-result").textContent = result.cleaned_html;
document.getElementById("markdown-result").textContent = result.markdown;
// Update code examples dynamically
const extractionStrategy = data.extraction_strategy;
const isLLMExtraction = extractionStrategy === "LLMExtractionStrategy";
document.getElementById(
"curl-code"
).textContent = `curl -X POST -H "Content-Type: application/json" -d '${JSON.stringify({
...data,
api_token: isLLMExtraction ? "your_api_token" : undefined,
})}' http://crawl4ai.uccode.io/crawl`;
document.getElementById(
"python-code"
).textContent = `import requests\n\ndata = ${JSON.stringify(
{ ...data, api_token: isLLMExtraction ? "your_api_token" : undefined },
null,
2
)}\n\nresponse = requests.post("http://crawl4ai.uccode.io/crawl", json=data) # OR local host if your run locally \nprint(response.json())`;
document.getElementById(
"nodejs-code"
).textContent = `const axios = require('axios');\n\nconst data = ${JSON.stringify(
{ ...data, api_token: isLLMExtraction ? "your_api_token" : undefined },
null,
2
)};\n\naxios.post("http://crawl4ai.uccode.io/crawl", data) // OR local host if your run locally \n .then(response => console.log(response.data))\n .catch(error => console.error(error));`;
document.getElementById(
"library-code"
).textContent = `from crawl4ai.web_crawler import WebCrawler\nfrom crawl4ai.extraction_strategy import *\nfrom crawl4ai.chunking_strategy import *\n\ncrawler = WebCrawler()\ncrawler.warmup()\n\nresult = crawler.run(\n url='${
urls[0]
}',\n word_count_threshold=${data.word_count_threshold},\n extraction_strategy=${
isLLMExtraction
? `${extractionStrategy}(provider="${data.provider_model}", api_token="${data.api_token}")`
: extractionStrategy + "()"
},\n chunking_strategy=${data.chunking_strategy}(),\n bypass_cache=${
data.bypass_cache
},\n css_selector="${data.css_selector}"\n)\nprint(result)`;
// Highlight code syntax
hljs.highlightAll();
// Select JSON tab by default
document.querySelector('.tab-btn[data-tab="json"]').click();
document.getElementById("loading").classList.add("hidden");
document.getElementById("result").classList.remove("hidden");
document.getElementById("code_help").classList.remove("hidden");
// increment the total count
document.getElementById("total-count").textContent =
parseInt(document.getElementById("total-count").textContent) + 1;
})
.catch((error) => {
console.error(error);
document.getElementById("loading").classList.add("hidden");
});
});
// Handle tab clicks
document.querySelectorAll(".tab-btn").forEach((btn) => {
btn.addEventListener("click", () => {
const tab = btn.dataset.tab;
document
.querySelectorAll(".tab-btn")
.forEach((b) => b.classList.remove("bg-lime-700", "text-white"));
btn.classList.add("bg-lime-700", "text-white");
document.querySelectorAll(".tab-content.code pre").forEach((el) => el.classList.add("hidden"));
document.getElementById(`${tab}-result`).parentElement.classList.remove("hidden");
});
});
// Handle code tab clicks
document.querySelectorAll(".code-tab-btn").forEach((btn) => {
btn.addEventListener("click", () => {
const tab = btn.dataset.tab;
document
.querySelectorAll(".code-tab-btn")
.forEach((b) => b.classList.remove("bg-lime-700", "text-white"));
btn.classList.add("bg-lime-700", "text-white");
document.querySelectorAll(".tab-content.result pre").forEach((el) => el.classList.add("hidden"));
document.getElementById(`${tab}-code`).parentElement.classList.remove("hidden");
});
});
// Handle copy to clipboard button clicks
async function copyToClipboard(text) {
if (navigator.clipboard && navigator.clipboard.writeText) {
return navigator.clipboard.writeText(text);
} else {
return fallbackCopyTextToClipboard(text);
}
}
function fallbackCopyTextToClipboard(text) {
return new Promise((resolve, reject) => {
const textArea = document.createElement("textarea");
textArea.value = text;
// Avoid scrolling to bottom
textArea.style.top = "0";
textArea.style.left = "0";
textArea.style.position = "fixed";
document.body.appendChild(textArea);
textArea.focus();
textArea.select();
try {
const successful = document.execCommand("copy");
if (successful) {
resolve();
} else {
reject();
}
} catch (err) {
reject(err);
}
document.body.removeChild(textArea);
});
}
document.querySelectorAll(".copy-btn").forEach((btn) => {
btn.addEventListener("click", () => {
const target = btn.dataset.target;
const code = document.getElementById(target).textContent;
//navigator.clipboard.writeText(code).then(() => {
copyToClipboard(code).then(() => {
btn.textContent = "Copied!";
setTimeout(() => {
btn.textContent = "Copy";
}, 2000);
});
});
});
document.addEventListener("DOMContentLoaded", async () => {
try {
const extractionResponse = await fetch("/strategies/extraction");
const extractionStrategies = await extractionResponse.json();
const chunkingResponse = await fetch("/strategies/chunking");
const chunkingStrategies = await chunkingResponse.json();
renderStrategies("extraction-strategies", extractionStrategies);
renderStrategies("chunking-strategies", chunkingStrategies);
} catch (error) {
console.error("Error fetching strategies:", error);
}
});
function renderStrategies(containerId, strategies) {
const container = document.getElementById(containerId);
container.innerHTML = ""; // Clear any existing content
strategies = JSON.parse(strategies);
Object.entries(strategies).forEach(([strategy, description]) => {
const strategyElement = document.createElement("div");
strategyElement.classList.add("bg-zinc-800", "p-4", "rounded", "shadow-md", "docs-item");
const strategyDescription = document.createElement("div");
strategyDescription.classList.add("text-gray-300", "prose", "prose-sm");
strategyDescription.innerHTML = marked.parse(description);
strategyElement.appendChild(strategyDescription);
container.appendChild(strategyElement);
});
}
// Highlight code syntax
hljs.highlightAll();
</script>
</body>
</html>

View File

@@ -1,73 +0,0 @@
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="UTF-8" />
<meta name="viewport" content="width=device-width, initial-scale=1.0" />
<title>Crawl4AI</title>
<link rel="preconnect" href="https://fonts.googleapis.com" />
<link rel="preconnect" href="https://fonts.gstatic.com" crossorigin />
<link href="https://fonts.googleapis.com/css2?family=Inter:wght@100..900&display=swap" rel="stylesheet" />
<!-- <link href="https://cdn.jsdelivr.net/npm/tailwindcss@3.4.3/dist/tailwind.min.css" rel="stylesheet" /> -->
<script src="https://cdn.tailwindcss.com"></script>
<script src="https://cdn.jsdelivr.net/npm/axios/dist/axios.min.js"></script>
<link rel="stylesheet" href="/pages/app.css" />
<link
rel="stylesheet"
href="https://cdnjs.cloudflare.com/ajax/libs/highlight.js/11.7.0/styles/monokai.min.css"
/>
<script src="https://cdn.jsdelivr.net/npm/marked/marked.min.js"></script>
<script src="https://cdnjs.cloudflare.com/ajax/libs/highlight.js/11.7.0/highlight.min.js"></script>
</head>
<body class="bg-black text-gray-200">
<header class="bg-zinc-950 text-lime-500 py-4 flex">
<div class="mx-auto px-4">
<h1 class="text-2xl font-bold">🔥🕷️ Crawl4AI: Web Data for your Thoughts</h1>
</div>
<div class="mx-auto px-4 flex font-bold text-xl gap-2">
<span>📊 Total Website Processed</span>
<span id="total-count" class="text-lime-400">2</span>
</div>
</header>
{{ try_it | safe }}
<div class="mx-auto p-4 bg-zinc-950 text-lime-500 min-h-screen">
<div class="container mx-auto">
<div class="flex h-full px-20">
<div class="sidebar w-1/4 p-4">
<h2 class="text-lg font-bold mb-4">Outline</h2>
<ul>
<li class="mb-2"><a href="#" data-target="installation">Installation</a></li>
<li class="mb-2"><a href="#" data-target="how-to-guide">How to Guide</a></li>
<li class="mb-2"><a href="#" data-target="chunking-strategies">Chunking Strategies</a></li>
<li class="mb-2">
<a href="#" data-target="extraction-strategies">Extraction Strategies</a>
</li>
</ul>
</div>
<!-- Main Content -->
<div class="w-3/4 p-4">
{{installation | safe}} {{how_to_guide | safe}}
<section id="chunking-strategies" class="content-section">
<h1 class="text-2xl font-bold">Chunking Strategies</h1>
<p>Content for chunking strategies...</p>
</section>
<section id="extraction-strategies" class="content-section">
<h1 class="text-2xl font-bold">Extraction Strategies</h1>
<p>Content for extraction strategies...</p>
</section>
</div>
</div>
</div>
</div>
{{ footer | safe }}
<script script src="/pages/app.js"></script>
</body>
</html>

View File

@@ -1,425 +0,0 @@
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="UTF-8" />
<meta name="viewport" content="width=device-width, initial-scale=1.0" />
<title>Crawl4AI</title>
<link rel="preconnect" href="https://fonts.googleapis.com" />
<link rel="preconnect" href="https://fonts.gstatic.com" crossorigin />
<link href="https://fonts.googleapis.com/css2?family=Inter:wght@100..900&display=swap" rel="stylesheet" />
<link href="https://cdn.jsdelivr.net/npm/tailwindcss@2.2.19/dist/tailwind.min.css" rel="stylesheet" />
<script src="https://cdn.jsdelivr.net/npm/axios/dist/axios.min.js"></script>
<link
rel="stylesheet"
href="https://cdnjs.cloudflare.com/ajax/libs/highlight.js/11.7.0/styles/vs2015.min.css"
/>
<script src="https://cdnjs.cloudflare.com/ajax/libs/highlight.js/11.7.0/highlight.min.js"></script>
<style>
:root {
--ifm-font-size-base: 100%;
--ifm-line-height-base: 1.65;
--ifm-font-family-base: system-ui, -apple-system, Segoe UI, Roboto, Ubuntu, Cantarell, Noto Sans,
sans-serif, BlinkMacSystemFont, "Segoe UI", Helvetica, Arial, sans-serif, "Apple Color Emoji",
"Segoe UI Emoji", "Segoe UI Symbol";
}
html {
-webkit-font-smoothing: antialiased;
-webkit-text-size-adjust: 100%;
text-size-adjust: 100%;
font: var(--ifm-font-size-base) / var(--ifm-line-height-base) var(--ifm-font-family-base);
}
body {
background-color: #1a202c;
color: #fff;
}
.tab-content {
max-height: 400px;
overflow: auto;
}
pre {
white-space: pre-wrap;
font-size: 14px;
}
pre code {
width: 100%;
}
</style>
</head>
<body>
<header class="bg-gray-900 text-white py-4">
<div class="container mx-auto px-4">
<h1 class="text-2xl font-bold">🔥🕷️ Crawl4AI: Open-source LLM Friendly Web scraper</h1>
</div>
</header>
<section class="try-it py-8 pb-20">
<div class="container mx-auto px-4">
<h2 class="text-2xl font-bold mb-4">Try It Now</h2>
<div class="mb-4 flex w-full gap-2">
<input
type="text"
id="url-input"
value="https://kidocode.com"
class="border border-gray-600 rounded px-4 py-2 flex-grow bg-gray-800 text-white"
placeholder="Enter URL(s) separated by commas"
/>
<select
id="provider-model-select"
class="border border-gray-600 rounded px-4 py-2 bg-gray-800 text-white"
>
<!-- Add your option values here -->
<option value="groq/llama3-70b-8192">groq/llama3-70b-8192</option>
<option value="groq/llama3-8b-8192">groq/llama3-8b-8192</option>
<option value="openai/gpt-4-turbo">gpt-4-turbo</option>
<option value="openai/gpt-3.5-turbo">gpt-3.5-turbo</option>
<option value="anthropic/claude-3-haiku-20240307">claude-3-haiku</option>
<option value="anthropic/claude-3-opus-20240229">claude-3-opus</option>
<option value="anthropic/claude-3-sonnet-20240229">claude-3-sonnet</option>
</select>
<input
type="password"
id="token-input"
class="border border-gray-600 rounded px-4 py-2 flex-grow bg-gray-800 text-white"
placeholder="Enter Groq API token"
/>
<div class="flex items-center justify-center">
<input type="checkbox" id="extract-blocks-checkbox" class="mr-2" checked>
<label for="extract-blocks-checkbox" class="text-white">Extract Blocks</label>
</div>
<button id="crawl-btn" class="bg-blue-600 text-white px-4 py-2 rounded">Crawl</button>
</div>
<div class="grid grid-cols-1 md:grid-cols-2 gap-8">
<div id="loading" class="hidden mt-4">
<p>Loading...</p>
</div>
<div id="result" class="tab-container flex-1 h-full flex-col">
<div class="tab-buttons flex gap-2">
<button class="tab-btn px-4 py-2 bg-gray-700 rounded-t" data-tab="json">JSON</button>
<button class="tab-btn px-4 py-2 bg-gray-700 rounded-t" data-tab="cleaned-html">
Cleaned HTML
</button>
<button class="tab-btn px-4 py-2 bg-gray-700 rounded-t" data-tab="markdown">
Markdown
</button>
</div>
<div class="tab-content code bg-gray-800 p-2 rounded h-full flex-1 border border-gray-600">
<pre class="h-full flex"><code id="json-result" class="language-json "></code></pre>
<pre
class="hidden h-full flex"
><code id="cleaned-html-result" class="language-html "></code></pre>
<pre
class="hidden h-full flex"
><code id="markdown-result" class="language-markdown "></code></pre>
</div>
</div>
<div id="code_help" class="tab-container flex-1 h-full">
<div class="tab-buttons flex gap-2">
<button class="code-tab-btn px-4 py-2 bg-gray-700 rounded-t" data-tab="curl">cURL</button>
<button class="code-tab-btn px-4 py-2 bg-gray-700 rounded-t" data-tab="python">
Python
</button>
<button class="code-tab-btn px-4 py-2 bg-gray-700 rounded-t" data-tab="nodejs">
Node.js
</button>
</div>
<div class="tab-content result bg-gray-800 p-2 rounded h-full flex-1 border border-gray-600">
<pre class="h-full flex relative">
<code id="curl-code" class="language-bash"></code>
<button class="absolute top-2 right-2 bg-gray-700 text-white px-2 py-1 rounded copy-btn" data-target="curl-code">Copy</button>
</pre>
<pre class="hidden h-full flex relative">
<code id="python-code" class="language-python"></code>
<button class="absolute top-2 right-2 bg-gray-700 text-white px-2 py-1 rounded copy-btn" data-target="python-code">Copy</button>
</pre>
<pre class="hidden h-full flex relative">
<code id="nodejs-code" class="language-javascript"></code>
<button class="absolute top-2 right-2 bg-gray-700 text-white px-2 py-1 rounded copy-btn" data-target="nodejs-code">Copy</button>
</pre>
</div>
</div>
</div>
</div>
</section>
<section class="hero bg-gray-900 py-8">
<div class="container mx-auto px-4">
<h2 class="text-3xl font-bold mb-4">🤔 Why building this?</h2>
<p class="text-lg mb-4">
In recent times, we've seen numerous startups emerging, riding the AI hype wave and charging for
services that should rightfully be accessible to everyone. 🌍💸 One for example is to scrap and crawl
a web page, and transform it o a form suitable for LLM. We don't think one should build a business
out of this, but definilty should be opened source. So if you possess the skills to build such things
and you have such philosphy you should join our "Robinhood" band and help set
these products free. 🆓🤝
</p>
</div>
</section>
<section class="installation py-8">
<div class="container mx-auto px-4">
<h2 class="text-2xl font-bold mb-4">⚙️ Installation</h2>
<p class="mb-4">
To install and run Crawl4AI locally or on your own service, the best way is to use Docker. 🐳 Follow
these steps:
</p>
<ol class="list-decimal list-inside mb-4">
<li>
Clone the GitHub repository: 📥
<code>git clone https://github.com/unclecode/crawl4ai.git</code>
</li>
<li>Navigate to the project directory: 📂 <code>cd crawl4ai</code></li>
<li>
Build the Docker image: 🛠️ <code>docker build -t crawl4ai .</code> On Mac, follow: 🍎
<code>docker build --platform linux/amd64 -t crawl4ai .</code>
</li>
<li>Run the Docker container: ▶️ <code>docker run -p 8000:80 crawl4ai</code></li>
</ol>
<p>
For more detailed instructions and advanced configuration options, please refer to the 📚
<a href="https://github.com/unclecode/crawl4ai" class="text-blue-400">GitHub repository</a>.
</p>
</div>
</section>
<footer class="bg-gray-900 text-white py-4">
<div class="container mx-auto px-4">
<div class="flex justify-between items-center">
<p>© 2024 Crawl4AI. All rights reserved.</p>
<div class="social-links">
<a
href="https://github.com/unclecode/crawl4ai"
class="text-white hover:text-gray-300 mx-2"
target="_blank"
>😺 GitHub</a
>
<a
href="https://twitter.com/unclecode"
class="text-white hover:text-gray-300 mx-2"
target="_blank"
>🐦 Twitter</a
>
<a
href="https://discord.gg/your-invite-link"
class="text-white hover:text-gray-300 mx-2"
target="_blank"
>💬 Discord</a
>
</div>
</div>
</div>
</footer>
<script>
// Get the selected provider model and token from local storage
const storedProviderModel = localStorage.getItem("provider_model");
const storedToken = localStorage.getItem(storedProviderModel);
if (storedProviderModel) {
document.getElementById("provider-model-select").value = storedProviderModel;
}
if (storedToken) {
document.getElementById("token-input").value = storedToken;
}
// Handle provider model dropdown change
document.getElementById("provider-model-select").addEventListener("change", () => {
const selectedProviderModel = document.getElementById("provider-model-select").value;
const storedToken = localStorage.getItem(selectedProviderModel);
if (storedToken) {
document.getElementById("token-input").value = storedToken;
} else {
document.getElementById("token-input").value = "";
}
});
// Fetch total count from the database
axios
.get("/total-count")
.then((response) => {
document.getElementById("total-count").textContent = response.data.count;
})
.catch((error) => console.error(error));
// Handle crawl button click
document.getElementById("crawl-btn").addEventListener("click", () => {
// validate input to have both URL and API token
if (!document.getElementById("url-input").value || !document.getElementById("token-input").value) {
alert("Please enter both URL(s) and API token.");
return;
}
const selectedProviderModel = document.getElementById("provider-model-select").value;
const apiToken = document.getElementById("token-input").value;
const extractBlocks = document.getElementById("extract-blocks-checkbox").checked;
// Save the selected provider model and token to local storage
localStorage.setItem("provider_model", selectedProviderModel);
localStorage.setItem(selectedProviderModel, apiToken);
const urlsInput = document.getElementById("url-input").value;
const urls = urlsInput.split(",").map((url) => url.trim());
const data = {
urls: urls,
provider_model: selectedProviderModel,
api_token: apiToken,
include_raw_html: true,
forced: false,
extract_blocks: extractBlocks,
};
// save api token to local storage
localStorage.setItem("api_token", document.getElementById("token-input").value);
document.getElementById("loading").classList.remove("hidden");
document.getElementById("result").classList.add("hidden");
document.getElementById("code_help").classList.add("hidden");
axios
.post("/crawl", data)
.then((response) => {
const result = response.data.results[0];
const parsedJson = JSON.parse(result.extracted_content);
document.getElementById("json-result").textContent = JSON.stringify(parsedJson, null, 2);
document.getElementById("cleaned-html-result").textContent = result.cleaned_html;
document.getElementById("markdown-result").textContent = result.markdown;
// Update code examples dynamically
// Update code examples dynamically
document.getElementById(
"curl-code"
).textContent = `curl -X POST -H "Content-Type: application/json" -d '${JSON.stringify({
...data,
api_token: "your_api_token",
})}' http://localhost:8000/crawl`;
document.getElementById(
"python-code"
).textContent = `import requests\n\ndata = ${JSON.stringify(
{ ...data, api_token: "your_api_token" },
null,
2
)}\n\nresponse = requests.post("http://localhost:8000/crawl", json=data)\nprint(response.json())`;
document.getElementById(
"nodejs-code"
).textContent = `const axios = require('axios');\n\nconst data = ${JSON.stringify(
{ ...data, api_token: "your_api_token" },
null,
2
)};\n\naxios.post("http://localhost:8000/crawl", data)\n .then(response => console.log(response.data))\n .catch(error => console.error(error));`;
// Highlight code syntax
hljs.highlightAll();
// Select JSON tab by default
document.querySelector('.tab-btn[data-tab="json"]').click();
document.getElementById("loading").classList.add("hidden");
document.getElementById("result").classList.remove("hidden");
document.getElementById("code_help").classList.remove("hidden");
})
.catch((error) => {
console.error(error);
document.getElementById("loading").classList.add("hidden");
});
});
// Handle tab clicks
document.querySelectorAll(".tab-btn").forEach((btn) => {
btn.addEventListener("click", () => {
const tab = btn.dataset.tab;
document
.querySelectorAll(".tab-btn")
.forEach((b) => b.classList.remove("bg-blue-600", "text-white"));
btn.classList.add("bg-blue-600", "text-white");
document.querySelectorAll(".tab-content.code pre").forEach((el) => el.classList.add("hidden"));
document.getElementById(`${tab}-result`).parentElement.classList.remove("hidden");
});
});
// Handle code tab clicks
document.querySelectorAll(".code-tab-btn").forEach((btn) => {
btn.addEventListener("click", () => {
const tab = btn.dataset.tab;
document
.querySelectorAll(".code-tab-btn")
.forEach((b) => b.classList.remove("bg-blue-600", "text-white"));
btn.classList.add("bg-blue-600", "text-white");
document.querySelectorAll(".tab-content.result pre").forEach((el) => el.classList.add("hidden"));
document.getElementById(`${tab}-code`).parentElement.classList.remove("hidden");
});
});
// Handle copy to clipboard button clicks
document.querySelectorAll(".copy-btn").forEach((btn) => {
btn.addEventListener("click", () => {
const target = btn.dataset.target;
const code = document.getElementById(target).textContent;
navigator.clipboard.writeText(code).then(() => {
btn.textContent = "Copied!";
setTimeout(() => {
btn.textContent = "Copy";
}, 2000);
});
});
});
document.getElementById("crawl-btn").addEventListener("click", () => {
const urlsInput = document.getElementById("url-input").value;
const urls = urlsInput.split(",").map(url => url.trim());
const apiToken = document.getElementById("token-input").value;
const selectedProviderModel = document.getElementById("provider-model-select").value;
const extractBlocks = document.getElementById("extract-blocks-checkbox").checked;
const data = {
urls: urls,
provider_model: selectedProviderModel,
api_token: apiToken,
include_raw_html: true,
forced: false,
extract_blocks: extractBlocks
};
localStorage.setItem("api_token", apiToken);
document.getElementById("loading").classList.remove("hidden");
document.getElementById("result").classList.add("hidden");
document.getElementById("code_help").classList.add("hidden");
axios.post("/crawl", data)
.then(response => {
const taskId = response.data.task_id;
pollTaskStatus(taskId);
})
.catch(error => {
console.error('Error during fetch:', error);
document.getElementById("loading").classList.add("hidden");
});
});
function pollTaskStatus(taskId) {
axios.get(`/task/${taskId}`)
.then(response => {
const task = response.data;
if (task.status === 'done') {
displayResults(task.results[0]);
} else if (task.status === 'pending') {
setTimeout(() => pollTaskStatus(taskId), 2000); // Poll every 2 seconds
} else {
console.error('Task failed:', task.error);
document.getElementById("loading").classList.add("hidden");
}
})
.catch(error => {
console.error('Error polling task status:', error);
document.getElementById("loading").classList.add("hidden");
});
}
</script>
</body>
</html>

View File

@@ -1,36 +0,0 @@
<section class="hero bg-zinc-900 py-8 px-20 text-zinc-400">
<div class="container mx-auto px-4">
<h2 class="text-3xl font-bold mb-4">🤔 Why building this?</h2>
<p class="text-lg mb-4">
In recent times, we've witnessed a surge of startups emerging, riding the AI hype wave and charging
for services that should rightfully be accessible to everyone. 🌍💸 One such example is scraping and
crawling web pages and transforming them into a format suitable for Large Language Models (LLMs).
🕸️🤖 We believe that building a business around this is not the right approach; instead, it should
definitely be open-source. 🆓🌟 So, if you possess the skills to build such tools and share our
philosophy, we invite you to join our "Robinhood" band and help set these products free for the
benefit of all. 🤝💪
</p>
</div>
</section>
<footer class="bg-zinc-900 text-zinc-400 py-4">
<div class="container mx-auto px-4">
<div class="flex justify-between items-center">
<p>© 2024 Crawl4AI. All rights reserved.</p>
<div class="social-links">
<a
href="https://github.com/unclecode/crawl4ai"
class="text-zinc-400 hover:text-gray-300 mx-2"
target="_blank"
>😺 GitHub</a
>
<a
href="https://twitter.com/unclecode"
class="text-zinc-400 hover:text-gray-300 mx-2"
target="_blank"
>🐦 Twitter</a
>
</div>
</div>
</div>
</footer>

View File

@@ -1,174 +0,0 @@
<section id="how-to-guide" class="content-section">
<h1 class="text-2xl font-bold">How to Guide</h1>
<div class="flex flex-col gap-4 p-4 bg-zinc-900 text-lime-500">
<!-- Step 1 -->
<div class="col-span-2 bg-lime-800 p-2 rounded text-zinc-50">
🌟
<strong
>Welcome to the Crawl4ai Quickstart Guide! Let's dive into some web crawling
fun!</strong
>
</div>
<div class="">
First Step: Create an instance of WebCrawler and call the
<code>warmup()</code> function.
</div>
<div>
<pre><code class="language-python">crawler = WebCrawler()
crawler.warmup()</code></pre>
</div>
<!-- Step 2 -->
<div class="col-span-2 bg-lime-800 p-2 rounded text-zinc-50">
🧠 <strong>Understanding 'bypass_cache' and 'include_raw_html' parameters:</strong>
</div>
<div class="">First crawl (caches the result):</div>
<div>
<pre><code class="language-python">result = crawler.run(url="https://www.nbcnews.com/business")</code></pre>
</div>
<div class="">Second crawl (Force to crawl again):</div>
<div>
<pre><code class="language-python">result = crawler.run(url="https://www.nbcnews.com/business", bypass_cache=True)</code></pre>
<div class="bg-red-900 p-2 text-zinc-50">
⚠️ Don't forget to set <code>`bypass_cache`</code> to True if you want to try different strategies for the same URL. Otherwise, the cached result will be returned. You can also set <code>`always_by_pass_cache`</code> in constructor to True to always bypass the cache.
</div>
</div>
<div class="">Crawl result without raw HTML content:</div>
<div>
<pre><code class="language-python">result = crawler.run(url="https://www.nbcnews.com/business", include_raw_html=False)</code></pre>
</div>
<!-- Step 3 -->
<div class="col-span-2 bg-lime-800 p-2 rounded text-zinc-50">
📄
<strong
>The 'include_raw_html' parameter, when set to True, includes the raw HTML content
in the response. By default, it is set to True.</strong
>
</div>
<div class="">Set <code>always_by_pass_cache</code> to True:</div>
<div>
<pre><code class="language-python">crawler.always_by_pass_cache = True</code></pre>
</div>
<!-- Step 3.5 Screenshot -->
<div class="col-span-2 bg-lime-800 p-2 rounded text-zinc-50">
📸
<strong>Let's take a screenshot of the page!</strong>
</div>
<div>
<pre><code class="language-python">result = crawler.run(
url="https://www.nbcnews.com/business",
screenshot=True
)
with open("screenshot.png", "wb") as f:
f.write(base64.b64decode(result.screenshot))</code></pre>
</div>
<!-- Step 4 -->
<div class="col-span-2 bg-lime-800 p-2 rounded text-zinc-50">
🧩 <strong>Let's add a chunking strategy: RegexChunking!</strong>
</div>
<div class="">Using RegexChunking:</div>
<div>
<pre><code class="language-python">result = crawler.run(
url="https://www.nbcnews.com/business",
chunking_strategy=RegexChunking(patterns=["\n\n"])
)</code></pre>
</div>
<div class="">Using NlpSentenceChunking:</div>
<div>
<pre><code class="language-python">result = crawler.run(
url="https://www.nbcnews.com/business",
chunking_strategy=NlpSentenceChunking()
)</code></pre>
</div>
<!-- Step 5 -->
<div class="col-span-2 bg-lime-800 p-2 rounded text-zinc-50">
🧠 <strong>Let's get smarter with an extraction strategy: CosineStrategy!</strong>
</div>
<div class="">Using CosineStrategy:</div>
<div>
<pre><code class="language-python">result = crawler.run(
url="https://www.nbcnews.com/business",
extraction_strategy=CosineStrategy(word_count_threshold=10, max_dist=0.2, linkage_method="ward", top_k=3)
)</code></pre>
</div>
<!-- Step 6 -->
<div class="col-span-2 bg-lime-800 p-2 rounded text-zinc-50">
🤖
<strong
>Time to bring in the big guns: LLMExtractionStrategy without instructions!</strong
>
</div>
<div class="">Using LLMExtractionStrategy without instructions:</div>
<div>
<pre><code class="language-python">result = crawler.run(
url="https://www.nbcnews.com/business",
extraction_strategy=LLMExtractionStrategy(provider="openai/gpt-4o", api_token=os.getenv('OPENAI_API_KEY'))
)</code></pre>
</div>
<!-- Step 7 -->
<div class="col-span-2 bg-lime-800 p-2 rounded text-zinc-50">
📜
<strong
>Let's make it even more interesting: LLMExtractionStrategy with
instructions!</strong
>
</div>
<div class="">Using LLMExtractionStrategy with instructions:</div>
<div>
<pre><code class="language-python">result = crawler.run(
url="https://www.nbcnews.com/business",
extraction_strategy=LLMExtractionStrategy(
provider="openai/gpt-4o",
api_token=os.getenv('OPENAI_API_KEY'),
instruction="I am interested in only financial news"
)
)</code></pre>
</div>
<!-- Step 8 -->
<div class="col-span-2 bg-lime-800 p-2 rounded text-zinc-50">
🎯
<strong>Targeted extraction: Let's use a CSS selector to extract only H2 tags!</strong>
</div>
<div class="">Using CSS selector to extract H2 tags:</div>
<div>
<pre><code class="language-python">result = crawler.run(
url="https://www.nbcnews.com/business",
css_selector="h2"
)</code></pre>
</div>
<!-- Step 9 -->
<div class="col-span-2 bg-lime-800 p-2 rounded text-zinc-50">
🖱️
<strong
>Let's get interactive: Passing JavaScript code to click 'Load More' button!</strong
>
</div>
<div class="">Using JavaScript to click 'Load More' button:</div>
<div>
<pre><code class="language-python">js_code = ["""
const loadMoreButton = Array.from(document.querySelectorAll('button')).find(button => button.textContent.includes('Load More'));
loadMoreButton && loadMoreButton.click();
"""]
crawler = WebCrawler(verbos=crawler_strategy, always_by_pass_cache=True)
result = crawler.run(url="https://www.nbcnews.com/business", js = js_code)</code></pre>
<div class="">Remember that you can pass multiple JavaScript code snippets in the list. They all will be executed in the order they are passed.</div>
</div>
<!-- Conclusion -->
<div class="col-span-2 bg-lime-800 p-2 rounded text-zinc-50">
🎉
<strong
>Congratulations! You've made it through the Crawl4ai Quickstart Guide! Now go forth
and crawl the web like a pro! 🕸️</strong
>
</div>
</div>
</section>

View File

@@ -1,65 +0,0 @@
<section id="installation" class="content-section active">
<h1 class="text-2xl font-bold">Installation 💻</h1>
<p class="mb-4">
There are three ways to use Crawl4AI:
<ol class="list-decimal list-inside mb-4">
<li class="">
As a library
</li>
<li class="">
As a local server (Docker)
</li>
<li class="">
As a Google Colab notebook. <a href="https://colab.research.google.com/drive/1wz8u30rvbq6Scodye9AGCw8Qg_Z8QGsk"
><img
src="https://colab.research.google.com/assets/colab-badge.svg"
alt="Open In Colab"
style="display: inline-block; width: 100px; height: 20px"
/></a>
</li>
</p>
<p class="my-4">To install Crawl4AI as a library, follow these steps:</p>
<ol class="list-decimal list-inside mb-4">
<li class="mb-4">
Install the package from GitHub:
<pre
class="bg-zinc-800 p-4 rounded mt-2 text-zinc-100"
><code>virtualenv venv
source venv/bin/activate
pip install "crawl4ai[all] @ git+https://github.com/unclecode/crawl4ai.git"
</code></pre>
</li>
<li class="mb-4">
Run the following command to load the required models. This is optional, but it will boost the performance and speed of the crawler. You need to do this only once.
<pre
class="bg-zinc-800 p-4 rounded mt-2 text-zinc-100"
><code>crawl4ai-download-models</code></pre>
</li>
<li class="mb-4">
Alternatively, you can clone the repository and install the package locally:
<pre
class="bg-zinc-800 p-4 rounded mt-2 text-zinc-100"
><code class = "language-python bash">virtualenv venv
source venv/bin/activate
git clone https://github.com/unclecode/crawl4ai.git
cd crawl4ai
pip install -e .[all]
</code></pre>
</li>
<li class="">
Use docker to run the local server:
<pre
class="bg-zinc-800 p-4 rounded mt-2 text-zinc-100"
><code class = "language-python bash">docker build -t crawl4ai .
# docker build --platform linux/amd64 -t crawl4ai . For Mac users
docker run -d -p 8000:80 crawl4ai</code></pre>
</li>
</ol>
<p class="mb-4">
For more information about how to run Crawl4AI as a local server, please refer to the
<a href="https://github.com/unclecode/crawl4ai" class="text-blue-400">GitHub repository</a>.
</p>
</section>

View File

@@ -1,217 +0,0 @@
<section class="try-it py-8 px-16 pb-20 bg-zinc-900 overflow-hidden">
<div class="container mx-auto ">
<h2 class="text-2xl font-bold mb-4 text-lime-500">Try It Now</h2>
<div class="flex gap-4">
<div class="flex flex-col flex-1 gap-2">
<div class="flex flex-col">
<label for="url-input" class="text-lime-500 font-bold text-xs">URL(s)</label>
<input
type="text"
id="url-input"
value="https://www.nbcnews.com/business"
class="border border-zinc-700 rounded px-4 py-0 bg-zinc-900 text-zinc-300"
placeholder="Enter URL(s) separated by commas"
/>
</div>
<div class="flex gap-2">
<div class="flex flex-col">
<label for="threshold" class="text-lime-500 font-bold text-xs">Min Words Threshold</label>
<select
id="threshold"
class="border border-zinc-700 rounded px-4 py-1 bg-zinc-900 text-zinc-300"
>
<option value="1">1</option>
<option value="5">5</option>
<option value="10" selected>10</option>
<option value="15">15</option>
<option value="20">20</option>
<option value="25">25</option>
</select>
</div>
<div class="flex flex-col flex-1">
<label for="css-selector" class="text-lime-500 font-bold text-xs">CSS Selector</label>
<input
type="text"
id="css-selector"
class="border border-zinc-700 rounded px-4 py-0 bg-zinc-900 text-zinc-300 placeholder-lime-700"
placeholder="CSS Selector (e.g. .content, #main, article)"
/>
</div>
</div>
<div class="flex gap-2">
<div class="flex flex-col">
<label for="extraction-strategy-select" class="text-lime-500 font-bold text-xs"
>Extraction Strategy</label
>
<select
id="extraction-strategy-select"
class="border border-zinc-700 rounded px-4 py-1 bg-zinc-900 text-zinc-300"
>
<option value="NoExtractionStrategy" selected>NoExtractionStrategy</option>
<option value="CosineStrategy">CosineStrategy</option>
<option value="LLMExtractionStrategy">LLMExtractionStrategy</option>
</select>
</div>
<div class="flex flex-col">
<label for="chunking-strategy-select" class="text-lime-500 font-bold text-xs"
>Chunking Strategy</label
>
<select
id="chunking-strategy-select"
class="border border-zinc-700 rounded px-4 py-1 bg-zinc-900 text-zinc-300"
>
<option value="RegexChunking">RegexChunking</option>
<option value="NlpSentenceChunking">NlpSentenceChunking</option>
<option value="TopicSegmentationChunking">TopicSegmentationChunking</option>
<option value="FixedLengthWordChunking">FixedLengthWordChunking</option>
<option value="SlidingWindowChunking">SlidingWindowChunking</option>
</select>
</div>
</div>
<div id = "llm_settings" class="flex gap-2 hidden hidden">
<div class="flex flex-col">
<label for="provider-model-select" class="text-lime-500 font-bold text-xs"
>Provider Model</label
>
<select
id="provider-model-select"
class="border border-zinc-700 rounded px-4 py-1 bg-zinc-900 text-zinc-300"
>
<option value="groq/llama3-70b-8192">groq/llama3-70b-8192</option>
<option value="groq/llama3-8b-8192">groq/llama3-8b-8192</option>
<option value="groq/mixtral-8x7b-32768">groq/mixtral-8x7b-32768</option>
<option value="openai/gpt-4-turbo">gpt-4-turbo</option>
<option value="openai/gpt-3.5-turbo">gpt-3.5-turbo</option>
<option value="openai/gpt-4o">gpt-4o</option>
<option value="anthropic/claude-3-haiku-20240307">claude-3-haiku</option>
<option value="anthropic/claude-3-opus-20240229">claude-3-opus</option>
<option value="anthropic/claude-3-sonnet-20240229">claude-3-sonnet</option>
</select>
</div>
<div class="flex flex-col flex-1">
<label for="token-input" class="text-lime-500 font-bold text-xs">API Token</label>
<input
type="password"
id="token-input"
class="border border-zinc-700 rounded px-4 py-0 bg-zinc-900 text-zinc-300"
placeholder="Enter Groq API token"
/>
</div>
</div>
<div class="flex gap-2">
<!-- Add two textarea one for getting Keyword Filter and another one Instruction, make both grow whole with-->
<div id = "semantic_filter_div" class="flex flex-col flex-1 hidden">
<label for="keyword-filter" class="text-lime-500 font-bold text-xs">Keyword Filter</label>
<textarea
id="semantic_filter"
rows="3"
class="border border-zinc-700 rounded px-4 py-0 bg-zinc-900 text-zinc-300 placeholder-zinc-700"
placeholder="Enter keywords for CosineStrategy to narrow down the content."
></textarea>
</div>
<div id = "instruction_div" class="flex flex-col flex-1 hidden">
<label for="instruction" class="text-lime-500 font-bold text-xs">Instruction</label>
<textarea
id="instruction"
rows="3"
class="border border-zinc-700 rounded px-4 py-0 bg-zinc-900 text-zinc-300 placeholder-zinc-700"
placeholder="Enter instruction for the LLMEstrategy to instruct the model."
></textarea>
</div>
</div>
<div class="flex gap-3">
<div class="flex items-center gap-2">
<input type="checkbox" id="bypass-cache-checkbox" />
<label for="bypass-cache-checkbox" class="text-lime-500 font-bold">Bypass Cache</label>
</div>
<div class="flex items-center gap-2">
<input type="checkbox" id="screenshot-checkbox" checked />
<label for="screenshot-checkbox" class="text-lime-500 font-bold">Screenshot</label>
</div>
<div class="flex items-center gap-2 hidden">
<input type="checkbox" id="extract-blocks-checkbox" />
<label for="extract-blocks-checkbox" class="text-lime-500 font-bold">Extract Blocks</label>
</div>
<button id="crawl-btn" class="bg-lime-600 text-black font-bold px-4 py-0 rounded">Crawl</button>
</div>
</div>
<div id="loading" class="hidden">
<p class="text-white">Loading... Please wait.</p>
</div>
<div id="result" class="flex-1 overflow-x-auto">
<div class="tab-buttons flex gap-2">
<button class="tab-btn px-4 py-1 text-sm bg-zinc-700 rounded-t text-lime-500" data-tab="json">
JSON
</button>
<button
class="tab-btn px-4 py-1 text-sm bg-zinc-700 rounded-t text-lime-500"
data-tab="cleaned-html"
>
Cleaned HTML
</button>
<button class="tab-btn px-4 py-1 text-sm bg-zinc-700 rounded-t text-lime-500" data-tab="markdown">
Markdown
</button>
<button class="tab-btn px-4 py-1 text-sm bg-zinc-700 rounded-t text-lime-500" data-tab="media">
Medias
</button>
<button class="tab-btn px-4 py-1 text-sm bg-zinc-700 rounded-t text-lime-500" data-tab="screenshot">
Screenshot
</button>
</div>
<div class="tab-content code bg-zinc-900 p-2 rounded h-full border border-zinc-700 text-sm">
<pre class="h-full flex"><code id="json-result" class="language-json"></code></pre>
<pre class="hidden h-full flex"><code id="cleaned-html-result" class="language-html"></code></pre>
<pre class="hidden h-full flex"><code id="markdown-result" class="language-markdown"></code></pre>
<pre class="hidden h-full flex"><code id="media-result" class="language-json"></code></pre>
<pre class="hidden h-full flex"><code id="screenshot-result"></code></pre>
</div>
</div>
<div id="code_help" class="flex-1 overflow-x-auto">
<div class="tab-buttons flex gap-2">
<button class="code-tab-btn px-4 py-1 text-sm bg-zinc-700 rounded-t text-lime-500" data-tab="curl">
cURL
</button>
<button
class="code-tab-btn px-4 py-1 text-sm bg-zinc-700 rounded-t text-lime-500"
data-tab="library"
>
Python
</button>
<button
class="code-tab-btn px-4 py-1 text-sm bg-zinc-700 rounded-t text-lime-500"
data-tab="python"
>
REST API
</button>
<!-- <button
class="code-tab-btn px-4 py-1 text-sm bg-zinc-700 rounded-t text-lime-500"
data-tab="nodejs"
>
Node.js
</button> -->
</div>
<div class="tab-content result bg-zinc-900 p-2 rounded h-full border border-zinc-700 text-sm">
<pre class="h-full flex relative overflow-x-auto">
<code id="curl-code" class="language-bash"></code>
<button class="absolute top-2 right-2 bg-zinc-700 text-white px-2 py-1 rounded copy-btn" data-target="curl-code">Copy</button>
</pre>
<pre class="hidden h-full flex relative overflow-x-auto">
<code id="python-code" class="language-python"></code>
<button class="absolute top-2 right-2 bg-zinc-700 text-white px-2 py-1 rounded copy-btn" data-target="python-code">Copy</button>
</pre>
<pre class="hidden h-full flex relative overflow-x-auto">
<code id="nodejs-code" class="language-javascript"></code>
<button class="absolute top-2 right-2 bg-zinc-700 text-white px-2 py-1 rounded copy-btn" data-target="nodejs-code">Copy</button>
</pre>
<pre class="hidden h-full flex relative overflow-x-auto">
<code id="library-code" class="language-python"></code>
<button class="absolute top-2 right-2 bg-zinc-700 text-white px-2 py-1 rounded copy-btn" data-target="library-code">Copy</button>
</pre>
</div>
</div>
</div>
</div>
</section>

View File

@@ -1,434 +0,0 @@
<div class="w-3/4 p-4">
<section id="installation" class="content-section active">
<h1 class="text-2xl font-bold">Installation 💻</h1>
<p class="mb-4">There are three ways to use Crawl4AI:</p>
<ol class="list-decimal list-inside mb-4">
<li class="">As a library</li>
<li class="">As a local server (Docker)</li>
<li class="">
As a Google Colab notebook.
<a href="https://colab.research.google.com/drive/1wz8u30rvbq6Scodye9AGCw8Qg_Z8QGsk"
><img
src="https://colab.research.google.com/assets/colab-badge.svg"
alt="Open In Colab"
style="display: inline-block; width: 100px; height: 20px"
/></a>
</li>
<p></p>
<p class="my-4">To install Crawl4AI as a library, follow these steps:</p>
<ol class="list-decimal list-inside mb-4">
<li class="mb-4">
Install the package from GitHub:
<pre
class="bg-zinc-800 p-4 rounded mt-2 text-zinc-100"
><code class="hljs language-bash">pip install git+https://github.com/unclecode/crawl4ai.git</code></pre>
</li>
<li class="mb-4">
Alternatively, you can clone the repository and install the package locally:
<pre
class="bg-zinc-800 p-4 rounded mt-2 text-zinc-100"
><code class="language-python bash hljs">virtualenv venv
source venv/<span class="hljs-built_in">bin</span>/activate
git clone https://github.com/unclecode/crawl4ai.git
cd crawl4ai
pip install -e .
</code></pre>
</li>
<li class="">
Use docker to run the local server:
<pre
class="bg-zinc-800 p-4 rounded mt-2 text-zinc-100"
><code class="language-python bash hljs">docker build -t crawl4ai .
<span class="hljs-comment"># docker build --platform linux/amd64 -t crawl4ai . For Mac users</span>
docker run -d -p <span class="hljs-number">8000</span>:<span class="hljs-number">80</span> crawl4ai</code></pre>
</li>
</ol>
<p class="mb-4">
For more information about how to run Crawl4AI as a local server, please refer to the
<a href="https://github.com/unclecode/crawl4ai" class="text-blue-400">GitHub repository</a>.
</p>
</ol>
</section>
<section id="how-to-guide" class="content-section">
<h1 class="text-2xl font-bold">How to Guide</h1>
<div class="flex flex-col gap-4 p-4 bg-zinc-900 text-lime-500">
<!-- Step 1 -->
<div class="col-span-2 bg-lime-800 p-2 rounded text-zinc-50">
🌟
<strong>Welcome to the Crawl4ai Quickstart Guide! Let's dive into some web crawling fun!</strong>
</div>
<div class="">
First Step: Create an instance of WebCrawler and call the
<code>warmup()</code> function.
</div>
<div>
<pre><code class="language-python hljs">crawler = WebCrawler()
crawler.warmup()</code></pre>
</div>
<!-- Step 2 -->
<div class="col-span-2 bg-lime-800 p-2 rounded text-zinc-50">
🧠 <strong>Understanding 'bypass_cache' and 'include_raw_html' parameters:</strong>
</div>
<div class="">First crawl (caches the result):</div>
<div>
<pre><code class="language-python hljs">result = crawler.run(url=<span class="hljs-string">"https://www.nbcnews.com/business"</span>)</code></pre>
</div>
<div class="">Second crawl (Force to crawl again):</div>
<div>
<pre><code class="language-python hljs">result = crawler.run(url=<span class="hljs-string">"https://www.nbcnews.com/business"</span>, bypass_cache=<span class="hljs-literal">True</span>)</code></pre>
<div class="bg-red-900 p-2 text-zinc-50">
⚠️ Don't forget to set <code>`bypass_cache`</code> to True if you want to try different strategies
for the same URL. Otherwise, the cached result will be returned. You can also set
<code>`always_by_pass_cache`</code> in constructor to True to always bypass the cache.
</div>
</div>
<div class="">Crawl result without raw HTML content:</div>
<div>
<pre><code class="language-python hljs">result = crawler.run(url=<span class="hljs-string">"https://www.nbcnews.com/business"</span>, include_raw_html=<span class="hljs-literal">False</span>)</code></pre>
</div>
<!-- Step 3 -->
<div class="col-span-2 bg-lime-800 p-2 rounded text-zinc-50">
📄
<strong
>The 'include_raw_html' parameter, when set to True, includes the raw HTML content in the response.
By default, it is set to True.</strong
>
</div>
<div class="">Set <code>always_by_pass_cache</code> to True:</div>
<div>
<pre><code class="language-python hljs">crawler.always_by_pass_cache = <span class="hljs-literal">True</span></code></pre>
</div>
<!-- Step 4 -->
<div class="col-span-2 bg-lime-800 p-2 rounded text-zinc-50">
🧩 <strong>Let's add a chunking strategy: RegexChunking!</strong>
</div>
<div class="">Using RegexChunking:</div>
<div>
<pre><code class="language-python hljs">result = crawler.run(
url=<span class="hljs-string">"https://www.nbcnews.com/business"</span>,
chunking_strategy=RegexChunking(patterns=[<span class="hljs-string">"\n\n"</span>])
)</code></pre>
</div>
<div class="">Using NlpSentenceChunking:</div>
<div>
<pre><code class="language-python hljs">result = crawler.run(
url=<span class="hljs-string">"https://www.nbcnews.com/business"</span>,
chunking_strategy=NlpSentenceChunking()
)</code></pre>
</div>
<!-- Step 5 -->
<div class="col-span-2 bg-lime-800 p-2 rounded text-zinc-50">
🧠 <strong>Let's get smarter with an extraction strategy: CosineStrategy!</strong>
</div>
<div class="">Using CosineStrategy:</div>
<div>
<pre><code class="language-python hljs">result = crawler.run(
url=<span class="hljs-string">"https://www.nbcnews.com/business"</span>,
extraction_strategy=CosineStrategy(word_count_threshold=<span class="hljs-number">20</span>, max_dist=<span class="hljs-number">0.2</span>, linkage_method=<span class="hljs-string">"ward"</span>, top_k=<span class="hljs-number">3</span>)
)</code></pre>
</div>
<!-- Step 6 -->
<div class="col-span-2 bg-lime-800 p-2 rounded text-zinc-50">
🤖
<strong>Time to bring in the big guns: LLMExtractionStrategy without instructions!</strong>
</div>
<div class="">Using LLMExtractionStrategy without instructions:</div>
<div>
<pre><code class="language-python hljs">result = crawler.run(
url=<span class="hljs-string">"https://www.nbcnews.com/business"</span>,
extraction_strategy=LLMExtractionStrategy(provider=<span class="hljs-string">"openai/gpt-4o"</span>, api_token=os.getenv(<span class="hljs-string">'OPENAI_API_KEY'</span>))
)</code></pre>
</div>
<!-- Step 7 -->
<div class="col-span-2 bg-lime-800 p-2 rounded text-zinc-50">
📜
<strong>Let's make it even more interesting: LLMExtractionStrategy with instructions!</strong>
</div>
<div class="">Using LLMExtractionStrategy with instructions:</div>
<div>
<pre><code class="language-python hljs">result = crawler.run(
url=<span class="hljs-string">"https://www.nbcnews.com/business"</span>,
extraction_strategy=LLMExtractionStrategy(
provider=<span class="hljs-string">"openai/gpt-4o"</span>,
api_token=os.getenv(<span class="hljs-string">'OPENAI_API_KEY'</span>),
instruction=<span class="hljs-string">"I am interested in only financial news"</span>
)
)</code></pre>
</div>
<!-- Step 8 -->
<div class="col-span-2 bg-lime-800 p-2 rounded text-zinc-50">
🎯
<strong>Targeted extraction: Let's use a CSS selector to extract only H2 tags!</strong>
</div>
<div class="">Using CSS selector to extract H2 tags:</div>
<div>
<pre><code class="language-python hljs">result = crawler.run(
url=<span class="hljs-string">"https://www.nbcnews.com/business"</span>,
css_selector=<span class="hljs-string">"h2"</span>
)</code></pre>
</div>
<!-- Step 9 -->
<div class="col-span-2 bg-lime-800 p-2 rounded text-zinc-50">
🖱️
<strong>Let's get interactive: Passing JavaScript code to click 'Load More' button!</strong>
</div>
<div class="">Using JavaScript to click 'Load More' button:</div>
<div>
<pre><code class="language-python hljs">js_code = <span class="hljs-string">"""
const loadMoreButton = Array.from(document.querySelectorAll('button')).find(button =&gt; button.textContent.includes('Load More'));
loadMoreButton &amp;&amp; loadMoreButton.click();
"""</span>
crawler_strategy = LocalSeleniumCrawlerStrategy(js_code=js_code)
crawler = WebCrawler(crawler_strategy=crawler_strategy, always_by_pass_cache=<span class="hljs-literal">True</span>)
result = crawler.run(url=<span class="hljs-string">"https://www.nbcnews.com/business"</span>)</code></pre>
</div>
<!-- Conclusion -->
<div class="col-span-2 bg-lime-800 p-2 rounded text-zinc-50">
🎉
<strong
>Congratulations! You've made it through the Crawl4ai Quickstart Guide! Now go forth and crawl the
web like a pro! 🕸️</strong
>
</div>
</div>
</section>
<section id="chunking-strategies" class="content-section">
<div class="bg-zinc-800 p-4 rounded shadow-md docs-item">
<div class="text-gray-300 prose prose-sm">
<h3>RegexChunking</h3>
<p>
<code>RegexChunking</code> is a text chunking strategy that splits a given text into smaller parts
using regular expressions. This is useful for preparing large texts for processing by language
models, ensuring they are divided into manageable segments.
</p>
<h4>Constructor Parameters:</h4>
<ul>
<li>
<code>patterns</code> (list, optional): A list of regular expression patterns used to split the
text. Default is to split by double newlines (<code>['\n\n']</code>).
</li>
</ul>
<h4>Example usage:</h4>
<pre><code class="language-python">chunker = RegexChunking(patterns=[r'\n\n', r'\. '])
chunks = chunker.chunk("This is a sample text. It will be split into chunks.")
</code></pre>
</div>
</div>
<div class="bg-zinc-800 p-4 rounded shadow-md docs-item">
<div class="text-gray-300 prose prose-sm">
<h3>NlpSentenceChunking</h3>
<p>
<code>NlpSentenceChunking</code> uses a natural language processing model to chunk a given text into
sentences. This approach leverages SpaCy to accurately split text based on sentence boundaries.
</p>
<h4>Constructor Parameters:</h4>
<ul>
<li>
None.
</li>
</ul>
<h4>Example usage:</h4>
<pre><code class="language-python">chunker = NlpSentenceChunking()
chunks = chunker.chunk("This is a sample text. It will be split into sentences.")
</code></pre>
</div>
</div>
<div class="bg-zinc-800 p-4 rounded shadow-md docs-item">
<div class="text-gray-300 prose prose-sm">
<h3>TopicSegmentationChunking</h3>
<p>
<code>TopicSegmentationChunking</code> uses the TextTiling algorithm to segment a given text into
topic-based chunks. This method identifies thematic boundaries in the text.
</p>
<h4>Constructor Parameters:</h4>
<ul>
<li>
<code>num_keywords</code> (int, optional): The number of keywords to extract for each topic
segment. Default is <code>3</code>.
</li>
</ul>
<h4>Example usage:</h4>
<pre><code class="language-python">chunker = TopicSegmentationChunking(num_keywords=3)
chunks = chunker.chunk("This is a sample text. It will be split into topic-based segments.")
</code></pre>
</div>
</div>
<div class="bg-zinc-800 p-4 rounded shadow-md docs-item">
<div class="text-gray-300 prose prose-sm">
<h3>FixedLengthWordChunking</h3>
<p>
<code>FixedLengthWordChunking</code> splits a given text into chunks of fixed length, based on the
number of words.
</p>
<h4>Constructor Parameters:</h4>
<ul>
<li>
<code>chunk_size</code> (int, optional): The number of words in each chunk. Default is
<code>100</code>.
</li>
</ul>
<h4>Example usage:</h4>
<pre><code class="language-python">chunker = FixedLengthWordChunking(chunk_size=100)
chunks = chunker.chunk("This is a sample text. It will be split into fixed-length word chunks.")
</code></pre>
</div>
</div>
<div class="bg-zinc-800 p-4 rounded shadow-md docs-item">
<div class="text-gray-300 prose prose-sm">
<h3>SlidingWindowChunking</h3>
<p>
<code>SlidingWindowChunking</code> uses a sliding window approach to chunk a given text. Each chunk
has a fixed length, and the window slides by a specified step size.
</p>
<h4>Constructor Parameters:</h4>
<ul>
<li>
<code>window_size</code> (int, optional): The number of words in each chunk. Default is
<code>100</code>.
</li>
<li>
<code>step</code> (int, optional): The number of words to slide the window. Default is
<code>50</code>.
</li>
</ul>
<h4>Example usage:</h4>
<pre><code class="language-python">chunker = SlidingWindowChunking(window_size=100, step=50)
chunks = chunker.chunk("This is a sample text. It will be split using a sliding window approach.")
</code></pre>
</div>
</div>
</section>
<section id="extraction-strategies" class="content-section">
<div class="bg-zinc-800 p-4 rounded shadow-md docs-item">
<div class="text-gray-300 prose prose-sm">
<h3>NoExtractionStrategy</h3>
<p>
<code>NoExtractionStrategy</code> is a basic extraction strategy that returns the entire HTML
content without any modification. It is useful for cases where no specific extraction is required.
Only clean html, and amrkdown.
</p>
<h4>Constructor Parameters:</h4>
<p>None.</p>
<h4>Example usage:</h4>
<pre><code class="language-python">extractor = NoExtractionStrategy()
extracted_content = extractor.extract(url, html)
</code></pre>
</div>
</div>
<div class="bg-zinc-800 p-4 rounded shadow-md docs-item">
<div class="text-gray-300 prose prose-sm">
<h3>LLMExtractionStrategy</h3>
<p>
<code>LLMExtractionStrategy</code> uses a Language Model (LLM) to extract meaningful blocks or
chunks from the given HTML content. This strategy leverages an external provider for language model
completions.
</p>
<h4>Constructor Parameters:</h4>
<ul>
<li>
<code>provider</code> (str, optional): The provider to use for the language model completions.
Default is <code>DEFAULT_PROVIDER</code> (e.g., openai/gpt-4).
</li>
<li>
<code>api_token</code> (str, optional): The API token for the provider. If not provided, it will
try to load from the environment variable <code>OPENAI_API_KEY</code>.
</li>
<li>
<code>instruction</code> (str, optional): An instruction to guide the LLM on how to perform the
extraction. This allows users to specify the type of data they are interested in or set the tone
of the response. Default is <code>None</code>.
</li>
</ul>
<h4>Example usage:</h4>
<pre><code class="language-python">extractor = LLMExtractionStrategy(provider='openai', api_token='your_api_token', instruction='Extract only news about AI.')
extracted_content = extractor.extract(url, html)
</code></pre>
<p>
By providing clear instructions, users can tailor the extraction process to their specific needs,
enhancing the relevance and utility of the extracted content.
</p>
</div>
</div>
<div class="bg-zinc-800 p-4 rounded shadow-md docs-item">
<div class="text-gray-300 prose prose-sm">
<h3>CosineStrategy</h3>
<p>
<code>CosineStrategy</code> uses hierarchical clustering based on cosine similarity to extract
clusters of text from the given HTML content. This strategy is suitable for identifying related
content sections.
</p>
<h4>Constructor Parameters:</h4>
<ul>
<li>
<code>semantic_filter</code> (str, optional): A string containing keywords for filtering relevant
documents before clustering. If provided, documents are filtered based on their cosine
similarity to the keyword filter embedding. Default is <code>None</code>.
</li>
<li>
<code>word_count_threshold</code> (int, optional): Minimum number of words per cluster. Default
is <code>20</code>.
</li>
<li>
<code>max_dist</code> (float, optional): The maximum cophenetic distance on the dendrogram to
form clusters. Default is <code>0.2</code>.
</li>
<li>
<code>linkage_method</code> (str, optional): The linkage method for hierarchical clustering.
Default is <code>'ward'</code>.
</li>
<li>
<code>top_k</code> (int, optional): Number of top categories to extract. Default is
<code>3</code>.
</li>
<li>
<code>model_name</code> (str, optional): The model name for embedding generation. Default is
<code>'BAAI/bge-small-en-v1.5'</code>.
</li>
</ul>
<h4>Example usage:</h4>
<pre><code class="language-python">extractor = CosineStrategy(semantic_filter='artificial intelligence', word_count_threshold=10, max_dist=0.2, linkage_method='ward', top_k=3, model_name='BAAI/bge-small-en-v1.5')
extracted_content = extractor.extract(url, html)
</code></pre>
<h4>Cosine Similarity Filtering</h4>
<p>
When a <code>semantic_filter</code> is provided, the <code>CosineStrategy</code> applies an
embedding-based filtering process to select relevant documents before performing hierarchical
clustering.
</p>
</div>
</div>
<div class="bg-zinc-800 p-4 rounded shadow-md docs-item">
<div class="text-gray-300 prose prose-sm">
<h3>TopicExtractionStrategy</h3>
<p>
<code>TopicExtractionStrategy</code> uses the TextTiling algorithm to segment the HTML content into
topics and extracts keywords for each segment. This strategy is useful for identifying and
summarizing thematic content.
</p>
<h4>Constructor Parameters:</h4>
<ul>
<li>
<code>num_keywords</code> (int, optional): Number of keywords to represent each topic segment.
Default is <code>3</code>.
</li>
</ul>
<h4>Example usage:</h4>
<pre><code class="language-python">extractor = TopicExtractionStrategy(num_keywords=3)
extracted_content = extractor.extract(url, html)
</code></pre>
</div>
</div>
</section>
</div>

View File

@@ -1,16 +1,16 @@
aiosqlite~=0.20
html2text~=2024.2
lxml~=5.3
litellm~=1.48
litellm>=1.53.1
numpy>=1.26.0,<3
pillow~=10.4
playwright>=1.47,<1.48
playwright>=1.49.0
python-dotenv~=1.0
requests~=2.26
beautifulsoup4~=4.12
tf-playwright-stealth~=1.0
tf-playwright-stealth>=1.1.0
xxhash~=3.4
rank-bm25~=0.2
aiofiles~=24.0
aiofiles>=24.1.0
colorama~=0.4
snowballstemmer~=2.2

View File

@@ -9,10 +9,17 @@ import asyncio
# Create the .crawl4ai folder in the user's home directory if it doesn't exist
# If the folder already exists, remove the cache folder
crawl4ai_folder = os.getenv("CRAWL4_AI_BASE_DIRECTORY", Path.home()) / ".crawl4ai"
base_dir = os.getenv("CRAWL4_AI_BASE_DIRECTORY")
crawl4ai_folder = Path(base_dir) if base_dir else Path.home()
crawl4ai_folder = crawl4ai_folder / ".crawl4ai"
cache_folder = crawl4ai_folder / "cache"
content_folders = ['html_content', 'cleaned_html', 'markdown_content',
'extracted_content', 'screenshots']
content_folders = [
"html_content",
"cleaned_html",
"markdown_content",
"extracted_content",
"screenshots",
]
# Clean up old cache if exists
if cache_folder.exists():
@@ -28,7 +35,7 @@ for folder in content_folders:
__location__ = os.path.realpath(os.path.join(os.getcwd(), os.path.dirname(__file__)))
with open(os.path.join(__location__, "requirements.txt")) as f:
requirements = f.read().splitlines()
with open("crawl4ai/__version__.py") as f:
for line in f:
if line.startswith("__version__"):
@@ -37,11 +44,12 @@ with open("crawl4ai/__version__.py") as f:
# Define requirements
default_requirements = requirements
torch_requirements = ["torch", "nltk", "scikit-learn"]
torch_requirements = ["torch", "nltk", "scikit-learn"]
transformer_requirements = ["transformers", "tokenizers"]
cosine_similarity_requirements = ["torch", "transformers", "nltk" ]
cosine_similarity_requirements = ["torch", "transformers", "nltk"]
sync_requirements = ["selenium"]
def install_playwright():
print("Installing Playwright browsers...")
try:
@@ -49,16 +57,22 @@ def install_playwright():
print("Playwright installation completed successfully.")
except subprocess.CalledProcessError as e:
print(f"Error during Playwright installation: {e}")
print("Please run 'python -m playwright install' manually after the installation.")
print(
"Please run 'python -m playwright install' manually after the installation."
)
except Exception as e:
print(f"Unexpected error during Playwright installation: {e}")
print("Please run 'python -m playwright install' manually after the installation.")
print(
"Please run 'python -m playwright install' manually after the installation."
)
def run_migration():
"""Initialize database during installation"""
try:
print("Starting database initialization...")
from crawl4ai.async_database import async_db_manager
asyncio.run(async_db_manager.initialize())
print("Database initialization completed successfully.")
except ImportError:
@@ -67,12 +81,14 @@ def run_migration():
print(f"Warning: Database initialization failed: {e}")
print("Database will be initialized on first use")
class PostInstallCommand(install):
def run(self):
install.run(self)
install_playwright()
# run_migration()
setup(
name="Crawl4AI",
version=version,
@@ -84,18 +100,23 @@ setup(
author_email="unclecode@kidocode.com",
license="MIT",
packages=find_packages(),
install_requires=default_requirements + ["playwright", "aiofiles"], # Added aiofiles
install_requires=default_requirements
+ ["playwright", "aiofiles"], # Added aiofiles
extras_require={
"torch": torch_requirements,
"transformer": transformer_requirements,
"cosine": cosine_similarity_requirements,
"sync": sync_requirements,
"all": default_requirements + torch_requirements + transformer_requirements + cosine_similarity_requirements + sync_requirements,
"all": default_requirements
+ torch_requirements
+ transformer_requirements
+ cosine_similarity_requirements
+ sync_requirements,
},
entry_points={
'console_scripts': [
'crawl4ai-download-models=crawl4ai.model_loader:main',
'crawl4ai-migrate=crawl4ai.migrations:main', # Added migration command
"console_scripts": [
"crawl4ai-download-models=crawl4ai.model_loader:main",
"crawl4ai-migrate=crawl4ai.migrations:main", # Added migration command
],
},
classifiers=[
@@ -110,6 +131,6 @@ setup(
],
python_requires=">=3.7",
cmdclass={
'install': PostInstallCommand,
"install": PostInstallCommand,
},
)
)

View File

@@ -13,8 +13,8 @@ parent_dir = os.path.dirname(os.path.dirname(os.path.dirname(os.path.abspath(__f
sys.path.append(parent_dir)
__location__ = os.path.realpath(os.path.join(os.getcwd(), os.path.dirname(__file__)))
from crawl4ai.content_scrapping_strategy import WebScrapingStrategy
from crawl4ai.content_scrapping_strategy import WebScrapingStrategy as WebScrapingStrategyCurrent
from crawl4ai.content_scraping_strategy import WebScrapingStrategy
from crawl4ai.content_scraping_strategy import WebScrapingStrategy as WebScrapingStrategyCurrent
# from crawl4ai.content_scrapping_strategy_current import WebScrapingStrategy as WebScrapingStrategyCurrent
@dataclass

View File

@@ -0,0 +1,165 @@
# ## Issue #236
# - **Last Updated:** 2024-11-11 01:42:14
# - **Title:** [user data crawling opens two windows, unable to control correct user browser](https://github.com/unclecode/crawl4ai/issues/236)
# - **State:** open
import os, sys, time
parent_dir = os.path.dirname(os.path.dirname(os.path.abspath(__file__)))
sys.path.append(parent_dir)
__location__ = os.path.realpath( os.path.join(os.getcwd(), os.path.dirname(__file__)))
import asyncio
import os
import time
from typing import Dict, Any
from crawl4ai.markdown_generation_strategy import DefaultMarkdownGenerator
# Get current directory
__location__ = os.path.realpath(os.path.join(os.getcwd(), os.path.dirname(__file__)))
def print_test_result(name: str, result: Dict[str, Any], execution_time: float):
"""Helper function to print test results."""
print(f"\n{'='*20} {name} {'='*20}")
print(f"Execution time: {execution_time:.4f} seconds")
# Save markdown to files
for key, content in result.items():
if isinstance(content, str):
with open(__location__ + f"/output/{name.lower()}_{key}.md", "w") as f:
f.write(content)
# # Print first few lines of each markdown version
# for key, content in result.items():
# if isinstance(content, str):
# preview = '\n'.join(content.split('\n')[:3])
# print(f"\n{key} (first 3 lines):")
# print(preview)
# print(f"Total length: {len(content)} characters")
def test_basic_markdown_conversion():
"""Test basic markdown conversion with links."""
with open(__location__ + "/data/wikipedia.html", "r") as f:
cleaned_html = f.read()
generator = DefaultMarkdownGenerator()
start_time = time.perf_counter()
result = generator.generate_markdown(
cleaned_html=cleaned_html,
base_url="https://en.wikipedia.org"
)
execution_time = time.perf_counter() - start_time
print_test_result("Basic Markdown Conversion", {
'raw': result.raw_markdown,
'with_citations': result.markdown_with_citations,
'references': result.references_markdown
}, execution_time)
# Basic assertions
assert result.raw_markdown, "Raw markdown should not be empty"
assert result.markdown_with_citations, "Markdown with citations should not be empty"
assert result.references_markdown, "References should not be empty"
assert "" in result.markdown_with_citations, "Citations should use ⟨⟩ brackets"
assert "## References" in result.references_markdown, "Should contain references section"
def test_relative_links():
"""Test handling of relative links with base URL."""
markdown = """
Here's a [relative link](/wiki/Apple) and an [absolute link](https://example.com).
Also an [image](/images/test.png) and another [page](/wiki/Banana).
"""
generator = DefaultMarkdownGenerator()
result = generator.generate_markdown(
cleaned_html=markdown,
base_url="https://en.wikipedia.org"
)
assert "https://en.wikipedia.org/wiki/Apple" in result.references_markdown
assert "https://example.com" in result.references_markdown
assert "https://en.wikipedia.org/images/test.png" in result.references_markdown
def test_duplicate_links():
"""Test handling of duplicate links."""
markdown = """
Here's a [link](/test) and another [link](/test) and a [different link](/other).
"""
generator = DefaultMarkdownGenerator()
result = generator.generate_markdown(
cleaned_html=markdown,
base_url="https://example.com"
)
# Count citations in markdown
citations = result.markdown_with_citations.count("⟨1⟩")
assert citations == 2, "Same link should use same citation number"
def test_link_descriptions():
"""Test handling of link titles and descriptions."""
markdown = """
Here's a [link with title](/test "Test Title") and a [link with description](/other) to test.
"""
generator = DefaultMarkdownGenerator()
result = generator.generate_markdown(
cleaned_html=markdown,
base_url="https://example.com"
)
assert "Test Title" in result.references_markdown, "Link title should be in references"
assert "link with description" in result.references_markdown, "Link text should be in references"
def test_performance_large_document():
"""Test performance with large document."""
with open(__location__ + "/data/wikipedia.md", "r") as f:
markdown = f.read()
# Test with multiple iterations
iterations = 5
times = []
generator = DefaultMarkdownGenerator()
for i in range(iterations):
start_time = time.perf_counter()
result = generator.generate_markdown(
cleaned_html=markdown,
base_url="https://en.wikipedia.org"
)
end_time = time.perf_counter()
times.append(end_time - start_time)
avg_time = sum(times) / len(times)
print(f"\n{'='*20} Performance Test {'='*20}")
print(f"Average execution time over {iterations} iterations: {avg_time:.4f} seconds")
print(f"Min time: {min(times):.4f} seconds")
print(f"Max time: {max(times):.4f} seconds")
def test_image_links():
"""Test handling of image links."""
markdown = """
Here's an ![image](/image.png "Image Title") and another ![image](/other.jpg).
And a regular [link](/page).
"""
generator = DefaultMarkdownGenerator()
result = generator.generate_markdown(
cleaned_html=markdown,
base_url="https://example.com"
)
assert "![" in result.markdown_with_citations, "Image markdown syntax should be preserved"
assert "Image Title" in result.references_markdown, "Image title should be in references"
if __name__ == "__main__":
print("Running markdown generation strategy tests...")
test_basic_markdown_conversion()
test_relative_links()
test_duplicate_links()
test_link_descriptions()
test_performance_large_document()
test_image_links()