diff --git a/README.md b/README.md
index 29bae309..0826ac77 100644
--- a/README.md
+++ b/README.md
@@ -1,13 +1,22 @@
# 🚀🤖 Crawl4AI: Open-source LLM Friendly Web Crawler & Scraper.
+
+

[](https://github.com/unclecode/crawl4ai/stargazers)
-
[](https://github.com/unclecode/crawl4ai/network/members)
-[](https://github.com/unclecode/crawl4ai/issues)
-[](https://github.com/unclecode/crawl4ai/pulls)
+
+[](https://badge.fury.io/py/crawl4ai)
+[](https://pypi.org/project/crawl4ai/)
+[](https://pepy.tech/project/crawl4ai)
+
+[](https://crawl4ai.readthedocs.io/)
[](https://github.com/unclecode/crawl4ai/blob/main/LICENSE)
+[](https://github.com/psf/black)
+[](https://github.com/PyCQA/bandit)
+
+
Crawl4AI is the #1 trending GitHub repository, actively maintained by a vibrant community. It delivers blazing-fast, AI-ready web crawling tailored for LLMs, AI agents, and data pipelines. Open source, flexible, and built for real-time performance, Crawl4AI empowers developers with unmatched speed, precision, and deployment ease.
@@ -28,20 +37,28 @@ Crawl4AI is the #1 trending GitHub repository, actively maintained by a vibrant
1. Install Crawl4AI:
```bash
+# Install the package
pip install crawl4ai
-crawl4ai-setup # Setup the browser
+crawl4ai-setup
+
+# Install Playwright with system dependencies (recommended)
+playwright install --with-deps
+
+# Or install specific browsers:
+playwright install --with-deps chrome # Recommended for Colab/Linux
```
2. Run a simple web crawl:
```python
import asyncio
-from crawl4ai import AsyncWebCrawler, CacheMode
+from crawl4ai import *
async def main():
- async with AsyncWebCrawler(verbose=True) as crawler:
- result = await crawler.arun(url="https://www.nbcnews.com/business")
- # Soone will be change to result.markdown
- print(result.markdown_v2.raw_markdown)
+ async with AsyncWebCrawler() as crawler:
+ result = await crawler.arun(
+ url="https://www.nbcnews.com/business",
+ )
+ print(result.markdown)
if __name__ == "__main__":
asyncio.run(main())
@@ -200,193 +217,26 @@ pip install -e ".[all]" # Install all optional features
-🚀 One-Click Deployment
+🐳 Docker Deployment
-Deploy your own instance of Crawl4AI with one click:
+> 🚀 **Major Changes Coming!** We're developing a completely new Docker implementation that will make deployment even more efficient and seamless. The current Docker setup is being deprecated in favor of this new solution.
-[](https://www.digitalocean.com/?repo=https://github.com/unclecode/crawl4ai/tree/0.3.74&refcode=a0780f1bdb3d&utm_campaign=Referral_Invite&utm_medium=Referral_Program&utm_source=badge)
+### Current Docker Support
-> 💡 **Recommended specs**: 4GB RAM minimum. Select "professional-xs" or higher when deploying for stable operation.
+The existing Docker implementation is being deprecated and will be replaced soon. If you still need to use Docker with the current version:
-The deploy will:
-- Set up a Docker container with Crawl4AI
-- Configure Playwright and all dependencies
-- Start the FastAPI server on port `11235`
-- Set up health checks and auto-deployment
+- 📚 [Deprecated Docker Setup](./docs/deprecated/docker-deployment.md) - Instructions for the current Docker implementation
+- ⚠️ Note: This setup will be replaced in the next major release
-
+### What's Coming Next?
-
-🐳 Using Docker
+Our new Docker implementation will bring:
+- Improved performance and resource efficiency
+- Streamlined deployment process
+- Better integration with Crawl4AI features
+- Enhanced scalability options
-Crawl4AI is available as Docker images for easy deployment. You can either pull directly from Docker Hub (recommended) or build from the repository.
-
----
-
-
-🐳 Option 1: Docker Hub (Recommended)
-
-Choose the appropriate image based on your platform and needs:
-
-### For AMD64 (Regular Linux/Windows):
-```bash
-# Basic version (recommended)
-docker pull unclecode/crawl4ai:basic-amd64
-docker run -p 11235:11235 unclecode/crawl4ai:basic-amd64
-
-# Full ML/LLM support
-docker pull unclecode/crawl4ai:all-amd64
-docker run -p 11235:11235 unclecode/crawl4ai:all-amd64
-
-# With GPU support
-docker pull unclecode/crawl4ai:gpu-amd64
-docker run -p 11235:11235 unclecode/crawl4ai:gpu-amd64
-```
-
-### For ARM64 (M1/M2 Macs, ARM servers):
-```bash
-# Basic version (recommended)
-docker pull unclecode/crawl4ai:basic-arm64
-docker run -p 11235:11235 unclecode/crawl4ai:basic-arm64
-
-# Full ML/LLM support
-docker pull unclecode/crawl4ai:all-arm64
-docker run -p 11235:11235 unclecode/crawl4ai:all-arm64
-
-# With GPU support
-docker pull unclecode/crawl4ai:gpu-arm64
-docker run -p 11235:11235 unclecode/crawl4ai:gpu-arm64
-```
-
-Need more memory? Add `--shm-size`:
-```bash
-docker run --shm-size=2gb -p 11235:11235 unclecode/crawl4ai:basic-amd64
-```
-
-Test the installation:
-```bash
-curl http://localhost:11235/health
-```
-
-### For Raspberry Pi (32-bit) (coming soon):
-```bash
-# Pull and run basic version (recommended for Raspberry Pi)
-docker pull unclecode/crawl4ai:basic-armv7
-docker run -p 11235:11235 unclecode/crawl4ai:basic-armv7
-
-# With increased shared memory if needed
-docker run --shm-size=2gb -p 11235:11235 unclecode/crawl4ai:basic-armv7
-```
-
-Note: Due to hardware constraints, only the basic version is recommended for Raspberry Pi.
-
-
-
-
-🐳 Option 2: Build from Repository
-
-Build the image locally based on your platform:
-
-```bash
-# Clone the repository
-git clone https://github.com/unclecode/crawl4ai.git
-cd crawl4ai
-
-# For AMD64 (Regular Linux/Windows)
-docker build --platform linux/amd64 \
- --tag crawl4ai:local \
- --build-arg INSTALL_TYPE=basic \
- .
-
-# For ARM64 (M1/M2 Macs, ARM servers)
-docker build --platform linux/arm64 \
- --tag crawl4ai:local \
- --build-arg INSTALL_TYPE=basic \
- .
-```
-
-Build options:
-- INSTALL_TYPE=basic (default): Basic crawling features
-- INSTALL_TYPE=all: Full ML/LLM support
-- ENABLE_GPU=true: Add GPU support
-
-Example with all options:
-```bash
-docker build --platform linux/amd64 \
- --tag crawl4ai:local \
- --build-arg INSTALL_TYPE=all \
- --build-arg ENABLE_GPU=true \
- .
-```
-
-Run your local build:
-```bash
-# Regular run
-docker run -p 11235:11235 crawl4ai:local
-
-# With increased shared memory
-docker run --shm-size=2gb -p 11235:11235 crawl4ai:local
-```
-
-Test the installation:
-```bash
-curl http://localhost:11235/health
-```
-
-
-
-
-🐳 Option 3: Using Docker Compose
-
-Docker Compose provides a more structured way to run Crawl4AI, especially when dealing with environment variables and multiple configurations.
-
-```bash
-# Clone the repository
-git clone https://github.com/unclecode/crawl4ai.git
-cd crawl4ai
-```
-
-### For AMD64 (Regular Linux/Windows):
-```bash
-# Build and run locally
-docker-compose --profile local-amd64 up
-
-# Run from Docker Hub
-VERSION=basic docker-compose --profile hub-amd64 up # Basic version
-VERSION=all docker-compose --profile hub-amd64 up # Full ML/LLM support
-VERSION=gpu docker-compose --profile hub-amd64 up # GPU support
-```
-
-### For ARM64 (M1/M2 Macs, ARM servers):
-```bash
-# Build and run locally
-docker-compose --profile local-arm64 up
-
-# Run from Docker Hub
-VERSION=basic docker-compose --profile hub-arm64 up # Basic version
-VERSION=all docker-compose --profile hub-arm64 up # Full ML/LLM support
-VERSION=gpu docker-compose --profile hub-arm64 up # GPU support
-```
-
-Environment variables (optional):
-```bash
-# Create a .env file
-CRAWL4AI_API_TOKEN=your_token
-OPENAI_API_KEY=your_openai_key
-CLAUDE_API_KEY=your_claude_key
-```
-
-The compose file includes:
-- Memory management (4GB limit, 1GB reserved)
-- Shared memory volume for browser support
-- Health checks
-- Auto-restart policy
-- All necessary port mappings
-
-Test the installation:
-```bash
-curl http://localhost:11235/health
-```
+Stay connected with our [GitHub repository](https://github.com/unclecode/crawl4ai) for updates!
@@ -424,24 +274,29 @@ You can check the project structure in the directory [https://github.com/uncleco
```python
import asyncio
-from crawl4ai import AsyncWebCrawler, CacheMode
+from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig, CacheMode
from crawl4ai.content_filter_strategy import PruningContentFilter, BM25ContentFilter
from crawl4ai.markdown_generation_strategy import DefaultMarkdownGenerator
async def main():
- async with AsyncWebCrawler(
+ browser_config = BrowserConfig(
headless=True,
verbose=True,
- ) as crawler:
+ )
+ run_config = CrawlerRunConfig(
+ cache_mode=CacheMode.ENABLED,
+ markdown_generator=DefaultMarkdownGenerator(
+ content_filter=PruningContentFilter(threshold=0.48, threshold_type="fixed", min_word_threshold=0)
+ ),
+ # markdown_generator=DefaultMarkdownGenerator(
+ # content_filter=BM25ContentFilter(user_query="WHEN_WE_FOCUS_BASED_ON_A_USER_QUERY", bm25_threshold=1.0)
+ # ),
+ )
+
+ async with AsyncWebCrawler(config=browser_config) as crawler:
result = await crawler.arun(
url="https://docs.micronaut.io/4.7.6/guide/",
- cache_mode=CacheMode.ENABLED,
- markdown_generator=DefaultMarkdownGenerator(
- content_filter=PruningContentFilter(threshold=0.48, threshold_type="fixed", min_word_threshold=0)
- ),
- # markdown_generator=DefaultMarkdownGenerator(
- # content_filter=BM25ContentFilter(user_query="WHEN_WE_FOCUS_BASED_ON_A_USER_QUERY", bm25_threshold=1.0)
- # ),
+ config=run_config
)
print(len(result.markdown))
print(len(result.fit_markdown))
@@ -458,7 +313,7 @@ if __name__ == "__main__":
```python
import asyncio
-from crawl4ai import AsyncWebCrawler, CacheMode
+from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig, CacheMode
from crawl4ai.extraction_strategy import JsonCssExtractionStrategy
import json
@@ -493,36 +348,26 @@ async def main():
"type": "attribute",
"attribute": "src"
}
- ]
+ }
}
extraction_strategy = JsonCssExtractionStrategy(schema, verbose=True)
- async with AsyncWebCrawler(
+ browser_config = BrowserConfig(
headless=False,
verbose=True
- ) as crawler:
+ )
+ run_config = CrawlerRunConfig(
+ extraction_strategy=extraction_strategy,
+ js_code=["""(async () => {const tabs = document.querySelectorAll("section.charge-methodology .tabs-menu-3 > div");for(let tab of tabs) {tab.scrollIntoView();tab.click();await new Promise(r => setTimeout(r, 500));}})();"""],
+ cache_mode=CacheMode.BYPASS
+ )
+
+ async with AsyncWebCrawler(config=browser_config) as crawler:
- # Create the JavaScript that handles clicking multiple times
- js_click_tabs = """
- (async () => {
- const tabs = document.querySelectorAll("section.charge-methodology .tabs-menu-3 > div");
-
- for(let tab of tabs) {
- // scroll to the tab
- tab.scrollIntoView();
- tab.click();
- // Wait for content to load and animations to complete
- await new Promise(r => setTimeout(r, 500));
- }
- })();
- """
-
result = await crawler.arun(
url="https://www.kidocode.com/degrees/technology",
- extraction_strategy=JsonCssExtractionStrategy(schema, verbose=True),
- js_code=[js_click_tabs],
- cache_mode=CacheMode.BYPASS
+ config=run_config
)
companies = json.loads(result.extracted_content)
@@ -542,7 +387,7 @@ if __name__ == "__main__":
```python
import os
import asyncio
-from crawl4ai import AsyncWebCrawler, CacheMode
+from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig, CacheMode
from crawl4ai.extraction_strategy import LLMExtractionStrategy
from pydantic import BaseModel, Field
@@ -552,21 +397,26 @@ class OpenAIModelFee(BaseModel):
output_fee: str = Field(..., description="Fee for output token for the OpenAI model.")
async def main():
- async with AsyncWebCrawler(verbose=True) as crawler:
+ browser_config = BrowserConfig(verbose=True)
+ run_config = CrawlerRunConfig(
+ word_count_threshold=1,
+ extraction_strategy=LLMExtractionStrategy(
+ # Here you can use any provider that Litellm library supports, for instance: ollama/qwen2
+ # provider="ollama/qwen2", api_token="no-token",
+ provider="openai/gpt-4o", api_token=os.getenv('OPENAI_API_KEY'),
+ schema=OpenAIModelFee.schema(),
+ extraction_type="schema",
+ instruction="""From the crawled content, extract all mentioned model names along with their fees for input and output tokens.
+ Do not miss any models in the entire content. One extracted model JSON format should look like this:
+ {"model_name": "GPT-4", "input_fee": "US$10.00 / 1M tokens", "output_fee": "US$30.00 / 1M tokens"}."""
+ ),
+ cache_mode=CacheMode.BYPASS,
+ )
+
+ async with AsyncWebCrawler(config=browser_config) as crawler:
result = await crawler.arun(
url='https://openai.com/api/pricing/',
- word_count_threshold=1,
- extraction_strategy=LLMExtractionStrategy(
- # Here you can use any provider that Litellm library supports, for instance: ollama/qwen2
- # provider="ollama/qwen2", api_token="no-token",
- provider="openai/gpt-4o", api_token=os.getenv('OPENAI_API_KEY'),
- schema=OpenAIModelFee.schema(),
- extraction_type="schema",
- instruction="""From the crawled content, extract all mentioned model names along with their fees for input and output tokens.
- Do not miss any models in the entire content. One extracted model JSON format should look like this:
- {"model_name": "GPT-4", "input_fee": "US$10.00 / 1M tokens", "output_fee": "US$30.00 / 1M tokens"}."""
- ),
- cache_mode=CacheMode.BYPASS,
+ config=run_config
)
print(result.extracted_content)
@@ -583,37 +433,29 @@ if __name__ == "__main__":
import os, sys
from pathlib import Path
import asyncio, time
-from crawl4ai import AsyncWebCrawler
+from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig, CacheMode
async def test_news_crawl():
# Create a persistent user data directory
user_data_dir = os.path.join(Path.home(), ".crawl4ai", "browser_profile")
os.makedirs(user_data_dir, exist_ok=True)
- async with AsyncWebCrawler(
+ browser_config = BrowserConfig(
verbose=True,
headless=True,
user_data_dir=user_data_dir,
use_persistent_context=True,
- headers={
- "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8",
- "Accept-Language": "en-US,en;q=0.5",
- "Accept-Encoding": "gzip, deflate, br",
- "DNT": "1",
- "Connection": "keep-alive",
- "Upgrade-Insecure-Requests": "1",
- "Sec-Fetch-Dest": "document",
- "Sec-Fetch-Mode": "navigate",
- "Sec-Fetch-Site": "none",
- "Sec-Fetch-User": "?1",
- "Cache-Control": "max-age=0",
- }
- ) as crawler:
+ )
+ run_config = CrawlerRunConfig(
+ cache_mode=CacheMode.BYPASS
+ )
+
+ async with AsyncWebCrawler(config=browser_config) as crawler:
url = "ADDRESS_OF_A_CHALLENGING_WEBSITE"
result = await crawler.arun(
url,
- cache_mode=CacheMode.BYPASS,
+ config=run_config,
magic=True,
)
@@ -705,9 +547,6 @@ We envision a future where AI is powered by real human knowledge, ensuring data
For more details, see our [full mission statement](./MISSION.md).
-
-
-
## Star History
[](https://star-history.com/#unclecode/crawl4ai&Date)
diff --git a/docs/deprecated/docker-deployment.md b/docs/deprecated/docker-deployment.md
new file mode 100644
index 00000000..db8446e3
--- /dev/null
+++ b/docs/deprecated/docker-deployment.md
@@ -0,0 +1,189 @@
+# 🐳 Using Docker (Legacy)
+
+Crawl4AI is available as Docker images for easy deployment. You can either pull directly from Docker Hub (recommended) or build from the repository.
+
+---
+
+
+🐳 Option 1: Docker Hub (Recommended)
+
+Choose the appropriate image based on your platform and needs:
+
+### For AMD64 (Regular Linux/Windows):
+```bash
+# Basic version (recommended)
+docker pull unclecode/crawl4ai:basic-amd64
+docker run -p 11235:11235 unclecode/crawl4ai:basic-amd64
+
+# Full ML/LLM support
+docker pull unclecode/crawl4ai:all-amd64
+docker run -p 11235:11235 unclecode/crawl4ai:all-amd64
+
+# With GPU support
+docker pull unclecode/crawl4ai:gpu-amd64
+docker run -p 11235:11235 unclecode/crawl4ai:gpu-amd64
+```
+
+### For ARM64 (M1/M2 Macs, ARM servers):
+```bash
+# Basic version (recommended)
+docker pull unclecode/crawl4ai:basic-arm64
+docker run -p 11235:11235 unclecode/crawl4ai:basic-arm64
+
+# Full ML/LLM support
+docker pull unclecode/crawl4ai:all-arm64
+docker run -p 11235:11235 unclecode/crawl4ai:all-arm64
+
+# With GPU support
+docker pull unclecode/crawl4ai:gpu-arm64
+docker run -p 11235:11235 unclecode/crawl4ai:gpu-arm64
+```
+
+Need more memory? Add `--shm-size`:
+```bash
+docker run --shm-size=2gb -p 11235:11235 unclecode/crawl4ai:basic-amd64
+```
+
+Test the installation:
+```bash
+curl http://localhost:11235/health
+```
+
+### For Raspberry Pi (32-bit) (coming soon):
+```bash
+# Pull and run basic version (recommended for Raspberry Pi)
+docker pull unclecode/crawl4ai:basic-armv7
+docker run -p 11235:11235 unclecode/crawl4ai:basic-armv7
+
+# With increased shared memory if needed
+docker run --shm-size=2gb -p 11235:11235 unclecode/crawl4ai:basic-armv7
+```
+
+Note: Due to hardware constraints, only the basic version is recommended for Raspberry Pi.
+
+
+
+
+🐳 Option 2: Build from Repository
+
+Build the image locally based on your platform:
+
+```bash
+# Clone the repository
+git clone https://github.com/unclecode/crawl4ai.git
+cd crawl4ai
+
+# For AMD64 (Regular Linux/Windows)
+docker build --platform linux/amd64 \
+ --tag crawl4ai:local \
+ --build-arg INSTALL_TYPE=basic \
+ .
+
+# For ARM64 (M1/M2 Macs, ARM servers)
+docker build --platform linux/arm64 \
+ --tag crawl4ai:local \
+ --build-arg INSTALL_TYPE=basic \
+ .
+```
+
+Build options:
+- INSTALL_TYPE=basic (default): Basic crawling features
+- INSTALL_TYPE=all: Full ML/LLM support
+- ENABLE_GPU=true: Add GPU support
+
+Example with all options:
+```bash
+docker build --platform linux/amd64 \
+ --tag crawl4ai:local \
+ --build-arg INSTALL_TYPE=all \
+ --build-arg ENABLE_GPU=true \
+ .
+```
+
+Run your local build:
+```bash
+# Regular run
+docker run -p 11235:11235 crawl4ai:local
+
+# With increased shared memory
+docker run --shm-size=2gb -p 11235:11235 crawl4ai:local
+```
+
+Test the installation:
+```bash
+curl http://localhost:11235/health
+```
+
+
+
+
+🐳 Option 3: Using Docker Compose
+
+Docker Compose provides a more structured way to run Crawl4AI, especially when dealing with environment variables and multiple configurations.
+
+```bash
+# Clone the repository
+git clone https://github.com/unclecode/crawl4ai.git
+cd crawl4ai
+```
+
+### For AMD64 (Regular Linux/Windows):
+```bash
+# Build and run locally
+docker-compose --profile local-amd64 up
+
+# Run from Docker Hub
+VERSION=basic docker-compose --profile hub-amd64 up # Basic version
+VERSION=all docker-compose --profile hub-amd64 up # Full ML/LLM support
+VERSION=gpu docker-compose --profile hub-amd64 up # GPU support
+```
+
+### For ARM64 (M1/M2 Macs, ARM servers):
+```bash
+# Build and run locally
+docker-compose --profile local-arm64 up
+
+# Run from Docker Hub
+VERSION=basic docker-compose --profile hub-arm64 up # Basic version
+VERSION=all docker-compose --profile hub-arm64 up # Full ML/LLM support
+VERSION=gpu docker-compose --profile hub-arm64 up # GPU support
+```
+
+Environment variables (optional):
+```bash
+# Create a .env file
+CRAWL4AI_API_TOKEN=your_token
+OPENAI_API_KEY=your_openai_key
+CLAUDE_API_KEY=your_claude_key
+```
+
+The compose file includes:
+- Memory management (4GB limit, 1GB reserved)
+- Shared memory volume for browser support
+- Health checks
+- Auto-restart policy
+- All necessary port mappings
+
+Test the installation:
+```bash
+curl http://localhost:11235/health
+```
+
+
+
+
+🚀 One-Click Deployment
+
+Deploy your own instance of Crawl4AI with one click:
+
+[](https://www.digitalocean.com/?repo=https://github.com/unclecode/crawl4ai/tree/0.3.74&refcode=a0780f1bdb3d&utm_campaign=Referral_Invite&utm_medium=Referral_Program&utm_source=badge)
+
+> 💡 **Recommended specs**: 4GB RAM minimum. Select "professional-xs" or higher when deploying for stable operation.
+
+The deploy will:
+- Set up a Docker container with Crawl4AI
+- Configure Playwright and all dependencies
+- Start the FastAPI server on port `11235`
+- Set up health checks and auto-deployment
+
+