docs: update README badges and Docker section, reorganize documentation structure
This commit is contained in:
347
README.md
347
README.md
@@ -1,13 +1,22 @@
|
|||||||
# 🚀🤖 Crawl4AI: Open-source LLM Friendly Web Crawler & Scraper.
|
# 🚀🤖 Crawl4AI: Open-source LLM Friendly Web Crawler & Scraper.
|
||||||
|
|
||||||
|
<div align="center">
|
||||||
|
|
||||||
<a href="https://trendshift.io/repositories/11716" target="_blank"><img src="https://trendshift.io/api/badge/repositories/11716" alt="unclecode%2Fcrawl4ai | Trendshift" style="width: 250px; height: 55px;" width="250" height="55"/></a>
|
<a href="https://trendshift.io/repositories/11716" target="_blank"><img src="https://trendshift.io/api/badge/repositories/11716" alt="unclecode%2Fcrawl4ai | Trendshift" style="width: 250px; height: 55px;" width="250" height="55"/></a>
|
||||||
|
|
||||||
[](https://github.com/unclecode/crawl4ai/stargazers)
|
[](https://github.com/unclecode/crawl4ai/stargazers)
|
||||||

|
|
||||||
[](https://github.com/unclecode/crawl4ai/network/members)
|
[](https://github.com/unclecode/crawl4ai/network/members)
|
||||||
[](https://github.com/unclecode/crawl4ai/issues)
|
|
||||||
[](https://github.com/unclecode/crawl4ai/pulls)
|
[](https://badge.fury.io/py/crawl4ai)
|
||||||
|
[](https://pypi.org/project/crawl4ai/)
|
||||||
|
[](https://pepy.tech/project/crawl4ai)
|
||||||
|
|
||||||
|
[](https://crawl4ai.readthedocs.io/)
|
||||||
[](https://github.com/unclecode/crawl4ai/blob/main/LICENSE)
|
[](https://github.com/unclecode/crawl4ai/blob/main/LICENSE)
|
||||||
|
[](https://github.com/psf/black)
|
||||||
|
[](https://github.com/PyCQA/bandit)
|
||||||
|
|
||||||
|
</div>
|
||||||
|
|
||||||
Crawl4AI is the #1 trending GitHub repository, actively maintained by a vibrant community. It delivers blazing-fast, AI-ready web crawling tailored for LLMs, AI agents, and data pipelines. Open source, flexible, and built for real-time performance, Crawl4AI empowers developers with unmatched speed, precision, and deployment ease.
|
Crawl4AI is the #1 trending GitHub repository, actively maintained by a vibrant community. It delivers blazing-fast, AI-ready web crawling tailored for LLMs, AI agents, and data pipelines. Open source, flexible, and built for real-time performance, Crawl4AI empowers developers with unmatched speed, precision, and deployment ease.
|
||||||
|
|
||||||
@@ -28,20 +37,28 @@ Crawl4AI is the #1 trending GitHub repository, actively maintained by a vibrant
|
|||||||
|
|
||||||
1. Install Crawl4AI:
|
1. Install Crawl4AI:
|
||||||
```bash
|
```bash
|
||||||
|
# Install the package
|
||||||
pip install crawl4ai
|
pip install crawl4ai
|
||||||
crawl4ai-setup # Setup the browser
|
crawl4ai-setup
|
||||||
|
|
||||||
|
# Install Playwright with system dependencies (recommended)
|
||||||
|
playwright install --with-deps
|
||||||
|
|
||||||
|
# Or install specific browsers:
|
||||||
|
playwright install --with-deps chrome # Recommended for Colab/Linux
|
||||||
```
|
```
|
||||||
|
|
||||||
2. Run a simple web crawl:
|
2. Run a simple web crawl:
|
||||||
```python
|
```python
|
||||||
import asyncio
|
import asyncio
|
||||||
from crawl4ai import AsyncWebCrawler, CacheMode
|
from crawl4ai import *
|
||||||
|
|
||||||
async def main():
|
async def main():
|
||||||
async with AsyncWebCrawler(verbose=True) as crawler:
|
async with AsyncWebCrawler() as crawler:
|
||||||
result = await crawler.arun(url="https://www.nbcnews.com/business")
|
result = await crawler.arun(
|
||||||
# Soone will be change to result.markdown
|
url="https://www.nbcnews.com/business",
|
||||||
print(result.markdown_v2.raw_markdown)
|
)
|
||||||
|
print(result.markdown)
|
||||||
|
|
||||||
if __name__ == "__main__":
|
if __name__ == "__main__":
|
||||||
asyncio.run(main())
|
asyncio.run(main())
|
||||||
@@ -200,193 +217,26 @@ pip install -e ".[all]" # Install all optional features
|
|||||||
</details>
|
</details>
|
||||||
|
|
||||||
<details>
|
<details>
|
||||||
<summary>🚀 <strong>One-Click Deployment</strong></summary>
|
<summary>🐳 <strong>Docker Deployment</strong></summary>
|
||||||
|
|
||||||
Deploy your own instance of Crawl4AI with one click:
|
> 🚀 **Major Changes Coming!** We're developing a completely new Docker implementation that will make deployment even more efficient and seamless. The current Docker setup is being deprecated in favor of this new solution.
|
||||||
|
|
||||||
[](https://www.digitalocean.com/?repo=https://github.com/unclecode/crawl4ai/tree/0.3.74&refcode=a0780f1bdb3d&utm_campaign=Referral_Invite&utm_medium=Referral_Program&utm_source=badge)
|
### Current Docker Support
|
||||||
|
|
||||||
> 💡 **Recommended specs**: 4GB RAM minimum. Select "professional-xs" or higher when deploying for stable operation.
|
The existing Docker implementation is being deprecated and will be replaced soon. If you still need to use Docker with the current version:
|
||||||
|
|
||||||
The deploy will:
|
- 📚 [Deprecated Docker Setup](./docs/deprecated/docker-deployment.md) - Instructions for the current Docker implementation
|
||||||
- Set up a Docker container with Crawl4AI
|
- ⚠️ Note: This setup will be replaced in the next major release
|
||||||
- Configure Playwright and all dependencies
|
|
||||||
- Start the FastAPI server on port `11235`
|
|
||||||
- Set up health checks and auto-deployment
|
|
||||||
|
|
||||||
</details>
|
### What's Coming Next?
|
||||||
|
|
||||||
<details>
|
Our new Docker implementation will bring:
|
||||||
<summary>🐳 <strong>Using Docker</strong></summary>
|
- Improved performance and resource efficiency
|
||||||
|
- Streamlined deployment process
|
||||||
|
- Better integration with Crawl4AI features
|
||||||
|
- Enhanced scalability options
|
||||||
|
|
||||||
Crawl4AI is available as Docker images for easy deployment. You can either pull directly from Docker Hub (recommended) or build from the repository.
|
Stay connected with our [GitHub repository](https://github.com/unclecode/crawl4ai) for updates!
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
<details>
|
|
||||||
<summary>🐳 <strong>Option 1: Docker Hub (Recommended)</strong></summary>
|
|
||||||
|
|
||||||
Choose the appropriate image based on your platform and needs:
|
|
||||||
|
|
||||||
### For AMD64 (Regular Linux/Windows):
|
|
||||||
```bash
|
|
||||||
# Basic version (recommended)
|
|
||||||
docker pull unclecode/crawl4ai:basic-amd64
|
|
||||||
docker run -p 11235:11235 unclecode/crawl4ai:basic-amd64
|
|
||||||
|
|
||||||
# Full ML/LLM support
|
|
||||||
docker pull unclecode/crawl4ai:all-amd64
|
|
||||||
docker run -p 11235:11235 unclecode/crawl4ai:all-amd64
|
|
||||||
|
|
||||||
# With GPU support
|
|
||||||
docker pull unclecode/crawl4ai:gpu-amd64
|
|
||||||
docker run -p 11235:11235 unclecode/crawl4ai:gpu-amd64
|
|
||||||
```
|
|
||||||
|
|
||||||
### For ARM64 (M1/M2 Macs, ARM servers):
|
|
||||||
```bash
|
|
||||||
# Basic version (recommended)
|
|
||||||
docker pull unclecode/crawl4ai:basic-arm64
|
|
||||||
docker run -p 11235:11235 unclecode/crawl4ai:basic-arm64
|
|
||||||
|
|
||||||
# Full ML/LLM support
|
|
||||||
docker pull unclecode/crawl4ai:all-arm64
|
|
||||||
docker run -p 11235:11235 unclecode/crawl4ai:all-arm64
|
|
||||||
|
|
||||||
# With GPU support
|
|
||||||
docker pull unclecode/crawl4ai:gpu-arm64
|
|
||||||
docker run -p 11235:11235 unclecode/crawl4ai:gpu-arm64
|
|
||||||
```
|
|
||||||
|
|
||||||
Need more memory? Add `--shm-size`:
|
|
||||||
```bash
|
|
||||||
docker run --shm-size=2gb -p 11235:11235 unclecode/crawl4ai:basic-amd64
|
|
||||||
```
|
|
||||||
|
|
||||||
Test the installation:
|
|
||||||
```bash
|
|
||||||
curl http://localhost:11235/health
|
|
||||||
```
|
|
||||||
|
|
||||||
### For Raspberry Pi (32-bit) (coming soon):
|
|
||||||
```bash
|
|
||||||
# Pull and run basic version (recommended for Raspberry Pi)
|
|
||||||
docker pull unclecode/crawl4ai:basic-armv7
|
|
||||||
docker run -p 11235:11235 unclecode/crawl4ai:basic-armv7
|
|
||||||
|
|
||||||
# With increased shared memory if needed
|
|
||||||
docker run --shm-size=2gb -p 11235:11235 unclecode/crawl4ai:basic-armv7
|
|
||||||
```
|
|
||||||
|
|
||||||
Note: Due to hardware constraints, only the basic version is recommended for Raspberry Pi.
|
|
||||||
|
|
||||||
</details>
|
|
||||||
|
|
||||||
<details>
|
|
||||||
<summary>🐳 <strong>Option 2: Build from Repository</strong></summary>
|
|
||||||
|
|
||||||
Build the image locally based on your platform:
|
|
||||||
|
|
||||||
```bash
|
|
||||||
# Clone the repository
|
|
||||||
git clone https://github.com/unclecode/crawl4ai.git
|
|
||||||
cd crawl4ai
|
|
||||||
|
|
||||||
# For AMD64 (Regular Linux/Windows)
|
|
||||||
docker build --platform linux/amd64 \
|
|
||||||
--tag crawl4ai:local \
|
|
||||||
--build-arg INSTALL_TYPE=basic \
|
|
||||||
.
|
|
||||||
|
|
||||||
# For ARM64 (M1/M2 Macs, ARM servers)
|
|
||||||
docker build --platform linux/arm64 \
|
|
||||||
--tag crawl4ai:local \
|
|
||||||
--build-arg INSTALL_TYPE=basic \
|
|
||||||
.
|
|
||||||
```
|
|
||||||
|
|
||||||
Build options:
|
|
||||||
- INSTALL_TYPE=basic (default): Basic crawling features
|
|
||||||
- INSTALL_TYPE=all: Full ML/LLM support
|
|
||||||
- ENABLE_GPU=true: Add GPU support
|
|
||||||
|
|
||||||
Example with all options:
|
|
||||||
```bash
|
|
||||||
docker build --platform linux/amd64 \
|
|
||||||
--tag crawl4ai:local \
|
|
||||||
--build-arg INSTALL_TYPE=all \
|
|
||||||
--build-arg ENABLE_GPU=true \
|
|
||||||
.
|
|
||||||
```
|
|
||||||
|
|
||||||
Run your local build:
|
|
||||||
```bash
|
|
||||||
# Regular run
|
|
||||||
docker run -p 11235:11235 crawl4ai:local
|
|
||||||
|
|
||||||
# With increased shared memory
|
|
||||||
docker run --shm-size=2gb -p 11235:11235 crawl4ai:local
|
|
||||||
```
|
|
||||||
|
|
||||||
Test the installation:
|
|
||||||
```bash
|
|
||||||
curl http://localhost:11235/health
|
|
||||||
```
|
|
||||||
|
|
||||||
</details>
|
|
||||||
|
|
||||||
<details>
|
|
||||||
<summary>🐳 <strong>Option 3: Using Docker Compose</strong></summary>
|
|
||||||
|
|
||||||
Docker Compose provides a more structured way to run Crawl4AI, especially when dealing with environment variables and multiple configurations.
|
|
||||||
|
|
||||||
```bash
|
|
||||||
# Clone the repository
|
|
||||||
git clone https://github.com/unclecode/crawl4ai.git
|
|
||||||
cd crawl4ai
|
|
||||||
```
|
|
||||||
|
|
||||||
### For AMD64 (Regular Linux/Windows):
|
|
||||||
```bash
|
|
||||||
# Build and run locally
|
|
||||||
docker-compose --profile local-amd64 up
|
|
||||||
|
|
||||||
# Run from Docker Hub
|
|
||||||
VERSION=basic docker-compose --profile hub-amd64 up # Basic version
|
|
||||||
VERSION=all docker-compose --profile hub-amd64 up # Full ML/LLM support
|
|
||||||
VERSION=gpu docker-compose --profile hub-amd64 up # GPU support
|
|
||||||
```
|
|
||||||
|
|
||||||
### For ARM64 (M1/M2 Macs, ARM servers):
|
|
||||||
```bash
|
|
||||||
# Build and run locally
|
|
||||||
docker-compose --profile local-arm64 up
|
|
||||||
|
|
||||||
# Run from Docker Hub
|
|
||||||
VERSION=basic docker-compose --profile hub-arm64 up # Basic version
|
|
||||||
VERSION=all docker-compose --profile hub-arm64 up # Full ML/LLM support
|
|
||||||
VERSION=gpu docker-compose --profile hub-arm64 up # GPU support
|
|
||||||
```
|
|
||||||
|
|
||||||
Environment variables (optional):
|
|
||||||
```bash
|
|
||||||
# Create a .env file
|
|
||||||
CRAWL4AI_API_TOKEN=your_token
|
|
||||||
OPENAI_API_KEY=your_openai_key
|
|
||||||
CLAUDE_API_KEY=your_claude_key
|
|
||||||
```
|
|
||||||
|
|
||||||
The compose file includes:
|
|
||||||
- Memory management (4GB limit, 1GB reserved)
|
|
||||||
- Shared memory volume for browser support
|
|
||||||
- Health checks
|
|
||||||
- Auto-restart policy
|
|
||||||
- All necessary port mappings
|
|
||||||
|
|
||||||
Test the installation:
|
|
||||||
```bash
|
|
||||||
curl http://localhost:11235/health
|
|
||||||
```
|
|
||||||
|
|
||||||
</details>
|
</details>
|
||||||
|
|
||||||
@@ -424,24 +274,29 @@ You can check the project structure in the directory [https://github.com/uncleco
|
|||||||
|
|
||||||
```python
|
```python
|
||||||
import asyncio
|
import asyncio
|
||||||
from crawl4ai import AsyncWebCrawler, CacheMode
|
from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig, CacheMode
|
||||||
from crawl4ai.content_filter_strategy import PruningContentFilter, BM25ContentFilter
|
from crawl4ai.content_filter_strategy import PruningContentFilter, BM25ContentFilter
|
||||||
from crawl4ai.markdown_generation_strategy import DefaultMarkdownGenerator
|
from crawl4ai.markdown_generation_strategy import DefaultMarkdownGenerator
|
||||||
|
|
||||||
async def main():
|
async def main():
|
||||||
async with AsyncWebCrawler(
|
browser_config = BrowserConfig(
|
||||||
headless=True,
|
headless=True,
|
||||||
verbose=True,
|
verbose=True,
|
||||||
) as crawler:
|
)
|
||||||
|
run_config = CrawlerRunConfig(
|
||||||
|
cache_mode=CacheMode.ENABLED,
|
||||||
|
markdown_generator=DefaultMarkdownGenerator(
|
||||||
|
content_filter=PruningContentFilter(threshold=0.48, threshold_type="fixed", min_word_threshold=0)
|
||||||
|
),
|
||||||
|
# markdown_generator=DefaultMarkdownGenerator(
|
||||||
|
# content_filter=BM25ContentFilter(user_query="WHEN_WE_FOCUS_BASED_ON_A_USER_QUERY", bm25_threshold=1.0)
|
||||||
|
# ),
|
||||||
|
)
|
||||||
|
|
||||||
|
async with AsyncWebCrawler(config=browser_config) as crawler:
|
||||||
result = await crawler.arun(
|
result = await crawler.arun(
|
||||||
url="https://docs.micronaut.io/4.7.6/guide/",
|
url="https://docs.micronaut.io/4.7.6/guide/",
|
||||||
cache_mode=CacheMode.ENABLED,
|
config=run_config
|
||||||
markdown_generator=DefaultMarkdownGenerator(
|
|
||||||
content_filter=PruningContentFilter(threshold=0.48, threshold_type="fixed", min_word_threshold=0)
|
|
||||||
),
|
|
||||||
# markdown_generator=DefaultMarkdownGenerator(
|
|
||||||
# content_filter=BM25ContentFilter(user_query="WHEN_WE_FOCUS_BASED_ON_A_USER_QUERY", bm25_threshold=1.0)
|
|
||||||
# ),
|
|
||||||
)
|
)
|
||||||
print(len(result.markdown))
|
print(len(result.markdown))
|
||||||
print(len(result.fit_markdown))
|
print(len(result.fit_markdown))
|
||||||
@@ -458,7 +313,7 @@ if __name__ == "__main__":
|
|||||||
|
|
||||||
```python
|
```python
|
||||||
import asyncio
|
import asyncio
|
||||||
from crawl4ai import AsyncWebCrawler, CacheMode
|
from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig, CacheMode
|
||||||
from crawl4ai.extraction_strategy import JsonCssExtractionStrategy
|
from crawl4ai.extraction_strategy import JsonCssExtractionStrategy
|
||||||
import json
|
import json
|
||||||
|
|
||||||
@@ -493,36 +348,26 @@ async def main():
|
|||||||
"type": "attribute",
|
"type": "attribute",
|
||||||
"attribute": "src"
|
"attribute": "src"
|
||||||
}
|
}
|
||||||
]
|
}
|
||||||
}
|
}
|
||||||
|
|
||||||
extraction_strategy = JsonCssExtractionStrategy(schema, verbose=True)
|
extraction_strategy = JsonCssExtractionStrategy(schema, verbose=True)
|
||||||
|
|
||||||
async with AsyncWebCrawler(
|
browser_config = BrowserConfig(
|
||||||
headless=False,
|
headless=False,
|
||||||
verbose=True
|
verbose=True
|
||||||
) as crawler:
|
)
|
||||||
|
run_config = CrawlerRunConfig(
|
||||||
|
extraction_strategy=extraction_strategy,
|
||||||
|
js_code=["""(async () => {const tabs = document.querySelectorAll("section.charge-methodology .tabs-menu-3 > div");for(let tab of tabs) {tab.scrollIntoView();tab.click();await new Promise(r => setTimeout(r, 500));}})();"""],
|
||||||
|
cache_mode=CacheMode.BYPASS
|
||||||
|
)
|
||||||
|
|
||||||
# Create the JavaScript that handles clicking multiple times
|
async with AsyncWebCrawler(config=browser_config) as crawler:
|
||||||
js_click_tabs = """
|
|
||||||
(async () => {
|
|
||||||
const tabs = document.querySelectorAll("section.charge-methodology .tabs-menu-3 > div");
|
|
||||||
|
|
||||||
for(let tab of tabs) {
|
|
||||||
// scroll to the tab
|
|
||||||
tab.scrollIntoView();
|
|
||||||
tab.click();
|
|
||||||
// Wait for content to load and animations to complete
|
|
||||||
await new Promise(r => setTimeout(r, 500));
|
|
||||||
}
|
|
||||||
})();
|
|
||||||
"""
|
|
||||||
|
|
||||||
result = await crawler.arun(
|
result = await crawler.arun(
|
||||||
url="https://www.kidocode.com/degrees/technology",
|
url="https://www.kidocode.com/degrees/technology",
|
||||||
extraction_strategy=JsonCssExtractionStrategy(schema, verbose=True),
|
config=run_config
|
||||||
js_code=[js_click_tabs],
|
|
||||||
cache_mode=CacheMode.BYPASS
|
|
||||||
)
|
)
|
||||||
|
|
||||||
companies = json.loads(result.extracted_content)
|
companies = json.loads(result.extracted_content)
|
||||||
@@ -542,7 +387,7 @@ if __name__ == "__main__":
|
|||||||
```python
|
```python
|
||||||
import os
|
import os
|
||||||
import asyncio
|
import asyncio
|
||||||
from crawl4ai import AsyncWebCrawler, CacheMode
|
from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig, CacheMode
|
||||||
from crawl4ai.extraction_strategy import LLMExtractionStrategy
|
from crawl4ai.extraction_strategy import LLMExtractionStrategy
|
||||||
from pydantic import BaseModel, Field
|
from pydantic import BaseModel, Field
|
||||||
|
|
||||||
@@ -552,21 +397,26 @@ class OpenAIModelFee(BaseModel):
|
|||||||
output_fee: str = Field(..., description="Fee for output token for the OpenAI model.")
|
output_fee: str = Field(..., description="Fee for output token for the OpenAI model.")
|
||||||
|
|
||||||
async def main():
|
async def main():
|
||||||
async with AsyncWebCrawler(verbose=True) as crawler:
|
browser_config = BrowserConfig(verbose=True)
|
||||||
|
run_config = CrawlerRunConfig(
|
||||||
|
word_count_threshold=1,
|
||||||
|
extraction_strategy=LLMExtractionStrategy(
|
||||||
|
# Here you can use any provider that Litellm library supports, for instance: ollama/qwen2
|
||||||
|
# provider="ollama/qwen2", api_token="no-token",
|
||||||
|
provider="openai/gpt-4o", api_token=os.getenv('OPENAI_API_KEY'),
|
||||||
|
schema=OpenAIModelFee.schema(),
|
||||||
|
extraction_type="schema",
|
||||||
|
instruction="""From the crawled content, extract all mentioned model names along with their fees for input and output tokens.
|
||||||
|
Do not miss any models in the entire content. One extracted model JSON format should look like this:
|
||||||
|
{"model_name": "GPT-4", "input_fee": "US$10.00 / 1M tokens", "output_fee": "US$30.00 / 1M tokens"}."""
|
||||||
|
),
|
||||||
|
cache_mode=CacheMode.BYPASS,
|
||||||
|
)
|
||||||
|
|
||||||
|
async with AsyncWebCrawler(config=browser_config) as crawler:
|
||||||
result = await crawler.arun(
|
result = await crawler.arun(
|
||||||
url='https://openai.com/api/pricing/',
|
url='https://openai.com/api/pricing/',
|
||||||
word_count_threshold=1,
|
config=run_config
|
||||||
extraction_strategy=LLMExtractionStrategy(
|
|
||||||
# Here you can use any provider that Litellm library supports, for instance: ollama/qwen2
|
|
||||||
# provider="ollama/qwen2", api_token="no-token",
|
|
||||||
provider="openai/gpt-4o", api_token=os.getenv('OPENAI_API_KEY'),
|
|
||||||
schema=OpenAIModelFee.schema(),
|
|
||||||
extraction_type="schema",
|
|
||||||
instruction="""From the crawled content, extract all mentioned model names along with their fees for input and output tokens.
|
|
||||||
Do not miss any models in the entire content. One extracted model JSON format should look like this:
|
|
||||||
{"model_name": "GPT-4", "input_fee": "US$10.00 / 1M tokens", "output_fee": "US$30.00 / 1M tokens"}."""
|
|
||||||
),
|
|
||||||
cache_mode=CacheMode.BYPASS,
|
|
||||||
)
|
)
|
||||||
print(result.extracted_content)
|
print(result.extracted_content)
|
||||||
|
|
||||||
@@ -583,37 +433,29 @@ if __name__ == "__main__":
|
|||||||
import os, sys
|
import os, sys
|
||||||
from pathlib import Path
|
from pathlib import Path
|
||||||
import asyncio, time
|
import asyncio, time
|
||||||
from crawl4ai import AsyncWebCrawler
|
from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig, CacheMode
|
||||||
|
|
||||||
async def test_news_crawl():
|
async def test_news_crawl():
|
||||||
# Create a persistent user data directory
|
# Create a persistent user data directory
|
||||||
user_data_dir = os.path.join(Path.home(), ".crawl4ai", "browser_profile")
|
user_data_dir = os.path.join(Path.home(), ".crawl4ai", "browser_profile")
|
||||||
os.makedirs(user_data_dir, exist_ok=True)
|
os.makedirs(user_data_dir, exist_ok=True)
|
||||||
|
|
||||||
async with AsyncWebCrawler(
|
browser_config = BrowserConfig(
|
||||||
verbose=True,
|
verbose=True,
|
||||||
headless=True,
|
headless=True,
|
||||||
user_data_dir=user_data_dir,
|
user_data_dir=user_data_dir,
|
||||||
use_persistent_context=True,
|
use_persistent_context=True,
|
||||||
headers={
|
)
|
||||||
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8",
|
run_config = CrawlerRunConfig(
|
||||||
"Accept-Language": "en-US,en;q=0.5",
|
cache_mode=CacheMode.BYPASS
|
||||||
"Accept-Encoding": "gzip, deflate, br",
|
)
|
||||||
"DNT": "1",
|
|
||||||
"Connection": "keep-alive",
|
async with AsyncWebCrawler(config=browser_config) as crawler:
|
||||||
"Upgrade-Insecure-Requests": "1",
|
|
||||||
"Sec-Fetch-Dest": "document",
|
|
||||||
"Sec-Fetch-Mode": "navigate",
|
|
||||||
"Sec-Fetch-Site": "none",
|
|
||||||
"Sec-Fetch-User": "?1",
|
|
||||||
"Cache-Control": "max-age=0",
|
|
||||||
}
|
|
||||||
) as crawler:
|
|
||||||
url = "ADDRESS_OF_A_CHALLENGING_WEBSITE"
|
url = "ADDRESS_OF_A_CHALLENGING_WEBSITE"
|
||||||
|
|
||||||
result = await crawler.arun(
|
result = await crawler.arun(
|
||||||
url,
|
url,
|
||||||
cache_mode=CacheMode.BYPASS,
|
config=run_config,
|
||||||
magic=True,
|
magic=True,
|
||||||
)
|
)
|
||||||
|
|
||||||
@@ -705,9 +547,6 @@ We envision a future where AI is powered by real human knowledge, ensuring data
|
|||||||
For more details, see our [full mission statement](./MISSION.md).
|
For more details, see our [full mission statement](./MISSION.md).
|
||||||
</details>
|
</details>
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
## Star History
|
## Star History
|
||||||
|
|
||||||
[](https://star-history.com/#unclecode/crawl4ai&Date)
|
[](https://star-history.com/#unclecode/crawl4ai&Date)
|
||||||
|
|||||||
189
docs/deprecated/docker-deployment.md
Normal file
189
docs/deprecated/docker-deployment.md
Normal file
@@ -0,0 +1,189 @@
|
|||||||
|
# 🐳 Using Docker (Legacy)
|
||||||
|
|
||||||
|
Crawl4AI is available as Docker images for easy deployment. You can either pull directly from Docker Hub (recommended) or build from the repository.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
<details>
|
||||||
|
<summary>🐳 <strong>Option 1: Docker Hub (Recommended)</strong></summary>
|
||||||
|
|
||||||
|
Choose the appropriate image based on your platform and needs:
|
||||||
|
|
||||||
|
### For AMD64 (Regular Linux/Windows):
|
||||||
|
```bash
|
||||||
|
# Basic version (recommended)
|
||||||
|
docker pull unclecode/crawl4ai:basic-amd64
|
||||||
|
docker run -p 11235:11235 unclecode/crawl4ai:basic-amd64
|
||||||
|
|
||||||
|
# Full ML/LLM support
|
||||||
|
docker pull unclecode/crawl4ai:all-amd64
|
||||||
|
docker run -p 11235:11235 unclecode/crawl4ai:all-amd64
|
||||||
|
|
||||||
|
# With GPU support
|
||||||
|
docker pull unclecode/crawl4ai:gpu-amd64
|
||||||
|
docker run -p 11235:11235 unclecode/crawl4ai:gpu-amd64
|
||||||
|
```
|
||||||
|
|
||||||
|
### For ARM64 (M1/M2 Macs, ARM servers):
|
||||||
|
```bash
|
||||||
|
# Basic version (recommended)
|
||||||
|
docker pull unclecode/crawl4ai:basic-arm64
|
||||||
|
docker run -p 11235:11235 unclecode/crawl4ai:basic-arm64
|
||||||
|
|
||||||
|
# Full ML/LLM support
|
||||||
|
docker pull unclecode/crawl4ai:all-arm64
|
||||||
|
docker run -p 11235:11235 unclecode/crawl4ai:all-arm64
|
||||||
|
|
||||||
|
# With GPU support
|
||||||
|
docker pull unclecode/crawl4ai:gpu-arm64
|
||||||
|
docker run -p 11235:11235 unclecode/crawl4ai:gpu-arm64
|
||||||
|
```
|
||||||
|
|
||||||
|
Need more memory? Add `--shm-size`:
|
||||||
|
```bash
|
||||||
|
docker run --shm-size=2gb -p 11235:11235 unclecode/crawl4ai:basic-amd64
|
||||||
|
```
|
||||||
|
|
||||||
|
Test the installation:
|
||||||
|
```bash
|
||||||
|
curl http://localhost:11235/health
|
||||||
|
```
|
||||||
|
|
||||||
|
### For Raspberry Pi (32-bit) (coming soon):
|
||||||
|
```bash
|
||||||
|
# Pull and run basic version (recommended for Raspberry Pi)
|
||||||
|
docker pull unclecode/crawl4ai:basic-armv7
|
||||||
|
docker run -p 11235:11235 unclecode/crawl4ai:basic-armv7
|
||||||
|
|
||||||
|
# With increased shared memory if needed
|
||||||
|
docker run --shm-size=2gb -p 11235:11235 unclecode/crawl4ai:basic-armv7
|
||||||
|
```
|
||||||
|
|
||||||
|
Note: Due to hardware constraints, only the basic version is recommended for Raspberry Pi.
|
||||||
|
|
||||||
|
</details>
|
||||||
|
|
||||||
|
<details>
|
||||||
|
<summary>🐳 <strong>Option 2: Build from Repository</strong></summary>
|
||||||
|
|
||||||
|
Build the image locally based on your platform:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# Clone the repository
|
||||||
|
git clone https://github.com/unclecode/crawl4ai.git
|
||||||
|
cd crawl4ai
|
||||||
|
|
||||||
|
# For AMD64 (Regular Linux/Windows)
|
||||||
|
docker build --platform linux/amd64 \
|
||||||
|
--tag crawl4ai:local \
|
||||||
|
--build-arg INSTALL_TYPE=basic \
|
||||||
|
.
|
||||||
|
|
||||||
|
# For ARM64 (M1/M2 Macs, ARM servers)
|
||||||
|
docker build --platform linux/arm64 \
|
||||||
|
--tag crawl4ai:local \
|
||||||
|
--build-arg INSTALL_TYPE=basic \
|
||||||
|
.
|
||||||
|
```
|
||||||
|
|
||||||
|
Build options:
|
||||||
|
- INSTALL_TYPE=basic (default): Basic crawling features
|
||||||
|
- INSTALL_TYPE=all: Full ML/LLM support
|
||||||
|
- ENABLE_GPU=true: Add GPU support
|
||||||
|
|
||||||
|
Example with all options:
|
||||||
|
```bash
|
||||||
|
docker build --platform linux/amd64 \
|
||||||
|
--tag crawl4ai:local \
|
||||||
|
--build-arg INSTALL_TYPE=all \
|
||||||
|
--build-arg ENABLE_GPU=true \
|
||||||
|
.
|
||||||
|
```
|
||||||
|
|
||||||
|
Run your local build:
|
||||||
|
```bash
|
||||||
|
# Regular run
|
||||||
|
docker run -p 11235:11235 crawl4ai:local
|
||||||
|
|
||||||
|
# With increased shared memory
|
||||||
|
docker run --shm-size=2gb -p 11235:11235 crawl4ai:local
|
||||||
|
```
|
||||||
|
|
||||||
|
Test the installation:
|
||||||
|
```bash
|
||||||
|
curl http://localhost:11235/health
|
||||||
|
```
|
||||||
|
|
||||||
|
</details>
|
||||||
|
|
||||||
|
<details>
|
||||||
|
<summary>🐳 <strong>Option 3: Using Docker Compose</strong></summary>
|
||||||
|
|
||||||
|
Docker Compose provides a more structured way to run Crawl4AI, especially when dealing with environment variables and multiple configurations.
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# Clone the repository
|
||||||
|
git clone https://github.com/unclecode/crawl4ai.git
|
||||||
|
cd crawl4ai
|
||||||
|
```
|
||||||
|
|
||||||
|
### For AMD64 (Regular Linux/Windows):
|
||||||
|
```bash
|
||||||
|
# Build and run locally
|
||||||
|
docker-compose --profile local-amd64 up
|
||||||
|
|
||||||
|
# Run from Docker Hub
|
||||||
|
VERSION=basic docker-compose --profile hub-amd64 up # Basic version
|
||||||
|
VERSION=all docker-compose --profile hub-amd64 up # Full ML/LLM support
|
||||||
|
VERSION=gpu docker-compose --profile hub-amd64 up # GPU support
|
||||||
|
```
|
||||||
|
|
||||||
|
### For ARM64 (M1/M2 Macs, ARM servers):
|
||||||
|
```bash
|
||||||
|
# Build and run locally
|
||||||
|
docker-compose --profile local-arm64 up
|
||||||
|
|
||||||
|
# Run from Docker Hub
|
||||||
|
VERSION=basic docker-compose --profile hub-arm64 up # Basic version
|
||||||
|
VERSION=all docker-compose --profile hub-arm64 up # Full ML/LLM support
|
||||||
|
VERSION=gpu docker-compose --profile hub-arm64 up # GPU support
|
||||||
|
```
|
||||||
|
|
||||||
|
Environment variables (optional):
|
||||||
|
```bash
|
||||||
|
# Create a .env file
|
||||||
|
CRAWL4AI_API_TOKEN=your_token
|
||||||
|
OPENAI_API_KEY=your_openai_key
|
||||||
|
CLAUDE_API_KEY=your_claude_key
|
||||||
|
```
|
||||||
|
|
||||||
|
The compose file includes:
|
||||||
|
- Memory management (4GB limit, 1GB reserved)
|
||||||
|
- Shared memory volume for browser support
|
||||||
|
- Health checks
|
||||||
|
- Auto-restart policy
|
||||||
|
- All necessary port mappings
|
||||||
|
|
||||||
|
Test the installation:
|
||||||
|
```bash
|
||||||
|
curl http://localhost:11235/health
|
||||||
|
```
|
||||||
|
|
||||||
|
</details>
|
||||||
|
|
||||||
|
<details>
|
||||||
|
<summary>🚀 <strong>One-Click Deployment</strong></summary>
|
||||||
|
|
||||||
|
Deploy your own instance of Crawl4AI with one click:
|
||||||
|
|
||||||
|
[](https://www.digitalocean.com/?repo=https://github.com/unclecode/crawl4ai/tree/0.3.74&refcode=a0780f1bdb3d&utm_campaign=Referral_Invite&utm_medium=Referral_Program&utm_source=badge)
|
||||||
|
|
||||||
|
> 💡 **Recommended specs**: 4GB RAM minimum. Select "professional-xs" or higher when deploying for stable operation.
|
||||||
|
|
||||||
|
The deploy will:
|
||||||
|
- Set up a Docker container with Crawl4AI
|
||||||
|
- Configure Playwright and all dependencies
|
||||||
|
- Start the FastAPI server on port `11235`
|
||||||
|
- Set up health checks and auto-deployment
|
||||||
|
|
||||||
|
</details>
|
||||||
Reference in New Issue
Block a user