docs: update README badges and Docker section, reorganize documentation structure

This commit is contained in:
UncleCode
2024-12-31 19:45:02 +08:00
parent 67d0999bc3
commit 5c3c05bf93
2 changed files with 283 additions and 255 deletions

347
README.md
View File

@@ -1,13 +1,22 @@
# 🚀🤖 Crawl4AI: Open-source LLM Friendly Web Crawler & Scraper.
<div align="center">
<a href="https://trendshift.io/repositories/11716" target="_blank"><img src="https://trendshift.io/api/badge/repositories/11716" alt="unclecode%2Fcrawl4ai | Trendshift" style="width: 250px; height: 55px;" width="250" height="55"/></a>
[![GitHub Stars](https://img.shields.io/github/stars/unclecode/crawl4ai?style=social)](https://github.com/unclecode/crawl4ai/stargazers)
![PyPI - Downloads](https://img.shields.io/pypi/dm/Crawl4AI)
[![GitHub Forks](https://img.shields.io/github/forks/unclecode/crawl4ai?style=social)](https://github.com/unclecode/crawl4ai/network/members)
[![GitHub Issues](https://img.shields.io/github/issues/unclecode/crawl4ai)](https://github.com/unclecode/crawl4ai/issues)
[![GitHub Pull Requests](https://img.shields.io/github/issues-pr/unclecode/crawl4ai)](https://github.com/unclecode/crawl4ai/pulls)
[![PyPI version](https://badge.fury.io/py/crawl4ai.svg)](https://badge.fury.io/py/crawl4ai)
[![Python Version](https://img.shields.io/pypi/pyversions/crawl4ai)](https://pypi.org/project/crawl4ai/)
[![Downloads](https://static.pepy.tech/badge/crawl4ai/month)](https://pepy.tech/project/crawl4ai)
[![Documentation Status](https://readthedocs.org/projects/crawl4ai/badge/?version=latest)](https://crawl4ai.readthedocs.io/)
[![License](https://img.shields.io/github/license/unclecode/crawl4ai)](https://github.com/unclecode/crawl4ai/blob/main/LICENSE)
[![Code style: black](https://img.shields.io/badge/code%20style-black-000000.svg)](https://github.com/psf/black)
[![Security: bandit](https://img.shields.io/badge/security-bandit-yellow.svg)](https://github.com/PyCQA/bandit)
</div>
Crawl4AI is the #1 trending GitHub repository, actively maintained by a vibrant community. It delivers blazing-fast, AI-ready web crawling tailored for LLMs, AI agents, and data pipelines. Open source, flexible, and built for real-time performance, Crawl4AI empowers developers with unmatched speed, precision, and deployment ease.
@@ -28,20 +37,28 @@ Crawl4AI is the #1 trending GitHub repository, actively maintained by a vibrant
1. Install Crawl4AI:
```bash
# Install the package
pip install crawl4ai
crawl4ai-setup # Setup the browser
crawl4ai-setup
# Install Playwright with system dependencies (recommended)
playwright install --with-deps
# Or install specific browsers:
playwright install --with-deps chrome # Recommended for Colab/Linux
```
2. Run a simple web crawl:
```python
import asyncio
from crawl4ai import AsyncWebCrawler, CacheMode
from crawl4ai import *
async def main():
async with AsyncWebCrawler(verbose=True) as crawler:
result = await crawler.arun(url="https://www.nbcnews.com/business")
# Soone will be change to result.markdown
print(result.markdown_v2.raw_markdown)
async with AsyncWebCrawler() as crawler:
result = await crawler.arun(
url="https://www.nbcnews.com/business",
)
print(result.markdown)
if __name__ == "__main__":
asyncio.run(main())
@@ -200,193 +217,26 @@ pip install -e ".[all]" # Install all optional features
</details>
<details>
<summary>🚀 <strong>One-Click Deployment</strong></summary>
<summary>🐳 <strong>Docker Deployment</strong></summary>
Deploy your own instance of Crawl4AI with one click:
> 🚀 **Major Changes Coming!** We're developing a completely new Docker implementation that will make deployment even more efficient and seamless. The current Docker setup is being deprecated in favor of this new solution.
[![DigitalOcean Referral Badge](https://web-platforms.sfo2.cdn.digitaloceanspaces.com/WWW/Badge%203.svg)](https://www.digitalocean.com/?repo=https://github.com/unclecode/crawl4ai/tree/0.3.74&refcode=a0780f1bdb3d&utm_campaign=Referral_Invite&utm_medium=Referral_Program&utm_source=badge)
### Current Docker Support
> 💡 **Recommended specs**: 4GB RAM minimum. Select "professional-xs" or higher when deploying for stable operation.
The existing Docker implementation is being deprecated and will be replaced soon. If you still need to use Docker with the current version:
The deploy will:
- Set up a Docker container with Crawl4AI
- Configure Playwright and all dependencies
- Start the FastAPI server on port `11235`
- Set up health checks and auto-deployment
- 📚 [Deprecated Docker Setup](./docs/deprecated/docker-deployment.md) - Instructions for the current Docker implementation
- ⚠️ Note: This setup will be replaced in the next major release
</details>
### What's Coming Next?
<details>
<summary>🐳 <strong>Using Docker</strong></summary>
Our new Docker implementation will bring:
- Improved performance and resource efficiency
- Streamlined deployment process
- Better integration with Crawl4AI features
- Enhanced scalability options
Crawl4AI is available as Docker images for easy deployment. You can either pull directly from Docker Hub (recommended) or build from the repository.
---
<details>
<summary>🐳 <strong>Option 1: Docker Hub (Recommended)</strong></summary>
Choose the appropriate image based on your platform and needs:
### For AMD64 (Regular Linux/Windows):
```bash
# Basic version (recommended)
docker pull unclecode/crawl4ai:basic-amd64
docker run -p 11235:11235 unclecode/crawl4ai:basic-amd64
# Full ML/LLM support
docker pull unclecode/crawl4ai:all-amd64
docker run -p 11235:11235 unclecode/crawl4ai:all-amd64
# With GPU support
docker pull unclecode/crawl4ai:gpu-amd64
docker run -p 11235:11235 unclecode/crawl4ai:gpu-amd64
```
### For ARM64 (M1/M2 Macs, ARM servers):
```bash
# Basic version (recommended)
docker pull unclecode/crawl4ai:basic-arm64
docker run -p 11235:11235 unclecode/crawl4ai:basic-arm64
# Full ML/LLM support
docker pull unclecode/crawl4ai:all-arm64
docker run -p 11235:11235 unclecode/crawl4ai:all-arm64
# With GPU support
docker pull unclecode/crawl4ai:gpu-arm64
docker run -p 11235:11235 unclecode/crawl4ai:gpu-arm64
```
Need more memory? Add `--shm-size`:
```bash
docker run --shm-size=2gb -p 11235:11235 unclecode/crawl4ai:basic-amd64
```
Test the installation:
```bash
curl http://localhost:11235/health
```
### For Raspberry Pi (32-bit) (coming soon):
```bash
# Pull and run basic version (recommended for Raspberry Pi)
docker pull unclecode/crawl4ai:basic-armv7
docker run -p 11235:11235 unclecode/crawl4ai:basic-armv7
# With increased shared memory if needed
docker run --shm-size=2gb -p 11235:11235 unclecode/crawl4ai:basic-armv7
```
Note: Due to hardware constraints, only the basic version is recommended for Raspberry Pi.
</details>
<details>
<summary>🐳 <strong>Option 2: Build from Repository</strong></summary>
Build the image locally based on your platform:
```bash
# Clone the repository
git clone https://github.com/unclecode/crawl4ai.git
cd crawl4ai
# For AMD64 (Regular Linux/Windows)
docker build --platform linux/amd64 \
--tag crawl4ai:local \
--build-arg INSTALL_TYPE=basic \
.
# For ARM64 (M1/M2 Macs, ARM servers)
docker build --platform linux/arm64 \
--tag crawl4ai:local \
--build-arg INSTALL_TYPE=basic \
.
```
Build options:
- INSTALL_TYPE=basic (default): Basic crawling features
- INSTALL_TYPE=all: Full ML/LLM support
- ENABLE_GPU=true: Add GPU support
Example with all options:
```bash
docker build --platform linux/amd64 \
--tag crawl4ai:local \
--build-arg INSTALL_TYPE=all \
--build-arg ENABLE_GPU=true \
.
```
Run your local build:
```bash
# Regular run
docker run -p 11235:11235 crawl4ai:local
# With increased shared memory
docker run --shm-size=2gb -p 11235:11235 crawl4ai:local
```
Test the installation:
```bash
curl http://localhost:11235/health
```
</details>
<details>
<summary>🐳 <strong>Option 3: Using Docker Compose</strong></summary>
Docker Compose provides a more structured way to run Crawl4AI, especially when dealing with environment variables and multiple configurations.
```bash
# Clone the repository
git clone https://github.com/unclecode/crawl4ai.git
cd crawl4ai
```
### For AMD64 (Regular Linux/Windows):
```bash
# Build and run locally
docker-compose --profile local-amd64 up
# Run from Docker Hub
VERSION=basic docker-compose --profile hub-amd64 up # Basic version
VERSION=all docker-compose --profile hub-amd64 up # Full ML/LLM support
VERSION=gpu docker-compose --profile hub-amd64 up # GPU support
```
### For ARM64 (M1/M2 Macs, ARM servers):
```bash
# Build and run locally
docker-compose --profile local-arm64 up
# Run from Docker Hub
VERSION=basic docker-compose --profile hub-arm64 up # Basic version
VERSION=all docker-compose --profile hub-arm64 up # Full ML/LLM support
VERSION=gpu docker-compose --profile hub-arm64 up # GPU support
```
Environment variables (optional):
```bash
# Create a .env file
CRAWL4AI_API_TOKEN=your_token
OPENAI_API_KEY=your_openai_key
CLAUDE_API_KEY=your_claude_key
```
The compose file includes:
- Memory management (4GB limit, 1GB reserved)
- Shared memory volume for browser support
- Health checks
- Auto-restart policy
- All necessary port mappings
Test the installation:
```bash
curl http://localhost:11235/health
```
Stay connected with our [GitHub repository](https://github.com/unclecode/crawl4ai) for updates!
</details>
@@ -424,24 +274,29 @@ You can check the project structure in the directory [https://github.com/uncleco
```python
import asyncio
from crawl4ai import AsyncWebCrawler, CacheMode
from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig, CacheMode
from crawl4ai.content_filter_strategy import PruningContentFilter, BM25ContentFilter
from crawl4ai.markdown_generation_strategy import DefaultMarkdownGenerator
async def main():
async with AsyncWebCrawler(
browser_config = BrowserConfig(
headless=True,
verbose=True,
) as crawler:
)
run_config = CrawlerRunConfig(
cache_mode=CacheMode.ENABLED,
markdown_generator=DefaultMarkdownGenerator(
content_filter=PruningContentFilter(threshold=0.48, threshold_type="fixed", min_word_threshold=0)
),
# markdown_generator=DefaultMarkdownGenerator(
# content_filter=BM25ContentFilter(user_query="WHEN_WE_FOCUS_BASED_ON_A_USER_QUERY", bm25_threshold=1.0)
# ),
)
async with AsyncWebCrawler(config=browser_config) as crawler:
result = await crawler.arun(
url="https://docs.micronaut.io/4.7.6/guide/",
cache_mode=CacheMode.ENABLED,
markdown_generator=DefaultMarkdownGenerator(
content_filter=PruningContentFilter(threshold=0.48, threshold_type="fixed", min_word_threshold=0)
),
# markdown_generator=DefaultMarkdownGenerator(
# content_filter=BM25ContentFilter(user_query="WHEN_WE_FOCUS_BASED_ON_A_USER_QUERY", bm25_threshold=1.0)
# ),
config=run_config
)
print(len(result.markdown))
print(len(result.fit_markdown))
@@ -458,7 +313,7 @@ if __name__ == "__main__":
```python
import asyncio
from crawl4ai import AsyncWebCrawler, CacheMode
from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig, CacheMode
from crawl4ai.extraction_strategy import JsonCssExtractionStrategy
import json
@@ -493,36 +348,26 @@ async def main():
"type": "attribute",
"attribute": "src"
}
]
}
}
extraction_strategy = JsonCssExtractionStrategy(schema, verbose=True)
async with AsyncWebCrawler(
browser_config = BrowserConfig(
headless=False,
verbose=True
) as crawler:
)
run_config = CrawlerRunConfig(
extraction_strategy=extraction_strategy,
js_code=["""(async () => {const tabs = document.querySelectorAll("section.charge-methodology .tabs-menu-3 > div");for(let tab of tabs) {tab.scrollIntoView();tab.click();await new Promise(r => setTimeout(r, 500));}})();"""],
cache_mode=CacheMode.BYPASS
)
# Create the JavaScript that handles clicking multiple times
js_click_tabs = """
(async () => {
const tabs = document.querySelectorAll("section.charge-methodology .tabs-menu-3 > div");
for(let tab of tabs) {
// scroll to the tab
tab.scrollIntoView();
tab.click();
// Wait for content to load and animations to complete
await new Promise(r => setTimeout(r, 500));
}
})();
"""
async with AsyncWebCrawler(config=browser_config) as crawler:
result = await crawler.arun(
url="https://www.kidocode.com/degrees/technology",
extraction_strategy=JsonCssExtractionStrategy(schema, verbose=True),
js_code=[js_click_tabs],
cache_mode=CacheMode.BYPASS
config=run_config
)
companies = json.loads(result.extracted_content)
@@ -542,7 +387,7 @@ if __name__ == "__main__":
```python
import os
import asyncio
from crawl4ai import AsyncWebCrawler, CacheMode
from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig, CacheMode
from crawl4ai.extraction_strategy import LLMExtractionStrategy
from pydantic import BaseModel, Field
@@ -552,21 +397,26 @@ class OpenAIModelFee(BaseModel):
output_fee: str = Field(..., description="Fee for output token for the OpenAI model.")
async def main():
async with AsyncWebCrawler(verbose=True) as crawler:
browser_config = BrowserConfig(verbose=True)
run_config = CrawlerRunConfig(
word_count_threshold=1,
extraction_strategy=LLMExtractionStrategy(
# Here you can use any provider that Litellm library supports, for instance: ollama/qwen2
# provider="ollama/qwen2", api_token="no-token",
provider="openai/gpt-4o", api_token=os.getenv('OPENAI_API_KEY'),
schema=OpenAIModelFee.schema(),
extraction_type="schema",
instruction="""From the crawled content, extract all mentioned model names along with their fees for input and output tokens.
Do not miss any models in the entire content. One extracted model JSON format should look like this:
{"model_name": "GPT-4", "input_fee": "US$10.00 / 1M tokens", "output_fee": "US$30.00 / 1M tokens"}."""
),
cache_mode=CacheMode.BYPASS,
)
async with AsyncWebCrawler(config=browser_config) as crawler:
result = await crawler.arun(
url='https://openai.com/api/pricing/',
word_count_threshold=1,
extraction_strategy=LLMExtractionStrategy(
# Here you can use any provider that Litellm library supports, for instance: ollama/qwen2
# provider="ollama/qwen2", api_token="no-token",
provider="openai/gpt-4o", api_token=os.getenv('OPENAI_API_KEY'),
schema=OpenAIModelFee.schema(),
extraction_type="schema",
instruction="""From the crawled content, extract all mentioned model names along with their fees for input and output tokens.
Do not miss any models in the entire content. One extracted model JSON format should look like this:
{"model_name": "GPT-4", "input_fee": "US$10.00 / 1M tokens", "output_fee": "US$30.00 / 1M tokens"}."""
),
cache_mode=CacheMode.BYPASS,
config=run_config
)
print(result.extracted_content)
@@ -583,37 +433,29 @@ if __name__ == "__main__":
import os, sys
from pathlib import Path
import asyncio, time
from crawl4ai import AsyncWebCrawler
from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig, CacheMode
async def test_news_crawl():
# Create a persistent user data directory
user_data_dir = os.path.join(Path.home(), ".crawl4ai", "browser_profile")
os.makedirs(user_data_dir, exist_ok=True)
async with AsyncWebCrawler(
browser_config = BrowserConfig(
verbose=True,
headless=True,
user_data_dir=user_data_dir,
use_persistent_context=True,
headers={
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8",
"Accept-Language": "en-US,en;q=0.5",
"Accept-Encoding": "gzip, deflate, br",
"DNT": "1",
"Connection": "keep-alive",
"Upgrade-Insecure-Requests": "1",
"Sec-Fetch-Dest": "document",
"Sec-Fetch-Mode": "navigate",
"Sec-Fetch-Site": "none",
"Sec-Fetch-User": "?1",
"Cache-Control": "max-age=0",
}
) as crawler:
)
run_config = CrawlerRunConfig(
cache_mode=CacheMode.BYPASS
)
async with AsyncWebCrawler(config=browser_config) as crawler:
url = "ADDRESS_OF_A_CHALLENGING_WEBSITE"
result = await crawler.arun(
url,
cache_mode=CacheMode.BYPASS,
config=run_config,
magic=True,
)
@@ -705,9 +547,6 @@ We envision a future where AI is powered by real human knowledge, ensuring data
For more details, see our [full mission statement](./MISSION.md).
</details>
## Star History
[![Star History Chart](https://api.star-history.com/svg?repos=unclecode/crawl4ai&type=Date)](https://star-history.com/#unclecode/crawl4ai&Date)

View File

@@ -0,0 +1,189 @@
# 🐳 Using Docker (Legacy)
Crawl4AI is available as Docker images for easy deployment. You can either pull directly from Docker Hub (recommended) or build from the repository.
---
<details>
<summary>🐳 <strong>Option 1: Docker Hub (Recommended)</strong></summary>
Choose the appropriate image based on your platform and needs:
### For AMD64 (Regular Linux/Windows):
```bash
# Basic version (recommended)
docker pull unclecode/crawl4ai:basic-amd64
docker run -p 11235:11235 unclecode/crawl4ai:basic-amd64
# Full ML/LLM support
docker pull unclecode/crawl4ai:all-amd64
docker run -p 11235:11235 unclecode/crawl4ai:all-amd64
# With GPU support
docker pull unclecode/crawl4ai:gpu-amd64
docker run -p 11235:11235 unclecode/crawl4ai:gpu-amd64
```
### For ARM64 (M1/M2 Macs, ARM servers):
```bash
# Basic version (recommended)
docker pull unclecode/crawl4ai:basic-arm64
docker run -p 11235:11235 unclecode/crawl4ai:basic-arm64
# Full ML/LLM support
docker pull unclecode/crawl4ai:all-arm64
docker run -p 11235:11235 unclecode/crawl4ai:all-arm64
# With GPU support
docker pull unclecode/crawl4ai:gpu-arm64
docker run -p 11235:11235 unclecode/crawl4ai:gpu-arm64
```
Need more memory? Add `--shm-size`:
```bash
docker run --shm-size=2gb -p 11235:11235 unclecode/crawl4ai:basic-amd64
```
Test the installation:
```bash
curl http://localhost:11235/health
```
### For Raspberry Pi (32-bit) (coming soon):
```bash
# Pull and run basic version (recommended for Raspberry Pi)
docker pull unclecode/crawl4ai:basic-armv7
docker run -p 11235:11235 unclecode/crawl4ai:basic-armv7
# With increased shared memory if needed
docker run --shm-size=2gb -p 11235:11235 unclecode/crawl4ai:basic-armv7
```
Note: Due to hardware constraints, only the basic version is recommended for Raspberry Pi.
</details>
<details>
<summary>🐳 <strong>Option 2: Build from Repository</strong></summary>
Build the image locally based on your platform:
```bash
# Clone the repository
git clone https://github.com/unclecode/crawl4ai.git
cd crawl4ai
# For AMD64 (Regular Linux/Windows)
docker build --platform linux/amd64 \
--tag crawl4ai:local \
--build-arg INSTALL_TYPE=basic \
.
# For ARM64 (M1/M2 Macs, ARM servers)
docker build --platform linux/arm64 \
--tag crawl4ai:local \
--build-arg INSTALL_TYPE=basic \
.
```
Build options:
- INSTALL_TYPE=basic (default): Basic crawling features
- INSTALL_TYPE=all: Full ML/LLM support
- ENABLE_GPU=true: Add GPU support
Example with all options:
```bash
docker build --platform linux/amd64 \
--tag crawl4ai:local \
--build-arg INSTALL_TYPE=all \
--build-arg ENABLE_GPU=true \
.
```
Run your local build:
```bash
# Regular run
docker run -p 11235:11235 crawl4ai:local
# With increased shared memory
docker run --shm-size=2gb -p 11235:11235 crawl4ai:local
```
Test the installation:
```bash
curl http://localhost:11235/health
```
</details>
<details>
<summary>🐳 <strong>Option 3: Using Docker Compose</strong></summary>
Docker Compose provides a more structured way to run Crawl4AI, especially when dealing with environment variables and multiple configurations.
```bash
# Clone the repository
git clone https://github.com/unclecode/crawl4ai.git
cd crawl4ai
```
### For AMD64 (Regular Linux/Windows):
```bash
# Build and run locally
docker-compose --profile local-amd64 up
# Run from Docker Hub
VERSION=basic docker-compose --profile hub-amd64 up # Basic version
VERSION=all docker-compose --profile hub-amd64 up # Full ML/LLM support
VERSION=gpu docker-compose --profile hub-amd64 up # GPU support
```
### For ARM64 (M1/M2 Macs, ARM servers):
```bash
# Build and run locally
docker-compose --profile local-arm64 up
# Run from Docker Hub
VERSION=basic docker-compose --profile hub-arm64 up # Basic version
VERSION=all docker-compose --profile hub-arm64 up # Full ML/LLM support
VERSION=gpu docker-compose --profile hub-arm64 up # GPU support
```
Environment variables (optional):
```bash
# Create a .env file
CRAWL4AI_API_TOKEN=your_token
OPENAI_API_KEY=your_openai_key
CLAUDE_API_KEY=your_claude_key
```
The compose file includes:
- Memory management (4GB limit, 1GB reserved)
- Shared memory volume for browser support
- Health checks
- Auto-restart policy
- All necessary port mappings
Test the installation:
```bash
curl http://localhost:11235/health
```
</details>
<details>
<summary>🚀 <strong>One-Click Deployment</strong></summary>
Deploy your own instance of Crawl4AI with one click:
[![DigitalOcean Referral Badge](https://web-platforms.sfo2.cdn.digitaloceanspaces.com/WWW/Badge%203.svg)](https://www.digitalocean.com/?repo=https://github.com/unclecode/crawl4ai/tree/0.3.74&refcode=a0780f1bdb3d&utm_campaign=Referral_Invite&utm_medium=Referral_Program&utm_source=badge)
> 💡 **Recommended specs**: 4GB RAM minimum. Select "professional-xs" or higher when deploying for stable operation.
The deploy will:
- Set up a Docker container with Crawl4AI
- Configure Playwright and all dependencies
- Start the FastAPI server on port `11235`
- Set up health checks and auto-deployment
</details>