Commit Message:

Enhance crawler capabilities and documentation

  - Added SSL certificate extraction in AsyncWebCrawler.
  - Introduced new content filters and chunking strategies for more robust data extraction.
  - Updated documentation management to streamline user experience.
This commit is contained in:
UncleCode
2024-12-26 15:17:07 +08:00
parent d5ed451299
commit 9a4ed6bbd7
72 changed files with 14793 additions and 363 deletions

View File

@@ -0,0 +1,267 @@
# Introduction
## Quick Start (Minimal Example)
For a fast hands-on start, try crawling a single URL and printing its Markdown output:
```python
import asyncio
from crawl4ai import AsyncWebCrawler
async def main():
async with AsyncWebCrawler(verbose=True) as crawler:
result = await crawler.arun(url="https://example.com")
print(result.markdown)
if __name__ == "__main__":
asyncio.run(main())
```
This simple snippet should immediately confirm your environment is set up correctly. If you see the page content in Markdown format, youre good to go.
---
## Overview of Crawl4AI
Crawl4AI is a state-of-the-art, **asynchronous** web crawling library optimized for large-scale data collection. Its built to integrate seamlessly into AI workflows such as fine-tuning, retrieval-augmented generation (RAG), and data pipelines. By focusing on generating structured, AI-ready data (like Markdown), it helps you build robust applications quickly.
**Why Asynchronous?**
Async architecture allows you to concurrently crawl multiple URLs without waiting on slow network operations. This results in drastically improved performance and efficiency, especially when dealing with large-scale data extraction.
### Purpose and Vision
- Offer an open-source alternative to expensive commercial APIs.
- Provide clean, structured, Markdown-based outputs for easy AI integration.
- Democratize large-scale, high-speed, and reliable web crawling solutions.
### Key Features
- **Markdown Generation**: Produces AI-friendly, concise Markdown.
- **High-Performance Crawling**: Asynchronous operations let you crawl numerous URLs concurrently.
- **Browser Control**: Fine-tune browser sessions, user agents, proxies, and viewport.
- **JavaScript Support**: Handle dynamic pages by injecting custom JavaScript snippets.
- **Content Filtering**: Use advanced strategies (e.g., BM25) to focus on what matters.
- **Extensibility**: Define custom extraction strategies for complex data schemas.
- **Deployment Ready**: Easy Docker deployment for production and scalability.
---
## Use Cases
- **LLM Training and Fine-Tuning**: Collect and preprocess large web datasets to train machine learning models.
- **RAG Pipelines**: Generate context documents for retrieval-augmented generation tasks.
- **Content Summarization**: Extract pages and produce summaries directly in Markdown.
- **Structured Data Extraction**: Pull structured JSON data suitable for building knowledge graphs or databases.
**Example: Creating a Fine-Tuning Dataset**
```python
import asyncio
from crawl4ai import AsyncWebCrawler
async def main():
urls = ["https://example.com/dataset_page_1", "https://example.com/dataset_page_2"]
async with AsyncWebCrawler(verbose=True) as crawler:
results = await asyncio.gather(*[crawler.arun(url=u) for u in urls])
# Combine Markdown outputs into a single file for model fine-tuning
with open("fine_tuning_data.md", "w") as f:
for res in results:
f.write(res.markdown + "\n")
if __name__ == "__main__":
asyncio.run(main())
```
---
## Installation and Setup
### Environment Setup (Recommended)
Use a virtual environment to keep dependencies isolated:
```bash
python3 -m venv venv
source venv/bin/activate
pip install --upgrade pip
```
### Basic Installation
```bash
pip install crawl4ai
crawl4ai-setup
```
By default, this installs the asynchronous version and sets up Playwright.
### Verify Installation
Run a quick test:
```python
import asyncio
from crawl4ai import AsyncWebCrawler
async def main():
async with AsyncWebCrawler(verbose=True) as crawler:
result = await crawler.arun(url="https://crawl4ai.com")
print(result.markdown)
if __name__ == "__main__":
asyncio.run(main())
```
If you see the page content printed as Markdown, youre ready.
### Handling JavaScript-Heavy Pages
For pages that require JavaScript actions (like clicking a “Load More” button), use the `js_code` parameter:
```python
js_code = """
(async () => {
const loadMoreBtn = document.querySelector('button.load-more');
if (loadMoreBtn) loadMoreBtn.click();
await new Promise(r => setTimeout(r, 1000));
})();
"""
async with AsyncWebCrawler(verbose=True) as crawler:
result = await crawler.arun(
url="https://example.com/js-page",
js_code=[js_code]
)
print(result.markdown)
```
### Using Cache Modes
`CacheMode` can speed up repeated crawls by reusing previously fetched data. For instance:
```python
from crawl4ai import AsyncWebCrawler, CacheMode
async with AsyncWebCrawler(verbose=True) as crawler:
result = await crawler.arun(
url="https://example.com/large-page",
cache_mode=CacheMode.ENABLED
)
print(result.markdown)
```
---
## Quick Start Guide
### Minimal Working Example
```python
import asyncio
from crawl4ai import AsyncWebCrawler
async def main():
async with AsyncWebCrawler(verbose=True) as crawler:
result = await crawler.arun(url="https://crawl4ai.com")
print(result.markdown)
if __name__ == "__main__":
asyncio.run(main())
```
### Multiple Concurrent Crawls
Harness async concurrency to run multiple crawls in parallel:
```python
import asyncio
from crawl4ai import AsyncWebCrawler
async def crawl_url(crawler, url):
return await crawler.arun(url=url)
async def main():
urls = ["https://example.com/page1", "https://example.com/page2", "https://example.com/page3"]
async with AsyncWebCrawler(verbose=True) as crawler:
results = await asyncio.gather(*[crawl_url(crawler, u) for u in urls])
for r in results:
print(r.markdown[:200])
if __name__ == "__main__":
asyncio.run(main())
```
### Dockerized Setup
Run Crawl4AI in Docker for production environments:
```bash
docker pull unclecode/crawl4ai:basic-amd64
docker run -p 11235:11235 unclecode/crawl4ai:basic-amd64
curl http://localhost:11235/health
```
### Proxy and Security Configurations
```python
async with AsyncWebCrawler(
proxies={"http": "http://proxy.server:port", "https": "https://proxy.server:port"}
) as crawler:
result = await crawler.arun(url="https://crawl4ai.com")
print(result.markdown)
```
You can also add basic auth:
```python
async with AsyncWebCrawler(
proxies={"http": "http://user:password@proxy.server:port"}
) as crawler:
result = await crawler.arun(url="https://crawl4ai.com")
print(result.markdown)
```
### Customizing Browser Settings
Customize headers, user agents, and viewport:
```python
async with AsyncWebCrawler(
verbose=True,
headers={"User-Agent": "MyCustomBrowser/1.0"},
viewport={"width": 1280, "height": 800}
) as crawler:
result = await crawler.arun("https://example.com")
print(result.markdown)
```
---
## Troubleshooting Installation
### Playwright Errors
If `crawl4ai-setup` fails, install manually:
```bash
playwright install chromium
pip install crawl4ai[all]
```
### SSL or Proxy Issues
- Check certificates or disable SSL verification (for dev only).
- Verify proxy credentials and server details.
Use `verbose=True` for detailed logs:
```python
async with AsyncWebCrawler(verbose=True) as crawler:
result = await crawler.arun(url="https://crawl4ai.com")
print(result.markdown)
```
---
## Common Pitfalls
1. **Missing Playwright Installation**: Run `playwright install chromium`.
2. **Time-Out on JavaScript-Heavy Pages**: Increase wait time or use `js_code` for page interactions.
3. **Empty Markdown**: Check if the page is JavaScript-rendered and adjust `js_code` or `wait_for` conditions.
4. **Permission Errors**: Run commands with appropriate permissions or use a virtual environment.
---
## Support and Community
- **GitHub Issues**: Have questions or found a bug? Open an issue on the [GitHub Repo](https://github.com/unclecode/crawl4ai/issues).
- **Contributions**: We welcome pull requests. Check out the [contribution guidelines](https://github.com/unclecode/crawl4ai/blob/main/CONTRIBUTING.md).
- **Community Discussions**: Join discussions on GitHub to share tips, best practices, and feedback.
---
## Further Exploration
- **Advanced Extraction Strategies**: Dive into specialized extraction strategies like `JsonCssExtractionStrategy` or `LLMExtractionStrategy` for structured data output.
- **Content Filtering**: Explore BM25-based strategies to highlight the most relevant parts of a page.
- **Production Deployment**: Refer to the Docker and environment variable configurations for large-scale, distributed crawling setups.
For more detailed code examples and advanced topics, refer to the accompanying [README](https://github.com/unclecode/crawl4ai) and the `QUICKSTART` Python file included with this distribution.