Files
crawl4ai/.local/ttt/1_introduction.ex.md
UncleCode 9a4ed6bbd7 Commit Message:
Enhance crawler capabilities and documentation

  - Added SSL certificate extraction in AsyncWebCrawler.
  - Introduced new content filters and chunking strategies for more robust data extraction.
  - Updated documentation management to streamline user experience.
2024-12-26 15:17:07 +08:00

267 lines
8.6 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Introduction
## Quick Start (Minimal Example)
For a fast hands-on start, try crawling a single URL and printing its Markdown output:
```python
import asyncio
from crawl4ai import AsyncWebCrawler
async def main():
async with AsyncWebCrawler(verbose=True) as crawler:
result = await crawler.arun(url="https://example.com")
print(result.markdown)
if __name__ == "__main__":
asyncio.run(main())
```
This simple snippet should immediately confirm your environment is set up correctly. If you see the page content in Markdown format, youre good to go.
---
## Overview of Crawl4AI
Crawl4AI is a state-of-the-art, **asynchronous** web crawling library optimized for large-scale data collection. Its built to integrate seamlessly into AI workflows such as fine-tuning, retrieval-augmented generation (RAG), and data pipelines. By focusing on generating structured, AI-ready data (like Markdown), it helps you build robust applications quickly.
**Why Asynchronous?**
Async architecture allows you to concurrently crawl multiple URLs without waiting on slow network operations. This results in drastically improved performance and efficiency, especially when dealing with large-scale data extraction.
### Purpose and Vision
- Offer an open-source alternative to expensive commercial APIs.
- Provide clean, structured, Markdown-based outputs for easy AI integration.
- Democratize large-scale, high-speed, and reliable web crawling solutions.
### Key Features
- **Markdown Generation**: Produces AI-friendly, concise Markdown.
- **High-Performance Crawling**: Asynchronous operations let you crawl numerous URLs concurrently.
- **Browser Control**: Fine-tune browser sessions, user agents, proxies, and viewport.
- **JavaScript Support**: Handle dynamic pages by injecting custom JavaScript snippets.
- **Content Filtering**: Use advanced strategies (e.g., BM25) to focus on what matters.
- **Extensibility**: Define custom extraction strategies for complex data schemas.
- **Deployment Ready**: Easy Docker deployment for production and scalability.
---
## Use Cases
- **LLM Training and Fine-Tuning**: Collect and preprocess large web datasets to train machine learning models.
- **RAG Pipelines**: Generate context documents for retrieval-augmented generation tasks.
- **Content Summarization**: Extract pages and produce summaries directly in Markdown.
- **Structured Data Extraction**: Pull structured JSON data suitable for building knowledge graphs or databases.
**Example: Creating a Fine-Tuning Dataset**
```python
import asyncio
from crawl4ai import AsyncWebCrawler
async def main():
urls = ["https://example.com/dataset_page_1", "https://example.com/dataset_page_2"]
async with AsyncWebCrawler(verbose=True) as crawler:
results = await asyncio.gather(*[crawler.arun(url=u) for u in urls])
# Combine Markdown outputs into a single file for model fine-tuning
with open("fine_tuning_data.md", "w") as f:
for res in results:
f.write(res.markdown + "\n")
if __name__ == "__main__":
asyncio.run(main())
```
---
## Installation and Setup
### Environment Setup (Recommended)
Use a virtual environment to keep dependencies isolated:
```bash
python3 -m venv venv
source venv/bin/activate
pip install --upgrade pip
```
### Basic Installation
```bash
pip install crawl4ai
crawl4ai-setup
```
By default, this installs the asynchronous version and sets up Playwright.
### Verify Installation
Run a quick test:
```python
import asyncio
from crawl4ai import AsyncWebCrawler
async def main():
async with AsyncWebCrawler(verbose=True) as crawler:
result = await crawler.arun(url="https://crawl4ai.com")
print(result.markdown)
if __name__ == "__main__":
asyncio.run(main())
```
If you see the page content printed as Markdown, youre ready.
### Handling JavaScript-Heavy Pages
For pages that require JavaScript actions (like clicking a “Load More” button), use the `js_code` parameter:
```python
js_code = """
(async () => {
const loadMoreBtn = document.querySelector('button.load-more');
if (loadMoreBtn) loadMoreBtn.click();
await new Promise(r => setTimeout(r, 1000));
})();
"""
async with AsyncWebCrawler(verbose=True) as crawler:
result = await crawler.arun(
url="https://example.com/js-page",
js_code=[js_code]
)
print(result.markdown)
```
### Using Cache Modes
`CacheMode` can speed up repeated crawls by reusing previously fetched data. For instance:
```python
from crawl4ai import AsyncWebCrawler, CacheMode
async with AsyncWebCrawler(verbose=True) as crawler:
result = await crawler.arun(
url="https://example.com/large-page",
cache_mode=CacheMode.ENABLED
)
print(result.markdown)
```
---
## Quick Start Guide
### Minimal Working Example
```python
import asyncio
from crawl4ai import AsyncWebCrawler
async def main():
async with AsyncWebCrawler(verbose=True) as crawler:
result = await crawler.arun(url="https://crawl4ai.com")
print(result.markdown)
if __name__ == "__main__":
asyncio.run(main())
```
### Multiple Concurrent Crawls
Harness async concurrency to run multiple crawls in parallel:
```python
import asyncio
from crawl4ai import AsyncWebCrawler
async def crawl_url(crawler, url):
return await crawler.arun(url=url)
async def main():
urls = ["https://example.com/page1", "https://example.com/page2", "https://example.com/page3"]
async with AsyncWebCrawler(verbose=True) as crawler:
results = await asyncio.gather(*[crawl_url(crawler, u) for u in urls])
for r in results:
print(r.markdown[:200])
if __name__ == "__main__":
asyncio.run(main())
```
### Dockerized Setup
Run Crawl4AI in Docker for production environments:
```bash
docker pull unclecode/crawl4ai:basic-amd64
docker run -p 11235:11235 unclecode/crawl4ai:basic-amd64
curl http://localhost:11235/health
```
### Proxy and Security Configurations
```python
async with AsyncWebCrawler(
proxies={"http": "http://proxy.server:port", "https": "https://proxy.server:port"}
) as crawler:
result = await crawler.arun(url="https://crawl4ai.com")
print(result.markdown)
```
You can also add basic auth:
```python
async with AsyncWebCrawler(
proxies={"http": "http://user:password@proxy.server:port"}
) as crawler:
result = await crawler.arun(url="https://crawl4ai.com")
print(result.markdown)
```
### Customizing Browser Settings
Customize headers, user agents, and viewport:
```python
async with AsyncWebCrawler(
verbose=True,
headers={"User-Agent": "MyCustomBrowser/1.0"},
viewport={"width": 1280, "height": 800}
) as crawler:
result = await crawler.arun("https://example.com")
print(result.markdown)
```
---
## Troubleshooting Installation
### Playwright Errors
If `crawl4ai-setup` fails, install manually:
```bash
playwright install chromium
pip install crawl4ai[all]
```
### SSL or Proxy Issues
- Check certificates or disable SSL verification (for dev only).
- Verify proxy credentials and server details.
Use `verbose=True` for detailed logs:
```python
async with AsyncWebCrawler(verbose=True) as crawler:
result = await crawler.arun(url="https://crawl4ai.com")
print(result.markdown)
```
---
## Common Pitfalls
1. **Missing Playwright Installation**: Run `playwright install chromium`.
2. **Time-Out on JavaScript-Heavy Pages**: Increase wait time or use `js_code` for page interactions.
3. **Empty Markdown**: Check if the page is JavaScript-rendered and adjust `js_code` or `wait_for` conditions.
4. **Permission Errors**: Run commands with appropriate permissions or use a virtual environment.
---
## Support and Community
- **GitHub Issues**: Have questions or found a bug? Open an issue on the [GitHub Repo](https://github.com/unclecode/crawl4ai/issues).
- **Contributions**: We welcome pull requests. Check out the [contribution guidelines](https://github.com/unclecode/crawl4ai/blob/main/CONTRIBUTING.md).
- **Community Discussions**: Join discussions on GitHub to share tips, best practices, and feedback.
---
## Further Exploration
- **Advanced Extraction Strategies**: Dive into specialized extraction strategies like `JsonCssExtractionStrategy` or `LLMExtractionStrategy` for structured data output.
- **Content Filtering**: Explore BM25-based strategies to highlight the most relevant parts of a page.
- **Production Deployment**: Refer to the Docker and environment variable configurations for large-scale, distributed crawling setups.
For more detailed code examples and advanced topics, refer to the accompanying [README](https://github.com/unclecode/crawl4ai) and the `QUICKSTART` Python file included with this distribution.