Commit Message:
Enhance crawler capabilities and documentation - Added SSL certificate extraction in AsyncWebCrawler. - Introduced new content filters and chunking strategies for more robust data extraction. - Updated documentation management to streamline user experience.
This commit is contained in:
267
.local/ttt/1_introduction.ex.md
Normal file
267
.local/ttt/1_introduction.ex.md
Normal file
@@ -0,0 +1,267 @@
|
||||
# Introduction
|
||||
|
||||
## Quick Start (Minimal Example)
|
||||
For a fast hands-on start, try crawling a single URL and printing its Markdown output:
|
||||
|
||||
```python
|
||||
import asyncio
|
||||
from crawl4ai import AsyncWebCrawler
|
||||
|
||||
async def main():
|
||||
async with AsyncWebCrawler(verbose=True) as crawler:
|
||||
result = await crawler.arun(url="https://example.com")
|
||||
print(result.markdown)
|
||||
|
||||
if __name__ == "__main__":
|
||||
asyncio.run(main())
|
||||
```
|
||||
|
||||
This simple snippet should immediately confirm your environment is set up correctly. If you see the page content in Markdown format, you’re good to go.
|
||||
|
||||
---
|
||||
|
||||
## Overview of Crawl4AI
|
||||
Crawl4AI is a state-of-the-art, **asynchronous** web crawling library optimized for large-scale data collection. It’s built to integrate seamlessly into AI workflows such as fine-tuning, retrieval-augmented generation (RAG), and data pipelines. By focusing on generating structured, AI-ready data (like Markdown), it helps you build robust applications quickly.
|
||||
|
||||
**Why Asynchronous?**
|
||||
Async architecture allows you to concurrently crawl multiple URLs without waiting on slow network operations. This results in drastically improved performance and efficiency, especially when dealing with large-scale data extraction.
|
||||
|
||||
### Purpose and Vision
|
||||
- Offer an open-source alternative to expensive commercial APIs.
|
||||
- Provide clean, structured, Markdown-based outputs for easy AI integration.
|
||||
- Democratize large-scale, high-speed, and reliable web crawling solutions.
|
||||
|
||||
### Key Features
|
||||
- **Markdown Generation**: Produces AI-friendly, concise Markdown.
|
||||
- **High-Performance Crawling**: Asynchronous operations let you crawl numerous URLs concurrently.
|
||||
- **Browser Control**: Fine-tune browser sessions, user agents, proxies, and viewport.
|
||||
- **JavaScript Support**: Handle dynamic pages by injecting custom JavaScript snippets.
|
||||
- **Content Filtering**: Use advanced strategies (e.g., BM25) to focus on what matters.
|
||||
- **Extensibility**: Define custom extraction strategies for complex data schemas.
|
||||
- **Deployment Ready**: Easy Docker deployment for production and scalability.
|
||||
|
||||
---
|
||||
|
||||
## Use Cases
|
||||
- **LLM Training and Fine-Tuning**: Collect and preprocess large web datasets to train machine learning models.
|
||||
- **RAG Pipelines**: Generate context documents for retrieval-augmented generation tasks.
|
||||
- **Content Summarization**: Extract pages and produce summaries directly in Markdown.
|
||||
- **Structured Data Extraction**: Pull structured JSON data suitable for building knowledge graphs or databases.
|
||||
|
||||
**Example: Creating a Fine-Tuning Dataset**
|
||||
```python
|
||||
import asyncio
|
||||
from crawl4ai import AsyncWebCrawler
|
||||
|
||||
async def main():
|
||||
urls = ["https://example.com/dataset_page_1", "https://example.com/dataset_page_2"]
|
||||
async with AsyncWebCrawler(verbose=True) as crawler:
|
||||
results = await asyncio.gather(*[crawler.arun(url=u) for u in urls])
|
||||
# Combine Markdown outputs into a single file for model fine-tuning
|
||||
with open("fine_tuning_data.md", "w") as f:
|
||||
for res in results:
|
||||
f.write(res.markdown + "\n")
|
||||
|
||||
if __name__ == "__main__":
|
||||
asyncio.run(main())
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Installation and Setup
|
||||
|
||||
### Environment Setup (Recommended)
|
||||
Use a virtual environment to keep dependencies isolated:
|
||||
|
||||
```bash
|
||||
python3 -m venv venv
|
||||
source venv/bin/activate
|
||||
pip install --upgrade pip
|
||||
```
|
||||
|
||||
### Basic Installation
|
||||
```bash
|
||||
pip install crawl4ai
|
||||
crawl4ai-setup
|
||||
```
|
||||
|
||||
By default, this installs the asynchronous version and sets up Playwright.
|
||||
|
||||
### Verify Installation
|
||||
Run a quick test:
|
||||
|
||||
```python
|
||||
import asyncio
|
||||
from crawl4ai import AsyncWebCrawler
|
||||
|
||||
async def main():
|
||||
async with AsyncWebCrawler(verbose=True) as crawler:
|
||||
result = await crawler.arun(url="https://crawl4ai.com")
|
||||
print(result.markdown)
|
||||
|
||||
if __name__ == "__main__":
|
||||
asyncio.run(main())
|
||||
```
|
||||
|
||||
If you see the page content printed as Markdown, you’re ready.
|
||||
|
||||
### Handling JavaScript-Heavy Pages
|
||||
For pages that require JavaScript actions (like clicking a “Load More” button), use the `js_code` parameter:
|
||||
|
||||
```python
|
||||
js_code = """
|
||||
(async () => {
|
||||
const loadMoreBtn = document.querySelector('button.load-more');
|
||||
if (loadMoreBtn) loadMoreBtn.click();
|
||||
await new Promise(r => setTimeout(r, 1000));
|
||||
})();
|
||||
"""
|
||||
|
||||
async with AsyncWebCrawler(verbose=True) as crawler:
|
||||
result = await crawler.arun(
|
||||
url="https://example.com/js-page",
|
||||
js_code=[js_code]
|
||||
)
|
||||
print(result.markdown)
|
||||
```
|
||||
|
||||
### Using Cache Modes
|
||||
`CacheMode` can speed up repeated crawls by reusing previously fetched data. For instance:
|
||||
|
||||
```python
|
||||
from crawl4ai import AsyncWebCrawler, CacheMode
|
||||
|
||||
async with AsyncWebCrawler(verbose=True) as crawler:
|
||||
result = await crawler.arun(
|
||||
url="https://example.com/large-page",
|
||||
cache_mode=CacheMode.ENABLED
|
||||
)
|
||||
print(result.markdown)
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Quick Start Guide
|
||||
|
||||
### Minimal Working Example
|
||||
```python
|
||||
import asyncio
|
||||
from crawl4ai import AsyncWebCrawler
|
||||
|
||||
async def main():
|
||||
async with AsyncWebCrawler(verbose=True) as crawler:
|
||||
result = await crawler.arun(url="https://crawl4ai.com")
|
||||
print(result.markdown)
|
||||
|
||||
if __name__ == "__main__":
|
||||
asyncio.run(main())
|
||||
```
|
||||
|
||||
### Multiple Concurrent Crawls
|
||||
Harness async concurrency to run multiple crawls in parallel:
|
||||
|
||||
```python
|
||||
import asyncio
|
||||
from crawl4ai import AsyncWebCrawler
|
||||
|
||||
async def crawl_url(crawler, url):
|
||||
return await crawler.arun(url=url)
|
||||
|
||||
async def main():
|
||||
urls = ["https://example.com/page1", "https://example.com/page2", "https://example.com/page3"]
|
||||
async with AsyncWebCrawler(verbose=True) as crawler:
|
||||
results = await asyncio.gather(*[crawl_url(crawler, u) for u in urls])
|
||||
for r in results:
|
||||
print(r.markdown[:200])
|
||||
|
||||
if __name__ == "__main__":
|
||||
asyncio.run(main())
|
||||
```
|
||||
|
||||
### Dockerized Setup
|
||||
Run Crawl4AI in Docker for production environments:
|
||||
|
||||
```bash
|
||||
docker pull unclecode/crawl4ai:basic-amd64
|
||||
docker run -p 11235:11235 unclecode/crawl4ai:basic-amd64
|
||||
curl http://localhost:11235/health
|
||||
```
|
||||
|
||||
### Proxy and Security Configurations
|
||||
```python
|
||||
async with AsyncWebCrawler(
|
||||
proxies={"http": "http://proxy.server:port", "https": "https://proxy.server:port"}
|
||||
) as crawler:
|
||||
result = await crawler.arun(url="https://crawl4ai.com")
|
||||
print(result.markdown)
|
||||
```
|
||||
|
||||
You can also add basic auth:
|
||||
|
||||
```python
|
||||
async with AsyncWebCrawler(
|
||||
proxies={"http": "http://user:password@proxy.server:port"}
|
||||
) as crawler:
|
||||
result = await crawler.arun(url="https://crawl4ai.com")
|
||||
print(result.markdown)
|
||||
```
|
||||
|
||||
### Customizing Browser Settings
|
||||
Customize headers, user agents, and viewport:
|
||||
|
||||
```python
|
||||
async with AsyncWebCrawler(
|
||||
verbose=True,
|
||||
headers={"User-Agent": "MyCustomBrowser/1.0"},
|
||||
viewport={"width": 1280, "height": 800}
|
||||
) as crawler:
|
||||
result = await crawler.arun("https://example.com")
|
||||
print(result.markdown)
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Troubleshooting Installation
|
||||
|
||||
### Playwright Errors
|
||||
If `crawl4ai-setup` fails, install manually:
|
||||
```bash
|
||||
playwright install chromium
|
||||
pip install crawl4ai[all]
|
||||
```
|
||||
|
||||
### SSL or Proxy Issues
|
||||
- Check certificates or disable SSL verification (for dev only).
|
||||
- Verify proxy credentials and server details.
|
||||
|
||||
Use `verbose=True` for detailed logs:
|
||||
```python
|
||||
async with AsyncWebCrawler(verbose=True) as crawler:
|
||||
result = await crawler.arun(url="https://crawl4ai.com")
|
||||
print(result.markdown)
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Common Pitfalls
|
||||
|
||||
1. **Missing Playwright Installation**: Run `playwright install chromium`.
|
||||
2. **Time-Out on JavaScript-Heavy Pages**: Increase wait time or use `js_code` for page interactions.
|
||||
3. **Empty Markdown**: Check if the page is JavaScript-rendered and adjust `js_code` or `wait_for` conditions.
|
||||
4. **Permission Errors**: Run commands with appropriate permissions or use a virtual environment.
|
||||
|
||||
---
|
||||
|
||||
## Support and Community
|
||||
- **GitHub Issues**: Have questions or found a bug? Open an issue on the [GitHub Repo](https://github.com/unclecode/crawl4ai/issues).
|
||||
- **Contributions**: We welcome pull requests. Check out the [contribution guidelines](https://github.com/unclecode/crawl4ai/blob/main/CONTRIBUTING.md).
|
||||
- **Community Discussions**: Join discussions on GitHub to share tips, best practices, and feedback.
|
||||
|
||||
---
|
||||
|
||||
## Further Exploration
|
||||
- **Advanced Extraction Strategies**: Dive into specialized extraction strategies like `JsonCssExtractionStrategy` or `LLMExtractionStrategy` for structured data output.
|
||||
- **Content Filtering**: Explore BM25-based strategies to highlight the most relevant parts of a page.
|
||||
- **Production Deployment**: Refer to the Docker and environment variable configurations for large-scale, distributed crawling setups.
|
||||
|
||||
For more detailed code examples and advanced topics, refer to the accompanying [README](https://github.com/unclecode/crawl4ai) and the `QUICKSTART` Python file included with this distribution.
|
||||
Reference in New Issue
Block a user