Commit Message:

Enhance crawler capabilities and documentation - Added SSL certificate extraction in AsyncWebCrawler. - Introduced new content filters and chunking strategies for more robust data extraction. - Updated documentation management to streamline user experience.
2024-12-26 15:17:07 +08:00
parent d5ed451299
commit 9a4ed6bbd7
72 changed files with 14793 additions and 363 deletions
--- a/.local/ttt/1_introduction.ex.md
+++ b/.local/ttt/1_introduction.ex.md
@@ -0,0 +1,267 @@
+# Introduction
+
+## Quick Start (Minimal Example)
+For a fast hands-on start, try crawling a single URL and printing its Markdown output:
+
+```python
+import asyncio
+from crawl4ai import AsyncWebCrawler
+
+async def main():
+    async with AsyncWebCrawler(verbose=True) as crawler:
+        result = await crawler.arun(url="https://example.com")
+        print(result.markdown)
+
+if __name__ == "__main__":
+    asyncio.run(main())
+```
+
+This simple snippet should immediately confirm your environment is set up correctly. If you see the page content in Markdown format, you’re good to go.
+
+---
+
+## Overview of Crawl4AI
+Crawl4AI is a state-of-the-art, **asynchronous** web crawling library optimized for large-scale data collection. It’s built to integrate seamlessly into AI workflows such as fine-tuning, retrieval-augmented generation (RAG), and data pipelines. By focusing on generating structured, AI-ready data (like Markdown), it helps you build robust applications quickly.
+
+**Why Asynchronous?**  
+Async architecture allows you to concurrently crawl multiple URLs without waiting on slow network operations. This results in drastically improved performance and efficiency, especially when dealing with large-scale data extraction.
+
+### Purpose and Vision
+- Offer an open-source alternative to expensive commercial APIs.
+- Provide clean, structured, Markdown-based outputs for easy AI integration.
+- Democratize large-scale, high-speed, and reliable web crawling solutions.
+
+### Key Features
+- **Markdown Generation**: Produces AI-friendly, concise Markdown.
+- **High-Performance Crawling**: Asynchronous operations let you crawl numerous URLs concurrently.
+- **Browser Control**: Fine-tune browser sessions, user agents, proxies, and viewport.
+- **JavaScript Support**: Handle dynamic pages by injecting custom JavaScript snippets.
+- **Content Filtering**: Use advanced strategies (e.g., BM25) to focus on what matters.
+- **Extensibility**: Define custom extraction strategies for complex data schemas.
+- **Deployment Ready**: Easy Docker deployment for production and scalability.
+
+---
+
+## Use Cases
+- **LLM Training and Fine-Tuning**: Collect and preprocess large web datasets to train machine learning models.
+- **RAG Pipelines**: Generate context documents for retrieval-augmented generation tasks.
+- **Content Summarization**: Extract pages and produce summaries directly in Markdown.
+- **Structured Data Extraction**: Pull structured JSON data suitable for building knowledge graphs or databases.
+
+**Example: Creating a Fine-Tuning Dataset**
+```python
+import asyncio
+from crawl4ai import AsyncWebCrawler
+
+async def main():
+    urls = ["https://example.com/dataset_page_1", "https://example.com/dataset_page_2"]
+    async with AsyncWebCrawler(verbose=True) as crawler:
+        results = await asyncio.gather(*[crawler.arun(url=u) for u in urls])
+        # Combine Markdown outputs into a single file for model fine-tuning
+        with open("fine_tuning_data.md", "w") as f:
+            for res in results:
+                f.write(res.markdown + "\n")
+
+if __name__ == "__main__":
+    asyncio.run(main())
+```
+
+---
+
+## Installation and Setup
+
+### Environment Setup (Recommended)
+Use a virtual environment to keep dependencies isolated:
+
+```bash
+python3 -m venv venv
+source venv/bin/activate
+pip install --upgrade pip
+```
+
+### Basic Installation
+```bash
+pip install crawl4ai
+crawl4ai-setup
+```
+
+By default, this installs the asynchronous version and sets up Playwright.
+
+### Verify Installation
+Run a quick test:
+
+```python
+import asyncio
+from crawl4ai import AsyncWebCrawler
+
+async def main():
+    async with AsyncWebCrawler(verbose=True) as crawler:
+        result = await crawler.arun(url="https://crawl4ai.com")
+        print(result.markdown)
+
+if __name__ == "__main__":
+    asyncio.run(main())
+```
+
+If you see the page content printed as Markdown, you’re ready.
+
+### Handling JavaScript-Heavy Pages
+For pages that require JavaScript actions (like clicking a “Load More” button), use the `js_code` parameter:
+
+```python
+js_code = """
+(async () => {
+    const loadMoreBtn = document.querySelector('button.load-more');
+    if (loadMoreBtn) loadMoreBtn.click();
+    await new Promise(r => setTimeout(r, 1000));
+})();
+"""
+
+async with AsyncWebCrawler(verbose=True) as crawler:
+    result = await crawler.arun(
+        url="https://example.com/js-page",
+        js_code=[js_code]
+    )
+    print(result.markdown)
+```
+
+### Using Cache Modes
+`CacheMode` can speed up repeated crawls by reusing previously fetched data. For instance:
+
+```python
+from crawl4ai import AsyncWebCrawler, CacheMode
+
+async with AsyncWebCrawler(verbose=True) as crawler:
+    result = await crawler.arun(
+        url="https://example.com/large-page",
+        cache_mode=CacheMode.ENABLED
+    )
+    print(result.markdown)
+```
+
+---
+
+## Quick Start Guide
+
+### Minimal Working Example
+```python
+import asyncio
+from crawl4ai import AsyncWebCrawler
+
+async def main():
+    async with AsyncWebCrawler(verbose=True) as crawler:
+        result = await crawler.arun(url="https://crawl4ai.com")
+        print(result.markdown)
+
+if __name__ == "__main__":
+    asyncio.run(main())
+```
+
+### Multiple Concurrent Crawls
+Harness async concurrency to run multiple crawls in parallel:
+
+```python
+import asyncio
+from crawl4ai import AsyncWebCrawler
+
+async def crawl_url(crawler, url):
+    return await crawler.arun(url=url)
+
+async def main():
+    urls = ["https://example.com/page1", "https://example.com/page2", "https://example.com/page3"]
+    async with AsyncWebCrawler(verbose=True) as crawler:
+        results = await asyncio.gather(*[crawl_url(crawler, u) for u in urls])
+        for r in results:
+            print(r.markdown[:200])
+
+if __name__ == "__main__":
+    asyncio.run(main())
+```
+
+### Dockerized Setup
+Run Crawl4AI in Docker for production environments:
+
+```bash
+docker pull unclecode/crawl4ai:basic-amd64
+docker run -p 11235:11235 unclecode/crawl4ai:basic-amd64
+curl http://localhost:11235/health
+```
+
+### Proxy and Security Configurations
+```python
+async with AsyncWebCrawler(
+    proxies={"http": "http://proxy.server:port", "https": "https://proxy.server:port"}
+) as crawler:
+    result = await crawler.arun(url="https://crawl4ai.com")
+    print(result.markdown)
+```
+
+You can also add basic auth:
+
+```python
+async with AsyncWebCrawler(
+    proxies={"http": "http://user:password@proxy.server:port"}
+) as crawler:
+    result = await crawler.arun(url="https://crawl4ai.com")
+    print(result.markdown)
+```
+
+### Customizing Browser Settings
+Customize headers, user agents, and viewport:
+
+```python
+async with AsyncWebCrawler(
+    verbose=True,
+    headers={"User-Agent": "MyCustomBrowser/1.0"},
+    viewport={"width": 1280, "height": 800}
+) as crawler:
+    result = await crawler.arun("https://example.com")
+    print(result.markdown)
+```
+
+---
+
+## Troubleshooting Installation
+
+### Playwright Errors
+If `crawl4ai-setup` fails, install manually:
+```bash
+playwright install chromium
+pip install crawl4ai[all]
+```
+
+### SSL or Proxy Issues
+- Check certificates or disable SSL verification (for dev only).
+- Verify proxy credentials and server details.
+
+Use `verbose=True` for detailed logs:
+```python
+async with AsyncWebCrawler(verbose=True) as crawler:
+    result = await crawler.arun(url="https://crawl4ai.com")
+    print(result.markdown)
+```
+
+---
+
+## Common Pitfalls
+
+1. **Missing Playwright Installation**: Run `playwright install chromium`.
+2. **Time-Out on JavaScript-Heavy Pages**: Increase wait time or use `js_code` for page interactions.
+3. **Empty Markdown**: Check if the page is JavaScript-rendered and adjust `js_code` or `wait_for` conditions.
+4. **Permission Errors**: Run commands with appropriate permissions or use a virtual environment.
+
+---
+
+## Support and Community
+- **GitHub Issues**: Have questions or found a bug? Open an issue on the [GitHub Repo](https://github.com/unclecode/crawl4ai/issues).
+- **Contributions**: We welcome pull requests. Check out the [contribution guidelines](https://github.com/unclecode/crawl4ai/blob/main/CONTRIBUTING.md).
+- **Community Discussions**: Join discussions on GitHub to share tips, best practices, and feedback.
+
+---
+
+## Further Exploration
+- **Advanced Extraction Strategies**: Dive into specialized extraction strategies like `JsonCssExtractionStrategy` or `LLMExtractionStrategy` for structured data output.
+- **Content Filtering**: Explore BM25-based strategies to highlight the most relevant parts of a page.
+- **Production Deployment**: Refer to the Docker and environment variable configurations for large-scale, distributed crawling setups.
+
+For more detailed code examples and advanced topics, refer to the accompanying [README](https://github.com/unclecode/crawl4ai) and the `QUICKSTART` Python file included with this distribution.