Enhance crawler capabilities and documentation

- Add llm.txt generator
  - Added SSL certificate extraction in AsyncWebCrawler.
  - Introduced new content filters and chunking strategies for more robust data extraction.
  - Updated documentation.
This commit is contained in:
UncleCode
2024-12-25 21:34:31 +08:00
parent 84b311760f
commit d5ed451299
59 changed files with 2208 additions and 1763 deletions

View File

@@ -1,50 +1,9 @@
Below is a structured list of hypothetical questions derived from the files content, followed by a bullet-point summary of key topics discussed.
### Hypothetical Questions
1. **Motivation and Use Cases**
- *"Why should I use the PDF-based screenshot approach for very long web pages?"*
- *"What are the benefits of generating a PDF before converting it to an image?"*
2. **Workflow and Technical Process**
- *"How does Crawl4AI generate a PDF and then convert it into a screenshot?"*
- *"Do I need to manually scroll or stitch images to capture large pages?"*
3. **Practical Steps**
- *"What code do I need to write to request both a PDF and a screenshot in one crawl?"*
- *"How do I save the resulting PDF and screenshot to disk?"*
4. **Performance and Reliability**
- *"Will this PDF-based method time out or fail for extremely long pages?"*
- *"Is this approach faster or more memory-efficient than traditional full-page screenshots?"*
5. **Additional Features and Customization**
- *"Can I save only the PDF without generating a screenshot?"*
- *"If I have a PDF, can I easily convert it to multiple images or just the first page?"*
6. **Integration with Other Crawl4AI Features**
- *"Can I combine PDF/screenshot generation with other Crawl4AI extraction strategies or hooks?"*
- *"Is caching or proxying affected by PDF or screenshot generation?"
7. **Troubleshooting**
- *"What should I do if the screenshot or PDF does not appear in the result?"*
- *"How do I handle large PDF sizes or slow saves when dealing with massive pages?"*
### Topics Discussed in the File
- **New Approach to Large Page Screenshots**:
The document introduces a method to first export a page as a PDF using the browsers built-in PDF rendering capabilities and then convert that PDF to an image if a screenshot is requested.
- **Advantages Over Traditional Methods**:
This approach avoids timeouts, memory issues, and the complexity of stitching multiple images for extremely long pages. The PDF rendering is stable, reliable, and does not require the crawler to scroll through the entire page.
- **One-Stop Solution**:
By enabling `pdf=True` and `screenshot=True`, you receive both the full-page PDF and a screenshot (converted from the PDF) in a single crawl. This reduces repetitive processes and complexity.
- **How to Implement**:
Demonstrates code usage with `arun` to request both the PDF and screenshot, and how to save them to files. Explains that if a PDF is already generated, the screenshot is derived directly from it, simplifying the workflow.
- **Integration and Efficiency**:
Compatible with other Crawl4AI features like caching and extraction strategies. Simplifies large-scale crawling pipelines needing both a textual representation (HTML extraction) and visual confirmations (PDF/screenshot).
In summary, the file outlines a new feature for capturing full-page screenshots of massive web pages by first generating a stable, reliable PDF, then converting it into an image. This technique eliminates previous issues related to large content pages, ensuring smoother performance and simpler code maintenance.
page_capture: Full-page screenshots and PDFs can be generated for massive webpages using Crawl4AI | webpage capture, full page screenshot, pdf export | AsyncWebCrawler().arun(url=url, pdf=True, screenshot=True)
pdf_approach: Pages are first exported as PDF then converted to high-quality images for better handling of large content | pdf conversion, image export, page rendering | result.pdf, result.screenshot
export_benefits: PDF export method never times out and works with any page length | timeout handling, page size limits, reliability | pdf=True
dual_output: Get both PDF and screenshot in single crawl without reloading | multiple formats, single pass, efficient capture | pdf=True, screenshot=True
result_handling: Screenshot and PDF data are returned as base64 encoded strings | base64 encoding, binary data, file saving | b64decode(result.screenshot), b64decode(result.pdf)
cache_control: Cache mode can be bypassed for fresh page captures | caching, fresh content, bypass cache | cache_mode=CacheMode.BYPASS
async_operation: Crawler operates asynchronously using Python's asyncio framework | async/await, concurrent execution | async with AsyncWebCrawler() as crawler
file_saving: Screenshots and PDFs can be saved directly to local files | file output, save results, local storage | open("screenshot.png", "wb"), open("page.pdf", "wb")
error_handling: Success status can be checked before processing results | error checking, result validation | if result.success: