Enhance crawler capabilities and documentation

- Add llm.txt generator
  - Added SSL certificate extraction in AsyncWebCrawler.
  - Introduced new content filters and chunking strategies for more robust data extraction.
  - Updated documentation.
This commit is contained in:
UncleCode
2024-12-25 21:34:31 +08:00
parent 84b311760f
commit d5ed451299
59 changed files with 2208 additions and 1763 deletions

View File

@@ -1,58 +1,10 @@
### Hypothetical Questions
1. **General Understanding of the New Caching System**
- *"Why did Crawl4AI move from boolean cache flags to a `CacheMode` enum?"*
- *"What are the benefits of using a single `CacheMode` enum over multiple booleans?"*
2. **CacheMode Usage**
- *"What `CacheMode` should I use if I want normal caching (both read and write)?"*
- *"How do I enable a mode that only reads from cache, or only writes to cache?"*
- *"What does `CacheMode.BYPASS` do, and how is it different from `CacheMode.DISABLED`?"*
3. **Migrating from Old to New System**
- *"How do I translate `bypass_cache=True` to the new `CacheMode` approach?"*
- *"I used to set `disable_cache=True`; what `CacheMode` should I use now?"*
- *"If I previously used `no_cache_read=True`, how do I achieve the same effect with `CacheMode`?"*
4. **Implementation Details**
- *"How do I specify the `CacheMode` in my crawler runs?"*
- *"Can I pass the `CacheMode` to `arun` directly, or do I need a `CrawlerRunConfig` object?"*
5. **Suppressing Deprecation Warnings**
- *"How can I temporarily disable deprecation warnings while I migrate my code?"*
6. **Edge Cases and Best Practices**
- *"What if I forget to update my code and still use the old flags?"*
- *"Is there a `CacheMode` for scenarios where I want to only write to cache and never read old data?"*
7. **Examples and Code Snippets**
- *"Can I see a side-by-side comparison of old and new caching code for a given URL?"*
- *"How can I confirm that using `CacheMode.BYPASS` skips both reading and writing cache?"*
8. **Performance and Reliability**
- *"Will switching to `CacheMode` improve my codes readability and reduce confusion?"*
- *"Can the new caching system still handle large-scale crawling scenarios efficiently?"*
### Topics Discussed in the File
- **Old vs. New Caching Approach**:
Previously, multiple boolean flags (`bypass_cache`, `disable_cache`, `no_cache_read`, `no_cache_write`) controlled caching. Now, a single `CacheMode` enum simplifies configuration.
- **CacheMode Enum**:
Provides clear modes:
- `ENABLED`: Normal caching (read and write)
- `DISABLED`: No caching at all
- `READ_ONLY`: Only read from cache, dont write new data
- `WRITE_ONLY`: Only write to cache, dont read old data
- `BYPASS`: Skip cache entirely for this operation
- **Migration Patterns**:
A simple mapping table helps developers switch old boolean flags to the corresponding `CacheMode` value.
- **Suppressing Deprecation Warnings**:
Temporarily disabling deprecation warnings provides a grace period to update old code.
- **Code Examples**:
Side-by-side comparisons show how to update code from old flags to the new `CacheMode` approach.
In summary, the file guides developers in transitioning from the old caching boolean flags to the new `CacheMode` enum, explaining the rationale, providing a mapping table, and offering code snippets to facilitate a smooth migration.
cache_system: Crawl4AI v0.5.0 introduces CacheMode enum to replace boolean cache flags | caching system, cache control, cache configuration | CacheMode.ENABLED
cache_modes: CacheMode enum supports five states: ENABLED, DISABLED, READ_ONLY, WRITE_ONLY, and BYPASS | cache states, caching options, cache settings | CacheMode.ENABLED, CacheMode.DISABLED, CacheMode.READ_ONLY, CacheMode.WRITE_ONLY, CacheMode.BYPASS
cache_migration_bypass: Replace bypass_cache=True with cache_mode=CacheMode.BYPASS | skip cache, bypass caching | cache_mode=CacheMode.BYPASS
cache_migration_disable: Replace disable_cache=True with cache_mode=CacheMode.DISABLED | disable caching, turn off cache | cache_mode=CacheMode.DISABLED
cache_migration_read: Replace no_cache_read=True with cache_mode=CacheMode.WRITE_ONLY | write-only cache, disable read | cache_mode=CacheMode.WRITE_ONLY
cache_migration_write: Replace no_cache_write=True with cache_mode=CacheMode.READ_ONLY | read-only cache, disable write | cache_mode=CacheMode.READ_ONLY
crawler_config: Use CrawlerRunConfig to set cache mode in AsyncWebCrawler | crawler settings, configuration object | CrawlerRunConfig(cache_mode=CacheMode.BYPASS)
deprecation_warnings: Suppress cache deprecation warnings by setting SHOW_DEPRECATION_WARNINGS to False | warning suppression, legacy support | SHOW_DEPRECATION_WARNINGS = False
async_crawler_usage: AsyncWebCrawler requires async/await syntax and supports configuration via CrawlerRunConfig | async crawler, web crawler setup | async with AsyncWebCrawler(verbose=True) as crawler
crawler_execution: Run AsyncWebCrawler using asyncio.run() in main script | crawler execution, async main | asyncio.run(main())