Enhance crawler capabilities and documentation
- Add llm.txt generator - Added SSL certificate extraction in AsyncWebCrawler. - Introduced new content filters and chunking strategies for more robust data extraction. - Updated documentation.
This commit is contained in:
@@ -1,58 +1,10 @@
|
||||
### Hypothetical Questions
|
||||
|
||||
1. **General Understanding of the New Caching System**
|
||||
- *"Why did Crawl4AI move from boolean cache flags to a `CacheMode` enum?"*
|
||||
- *"What are the benefits of using a single `CacheMode` enum over multiple booleans?"*
|
||||
|
||||
2. **CacheMode Usage**
|
||||
- *"What `CacheMode` should I use if I want normal caching (both read and write)?"*
|
||||
- *"How do I enable a mode that only reads from cache, or only writes to cache?"*
|
||||
- *"What does `CacheMode.BYPASS` do, and how is it different from `CacheMode.DISABLED`?"*
|
||||
|
||||
3. **Migrating from Old to New System**
|
||||
- *"How do I translate `bypass_cache=True` to the new `CacheMode` approach?"*
|
||||
- *"I used to set `disable_cache=True`; what `CacheMode` should I use now?"*
|
||||
- *"If I previously used `no_cache_read=True`, how do I achieve the same effect with `CacheMode`?"*
|
||||
|
||||
4. **Implementation Details**
|
||||
- *"How do I specify the `CacheMode` in my crawler runs?"*
|
||||
- *"Can I pass the `CacheMode` to `arun` directly, or do I need a `CrawlerRunConfig` object?"*
|
||||
|
||||
5. **Suppressing Deprecation Warnings**
|
||||
- *"How can I temporarily disable deprecation warnings while I migrate my code?"*
|
||||
|
||||
6. **Edge Cases and Best Practices**
|
||||
- *"What if I forget to update my code and still use the old flags?"*
|
||||
- *"Is there a `CacheMode` for scenarios where I want to only write to cache and never read old data?"*
|
||||
|
||||
7. **Examples and Code Snippets**
|
||||
- *"Can I see a side-by-side comparison of old and new caching code for a given URL?"*
|
||||
- *"How can I confirm that using `CacheMode.BYPASS` skips both reading and writing cache?"*
|
||||
|
||||
8. **Performance and Reliability**
|
||||
- *"Will switching to `CacheMode` improve my code’s readability and reduce confusion?"*
|
||||
- *"Can the new caching system still handle large-scale crawling scenarios efficiently?"*
|
||||
|
||||
### Topics Discussed in the File
|
||||
|
||||
- **Old vs. New Caching Approach**:
|
||||
Previously, multiple boolean flags (`bypass_cache`, `disable_cache`, `no_cache_read`, `no_cache_write`) controlled caching. Now, a single `CacheMode` enum simplifies configuration.
|
||||
|
||||
- **CacheMode Enum**:
|
||||
Provides clear modes:
|
||||
- `ENABLED`: Normal caching (read and write)
|
||||
- `DISABLED`: No caching at all
|
||||
- `READ_ONLY`: Only read from cache, don’t write new data
|
||||
- `WRITE_ONLY`: Only write to cache, don’t read old data
|
||||
- `BYPASS`: Skip cache entirely for this operation
|
||||
|
||||
- **Migration Patterns**:
|
||||
A simple mapping table helps developers switch old boolean flags to the corresponding `CacheMode` value.
|
||||
|
||||
- **Suppressing Deprecation Warnings**:
|
||||
Temporarily disabling deprecation warnings provides a grace period to update old code.
|
||||
|
||||
- **Code Examples**:
|
||||
Side-by-side comparisons show how to update code from old flags to the new `CacheMode` approach.
|
||||
|
||||
In summary, the file guides developers in transitioning from the old caching boolean flags to the new `CacheMode` enum, explaining the rationale, providing a mapping table, and offering code snippets to facilitate a smooth migration.
|
||||
cache_system: Crawl4AI v0.5.0 introduces CacheMode enum to replace boolean cache flags | caching system, cache control, cache configuration | CacheMode.ENABLED
|
||||
cache_modes: CacheMode enum supports five states: ENABLED, DISABLED, READ_ONLY, WRITE_ONLY, and BYPASS | cache states, caching options, cache settings | CacheMode.ENABLED, CacheMode.DISABLED, CacheMode.READ_ONLY, CacheMode.WRITE_ONLY, CacheMode.BYPASS
|
||||
cache_migration_bypass: Replace bypass_cache=True with cache_mode=CacheMode.BYPASS | skip cache, bypass caching | cache_mode=CacheMode.BYPASS
|
||||
cache_migration_disable: Replace disable_cache=True with cache_mode=CacheMode.DISABLED | disable caching, turn off cache | cache_mode=CacheMode.DISABLED
|
||||
cache_migration_read: Replace no_cache_read=True with cache_mode=CacheMode.WRITE_ONLY | write-only cache, disable read | cache_mode=CacheMode.WRITE_ONLY
|
||||
cache_migration_write: Replace no_cache_write=True with cache_mode=CacheMode.READ_ONLY | read-only cache, disable write | cache_mode=CacheMode.READ_ONLY
|
||||
crawler_config: Use CrawlerRunConfig to set cache mode in AsyncWebCrawler | crawler settings, configuration object | CrawlerRunConfig(cache_mode=CacheMode.BYPASS)
|
||||
deprecation_warnings: Suppress cache deprecation warnings by setting SHOW_DEPRECATION_WARNINGS to False | warning suppression, legacy support | SHOW_DEPRECATION_WARNINGS = False
|
||||
async_crawler_usage: AsyncWebCrawler requires async/await syntax and supports configuration via CrawlerRunConfig | async crawler, web crawler setup | async with AsyncWebCrawler(verbose=True) as crawler
|
||||
crawler_execution: Run AsyncWebCrawler using asyncio.run() in main script | crawler execution, async main | asyncio.run(main())
|
||||
Reference in New Issue
Block a user