Enhance crawler strategies with new features

- ReImplemented JsonXPathExtractionStrategy for enhanced JSON data extraction.
  - Updated existing extraction strategies for better performance.
  - Improved handling of response status codes during crawls.
This commit is contained in:
UncleCode
2024-12-17 22:40:10 +08:00
parent 4a5f1aebee
commit 393bb911c0
4 changed files with 48 additions and 41 deletions

View File

@@ -709,7 +709,7 @@ This commit introduces several key enhancements, including improved error handli
- Improved `AsyncPlaywrightCrawlerStrategy.close()` method to use a shorter sleep time (0.5 seconds instead of 500), significantly reducing wait time when closing the crawler.
- Enhanced flexibility in `CosineStrategy`:
- Now uses a more generic `load_HF_embedding_model` function, allowing for easier swapping of embedding models.
- Updated `JsonCssExtractionStrategy` and `JsonXPATHExtractionStrategy` for better JSON-based extraction.
- Updated `JsonCssExtractionStrategy` and `JsonXPathExtractionStrategy` for better JSON-based extraction.
### Fixed
- Addressed potential issues with the sliding window chunking strategy to ensure all text is properly chunked.