crawl4ai

Files

unclecode fb6ed5f000 feat: Sanitize input and handle encoding issues in LLMExtractionStrategy

This commit modifies the LLMExtractionStrategy class in `extraction_strategy.py` to sanitize input and handle potential encoding issues. The `sanitize_input_encode` function is introduced in `utils.py` to encode and decode the input text as UTF-8 or ASCII, depending on the encoding issues encountered. If an encoding error occurs, the function falls back to ASCII encoding and logs a warning message. This change improves the robustness of the extraction process and ensures that characters are not lost due to encoding issues.

2024-07-05 17:30:58 +08:00

models/onnx

- Add ONNX embedding model for CPU devices, Update the similarithy threshold, improve the embedding speed.

2024-05-19 22:30:10 +08:00

__init__.py

Change the project folder name from crawler to crawl4ai

2024-05-09 22:16:28 +08:00

chunking_strategy.py

chore: Update extraction strategy to support GPU, MPS, and CPU, add batch processing for CPU devices

2024-05-19 16:18:58 +00:00

config.py

chore: Update configuration values for chunk token threshold, overlap rate, and minimum word threshold. Create a new example for LLMExtraction Strategy, update Dockerfile, and README