refactor(deep-crawl): reorganize deep crawling functionality into dedicated module

Restructure deep crawling code into a dedicated module with improved organization:
- Move deep crawl logic from async_deep_crawl.py to deep_crawling/
- Create separate files for BFS strategy, filters, and scorers
- Improve code organization and maintainability
- Add optimized implementations for URL filtering and scoring
- Rename DeepCrawlHandler to DeepCrawlDecorator for clarity

BREAKING CHANGE: DeepCrawlStrategy and BreadthFirstSearchStrategy imports need to be updated to new package structure
This commit is contained in:
UncleCode
2025-02-04 23:28:17 +08:00
parent bc7559586f
commit c308a794e8
11 changed files with 2921 additions and 201 deletions

View File

@@ -15,8 +15,6 @@ from .extraction_strategy import (
JsonCssExtractionStrategy,
JsonXPathExtractionStrategy
)
from .async_deep_crawl import DeepCrawlStrategy, BreadthFirstSearchStrategy
from .chunking_strategy import ChunkingStrategy, RegexChunking
from .markdown_generation_strategy import DefaultMarkdownGenerator
from .content_filter_strategy import PruningContentFilter, BM25ContentFilter, LLMContentFilter, RelevantContentFilter
@@ -33,8 +31,6 @@ from .docker_client import Crawl4aiDockerClient
from .hub import CrawlerHub
__all__ = [
"DeepCrawlStrategy",
"BreadthFirstSearchStrategy",
"AsyncWebCrawler",
"CrawlResult",
"CrawlerHub",