feat: Add advanced link head extraction with three-layer scoring system (#1)

Squashed commit from feature/link-extractor branch implementing comprehensive link analysis:

- Extract HTML head content from discovered links with parallel processing
- Three-layer scoring: Intrinsic (URL quality), Contextual (BM25), and Total scores
- New LinkExtractionConfig class for type-safe configuration
- Pattern-based filtering for internal/external links
- Comprehensive documentation and examples
This commit is contained in:
UncleCode
2025-06-27 20:06:04 +08:00
parent e528086341
commit 5c9c305dbf
10 changed files with 2126 additions and 15 deletions

View File

@@ -37,6 +37,7 @@ from .content_filter_strategy import (
)
from .models import CrawlResult, MarkdownGenerationResult, DisplayMode
from .components.crawler_monitor import CrawlerMonitor
from .link_extractor import LinkExtractor
from .async_dispatcher import (
MemoryAdaptiveDispatcher,
SemaphoreDispatcher,
@@ -141,6 +142,7 @@ __all__ = [
"SemaphoreDispatcher",
"RateLimiter",
"CrawlerMonitor",
"LinkExtractor",
"DisplayMode",
"MarkdownGenerationResult",
"Crawl4aiDockerClient",