This commit introduces AsyncUrlSeeder, a high-performance URL discovery system that enables intelligent crawling at scale by pre-discovering and filtering URLs before crawling. ## Core Features ### AsyncUrlSeeder Component - Discovers URLs from multiple sources: - Sitemaps (including nested and gzipped) - Common Crawl index - Combined sources for maximum coverage - Extracts page metadata without full crawling: - Title, description, keywords - Open Graph and Twitter Card tags - JSON-LD structured data - Language and charset information - BM25 relevance scoring for intelligent filtering: - Query-based URL discovery - Configurable score thresholds - Automatic ranking by relevance - Performance optimizations: - Async/concurrent processing with configurable workers - Rate limiting (hits per second) - Automatic caching with TTL - Streaming results for large datasets ### SeedingConfig - Comprehensive configuration for URL seeding: - Source selection (sitemap, cc, or both) - URL pattern filtering with wildcards - Live URL validation options - Metadata extraction controls - BM25 scoring parameters - Concurrency and rate limiting ### Integration with AsyncWebCrawler - Seamless pipeline: discover → filter → crawl - Direct compatibility with arun_many() - Significant resource savings by pre-filtering URLs ## Documentation - Comprehensive guide comparing URL seeding vs deep crawling - Complete API reference with parameter tables - Practical examples showing all features - Performance benchmarks and best practices - Integration patterns with AsyncWebCrawler ## Examples - url_seeder_demo.py: Interactive Rich-based demo with: - Basic discovery - Cache management - Live validation - BM25 scoring - Multi-domain discovery - Complete pipeline integration - url_seeder_quick_demo.py: Screenshot-friendly examples: - Pattern-based filtering - Metadata exploration - Smart search with BM25 ## Testing - Comprehensive test suite (test_async_url_seeder_bm25.py) - Coverage of all major features - Edge cases and error handling - Performance and consistency tests ## Implementation Details - Built on httpx with HTTP/2 support - Optional dependencies: lxml, brotli, rank_bm25 - Cache management in ~/.crawl4ai/seeder_cache/ - Logger integration with AsyncLoggerBase - Proper error handling and retry logic ## Bug Fixes - Fixed logger color compatibility (lightblack → bright_black) - Corrected URL extraction from seeder results for arun_many() - Updated all examples and documentation with proper usage This feature enables users to crawl smarter, not harder, by discovering and analyzing URLs before committing resources to crawling them.
261 lines
5.9 KiB
CSS
261 lines
5.9 KiB
CSS
@font-face {
|
|
font-family: "Monaco";
|
|
font-style: normal;
|
|
font-weight: normal;
|
|
src: local("Monaco"), url("Monaco.woff") format("woff");
|
|
}
|
|
|
|
:root {
|
|
--global-font-size: 14px;
|
|
--global-code-font-size: 13px;
|
|
--global-line-height: 1.5em;
|
|
--global-space: 10px;
|
|
--font-stack: Menlo, Monaco, Lucida Console, Liberation Mono, DejaVu Sans Mono, Bitstream Vera Sans Mono,
|
|
Courier New, monospace, serif;
|
|
--font-stack: dm, Monaco, Courier New, monospace, serif;
|
|
--mono-font-stack: Menlo, Monaco, Lucida Console, Liberation Mono, DejaVu Sans Mono, Bitstream Vera Sans Mono,
|
|
Courier New, monospace, serif;
|
|
|
|
--secondary-dimmed-color: #8b857a; /* Dimmed secondary color */
|
|
--block-background-color: #202020; /* Darker background for block elements */
|
|
--global-font-color: #eaeaea; /* Light font color for global elements */
|
|
|
|
--background-color: #070708;
|
|
--page-width: 70em;
|
|
--font-color: #e8e9ed;
|
|
--invert-font-color: #222225;
|
|
--secondary-color: #a3abba;
|
|
--secondary-color: #d5cec0;
|
|
--tertiary-color: #a3abba;
|
|
--primary-dimmed-color: #09b5a5; /* Updated to the brand color */
|
|
--primary-color: #0fbbaa; /* Updated to the brand color */
|
|
--accent-color: rgb(243, 128, 245);
|
|
--error-color: #ff3c74;
|
|
--progress-bar-background: #3f3f44;
|
|
--progress-bar-fill: #09b5a5; /* Updated to the brand color */
|
|
--code-bg-color: #3f3f44;
|
|
--input-style: solid;
|
|
--display-h1-decoration: none;
|
|
|
|
--display-h1-decoration: none;
|
|
|
|
--header-height: 65px; /* Adjust based on your actual header height */
|
|
--sidebar-width: 280px; /* Adjust based on your desired sidebar width */
|
|
--toc-width: 240px; /* Adjust based on your desired ToC width */
|
|
--layout-transition-speed: 0.2s; /* For potential future animations */
|
|
|
|
--page-width : 100em; /* Adjust based on your design */
|
|
}
|
|
|
|
|
|
|
|
/* body {
|
|
background-color: var(--background-color);
|
|
color: var(--font-color);
|
|
}
|
|
|
|
a {
|
|
color: var(--primary-color);
|
|
}
|
|
|
|
a:hover {
|
|
background-color: var(--primary-color);
|
|
color: var(--invert-font-color);
|
|
}
|
|
|
|
blockquote::after {
|
|
color: #444;
|
|
}
|
|
|
|
pre, code {
|
|
background-color: var(--code-bg-color);
|
|
color: var(--font-color);
|
|
}
|
|
|
|
.terminal-nav:first-child {
|
|
border-bottom: 1px dashed var(--secondary-color);
|
|
} */
|
|
|
|
.terminal-mkdocs-main-content {
|
|
line-height: var(--global-line-height);
|
|
}
|
|
|
|
strong {
|
|
/* color : var(--primary-dimmed-color); */
|
|
/* background-color: #50ffff17; */
|
|
text-shadow: 0 0 0px var(--font-color), 0 0 0px var(--font-color);
|
|
}
|
|
|
|
.highlight {
|
|
/* background: url(//s2.svgbox.net/pen-brushes.svg?ic=brush-1&color=50ffff); */
|
|
background-color: #50ffff17;
|
|
|
|
}
|
|
|
|
div.highlight {
|
|
margin-bottom: 2em;
|
|
}
|
|
|
|
.terminal-card > header {
|
|
color: var(--font-color);
|
|
text-align: center;
|
|
background-color: var(--progress-bar-background);
|
|
padding: 0.3em 0.5em;
|
|
}
|
|
.btn.btn-sm {
|
|
color: var(--font-color);
|
|
padding: 0.2em 0.5em;
|
|
font-size: 0.8em;
|
|
}
|
|
|
|
.loading-message {
|
|
display: none;
|
|
margin-top: 20px;
|
|
}
|
|
|
|
.response-section {
|
|
display: none;
|
|
padding-top: 20px;
|
|
}
|
|
|
|
.tabs {
|
|
display: flex;
|
|
flex-direction: column;
|
|
}
|
|
.tab-list {
|
|
display: flex;
|
|
padding: 0;
|
|
margin: 0;
|
|
list-style-type: none;
|
|
border-bottom: 1px solid var(--font-color);
|
|
}
|
|
.tab-item {
|
|
cursor: pointer;
|
|
padding: 10px;
|
|
border: 1px solid var(--font-color);
|
|
margin-right: -1px;
|
|
border-bottom: none;
|
|
}
|
|
.tab-item:hover,
|
|
.tab-item:focus,
|
|
.tab-item:active {
|
|
background-color: var(--progress-bar-background);
|
|
}
|
|
.tab-content {
|
|
display: none;
|
|
border: 1px solid var(--font-color);
|
|
border-top: none;
|
|
}
|
|
.tab-content:first-of-type {
|
|
display: block;
|
|
}
|
|
|
|
.tab-content header {
|
|
padding: 0.5em;
|
|
display: flex;
|
|
justify-content: end;
|
|
align-items: center;
|
|
background-color: var(--progress-bar-background);
|
|
}
|
|
.tab-content pre {
|
|
margin: 0;
|
|
max-height: 300px; overflow: auto; border:none;
|
|
}
|
|
|
|
ol li::before {
|
|
content: counters(item, ".") ". ";
|
|
counter-increment: item;
|
|
/* float: left; */
|
|
/* padding-right: 5px; */
|
|
}
|
|
|
|
|
|
/* 8 TERMINAL CSS */
|
|
|
|
.terminal code {
|
|
font-size: var(--global-code-font-size);
|
|
background: var(--block-background-color);
|
|
/* color: var(--secondary-color); */
|
|
color: var(--primary-dimmed-color);
|
|
}
|
|
|
|
.terminal pre code {
|
|
background: var(--block-background-color);
|
|
color: var(--secondary-color);
|
|
}
|
|
|
|
.hljs-keyword, .hljs-selector-tag, .hljs-built_in, .hljs-name, .hljs-tag {
|
|
color: var(--accent-color);
|
|
}
|
|
.hljs-string {
|
|
color: var(--primary-dimmed-color);
|
|
}
|
|
.hljs-comment {
|
|
color: var(--secondary-dimmed-color);
|
|
font-style: italic;
|
|
font-size: 0.9em;
|
|
}
|
|
.hljs-number {
|
|
color: var(--primary-dimmed-color);
|
|
}
|
|
|
|
.terminal strong > code, .terminal h2 > code , .terminal h3 > code {
|
|
background-color: transparent;
|
|
/* color: var(--font-color); */
|
|
color: var(--primary-dimmed-color);
|
|
text-shadow: none;
|
|
}
|
|
|
|
blockquote {
|
|
background-color: var(--invert-font-color);
|
|
padding: 1em 2em;
|
|
border-left: 2px solid var(--primary-dimmed-color);
|
|
}
|
|
|
|
blockquote::after {
|
|
content: "💡";
|
|
white-space: pre;
|
|
position: absolute;
|
|
top: 1em;
|
|
left: 5px;
|
|
line-height: var(--global-line-height);
|
|
color: #9ca2ab;
|
|
}
|
|
|
|
pre {
|
|
display: block;
|
|
word-break: break-word;
|
|
word-wrap: break-word;
|
|
}
|
|
|
|
.terminal h1 {
|
|
font-size: 2em;
|
|
}
|
|
|
|
.terminal h2 {
|
|
font-size: 1.5em;
|
|
margin-bottom: 0.8em;
|
|
}
|
|
|
|
.terminal h3 {
|
|
font-size: 1.3em;
|
|
margin-bottom: 0.8em;
|
|
}
|
|
|
|
.terminal h1, .terminal h2, .terminal h3, .terminal h4, .terminal h5, .terminal h6 {
|
|
text-shadow: 0 0 0px var(--font-color), 0 0 0px var(--font-color), 0 0 0px var(--font-color);
|
|
}
|
|
|
|
/* Lower max height or width for these images */
|
|
div.badges a {
|
|
/* no underline */
|
|
text-decoration: none !important;
|
|
}
|
|
div.badges a > img {
|
|
width: auto;
|
|
}
|
|
|
|
|
|
table td, table th {
|
|
border: 1px solid var(--code-bg-color) !important;
|
|
} |