docs(linkdin, url_seeder): update and reorganize LinkedIn data discovery and URL seeder documentation

This commit introduces significant updates to the LinkedIn data discovery documentation by adding two new Jupyter notebooks that provide detailed insights into data discovery processes. The previous workshop notebook has been removed to streamline the content and avoid redundancy. Additionally, the URL seeder documentation has been expanded with a new tutorial and several enhancements to existing scripts, improving usability and clarity. The changes include: - Added and for comprehensive LinkedIn data discovery. - Removed to eliminate outdated content. - Updated to reflect new data visualization requirements. - Introduced and to facilitate easier access to URL seeding techniques. - Enhanced existing Python scripts and markdown files in the URL seeder section for better documentation and examples. These changes aim to improve the overall documentation quality and user experience for developers working with LinkedIn data and URL seeding techniques.
2025-06-05 15:06:25 +08:00
parent b5c2732f88
commit c6fc5c0518
11 changed files with 9744 additions and 1464 deletions
--- a/docs/md_v2/core/url-seeding.md
+++ b/docs/md_v2/core/url-seeding.md
@@ -173,12 +173,19 @@ Creating a URL seeder is simple:
 ```python
 from crawl4ai import AsyncUrlSeeder

-# Create a seeder instance
+# Method 1: Manual cleanup
 seeder = AsyncUrlSeeder()
+try:
+    config = SeedingConfig(source="sitemap")
+    urls = await seeder.urls("example.com", config)
+finally:
+    await seeder.close()

-# Discover URLs from a domain
-config = SeedingConfig(source="sitemap")
-urls = await seeder.urls("example.com", config)
+# Method 2: Context manager (recommended)
+async with AsyncUrlSeeder() as seeder:
+    config = SeedingConfig(source="sitemap")
+    urls = await seeder.urls("example.com", config)
+    # Automatically cleaned up on exit
 ```

 The seeder can discover URLs from two powerful sources:
@@ -193,6 +200,23 @@ urls = await seeder.urls("example.com", config)

 Sitemaps are XML files that websites create specifically to list all their URLs. It's like getting a menu at a restaurant - everything is listed upfront.

+**Sitemap Index Support**: For large websites like TechCrunch that use sitemap indexes (a sitemap of sitemaps), the seeder automatically detects and processes all sub-sitemaps in parallel:
+
+```xml
+<!-- Example sitemap index -->
+<sitemapindex>
+  <sitemap>
+    <loc>https://techcrunch.com/sitemap-1.xml</loc>
+  </sitemap>
+  <sitemap>
+    <loc>https://techcrunch.com/sitemap-2.xml</loc>
+  </sitemap>
+  <!-- ... more sitemaps ... -->
+</sitemapindex>
+```
+
+The seeder handles this transparently - you'll get all URLs from all sub-sitemaps automatically!
+
 #### 2. Common Crawl (Most Comprehensive)

 ```python
@@ -349,6 +373,35 @@ The head extraction gives you a treasure trove of information:

 This metadata is gold for filtering! You can find exactly what you need without crawling a single page.

+### Smart URL-Based Filtering (No Head Extraction)
+
+When `extract_head=False` but you still provide a query, the seeder uses intelligent URL-based scoring:
+
+```python
+# Fast filtering based on URL structure alone
+config = SeedingConfig(
+    source="sitemap",
+    extract_head=False,  # Don't fetch page metadata
+    query="python tutorial async",
+    scoring_method="bm25",
+    score_threshold=0.3
+)
+
+urls = await seeder.urls("example.com", config)
+
+# URLs are scored based on:
+# 1. Domain parts matching (e.g., 'python' in python.example.com)
+# 2. Path segments (e.g., '/tutorials/python-async/')
+# 3. Query parameters (e.g., '?topic=python')
+# 4. Fuzzy matching using character n-grams
+
+# Example URL scoring:
+# https://example.com/tutorials/python/async-guide.html - High score
+# https://example.com/blog/javascript-tips.html - Low score
+```
+
+This approach is much faster than head extraction while still providing intelligent filtering!
+
 ### Understanding Results

 Each URL in the results has this structure:
@@ -710,7 +763,16 @@ from crawl4ai import AsyncUrlSeeder, AsyncWebCrawler, SeedingConfig, CrawlerRunC

 class ResearchAssistant:
    def __init__(self):
+        self.seeder = None
+    
+    async def __aenter__(self):
        self.seeder = AsyncUrlSeeder()
+        await self.seeder.__aenter__()
+        return self
+    
+    async def __aexit__(self, exc_type, exc_val, exc_tb):
+        if self.seeder:
+            await self.seeder.__aexit__(exc_type, exc_val, exc_tb)
        
    async def research_topic(self, topic, domains, max_articles=20):
        """Research a topic across multiple domains."""
@@ -812,18 +874,17 @@ class ResearchAssistant:

 # Use the research assistant
 async def main():
-    assistant = ResearchAssistant()
-    
-    # Research Python async programming across multiple sources
-    topic = "python asyncio best practices performance optimization"
-    domains = [
-        "realpython.com",
-        "python.org",
-        "stackoverflow.com",
-        "medium.com"
-    ]
-    
-    summary = await assistant.research_topic(topic, domains, max_articles=15)
+    async with ResearchAssistant() as assistant:
+        # Research Python async programming across multiple sources
+        topic = "python asyncio best practices performance optimization"
+        domains = [
+            "realpython.com",
+            "python.org",
+            "stackoverflow.com",
+            "medium.com"
+        ]
+        
+        summary = await assistant.research_topic(topic, domains, max_articles=15)
    
    # Display results
    print("\n" + "="*60)
@@ -878,6 +939,24 @@ async with AsyncWebCrawler() as crawler:
        process_immediately(result)  # Don't wait for all
 ```

+4. **Memory protection for large domains**
+
+The seeder uses bounded queues to prevent memory issues when processing domains with millions of URLs:
+
+```python
+# Safe for domains with 1M+ URLs
+config = SeedingConfig(
+    source="cc+sitemap",
+    concurrency=50,  # Queue size adapts to concurrency
+    max_urls=100000  # Process in batches if needed
+)
+
+# The seeder automatically manages memory by:
+# - Using bounded queues (prevents RAM spikes)
+# - Applying backpressure when queue is full
+# - Processing URLs as they're discovered
+```
+
 ## Best Practices & Tips

 ### Cache Management
@@ -975,6 +1054,8 @@ config = SeedingConfig(
 | Missing metadata | Ensure `extract_head=True` |
 | Low relevance scores | Refine query, lower `score_threshold` |
 | Rate limit errors | Reduce `hits_per_sec` and `concurrency` |
+| Memory issues with large sites | Use `max_urls` to limit results, reduce `concurrency` |
+| Connection not closed | Use context manager or call `await seeder.close()` |

 ### Performance Benchmarks

@@ -997,4 +1078,12 @@ URL seeding transforms web crawling from a blind expedition into a surgical stri

 Whether you're building a research tool, monitoring competitors, or creating a content aggregator, URL seeding gives you the intelligence to crawl smarter, not harder.

+### Key Features Summary
+
+1. **Parallel Sitemap Index Processing**: Automatically detects and processes sitemap indexes in parallel
+2. **Memory Protection**: Bounded queues prevent RAM issues with large domains (1M+ URLs)
+3. **Context Manager Support**: Automatic cleanup with `async with` statement
+4. **URL-Based Scoring**: Smart filtering even without head extraction
+5. **Dual Caching**: Separate caches for URL lists and metadata
+
 Now go forth and seed intelligently! 🌱🚀