docs(linkdin, url_seeder): update and reorganize LinkedIn data discovery and URL seeder documentation

This commit introduces significant updates to the LinkedIn data discovery documentation by adding two new Jupyter notebooks that provide detailed insights into data discovery processes. The previous workshop notebook has been removed to streamline the content and avoid redundancy. Additionally, the URL seeder documentation has been expanded with a new tutorial and several enhancements to existing scripts, improving usability and clarity.

The changes include:
- Added  and  for comprehensive LinkedIn data discovery.
- Removed  to eliminate outdated content.
- Updated  to reflect new data visualization requirements.
- Introduced  and  to facilitate easier access to URL seeding techniques.
- Enhanced existing Python scripts and markdown files in the URL seeder section for better documentation and examples.

These changes aim to improve the overall documentation quality and user experience for developers working with LinkedIn data and URL seeding techniques.
This commit is contained in:
UncleCode
2025-06-05 15:06:25 +08:00
parent b5c2732f88
commit c6fc5c0518
11 changed files with 9744 additions and 1464 deletions

View File

@@ -173,12 +173,19 @@ Creating a URL seeder is simple:
```python
from crawl4ai import AsyncUrlSeeder
# Create a seeder instance
# Method 1: Manual cleanup
seeder = AsyncUrlSeeder()
try:
config = SeedingConfig(source="sitemap")
urls = await seeder.urls("example.com", config)
finally:
await seeder.close()
# Discover URLs from a domain
config = SeedingConfig(source="sitemap")
urls = await seeder.urls("example.com", config)
# Method 2: Context manager (recommended)
async with AsyncUrlSeeder() as seeder:
config = SeedingConfig(source="sitemap")
urls = await seeder.urls("example.com", config)
# Automatically cleaned up on exit
```
The seeder can discover URLs from two powerful sources:
@@ -193,6 +200,23 @@ urls = await seeder.urls("example.com", config)
Sitemaps are XML files that websites create specifically to list all their URLs. It's like getting a menu at a restaurant - everything is listed upfront.
**Sitemap Index Support**: For large websites like TechCrunch that use sitemap indexes (a sitemap of sitemaps), the seeder automatically detects and processes all sub-sitemaps in parallel:
```xml
<!-- Example sitemap index -->
<sitemapindex>
<sitemap>
<loc>https://techcrunch.com/sitemap-1.xml</loc>
</sitemap>
<sitemap>
<loc>https://techcrunch.com/sitemap-2.xml</loc>
</sitemap>
<!-- ... more sitemaps ... -->
</sitemapindex>
```
The seeder handles this transparently - you'll get all URLs from all sub-sitemaps automatically!
#### 2. Common Crawl (Most Comprehensive)
```python
@@ -349,6 +373,35 @@ The head extraction gives you a treasure trove of information:
This metadata is gold for filtering! You can find exactly what you need without crawling a single page.
### Smart URL-Based Filtering (No Head Extraction)
When `extract_head=False` but you still provide a query, the seeder uses intelligent URL-based scoring:
```python
# Fast filtering based on URL structure alone
config = SeedingConfig(
source="sitemap",
extract_head=False, # Don't fetch page metadata
query="python tutorial async",
scoring_method="bm25",
score_threshold=0.3
)
urls = await seeder.urls("example.com", config)
# URLs are scored based on:
# 1. Domain parts matching (e.g., 'python' in python.example.com)
# 2. Path segments (e.g., '/tutorials/python-async/')
# 3. Query parameters (e.g., '?topic=python')
# 4. Fuzzy matching using character n-grams
# Example URL scoring:
# https://example.com/tutorials/python/async-guide.html - High score
# https://example.com/blog/javascript-tips.html - Low score
```
This approach is much faster than head extraction while still providing intelligent filtering!
### Understanding Results
Each URL in the results has this structure:
@@ -710,7 +763,16 @@ from crawl4ai import AsyncUrlSeeder, AsyncWebCrawler, SeedingConfig, CrawlerRunC
class ResearchAssistant:
def __init__(self):
self.seeder = None
async def __aenter__(self):
self.seeder = AsyncUrlSeeder()
await self.seeder.__aenter__()
return self
async def __aexit__(self, exc_type, exc_val, exc_tb):
if self.seeder:
await self.seeder.__aexit__(exc_type, exc_val, exc_tb)
async def research_topic(self, topic, domains, max_articles=20):
"""Research a topic across multiple domains."""
@@ -812,18 +874,17 @@ class ResearchAssistant:
# Use the research assistant
async def main():
assistant = ResearchAssistant()
# Research Python async programming across multiple sources
topic = "python asyncio best practices performance optimization"
domains = [
"realpython.com",
"python.org",
"stackoverflow.com",
"medium.com"
]
summary = await assistant.research_topic(topic, domains, max_articles=15)
async with ResearchAssistant() as assistant:
# Research Python async programming across multiple sources
topic = "python asyncio best practices performance optimization"
domains = [
"realpython.com",
"python.org",
"stackoverflow.com",
"medium.com"
]
summary = await assistant.research_topic(topic, domains, max_articles=15)
# Display results
print("\n" + "="*60)
@@ -878,6 +939,24 @@ async with AsyncWebCrawler() as crawler:
process_immediately(result) # Don't wait for all
```
4. **Memory protection for large domains**
The seeder uses bounded queues to prevent memory issues when processing domains with millions of URLs:
```python
# Safe for domains with 1M+ URLs
config = SeedingConfig(
source="cc+sitemap",
concurrency=50, # Queue size adapts to concurrency
max_urls=100000 # Process in batches if needed
)
# The seeder automatically manages memory by:
# - Using bounded queues (prevents RAM spikes)
# - Applying backpressure when queue is full
# - Processing URLs as they're discovered
```
## Best Practices & Tips
### Cache Management
@@ -975,6 +1054,8 @@ config = SeedingConfig(
| Missing metadata | Ensure `extract_head=True` |
| Low relevance scores | Refine query, lower `score_threshold` |
| Rate limit errors | Reduce `hits_per_sec` and `concurrency` |
| Memory issues with large sites | Use `max_urls` to limit results, reduce `concurrency` |
| Connection not closed | Use context manager or call `await seeder.close()` |
### Performance Benchmarks
@@ -997,4 +1078,12 @@ URL seeding transforms web crawling from a blind expedition into a surgical stri
Whether you're building a research tool, monitoring competitors, or creating a content aggregator, URL seeding gives you the intelligence to crawl smarter, not harder.
### Key Features Summary
1. **Parallel Sitemap Index Processing**: Automatically detects and processes sitemap indexes in parallel
2. **Memory Protection**: Bounded queues prevent RAM issues with large domains (1M+ URLs)
3. **Context Manager Support**: Automatic cleanup with `async with` statement
4. **URL-Based Scoring**: Smart filtering even without head extraction
5. **Dual Caching**: Separate caches for URL lists and metadata
Now go forth and seed intelligently! 🌱🚀