- Implemented `test_link_analysis` in `test_docker.py` to validate link analysis functionality. - Created `test_link_analysis.py` with comprehensive tests for link analysis, including basic functionality, configuration options, error handling, performance, and edge cases. - Added integration tests in `test_link_analysis_integration.py` to verify the /links/analyze endpoint, including health checks, authentication, and error handling.
14 KiB
Link Analysis and Scoring
Introduction
Link Analysis is a powerful feature that extracts, analyzes, and scores all links found on a webpage. This endpoint helps you understand the link structure, identify high-value links, and get insights into the connectivity patterns of any website.
Think of it as a smart link discovery tool that not only extracts links but also evaluates their importance, relevance, and quality through advanced scoring algorithms.
Key Concepts
What Link Analysis Does
When you analyze a webpage, the system:
- Extracts All Links - Finds every hyperlink on the page
- Scores Links - Assigns relevance scores based on multiple factors
- Categorizes Links - Groups links by type (internal, external, etc.)
- Provides Metadata - URL text, attributes, and context information
- Ranks by Importance - Orders links from most to least valuable
Scoring Factors
The link scoring algorithm considers:
- Text Content: Link anchor text relevance and descriptiveness
- URL Structure: Depth, parameters, and path patterns
- Context: Surrounding text and page position
- Attributes: Title, rel attributes, and other metadata
- Link Type: Internal vs external classification
Quick Start
Basic Usage
import requests
# Analyze links on a webpage
response = requests.post(
"http://localhost:8000/links/analyze",
headers={"Authorization": "Bearer YOUR_TOKEN"},
json={
"url": "https://example.com"
}
)
result = response.json()
print(f"Found {len(result.get('internal', []))} internal links")
print(f"Found {len(result.get('external', []))} external links")
# Show top 3 links by score
for link_type in ['internal', 'external']:
if link_type in result:
top_links = sorted(result[link_type], key=lambda x: x.get('score', 0), reverse=True)[:3]
print(f"\nTop {link_type} links:")
for link in top_links:
print(f"- {link.get('url', 'N/A')} (score: {link.get('score', 0):.2f})")
With Custom Configuration
response = requests.post(
"http://localhost:8000/links/analyze",
headers={"Authorization": "Bearer YOUR_TOKEN"},
json={
"url": "https://news.example.com",
"config": {
"force": False, # Skip cache
"wait_for": 2.0, # Wait for dynamic content
"simulate_user": True, # User-like browsing
"override_navigator": True # Custom user agent
}
}
)
Configuration Options
The config parameter accepts a LinkPreviewConfig dictionary:
Basic Options
config = {
"force": False, # Force fresh crawl (default: False)
"wait_for": None, # CSS selector or timeout in seconds
"simulate_user": True, # Simulate human behavior
"override_navigator": True, # Override browser navigator
"headers": { # Custom headers
"Accept-Language": "en-US,en;q=0.9"
}
}
Advanced Options
config = {
# Timing and behavior
"delay_before_return_html": 0.5, # Delay before HTML extraction
"js_code": ["window.scrollTo(0, document.body.scrollHeight)"], # JS to execute
# Content processing
"word_count_threshold": 1, # Minimum word count
"exclusion_patterns": [ # Link patterns to exclude
r".*/logout.*",
r".*/admin.*"
],
# Caching and session
"session_id": "my-session-123", # Session identifier
"magic": False # Magic link processing
}
Response Structure
The endpoint returns a JSON object with categorized links:
{
"internal": [
{
"url": "https://example.com/about",
"text": "About Us",
"title": "Learn about our company",
"score": 0.85,
"context": "footer navigation",
"attributes": {
"rel": ["nofollow"],
"target": "_blank"
}
}
],
"external": [
{
"url": "https://partner-site.com",
"text": "Partner Site",
"title": "Visit our partner",
"score": 0.72,
"context": "main content",
"attributes": {}
}
],
"social": [...],
"download": [...],
"email": [...],
"phone": [...]
}
Link Categories
| Category | Description | Example |
|---|---|---|
| internal | Links within the same domain | /about, https://example.com/contact |
| external | Links to different domains | https://google.com |
| social | Social media platform links | https://twitter.com/user |
| download | File download links | /files/document.pdf |
| Email addresses | mailto:contact@example.com |
|
| phone | Phone numbers | tel:+1234567890 |
Link Metadata
Each link object contains:
{
"url": str, # The actual href value
"text": str, # Anchor text content
"title": str, # Title attribute (if any)
"score": float, # Relevance score (0.0-1.0)
"context": str, # Where the link was found
"attributes": dict, # All HTML attributes
"hash": str, # URL fragment (if any)
"domain": str, # Extracted domain name
"scheme": str, # URL scheme (http/https/etc)
}
Practical Examples
SEO Audit Tool
def seo_audit(url: str):
"""Perform SEO link analysis on a webpage"""
response = requests.post(
"http://localhost:8000/links/analyze",
headers={"Authorization": "Bearer YOUR_TOKEN"},
json={"url": url}
)
result = response.json()
print(f"📊 SEO Audit for {url}")
print(f"Internal links: {len(result.get('internal', []))}")
print(f"External links: {len(result.get('external', []))}")
# Check for SEO issues
internal_links = result.get('internal', [])
external_links = result.get('external', [])
# Find links with low scores
low_score_links = [link for link in internal_links if link.get('score', 0) < 0.3]
if low_score_links:
print(f"⚠️ Found {len(low_score_links)} low-quality internal links")
# Find external opportunities
high_value_external = [link for link in external_links if link.get('score', 0) > 0.7]
if high_value_external:
print(f"✅ Found {len(high_value_external)} high-value external links")
return result
# Usage
audit_result = seo_audit("https://example.com")
Competitor Analysis
def competitor_analysis(urls: list):
"""Analyze link patterns across multiple competitor sites"""
all_results = {}
for url in urls:
response = requests.post(
"http://localhost:8000/links/analyze",
headers={"Authorization": "Bearer YOUR_TOKEN"},
json={"url": url}
)
all_results[url] = response.json()
# Compare external link strategies
print("🔍 Competitor Link Analysis")
for url, result in all_results.items():
external_links = result.get('external', [])
avg_score = sum(link.get('score', 0) for link in external_links) / len(external_links) if external_links else 0
print(f"{url}: {len(external_links)} external links (avg score: {avg_score:.2f})")
return all_results
# Usage
competitors = [
"https://competitor1.com",
"https://competitor2.com",
"https://competitor3.com"
]
analysis = competitor_analysis(competitors)
Content Discovery
def discover_related_content(start_url: str, max_depth: int = 2):
"""Discover related content through link analysis"""
visited = set()
queue = [(start_url, 0)]
while queue and len(visited) < 20:
current_url, depth = queue.pop(0)
if current_url in visited or depth > max_depth:
continue
visited.add(current_url)
try:
response = requests.post(
"http://localhost:8000/links/analyze",
headers={"Authorization": "Bearer YOUR_TOKEN"},
json={"url": current_url}
)
result = response.json()
internal_links = result.get('internal', [])
# Sort by score and add top links to queue
top_links = sorted(internal_links, key=lambda x: x.get('score', 0), reverse=True)[:3]
for link in top_links:
if link['url'] not in visited:
queue.append((link['url'], depth + 1))
print(f"🔗 Found: {link['text']} ({link['score']:.2f})")
except Exception as e:
print(f"❌ Error analyzing {current_url}: {e}")
return visited
# Usage
related_pages = discover_related_content("https://blog.example.com")
print(f"Discovered {len(related_pages)} related pages")
Best Practices
1. Request Optimization
# ✅ Good: Use appropriate timeouts
response = requests.post(
"http://localhost:8000/links/analyze",
headers={"Authorization": "Bearer YOUR_TOKEN"},
json={"url": url},
timeout=30 # 30 second timeout
)
# ✅ Good: Configure wait times for dynamic sites
config = {
"wait_for": 2.0, # Wait for JavaScript to load
"simulate_user": True
}
2. Error Handling
def safe_link_analysis(url: str):
try:
response = requests.post(
"http://localhost:8000/links/analyze",
headers={"Authorization": "Bearer YOUR_TOKEN"},
json={"url": url},
timeout=30
)
if response.status_code == 200:
return response.json()
elif response.status_code == 400:
print("❌ Invalid request format")
elif response.status_code == 500:
print("❌ Server error during analysis")
else:
print(f"❌ Unexpected status code: {response.status_code}")
except requests.Timeout:
print("⏰ Request timed out")
except requests.ConnectionError:
print("🔌 Connection error")
except Exception as e:
print(f"❌ Unexpected error: {e}")
return None
3. Data Processing
def process_links_data(result: dict):
"""Process and filter link analysis results"""
# Filter by minimum score
min_score = 0.5
high_quality_links = {}
for category, links in result.items():
filtered_links = [
link for link in links
if link.get('score', 0) >= min_score
]
if filtered_links:
high_quality_links[category] = filtered_links
# Extract unique domains
domains = set()
for links in result.get('external', []):
domains.add(links.get('domain', ''))
return {
'filtered_links': high_quality_links,
'unique_domains': list(domains),
'total_links': sum(len(links) for links in result.values())
}
Performance Considerations
Response Times
- Simple pages: 2-5 seconds
- Complex pages: 5-15 seconds
- JavaScript-heavy: 10-30 seconds
Rate Limiting
The endpoint includes built-in rate limiting. For bulk analysis:
import time
def bulk_link_analysis(urls: list, delay: float = 1.0):
"""Analyze multiple URLs with rate limiting"""
results = {}
for url in urls:
result = safe_link_analysis(url)
if result:
results[url] = result
# Respect rate limits
time.sleep(delay)
return results
Error Handling
Common Errors and Solutions
| Error Code | Cause | Solution |
|---|---|---|
| 400 | Invalid URL or config | Check URL format and config structure |
| 401 | Invalid authentication | Verify your API token |
| 429 | Rate limit exceeded | Add delays between requests |
| 500 | Crawl failure | Check if site is accessible |
| 503 | Service unavailable | Try again later |
Debug Mode
# Enable verbose logging for debugging
config = {
"headers": {
"User-Agent": "Crawl4AI-Debug/1.0"
}
}
# Include error details in response
try:
response = requests.post(
"http://localhost:8000/links/analyze",
headers={"Authorization": "Bearer YOUR_TOKEN"},
json={"url": url, "config": config}
)
response.raise_for_status()
except requests.HTTPError as e:
print(f"Error details: {e.response.text}")
API Reference
Endpoint Details
- URL:
/links/analyze - Method:
POST - Content-Type:
application/json - Authentication: Bearer token required
Request Schema
{
"url": str, # Required: URL to analyze
"config": { # Optional: LinkPreviewConfig
"force": bool,
"wait_for": float,
"simulate_user": bool,
"override_navigator": bool,
"headers": dict,
"js_code": list,
"delay_before_return_html": float,
"word_count_threshold": int,
"exclusion_patterns": list,
"session_id": str,
"magic": bool
}
}
Response Schema
{
"internal": [LinkObject],
"external": [LinkObject],
"social": [LinkObject],
"download": [LinkObject],
"email": [LinkObject],
"phone": [LinkObject]
}
LinkObject Schema
{
"url": str,
"text": str,
"title": str,
"score": float,
"context": str,
"attributes": dict,
"hash": str,
"domain": str,
"scheme": str
}
Next Steps
- Learn about Advanced Link Processing
- Explore the Link Preview Configuration
- See more Examples
FAQ
Q: How is the link score calculated? A: The score considers multiple factors including anchor text relevance, URL structure, page context, and link attributes. Scores range from 0.0 (lowest quality) to 1.0 (highest quality).
Q: Can I analyze password-protected pages?
A: Yes! Use the js_code parameter to handle authentication, or include session cookies in the headers configuration.
Q: How many links can I analyze at once? A: There's no hard limit on the number of links per page, but very large pages (>10,000 links) may take longer to process.
Q: Can I filter out certain types of links?
A: Use the exclusion_patterns parameter in the config to filter out unwanted links using regex patterns.
Q: Does this work with JavaScript-heavy sites?
A: Absolutely! The crawler waits for JavaScript execution and can even run custom JavaScript using the js_code parameter.