Files

AHMET YILMAZ aebf5a3694 Add link analysis tests and integration tests for /links/analyze endpoint

- Implemented `test_link_analysis` in `test_docker.py` to validate link analysis functionality.
- Created `test_link_analysis.py` with comprehensive tests for link analysis, including basic functionality, configuration options, error handling, performance, and edge cases.
- Added integration tests in `test_link_analysis_integration.py` to verify the /links/analyze endpoint, including health checks, authentication, and error handling.

2025-10-14 19:58:25 +08:00

14 KiB

Raw Blame History

Link Analysis and Scoring

Introduction

Link Analysis is a powerful feature that extracts, analyzes, and scores all links found on a webpage. This endpoint helps you understand the link structure, identify high-value links, and get insights into the connectivity patterns of any website.

Think of it as a smart link discovery tool that not only extracts links but also evaluates their importance, relevance, and quality through advanced scoring algorithms.

Key Concepts

What Link Analysis Does

When you analyze a webpage, the system:

Extracts All Links - Finds every hyperlink on the page
Scores Links - Assigns relevance scores based on multiple factors
Categorizes Links - Groups links by type (internal, external, etc.)
Provides Metadata - URL text, attributes, and context information
Ranks by Importance - Orders links from most to least valuable

Scoring Factors

The link scoring algorithm considers:

Text Content: Link anchor text relevance and descriptiveness
URL Structure: Depth, parameters, and path patterns
Context: Surrounding text and page position
Attributes: Title, rel attributes, and other metadata
Link Type: Internal vs external classification

Quick Start

Basic Usage

import requests

# Analyze links on a webpage
response = requests.post(
    "http://localhost:8000/links/analyze",
    headers={"Authorization": "Bearer YOUR_TOKEN"},
    json={
        "url": "https://example.com"
    }
)

result = response.json()
print(f"Found {len(result.get('internal', []))} internal links")
print(f"Found {len(result.get('external', []))} external links")

# Show top 3 links by score
for link_type in ['internal', 'external']:
    if link_type in result:
        top_links = sorted(result[link_type], key=lambda x: x.get('score', 0), reverse=True)[:3]
        print(f"\nTop {link_type} links:")
        for link in top_links:
            print(f"- {link.get('url', 'N/A')} (score: {link.get('score', 0):.2f})")

With Custom Configuration

response = requests.post(
    "http://localhost:8000/links/analyze",
    headers={"Authorization": "Bearer YOUR_TOKEN"},
    json={
        "url": "https://news.example.com",
        "config": {
            "force": False,           # Skip cache
            "wait_for": 2.0,          # Wait for dynamic content
            "simulate_user": True,     # User-like browsing
            "override_navigator": True # Custom user agent
        }
    }
)

Configuration Options

The config parameter accepts a LinkPreviewConfig dictionary:

Basic Options

config = {
    "force": False,                    # Force fresh crawl (default: False)
    "wait_for": None,                  # CSS selector or timeout in seconds
    "simulate_user": True,             # Simulate human behavior
    "override_navigator": True,        # Override browser navigator
    "headers": {                       # Custom headers
        "Accept-Language": "en-US,en;q=0.9"
    }
}

Advanced Options

config = {
    # Timing and behavior
    "delay_before_return_html": 0.5,   # Delay before HTML extraction
    "js_code": ["window.scrollTo(0, document.body.scrollHeight)"],  # JS to execute

    # Content processing
    "word_count_threshold": 1,         # Minimum word count
    "exclusion_patterns": [            # Link patterns to exclude
        r".*/logout.*",
        r".*/admin.*"
    ],

    # Caching and session
    "session_id": "my-session-123",    # Session identifier
    "magic": False                     # Magic link processing
}

Response Structure

The endpoint returns a JSON object with categorized links:

{
  "internal": [
    {
      "url": "https://example.com/about",
      "text": "About Us",
      "title": "Learn about our company",
      "score": 0.85,
      "context": "footer navigation",
      "attributes": {
        "rel": ["nofollow"],
        "target": "_blank"
      }
    }
  ],
  "external": [
    {
      "url": "https://partner-site.com",
      "text": "Partner Site",
      "title": "Visit our partner",
      "score": 0.72,
      "context": "main content",
      "attributes": {}
    }
  ],
  "social": [...],
  "download": [...],
  "email": [...],
  "phone": [...]
}

Link Categories

Category	Description	Example
internal	Links within the same domain	`/about`, `https://example.com/contact`
external	Links to different domains	`https://google.com`
social	Social media platform links	`https://twitter.com/user`
download	File download links	`/files/document.pdf`
email	Email addresses	`mailto:contact@example.com`
phone	Phone numbers	`tel:+1234567890`

Link Metadata

Each link object contains:

{
    "url": str,           # The actual href value
    "text": str,          # Anchor text content
    "title": str,         # Title attribute (if any)
    "score": float,       # Relevance score (0.0-1.0)
    "context": str,       # Where the link was found
    "attributes": dict,   # All HTML attributes
    "hash": str,          # URL fragment (if any)
    "domain": str,        # Extracted domain name
    "scheme": str,        # URL scheme (http/https/etc)
}

Practical Examples

SEO Audit Tool

def seo_audit(url: str):
    """Perform SEO link analysis on a webpage"""
    response = requests.post(
        "http://localhost:8000/links/analyze",
        headers={"Authorization": "Bearer YOUR_TOKEN"},
        json={"url": url}
    )

    result = response.json()

    print(f"📊 SEO Audit for {url}")
    print(f"Internal links: {len(result.get('internal', []))}")
    print(f"External links: {len(result.get('external', []))}")

    # Check for SEO issues
    internal_links = result.get('internal', [])
    external_links = result.get('external', [])

    # Find links with low scores
    low_score_links = [link for link in internal_links if link.get('score', 0) < 0.3]
    if low_score_links:
        print(f"⚠️  Found {len(low_score_links)} low-quality internal links")

    # Find external opportunities
    high_value_external = [link for link in external_links if link.get('score', 0) > 0.7]
    if high_value_external:
        print(f"✅ Found {len(high_value_external)} high-value external links")

    return result

# Usage
audit_result = seo_audit("https://example.com")

Competitor Analysis

def competitor_analysis(urls: list):
    """Analyze link patterns across multiple competitor sites"""
    all_results = {}

    for url in urls:
        response = requests.post(
            "http://localhost:8000/links/analyze",
            headers={"Authorization": "Bearer YOUR_TOKEN"},
            json={"url": url}
        )
        all_results[url] = response.json()

    # Compare external link strategies
    print("🔍 Competitor Link Analysis")
    for url, result in all_results.items():
        external_links = result.get('external', [])
        avg_score = sum(link.get('score', 0) for link in external_links) / len(external_links) if external_links else 0
        print(f"{url}: {len(external_links)} external links (avg score: {avg_score:.2f})")

    return all_results

# Usage
competitors = [
    "https://competitor1.com",
    "https://competitor2.com",
    "https://competitor3.com"
]
analysis = competitor_analysis(competitors)

Content Discovery

def discover_related_content(start_url: str, max_depth: int = 2):
    """Discover related content through link analysis"""
    visited = set()
    queue = [(start_url, 0)]

    while queue and len(visited) < 20:
        current_url, depth = queue.pop(0)

        if current_url in visited or depth > max_depth:
            continue

        visited.add(current_url)

        try:
            response = requests.post(
                "http://localhost:8000/links/analyze",
                headers={"Authorization": "Bearer YOUR_TOKEN"},
                json={"url": current_url}
            )

            result = response.json()
            internal_links = result.get('internal', [])

            # Sort by score and add top links to queue
            top_links = sorted(internal_links, key=lambda x: x.get('score', 0), reverse=True)[:3]

            for link in top_links:
                if link['url'] not in visited:
                    queue.append((link['url'], depth + 1))
                    print(f"🔗 Found: {link['text']} ({link['score']:.2f})")

        except Exception as e:
            print(f"❌ Error analyzing {current_url}: {e}")

    return visited

# Usage
related_pages = discover_related_content("https://blog.example.com")
print(f"Discovered {len(related_pages)} related pages")

Best Practices

1. Request Optimization

# ✅ Good: Use appropriate timeouts
response = requests.post(
    "http://localhost:8000/links/analyze",
    headers={"Authorization": "Bearer YOUR_TOKEN"},
    json={"url": url},
    timeout=30  # 30 second timeout
)

# ✅ Good: Configure wait times for dynamic sites
config = {
    "wait_for": 2.0,  # Wait for JavaScript to load
    "simulate_user": True
}

2. Error Handling

def safe_link_analysis(url: str):
    try:
        response = requests.post(
            "http://localhost:8000/links/analyze",
            headers={"Authorization": "Bearer YOUR_TOKEN"},
            json={"url": url},
            timeout=30
        )

        if response.status_code == 200:
            return response.json()
        elif response.status_code == 400:
            print("❌ Invalid request format")
        elif response.status_code == 500:
            print("❌ Server error during analysis")
        else:
            print(f"❌ Unexpected status code: {response.status_code}")

    except requests.Timeout:
        print("⏰ Request timed out")
    except requests.ConnectionError:
        print("🔌 Connection error")
    except Exception as e:
        print(f"❌ Unexpected error: {e}")

    return None

3. Data Processing

def process_links_data(result: dict):
    """Process and filter link analysis results"""

    # Filter by minimum score
    min_score = 0.5
    high_quality_links = {}

    for category, links in result.items():
        filtered_links = [
            link for link in links
            if link.get('score', 0) >= min_score
        ]
        if filtered_links:
            high_quality_links[category] = filtered_links

    # Extract unique domains
    domains = set()
    for links in result.get('external', []):
        domains.add(links.get('domain', ''))

    return {
        'filtered_links': high_quality_links,
        'unique_domains': list(domains),
        'total_links': sum(len(links) for links in result.values())
    }

Performance Considerations

Response Times

Simple pages: 2-5 seconds
Complex pages: 5-15 seconds
JavaScript-heavy: 10-30 seconds

Rate Limiting

The endpoint includes built-in rate limiting. For bulk analysis:

import time

def bulk_link_analysis(urls: list, delay: float = 1.0):
    """Analyze multiple URLs with rate limiting"""
    results = {}

    for url in urls:
        result = safe_link_analysis(url)
        if result:
            results[url] = result

        # Respect rate limits
        time.sleep(delay)

    return results

Error Handling

Common Errors and Solutions

Error Code	Cause	Solution
400	Invalid URL or config	Check URL format and config structure
401	Invalid authentication	Verify your API token
429	Rate limit exceeded	Add delays between requests
500	Crawl failure	Check if site is accessible
503	Service unavailable	Try again later

Debug Mode

# Enable verbose logging for debugging
config = {
    "headers": {
        "User-Agent": "Crawl4AI-Debug/1.0"
    }
}

# Include error details in response
try:
    response = requests.post(
        "http://localhost:8000/links/analyze",
        headers={"Authorization": "Bearer YOUR_TOKEN"},
        json={"url": url, "config": config}
    )
    response.raise_for_status()
except requests.HTTPError as e:
    print(f"Error details: {e.response.text}")

API Reference

Endpoint Details

URL: /links/analyze
Method: POST
Content-Type: application/json
Authentication: Bearer token required

Request Schema

{
    "url": str,                    # Required: URL to analyze
    "config": {                    # Optional: LinkPreviewConfig
        "force": bool,
        "wait_for": float,
        "simulate_user": bool,
        "override_navigator": bool,
        "headers": dict,
        "js_code": list,
        "delay_before_return_html": float,
        "word_count_threshold": int,
        "exclusion_patterns": list,
        "session_id": str,
        "magic": bool
    }
}

Response Schema

{
    "internal": [LinkObject],
    "external": [LinkObject],
    "social": [LinkObject],
    "download": [LinkObject],
    "email": [LinkObject],
    "phone": [LinkObject]
}

LinkObject Schema

{
    "url": str,
    "text": str,
    "title": str,
    "score": float,
    "context": str,
    "attributes": dict,
    "hash": str,
    "domain": str,
    "scheme": str
}

Next Steps

FAQ

Q: How is the link score calculated? A: The score considers multiple factors including anchor text relevance, URL structure, page context, and link attributes. Scores range from 0.0 (lowest quality) to 1.0 (highest quality).

Q: Can I analyze password-protected pages? A: Yes! Use the js_code parameter to handle authentication, or include session cookies in the headers configuration.

Q: How many links can I analyze at once? A: There's no hard limit on the number of links per page, but very large pages (>10,000 links) may take longer to process.

Q: Can I filter out certain types of links? A: Use the exclusion_patterns parameter in the config to filter out unwanted links using regex patterns.

Q: Does this work with JavaScript-heavy sites? A: Absolutely! The crawler waits for JavaScript execution and can even run custom JavaScript using the js_code parameter.

14 KiB Raw Blame History