feat: Major Chrome Extension overhaul with Click2Crawl, instant Schema extraction, and modular architecture
✨ New Features: - Click2Crawl: Visual element selection with markdown conversion - Ctrl/Cmd+Click to select multiple elements - Visual text mode for WYSIWYG extraction - Real-time markdown preview with syntax highlighting - Export to .md file or clipboard - Schema Builder Enhancement: Instant data extraction without LLMs - Test schemas directly in browser - See JSON results immediately - Export data or Python code - Cloud deployment ready (coming soon) - Modular Architecture: - Separated into schemaBuilder.js, scriptBuilder.js, click2CrawlBuilder.js - Added contentAnalyzer.js and markdownConverter.js modules - Shared utilities and CSS reset system - Integrated marked.js for markdown rendering 🎨 UI/UX Improvements: - Added edgy cloud announcement banner with seamless shimmer animation - Direct, technical copy: "You don't need Puppeteer. You need Crawl4AI Cloud." - Enhanced feature cards with emojis - Fixed CSS conflicts with targeted reset approach - Improved badge hover effects (red on hover) - Added wrap toggle for code preview 📚 Documentation Updates: - Split extraction diagrams into LLM and no-LLM versions - Updated llms-full.txt with latest content - Added versioned LLM context (v0.1.1) 🔧 Technical Enhancements: - Refactored 3464 lines of monolithic content.js into modules - Added proper event handling and cleanup - Improved z-index management - Better scroll position tracking for badges - Enhanced error handling throughout This release transforms the Chrome Extension from a simple tool into a powerful visual data extraction suite, making web scraping accessible to everyone.
This commit is contained in:
@@ -16,7 +16,11 @@
|
||||
"Bash(/Users/unclecode/.npm-global/lib/node_modules/@anthropic-ai/claude-code/vendor/ripgrep/arm64-darwin/rg -A 5 -B 5 \"Script Builder\" docs/md_v2/apps/crawl4ai-assistant/)",
|
||||
"Bash(/Users/unclecode/.npm-global/lib/node_modules/@anthropic-ai/claude-code/vendor/ripgrep/arm64-darwin/rg -A 30 \"generateCode\\(events, format\\)\" docs/md_v2/apps/crawl4ai-assistant/content/content.js)",
|
||||
"Bash(/Users/unclecode/.npm-global/lib/node_modules/@anthropic-ai/claude-code/vendor/ripgrep/arm64-darwin/rg \"<style>\" docs/md_v2/apps/crawl4ai-assistant/index.html -A 5)",
|
||||
"Bash(git checkout:*)"
|
||||
"Bash(git checkout:*)",
|
||||
"Bash(docker logs:*)",
|
||||
"Bash(curl:*)",
|
||||
"Bash(docker compose:*)",
|
||||
"Bash(./test-final-integration.sh:*)"
|
||||
]
|
||||
},
|
||||
"enableAllProjectMcpServers": false
|
||||
|
||||
@@ -626,6 +626,16 @@ code {
|
||||
background: var(--primary-pink);
|
||||
}
|
||||
|
||||
.tool-status.new {
|
||||
background: var(--primary-green);
|
||||
animation: pulse 2s ease-in-out infinite;
|
||||
}
|
||||
|
||||
@keyframes pulse {
|
||||
0%, 100% { opacity: 1; }
|
||||
50% { opacity: 0.8; }
|
||||
}
|
||||
|
||||
/* Tool Details Panel */
|
||||
.tool-details {
|
||||
background: var(--bg-secondary);
|
||||
@@ -1027,3 +1037,515 @@ code {
|
||||
font-size: 1.5rem;
|
||||
}
|
||||
}
|
||||
/* Code Examples Grid Layout */
|
||||
.code-example > div[style*="grid"] {
|
||||
min-height: 500px;
|
||||
}
|
||||
|
||||
.code-example > div[style*="grid"] .terminal-window {
|
||||
height: 100%;
|
||||
display: flex;
|
||||
flex-direction: column;
|
||||
}
|
||||
|
||||
.code-example > div[style*="grid"] .terminal-content {
|
||||
flex: 1;
|
||||
overflow: auto;
|
||||
max-height: 450px;
|
||||
}
|
||||
|
||||
@media (max-width: 1200px) {
|
||||
.code-example > div[style*="grid"] {
|
||||
grid-template-columns: 1fr \!important;
|
||||
gap: 12px \!important;
|
||||
}
|
||||
}
|
||||
|
||||
/* Cloud Banner Section (Thin Version) */
|
||||
.cloud-banner-section {
|
||||
margin: 2rem 0 3rem 0;
|
||||
}
|
||||
|
||||
.cloud-banner {
|
||||
background: linear-gradient(135deg, rgba(15, 187, 170, 0.05) 0%, rgba(243, 128, 245, 0.05) 100%);
|
||||
border: 1px solid rgba(15, 187, 170, 0.3);
|
||||
border-radius: 12px;
|
||||
padding: 1.5rem 2rem;
|
||||
position: relative;
|
||||
overflow: hidden;
|
||||
}
|
||||
|
||||
.cloud-banner::before {
|
||||
content: "";
|
||||
position: absolute;
|
||||
top: 0;
|
||||
left: 0;
|
||||
width: 200%;
|
||||
height: 100%;
|
||||
background: linear-gradient(90deg,
|
||||
transparent 0%,
|
||||
rgba(15, 187, 170, 0.1) 25%,
|
||||
transparent 50%,
|
||||
rgba(15, 187, 170, 0.1) 75%,
|
||||
transparent 100%
|
||||
);
|
||||
animation: cloud-shimmer 4s linear infinite;
|
||||
}
|
||||
|
||||
@keyframes cloud-shimmer {
|
||||
0% { transform: translateX(0); }
|
||||
100% { transform: translateX(-50%); }
|
||||
}
|
||||
|
||||
.cloud-banner-content {
|
||||
display: flex;
|
||||
align-items: center;
|
||||
justify-content: space-between;
|
||||
gap: 2rem;
|
||||
position: relative;
|
||||
z-index: 1;
|
||||
}
|
||||
|
||||
|
||||
.cloud-banner-text {
|
||||
flex: 1;
|
||||
text-align: left;
|
||||
}
|
||||
|
||||
.cloud-banner-text h3 {
|
||||
margin: 0;
|
||||
font-size: 1.25rem;
|
||||
color: var(--text-primary);
|
||||
font-weight: 600;
|
||||
letter-spacing: -0.02em;
|
||||
}
|
||||
|
||||
.cloud-banner-text p {
|
||||
margin: 0.25rem 0 0;
|
||||
font-size: 0.875rem;
|
||||
color: var(--text-secondary);
|
||||
}
|
||||
|
||||
.cloud-banner-btn {
|
||||
background: var(--primary-green);
|
||||
color: var(--bg-dark);
|
||||
border: none;
|
||||
padding: 0.75rem 1.5rem;
|
||||
font-size: 0.875rem;
|
||||
font-weight: 600;
|
||||
border-radius: 25px;
|
||||
cursor: pointer;
|
||||
transition: all 0.3s ease;
|
||||
font-family: var(--font-primary);
|
||||
white-space: nowrap;
|
||||
flex-shrink: 0;
|
||||
}
|
||||
|
||||
.cloud-banner-btn:hover {
|
||||
background: #1fcbba;
|
||||
transform: translateY(-2px);
|
||||
box-shadow: 0 6px 20px rgba(15, 187, 170, 0.3);
|
||||
}
|
||||
|
||||
@media (max-width: 768px) {
|
||||
.cloud-banner-content {
|
||||
flex-direction: column;
|
||||
text-align: center;
|
||||
gap: 1rem;
|
||||
}
|
||||
|
||||
.cloud-banner-text {
|
||||
text-align: center;
|
||||
}
|
||||
|
||||
.cloud-banner-icon {
|
||||
font-size: 2rem;
|
||||
}
|
||||
|
||||
.cloud-banner-text h3 {
|
||||
font-size: 1.25rem;
|
||||
}
|
||||
}
|
||||
|
||||
/* Crawl4AI Cloud Section */
|
||||
.cloud-section {
|
||||
margin: 5rem 0;
|
||||
}
|
||||
|
||||
.cloud-announcement {
|
||||
background: linear-gradient(135deg, #1a1a1a 0%, #2a2a2a 100%);
|
||||
border: 2px solid var(--primary-green);
|
||||
border-radius: 20px;
|
||||
padding: 4rem 3rem;
|
||||
position: relative;
|
||||
overflow: hidden;
|
||||
box-shadow: 0 20px 60px rgba(15, 187, 170, 0.2);
|
||||
text-align: center;
|
||||
}
|
||||
|
||||
.cloud-announcement::before {
|
||||
content: "";
|
||||
position: absolute;
|
||||
top: -50%;
|
||||
left: -50%;
|
||||
width: 200%;
|
||||
height: 450%;
|
||||
background: radial-gradient(circle, rgba(15, 187, 170, 0.1) 0%, transparent 70%);
|
||||
animation: rotate 20s linear infinite;
|
||||
}
|
||||
|
||||
@keyframes rotate {
|
||||
from { transform: rotate(0deg); }
|
||||
to { transform: rotate(360deg); }
|
||||
}
|
||||
|
||||
|
||||
@keyframes float {
|
||||
0%, 100% { transform: translateY(0); }
|
||||
50% { transform: translateY(-10px); }
|
||||
}
|
||||
|
||||
.cloud-announcement h2 {
|
||||
font-size: 2.5rem;
|
||||
margin: 0 0 0.5rem 0;
|
||||
color: var(--text-primary);
|
||||
font-weight: 700;
|
||||
letter-spacing: -0.03em;
|
||||
position: relative;
|
||||
z-index: 1;
|
||||
}
|
||||
|
||||
.cloud-tagline {
|
||||
font-size: 1.25rem;
|
||||
color: var(--text-secondary);
|
||||
margin: 0.5rem 0 2rem;
|
||||
position: relative;
|
||||
z-index: 1;
|
||||
}
|
||||
|
||||
.cloud-features-preview {
|
||||
display: flex;
|
||||
justify-content: center;
|
||||
gap: 2rem;
|
||||
margin: 2rem 0 3rem;
|
||||
flex-wrap: wrap;
|
||||
position: relative;
|
||||
z-index: 1;
|
||||
}
|
||||
|
||||
.cloud-feature-item {
|
||||
font-size: 0.875rem;
|
||||
color: var(--text-secondary);
|
||||
font-family: var(--font-code);
|
||||
padding: 0.5rem 1rem;
|
||||
background: var(--bg-secondary);
|
||||
border: 1px solid var(--border-color);
|
||||
border-radius: 6px;
|
||||
}
|
||||
|
||||
.cloud-cta-button {
|
||||
background: var(--primary-green);
|
||||
color: var(--bg-dark);
|
||||
border: none;
|
||||
padding: 0.875rem 2rem;
|
||||
font-size: 1rem;
|
||||
font-weight: 600;
|
||||
border-radius: 6px;
|
||||
cursor: pointer;
|
||||
transition: all 0.2s ease;
|
||||
position: relative;
|
||||
z-index: 1;
|
||||
font-family: var(--font-primary);
|
||||
text-transform: none;
|
||||
letter-spacing: -0.01em;
|
||||
}
|
||||
|
||||
|
||||
.cloud-cta-button:hover {
|
||||
transform: translateY(-2px);
|
||||
box-shadow: 0 10px 30px rgba(15, 187, 170, 0.4);
|
||||
background: #1fcbba;
|
||||
}
|
||||
|
||||
.cloud-hint {
|
||||
margin-top: 1.5rem;
|
||||
font-size: 0.875rem;
|
||||
color: var(--text-secondary);
|
||||
position: relative;
|
||||
z-index: 1;
|
||||
font-style: italic;
|
||||
}
|
||||
|
||||
/* Signup Overlay */
|
||||
.signup-overlay {
|
||||
position: fixed;
|
||||
top: 0;
|
||||
left: 0;
|
||||
right: 0;
|
||||
bottom: 0;
|
||||
background: rgba(0, 0, 0, 0.9);
|
||||
backdrop-filter: blur(10px);
|
||||
z-index: 10000;
|
||||
display: none;
|
||||
align-items: center;
|
||||
justify-content: center;
|
||||
padding: 2rem;
|
||||
}
|
||||
|
||||
.signup-overlay.active {
|
||||
display: flex;
|
||||
}
|
||||
|
||||
.signup-container {
|
||||
background: var(--bg-secondary);
|
||||
border: 2px solid var(--primary-green);
|
||||
border-radius: 16px;
|
||||
max-width: 600px;
|
||||
width: 100%;
|
||||
max-height: 90vh;
|
||||
overflow: auto;
|
||||
position: relative;
|
||||
box-shadow: 0 20px 60px rgba(15, 187, 170, 0.3);
|
||||
}
|
||||
|
||||
.close-signup {
|
||||
position: absolute;
|
||||
top: 1rem;
|
||||
right: 1rem;
|
||||
background: var(--bg-tertiary);
|
||||
border: none;
|
||||
color: var(--text-secondary);
|
||||
width: 40px;
|
||||
height: 40px;
|
||||
border-radius: 50%;
|
||||
font-size: 24px;
|
||||
cursor: pointer;
|
||||
transition: all 0.2s ease;
|
||||
z-index: 10;
|
||||
}
|
||||
|
||||
.close-signup:hover {
|
||||
background: var(--primary-pink);
|
||||
color: var(--bg-dark);
|
||||
transform: rotate(90deg);
|
||||
}
|
||||
|
||||
.signup-content {
|
||||
padding: 3rem;
|
||||
}
|
||||
|
||||
.signup-content h3 {
|
||||
font-size: 1.75rem;
|
||||
margin: 0 0 0.5rem;
|
||||
color: var(--text-primary);
|
||||
}
|
||||
|
||||
.signup-content p {
|
||||
color: var(--text-secondary);
|
||||
margin-bottom: 2rem;
|
||||
}
|
||||
|
||||
.waitlist-form {
|
||||
display: flex;
|
||||
flex-direction: column;
|
||||
gap: 1.5rem;
|
||||
}
|
||||
|
||||
.form-field {
|
||||
display: flex;
|
||||
flex-direction: column;
|
||||
gap: 0.5rem;
|
||||
}
|
||||
|
||||
.form-field label {
|
||||
font-size: 0.875rem;
|
||||
color: var(--text-secondary);
|
||||
text-transform: uppercase;
|
||||
font-weight: 600;
|
||||
}
|
||||
|
||||
.form-field input,
|
||||
.form-field select {
|
||||
background: var(--bg-tertiary);
|
||||
border: 1px solid var(--border-color);
|
||||
color: var(--text-primary);
|
||||
padding: 0.75rem 1rem;
|
||||
border-radius: 8px;
|
||||
font-size: 1rem;
|
||||
font-family: var(--font-primary);
|
||||
transition: all 0.2s ease;
|
||||
}
|
||||
|
||||
.form-field input:focus,
|
||||
.form-field select:focus {
|
||||
outline: none;
|
||||
border-color: var(--primary-green);
|
||||
box-shadow: 0 0 0 3px rgba(15, 187, 170, 0.2);
|
||||
}
|
||||
|
||||
.submit-button {
|
||||
background: var(--primary-green);
|
||||
color: var(--bg-dark);
|
||||
border: none;
|
||||
padding: 1rem 2rem;
|
||||
font-size: 1.125rem;
|
||||
font-weight: 600;
|
||||
border-radius: 8px;
|
||||
cursor: pointer;
|
||||
transition: all 0.2s ease;
|
||||
font-family: var(--font-primary);
|
||||
display: flex;
|
||||
align-items: center;
|
||||
justify-content: center;
|
||||
gap: 0.5rem;
|
||||
margin-top: 1rem;
|
||||
}
|
||||
|
||||
.submit-button:hover {
|
||||
background: #1fcbba;
|
||||
transform: translateY(-2px);
|
||||
box-shadow: 0 8px 24px rgba(15, 187, 170, 0.3);
|
||||
}
|
||||
|
||||
/* Crawling Animation */
|
||||
.crawl-animation {
|
||||
padding: 3rem;
|
||||
text-align: left;
|
||||
}
|
||||
|
||||
.crawl-terminal {
|
||||
margin-bottom: 2rem;
|
||||
}
|
||||
|
||||
.crawl-terminal .terminal-content {
|
||||
max-height: 400px;
|
||||
overflow-y: auto;
|
||||
}
|
||||
|
||||
.crawl-terminal code {
|
||||
white-space: pre;
|
||||
display: block;
|
||||
line-height: 1.6;
|
||||
}
|
||||
|
||||
.crawl-log {
|
||||
color: var(--text-primary);
|
||||
font-family: var(--font-code);
|
||||
}
|
||||
|
||||
.crawl-log .log-init { color: #0fbbaa; }
|
||||
.crawl-log .log-fetch { color: #4169e1; }
|
||||
.crawl-log .log-scrape { color: #f380f5; }
|
||||
.crawl-log .log-extract { color: #ffbd2e; }
|
||||
.crawl-log .log-complete { color: #0fbbaa; }
|
||||
.crawl-log .log-success { color: #0fbbaa; }
|
||||
.crawl-log .log-time { color: #666; }
|
||||
|
||||
.extracted-preview {
|
||||
background: var(--bg-tertiary);
|
||||
border-radius: 12px;
|
||||
padding: 1.5rem;
|
||||
margin-bottom: 2rem;
|
||||
}
|
||||
|
||||
.extracted-preview h4 {
|
||||
margin: 0 0 1rem;
|
||||
color: var(--primary-green);
|
||||
font-size: 1.25rem;
|
||||
}
|
||||
|
||||
.json-preview {
|
||||
background: var(--bg-dark);
|
||||
border: 1px solid var(--border-color);
|
||||
border-radius: 8px;
|
||||
padding: 1rem;
|
||||
overflow-x: auto;
|
||||
max-height: 300px;
|
||||
}
|
||||
|
||||
.json-preview code {
|
||||
color: var(--text-primary);
|
||||
font-size: 0.875rem;
|
||||
}
|
||||
|
||||
.success-message {
|
||||
text-align: center;
|
||||
padding: 2rem;
|
||||
}
|
||||
|
||||
.continue-button {
|
||||
background: var(--primary-green);
|
||||
color: var(--bg-dark);
|
||||
border: none;
|
||||
padding: 1rem 2rem;
|
||||
font-size: 1.125rem;
|
||||
font-weight: 600;
|
||||
border-radius: 8px;
|
||||
cursor: pointer;
|
||||
transition: all 0.2s ease;
|
||||
font-family: var(--font-primary);
|
||||
margin-top: 2rem;
|
||||
}
|
||||
|
||||
.continue-button:hover {
|
||||
background: #1fcbba;
|
||||
transform: translateY(-2px);
|
||||
box-shadow: 0 8px 24px rgba(15, 187, 170, 0.3);
|
||||
}
|
||||
|
||||
.success-icon {
|
||||
font-size: 4rem;
|
||||
margin-bottom: 1rem;
|
||||
animation: bounce 0.5s ease;
|
||||
}
|
||||
|
||||
@keyframes bounce {
|
||||
0%, 100% { transform: translateY(0); }
|
||||
50% { transform: translateY(-20px); }
|
||||
}
|
||||
|
||||
.success-message h3 {
|
||||
font-size: 2rem;
|
||||
margin: 0 0 1rem;
|
||||
color: var(--primary-green);
|
||||
}
|
||||
|
||||
.success-message ul {
|
||||
list-style: none;
|
||||
margin: 1.5rem 0;
|
||||
padding: 0;
|
||||
text-align: left;
|
||||
max-width: 400px;
|
||||
margin-left: auto;
|
||||
margin-right: auto;
|
||||
}
|
||||
|
||||
.success-message li {
|
||||
padding: 0.5rem 0;
|
||||
color: var(--text-primary);
|
||||
font-size: 1.125rem;
|
||||
}
|
||||
|
||||
.success-note {
|
||||
color: var(--text-secondary);
|
||||
font-size: 1rem;
|
||||
margin-top: 2rem;
|
||||
padding: 1rem;
|
||||
background: var(--bg-tertiary);
|
||||
border-radius: 8px;
|
||||
}
|
||||
|
||||
@media (max-width: 768px) {
|
||||
.cloud-announcement h2 {
|
||||
font-size: 2rem;
|
||||
}
|
||||
|
||||
.cloud-features-preview {
|
||||
flex-direction: column;
|
||||
gap: 1rem;
|
||||
}
|
||||
|
||||
.signup-content {
|
||||
padding: 2rem;
|
||||
}
|
||||
}
|
||||
|
||||
732
docs/md_v2/apps/crawl4ai-assistant/content/click2CrawlBuilder.js
Normal file
732
docs/md_v2/apps/crawl4ai-assistant/content/click2CrawlBuilder.js
Normal file
@@ -0,0 +1,732 @@
|
||||
class Click2CrawlBuilder {
|
||||
constructor() {
|
||||
this.selectedElements = new Set();
|
||||
this.highlightBoxes = new Map();
|
||||
this.selectionMode = false;
|
||||
this.toolbar = null;
|
||||
this.previewPanel = null;
|
||||
this.selectionCounter = 0;
|
||||
this.markdownConverter = null;
|
||||
this.contentAnalyzer = null;
|
||||
|
||||
// Configuration options
|
||||
this.options = {
|
||||
includeImages: true,
|
||||
preserveTables: true,
|
||||
keepCodeFormatting: true,
|
||||
simplifyLayout: false,
|
||||
preserveLinks: true,
|
||||
addSeparators: true,
|
||||
includeXPath: false,
|
||||
textOnly: false
|
||||
};
|
||||
|
||||
this.init();
|
||||
}
|
||||
|
||||
async init() {
|
||||
// Initialize dependencies
|
||||
this.markdownConverter = new MarkdownConverter();
|
||||
this.contentAnalyzer = new ContentAnalyzer();
|
||||
|
||||
this.createToolbar();
|
||||
this.setupEventListeners();
|
||||
}
|
||||
|
||||
createToolbar() {
|
||||
// Create floating toolbar
|
||||
this.toolbar = document.createElement('div');
|
||||
this.toolbar.className = 'c4ai-c2c-toolbar';
|
||||
this.toolbar.innerHTML = `
|
||||
<div class="c4ai-toolbar-header">
|
||||
<div class="c4ai-toolbar-dots">
|
||||
<span class="c4ai-dot c4ai-dot-red"></span>
|
||||
<span class="c4ai-dot c4ai-dot-yellow"></span>
|
||||
<span class="c4ai-dot c4ai-dot-green"></span>
|
||||
</div>
|
||||
<span class="c4ai-toolbar-title">Click2Crawl</span>
|
||||
<button class="c4ai-close-btn" title="Close">×</button>
|
||||
</div>
|
||||
<div class="c4ai-toolbar-content">
|
||||
<div class="c4ai-selection-info">
|
||||
<span class="c4ai-selection-count">0 elements selected</span>
|
||||
<button class="c4ai-clear-btn" title="Clear selection" disabled>Clear</button>
|
||||
</div>
|
||||
<div class="c4ai-toolbar-actions">
|
||||
<button class="c4ai-preview-btn" disabled>Preview Markdown</button>
|
||||
<button class="c4ai-copy-btn" disabled>Copy to Clipboard</button>
|
||||
</div>
|
||||
<div class="c4ai-toolbar-instructions">
|
||||
<p>💡 <strong>Ctrl/Cmd + Click</strong> to select multiple elements</p>
|
||||
<p>📝 Selected elements will be converted to clean markdown</p>
|
||||
<p>⌨️ Press <strong>ESC</strong> to exit</p>
|
||||
</div>
|
||||
</div>
|
||||
`;
|
||||
|
||||
document.body.appendChild(this.toolbar);
|
||||
makeDraggableByHeader(this.toolbar);
|
||||
|
||||
// Position toolbar
|
||||
this.toolbar.style.position = 'fixed';
|
||||
this.toolbar.style.top = '20px';
|
||||
this.toolbar.style.right = '20px';
|
||||
this.toolbar.style.zIndex = '999999';
|
||||
}
|
||||
|
||||
setupEventListeners() {
|
||||
// Close button
|
||||
this.toolbar.querySelector('.c4ai-close-btn').addEventListener('click', () => {
|
||||
this.deactivate();
|
||||
});
|
||||
|
||||
// Clear selection button
|
||||
this.toolbar.querySelector('.c4ai-clear-btn').addEventListener('click', () => {
|
||||
this.clearSelection();
|
||||
});
|
||||
|
||||
// Preview button
|
||||
this.toolbar.querySelector('.c4ai-preview-btn').addEventListener('click', () => {
|
||||
this.showPreview();
|
||||
});
|
||||
|
||||
// Copy button
|
||||
this.toolbar.querySelector('.c4ai-copy-btn').addEventListener('click', () => {
|
||||
this.copyToClipboard();
|
||||
});
|
||||
|
||||
// Document click handler for element selection
|
||||
this.documentClickHandler = (event) => this.handleElementClick(event);
|
||||
document.addEventListener('click', this.documentClickHandler, true);
|
||||
|
||||
// Prevent default link behavior during selection mode
|
||||
this.linkClickHandler = (event) => {
|
||||
if (event.ctrlKey || event.metaKey) {
|
||||
event.preventDefault();
|
||||
event.stopPropagation();
|
||||
}
|
||||
};
|
||||
document.addEventListener('click', this.linkClickHandler, true);
|
||||
|
||||
// Hover effect
|
||||
this.documentHoverHandler = (event) => this.handleElementHover(event);
|
||||
document.addEventListener('mouseover', this.documentHoverHandler, true);
|
||||
|
||||
// Remove hover on mouseout
|
||||
this.documentMouseOutHandler = (event) => this.handleElementMouseOut(event);
|
||||
document.addEventListener('mouseout', this.documentMouseOutHandler, true);
|
||||
|
||||
// Keyboard shortcuts
|
||||
this.keyboardHandler = (event) => this.handleKeyboard(event);
|
||||
document.addEventListener('keydown', this.keyboardHandler);
|
||||
}
|
||||
|
||||
handleElementClick(event) {
|
||||
// Check if Ctrl/Cmd is pressed
|
||||
if (!event.ctrlKey && !event.metaKey) return;
|
||||
|
||||
// Prevent default behavior
|
||||
event.preventDefault();
|
||||
event.stopPropagation();
|
||||
|
||||
const element = event.target;
|
||||
|
||||
// Don't select our own UI elements
|
||||
if (element.closest('.c4ai-c2c-toolbar') ||
|
||||
element.closest('.c4ai-c2c-preview') ||
|
||||
element.closest('.c4ai-highlight-box')) {
|
||||
return;
|
||||
}
|
||||
|
||||
// Toggle element selection
|
||||
if (this.selectedElements.has(element)) {
|
||||
this.deselectElement(element);
|
||||
} else {
|
||||
this.selectElement(element);
|
||||
}
|
||||
|
||||
this.updateUI();
|
||||
}
|
||||
|
||||
handleElementHover(event) {
|
||||
const element = event.target;
|
||||
|
||||
// Don't hover our own UI elements
|
||||
if (element.closest('.c4ai-c2c-toolbar') ||
|
||||
element.closest('.c4ai-c2c-preview') ||
|
||||
element.closest('.c4ai-highlight-box') ||
|
||||
element.hasAttribute('data-c4ai-badge')) {
|
||||
return;
|
||||
}
|
||||
|
||||
// Add hover class
|
||||
element.classList.add('c4ai-hover-candidate');
|
||||
}
|
||||
|
||||
handleElementMouseOut(event) {
|
||||
const element = event.target;
|
||||
element.classList.remove('c4ai-hover-candidate');
|
||||
}
|
||||
|
||||
handleKeyboard(event) {
|
||||
// ESC to deactivate
|
||||
if (event.key === 'Escape') {
|
||||
this.deactivate();
|
||||
}
|
||||
// Ctrl/Cmd + A to select all visible elements
|
||||
else if ((event.ctrlKey || event.metaKey) && event.key === 'a') {
|
||||
event.preventDefault();
|
||||
// Select all visible text-containing elements
|
||||
const elements = document.querySelectorAll('p, h1, h2, h3, h4, h5, h6, li, td, th, div, span, article, section');
|
||||
elements.forEach(el => {
|
||||
if (el.textContent.trim() && this.isVisible(el) && !this.selectedElements.has(el)) {
|
||||
this.selectElement(el);
|
||||
}
|
||||
});
|
||||
this.updateUI();
|
||||
}
|
||||
}
|
||||
|
||||
isVisible(element) {
|
||||
const rect = element.getBoundingClientRect();
|
||||
const style = window.getComputedStyle(element);
|
||||
return rect.width > 0 &&
|
||||
rect.height > 0 &&
|
||||
style.display !== 'none' &&
|
||||
style.visibility !== 'hidden' &&
|
||||
style.opacity !== '0';
|
||||
}
|
||||
|
||||
selectElement(element) {
|
||||
this.selectedElements.add(element);
|
||||
|
||||
// Create highlight box
|
||||
const box = this.createHighlightBox(element);
|
||||
this.highlightBoxes.set(element, box);
|
||||
|
||||
// Add selected class
|
||||
element.classList.add('c4ai-selected');
|
||||
|
||||
this.selectionCounter++;
|
||||
}
|
||||
|
||||
deselectElement(element) {
|
||||
this.selectedElements.delete(element);
|
||||
|
||||
// Remove highlight box (badge)
|
||||
const badge = this.highlightBoxes.get(element);
|
||||
if (badge) {
|
||||
// Remove scroll/resize listeners
|
||||
if (badge._updatePosition) {
|
||||
window.removeEventListener('scroll', badge._updatePosition, true);
|
||||
window.removeEventListener('resize', badge._updatePosition);
|
||||
}
|
||||
badge.remove();
|
||||
this.highlightBoxes.delete(element);
|
||||
}
|
||||
|
||||
// Remove outline
|
||||
element.style.outline = '';
|
||||
element.style.outlineOffset = '';
|
||||
|
||||
// Remove attributes
|
||||
element.removeAttribute('data-c4ai-selection-order');
|
||||
element.classList.remove('c4ai-selected');
|
||||
|
||||
this.selectionCounter--;
|
||||
}
|
||||
|
||||
createHighlightBox(element) {
|
||||
// Add a data attribute to track selection order
|
||||
element.setAttribute('data-c4ai-selection-order', this.selectionCounter + 1);
|
||||
|
||||
// Add selection outline directly to the element
|
||||
element.style.outline = '2px solid #0fbbaa';
|
||||
element.style.outlineOffset = '2px';
|
||||
|
||||
// Create badge with fixed positioning
|
||||
const badge = document.createElement('div');
|
||||
badge.className = 'c4ai-selection-badge-fixed';
|
||||
badge.textContent = this.selectionCounter + 1;
|
||||
badge.setAttribute('data-c4ai-badge', 'true');
|
||||
badge.title = 'Click to deselect';
|
||||
|
||||
// Get element position and set badge position
|
||||
const rect = element.getBoundingClientRect();
|
||||
badge.style.cssText = `
|
||||
position: fixed !important;
|
||||
top: ${rect.top - 12}px !important;
|
||||
left: ${rect.left - 12}px !important;
|
||||
width: 24px !important;
|
||||
height: 24px !important;
|
||||
background: #0fbbaa !important;
|
||||
color: #070708 !important;
|
||||
border-radius: 50% !important;
|
||||
display: flex !important;
|
||||
align-items: center !important;
|
||||
justify-content: center !important;
|
||||
font-size: 12px !important;
|
||||
font-weight: bold !important;
|
||||
font-family: -apple-system, BlinkMacSystemFont, "Segoe UI", Roboto, sans-serif !important;
|
||||
box-shadow: 0 2px 8px rgba(0, 0, 0, 0.3) !important;
|
||||
z-index: 999998 !important;
|
||||
cursor: pointer !important;
|
||||
transition: all 0.2s ease !important;
|
||||
pointer-events: auto !important;
|
||||
border: none !important;
|
||||
padding: 0 !important;
|
||||
margin: 0 !important;
|
||||
line-height: 1 !important;
|
||||
text-align: center !important;
|
||||
text-decoration: none !important;
|
||||
box-sizing: border-box !important;
|
||||
`;
|
||||
|
||||
// Add hover styles dynamically
|
||||
badge.addEventListener('mouseenter', () => {
|
||||
badge.style.setProperty('background', '#ff3c74', 'important');
|
||||
badge.style.setProperty('transform', 'scale(1.1)', 'important');
|
||||
});
|
||||
|
||||
badge.addEventListener('mouseleave', () => {
|
||||
badge.style.setProperty('background', '#0fbbaa', 'important');
|
||||
badge.style.setProperty('transform', 'scale(1)', 'important');
|
||||
});
|
||||
|
||||
// Add click handler to badge for deselection
|
||||
badge.addEventListener('click', (e) => {
|
||||
e.stopPropagation();
|
||||
e.preventDefault();
|
||||
this.deselectElement(element);
|
||||
this.updateUI();
|
||||
});
|
||||
|
||||
// Add scroll listener to update position
|
||||
const updatePosition = () => {
|
||||
const newRect = element.getBoundingClientRect();
|
||||
badge.style.top = `${newRect.top - 12}px`;
|
||||
badge.style.left = `${newRect.left - 12}px`;
|
||||
};
|
||||
|
||||
// Store the update function so we can remove it later
|
||||
badge._updatePosition = updatePosition;
|
||||
window.addEventListener('scroll', updatePosition, true);
|
||||
window.addEventListener('resize', updatePosition);
|
||||
|
||||
document.body.appendChild(badge);
|
||||
|
||||
return badge;
|
||||
}
|
||||
|
||||
clearSelection() {
|
||||
// Clear all selections
|
||||
this.selectedElements.forEach(element => {
|
||||
// Remove badge
|
||||
const badge = this.highlightBoxes.get(element);
|
||||
if (badge) {
|
||||
// Remove scroll/resize listeners
|
||||
if (badge._updatePosition) {
|
||||
window.removeEventListener('scroll', badge._updatePosition, true);
|
||||
window.removeEventListener('resize', badge._updatePosition);
|
||||
}
|
||||
badge.remove();
|
||||
}
|
||||
|
||||
// Remove outline
|
||||
element.style.outline = '';
|
||||
element.style.outlineOffset = '';
|
||||
|
||||
// Remove attributes
|
||||
element.removeAttribute('data-c4ai-selection-order');
|
||||
element.classList.remove('c4ai-selected');
|
||||
});
|
||||
|
||||
this.selectedElements.clear();
|
||||
this.highlightBoxes.clear();
|
||||
this.selectionCounter = 0;
|
||||
|
||||
this.updateUI();
|
||||
}
|
||||
|
||||
updateUI() {
|
||||
const count = this.selectedElements.size;
|
||||
|
||||
// Update selection count
|
||||
this.toolbar.querySelector('.c4ai-selection-count').textContent =
|
||||
`${count} element${count !== 1 ? 's' : ''} selected`;
|
||||
|
||||
// Enable/disable buttons
|
||||
const hasSelection = count > 0;
|
||||
this.toolbar.querySelector('.c4ai-preview-btn').disabled = !hasSelection;
|
||||
this.toolbar.querySelector('.c4ai-copy-btn').disabled = !hasSelection;
|
||||
this.toolbar.querySelector('.c4ai-clear-btn').disabled = !hasSelection;
|
||||
}
|
||||
|
||||
async showPreview() {
|
||||
// Generate markdown from selected elements
|
||||
const markdown = await this.generateMarkdown();
|
||||
|
||||
// Create or update preview panel
|
||||
if (!this.previewPanel) {
|
||||
this.createPreviewPanel();
|
||||
}
|
||||
|
||||
await this.updatePreviewContent(markdown);
|
||||
this.previewPanel.style.display = 'block';
|
||||
}
|
||||
|
||||
createPreviewPanel() {
|
||||
this.previewPanel = document.createElement('div');
|
||||
this.previewPanel.className = 'c4ai-c2c-preview';
|
||||
this.previewPanel.innerHTML = `
|
||||
<div class="c4ai-preview-header">
|
||||
<div class="c4ai-toolbar-dots">
|
||||
<span class="c4ai-dot c4ai-dot-red"></span>
|
||||
<span class="c4ai-dot c4ai-dot-yellow"></span>
|
||||
<span class="c4ai-dot c4ai-dot-green"></span>
|
||||
</div>
|
||||
<span class="c4ai-preview-title">Markdown Preview</span>
|
||||
<button class="c4ai-preview-close">×</button>
|
||||
</div>
|
||||
<div class="c4ai-preview-options">
|
||||
<label><input type="checkbox" name="textOnly"> 👁️ Visual Text Mode (As You See) TRY THIS!!!</label>
|
||||
<label><input type="checkbox" name="includeImages" checked> Include Images</label>
|
||||
<label><input type="checkbox" name="preserveTables" checked> Preserve Tables</label>
|
||||
<label><input type="checkbox" name="preserveLinks" checked> Preserve Links</label>
|
||||
<label><input type="checkbox" name="keepCodeFormatting" checked> Keep Code Formatting</label>
|
||||
<label><input type="checkbox" name="simplifyLayout"> Simplify Layout</label>
|
||||
<label><input type="checkbox" name="addSeparators" checked> Add Separators</label>
|
||||
<label><input type="checkbox" name="includeXPath"> Include XPath Headers</label>
|
||||
</div>
|
||||
<div class="c4ai-preview-content">
|
||||
<div class="c4ai-preview-tabs">
|
||||
<button class="c4ai-tab active" data-tab="preview">Preview</button>
|
||||
<button class="c4ai-tab" data-tab="markdown">Markdown</button>
|
||||
<button class="c4ai-wrap-toggle" title="Toggle word wrap">↔️ Wrap</button>
|
||||
</div>
|
||||
<div class="c4ai-preview-pane active" data-pane="preview"></div>
|
||||
<div class="c4ai-preview-pane" data-pane="markdown"></div>
|
||||
</div>
|
||||
<div class="c4ai-preview-actions">
|
||||
<button class="c4ai-download-btn">Download .md</button>
|
||||
<button class="c4ai-copy-markdown-btn">Copy Markdown</button>
|
||||
<button class="c4ai-cloud-btn" disabled>Send to Cloud (Coming Soon)</button>
|
||||
</div>
|
||||
`;
|
||||
|
||||
document.body.appendChild(this.previewPanel);
|
||||
makeDraggableByHeader(this.previewPanel);
|
||||
|
||||
// Position preview panel
|
||||
this.previewPanel.style.position = 'fixed';
|
||||
this.previewPanel.style.top = '50%';
|
||||
this.previewPanel.style.left = '50%';
|
||||
this.previewPanel.style.transform = 'translate(-50%, -50%)';
|
||||
this.previewPanel.style.zIndex = '999999';
|
||||
|
||||
this.setupPreviewEventListeners();
|
||||
}
|
||||
|
||||
setupPreviewEventListeners() {
|
||||
// Close button
|
||||
this.previewPanel.querySelector('.c4ai-preview-close').addEventListener('click', () => {
|
||||
this.previewPanel.style.display = 'none';
|
||||
});
|
||||
|
||||
// Tab switching
|
||||
this.previewPanel.querySelectorAll('.c4ai-tab').forEach(tab => {
|
||||
tab.addEventListener('click', (e) => {
|
||||
const tabName = e.target.dataset.tab;
|
||||
this.switchPreviewTab(tabName);
|
||||
});
|
||||
});
|
||||
|
||||
// Wrap toggle
|
||||
const wrapToggle = this.previewPanel.querySelector('.c4ai-wrap-toggle');
|
||||
wrapToggle.addEventListener('click', () => {
|
||||
const panes = this.previewPanel.querySelectorAll('.c4ai-preview-pane');
|
||||
panes.forEach(pane => {
|
||||
pane.classList.toggle('wrap');
|
||||
});
|
||||
wrapToggle.classList.toggle('active');
|
||||
});
|
||||
|
||||
// Options change
|
||||
this.previewPanel.querySelectorAll('input[type="checkbox"]').forEach(checkbox => {
|
||||
checkbox.addEventListener('change', async (e) => {
|
||||
this.options[e.target.name] = e.target.checked;
|
||||
|
||||
// If text-only is enabled, automatically disable certain options
|
||||
if (e.target.name === 'textOnly' && e.target.checked) {
|
||||
// Update UI checkboxes
|
||||
const preserveLinksCheckbox = this.previewPanel.querySelector('input[name="preserveLinks"]');
|
||||
if (preserveLinksCheckbox) {
|
||||
preserveLinksCheckbox.checked = false;
|
||||
preserveLinksCheckbox.disabled = true;
|
||||
}
|
||||
|
||||
// Optionally disable images in text-only mode
|
||||
const includeImagesCheckbox = this.previewPanel.querySelector('input[name="includeImages"]');
|
||||
if (includeImagesCheckbox) {
|
||||
includeImagesCheckbox.disabled = true;
|
||||
}
|
||||
} else if (e.target.name === 'textOnly' && !e.target.checked) {
|
||||
// Re-enable options when text-only is disabled
|
||||
const preserveLinksCheckbox = this.previewPanel.querySelector('input[name="preserveLinks"]');
|
||||
if (preserveLinksCheckbox) {
|
||||
preserveLinksCheckbox.disabled = false;
|
||||
}
|
||||
|
||||
const includeImagesCheckbox = this.previewPanel.querySelector('input[name="includeImages"]');
|
||||
if (includeImagesCheckbox) {
|
||||
includeImagesCheckbox.disabled = false;
|
||||
}
|
||||
}
|
||||
|
||||
const markdown = await this.generateMarkdown();
|
||||
await this.updatePreviewContent(markdown);
|
||||
});
|
||||
});
|
||||
|
||||
// Action buttons
|
||||
this.previewPanel.querySelector('.c4ai-copy-markdown-btn').addEventListener('click', () => {
|
||||
this.copyToClipboard();
|
||||
});
|
||||
|
||||
this.previewPanel.querySelector('.c4ai-download-btn').addEventListener('click', () => {
|
||||
this.downloadMarkdown();
|
||||
});
|
||||
}
|
||||
|
||||
switchPreviewTab(tabName) {
|
||||
// Update active tab
|
||||
this.previewPanel.querySelectorAll('.c4ai-tab').forEach(tab => {
|
||||
tab.classList.toggle('active', tab.dataset.tab === tabName);
|
||||
});
|
||||
|
||||
// Update active pane
|
||||
this.previewPanel.querySelectorAll('.c4ai-preview-pane').forEach(pane => {
|
||||
pane.classList.toggle('active', pane.dataset.pane === tabName);
|
||||
});
|
||||
}
|
||||
|
||||
async updatePreviewContent(markdown) {
|
||||
// Update markdown pane
|
||||
const markdownPane = this.previewPanel.querySelector('[data-pane="markdown"]');
|
||||
markdownPane.innerHTML = `<pre><code>${this.escapeHtml(markdown)}</code></pre>`;
|
||||
|
||||
// Update preview pane using marked.js
|
||||
const previewPane = this.previewPanel.querySelector('[data-pane="preview"]');
|
||||
|
||||
// Configure marked options (marked.js is already loaded via manifest)
|
||||
if (window.marked) {
|
||||
marked.setOptions({
|
||||
gfm: true,
|
||||
breaks: true,
|
||||
tables: true,
|
||||
headerIds: false,
|
||||
mangle: false
|
||||
});
|
||||
|
||||
// Render markdown to HTML
|
||||
const html = marked.parse(markdown);
|
||||
previewPane.innerHTML = `<div class="c4ai-markdown-preview">${html}</div>`;
|
||||
} else {
|
||||
// Fallback if marked.js is not available
|
||||
previewPane.innerHTML = `<div class="c4ai-markdown-preview"><pre>${this.escapeHtml(markdown)}</pre></div>`;
|
||||
}
|
||||
}
|
||||
|
||||
|
||||
escapeHtml(unsafe) {
|
||||
return unsafe
|
||||
.replace(/&/g, "&")
|
||||
.replace(/</g, "<")
|
||||
.replace(/>/g, ">")
|
||||
.replace(/"/g, """)
|
||||
.replace(/'/g, "'");
|
||||
}
|
||||
|
||||
async generateMarkdown() {
|
||||
// Get selected elements as array
|
||||
const elements = Array.from(this.selectedElements);
|
||||
|
||||
// Sort elements by their selection order
|
||||
const sortedElements = elements.sort((a, b) => {
|
||||
const orderA = parseInt(a.getAttribute('data-c4ai-selection-order') || '0');
|
||||
const orderB = parseInt(b.getAttribute('data-c4ai-selection-order') || '0');
|
||||
return orderA - orderB;
|
||||
});
|
||||
|
||||
// Convert each element separately
|
||||
const markdownParts = [];
|
||||
|
||||
for (let i = 0; i < sortedElements.length; i++) {
|
||||
const element = sortedElements[i];
|
||||
|
||||
// Add XPath header if enabled
|
||||
if (this.options.includeXPath) {
|
||||
const xpath = this.getXPath(element);
|
||||
markdownParts.push(`### Element ${i + 1} - XPath: \`${xpath}\`\n`);
|
||||
}
|
||||
|
||||
// Check if element is part of a table structure that should be processed specially
|
||||
let elementsToConvert = [element];
|
||||
|
||||
// If text-only mode and element is a TR, process the entire table for better context
|
||||
if (this.options.textOnly && element.tagName === 'TR') {
|
||||
const table = element.closest('table');
|
||||
if (table && !sortedElements.includes(table)) {
|
||||
// Only include this table row, not the whole table
|
||||
elementsToConvert = [element];
|
||||
}
|
||||
}
|
||||
|
||||
// Analyze and convert individual element
|
||||
const analysis = await this.contentAnalyzer.analyze(elementsToConvert);
|
||||
const markdown = await this.markdownConverter.convert(elementsToConvert, {
|
||||
...this.options,
|
||||
analysis
|
||||
});
|
||||
|
||||
markdownParts.push(markdown.trim());
|
||||
|
||||
// Add separator if enabled and not last element
|
||||
if (this.options.addSeparators && i < sortedElements.length - 1) {
|
||||
markdownParts.push('\n\n---\n\n');
|
||||
}
|
||||
}
|
||||
|
||||
return markdownParts.join('\n\n');
|
||||
}
|
||||
|
||||
getXPath(element) {
|
||||
if (element.id) {
|
||||
return `//*[@id="${element.id}"]`;
|
||||
}
|
||||
|
||||
const parts = [];
|
||||
let current = element;
|
||||
|
||||
while (current && current.nodeType === Node.ELEMENT_NODE) {
|
||||
let index = 0;
|
||||
let sibling = current.previousSibling;
|
||||
|
||||
while (sibling) {
|
||||
if (sibling.nodeType === Node.ELEMENT_NODE && sibling.nodeName === current.nodeName) {
|
||||
index++;
|
||||
}
|
||||
sibling = sibling.previousSibling;
|
||||
}
|
||||
|
||||
const tagName = current.nodeName.toLowerCase();
|
||||
const part = index > 0 ? `${tagName}[${index + 1}]` : tagName;
|
||||
parts.unshift(part);
|
||||
|
||||
current = current.parentNode;
|
||||
}
|
||||
|
||||
return '/' + parts.join('/');
|
||||
}
|
||||
|
||||
sortElementsByPosition(elements) {
|
||||
return elements.sort((a, b) => {
|
||||
const position = a.compareDocumentPosition(b);
|
||||
if (position & Node.DOCUMENT_POSITION_FOLLOWING) {
|
||||
return -1;
|
||||
} else if (position & Node.DOCUMENT_POSITION_PRECEDING) {
|
||||
return 1;
|
||||
}
|
||||
return 0;
|
||||
});
|
||||
}
|
||||
|
||||
async copyToClipboard() {
|
||||
const markdown = await this.generateMarkdown();
|
||||
|
||||
try {
|
||||
await navigator.clipboard.writeText(markdown);
|
||||
this.showNotification('Markdown copied to clipboard!');
|
||||
} catch (err) {
|
||||
console.error('Failed to copy:', err);
|
||||
this.showNotification('Failed to copy. Please try again.', 'error');
|
||||
}
|
||||
}
|
||||
|
||||
async downloadMarkdown() {
|
||||
const markdown = await this.generateMarkdown();
|
||||
const timestamp = new Date().toISOString().replace(/[:.]/g, '-').slice(0, -5);
|
||||
const filename = `crawl4ai-export-${timestamp}.md`;
|
||||
|
||||
// Create blob and download
|
||||
const blob = new Blob([markdown], { type: 'text/markdown' });
|
||||
const url = URL.createObjectURL(blob);
|
||||
|
||||
const a = document.createElement('a');
|
||||
a.href = url;
|
||||
a.download = filename;
|
||||
document.body.appendChild(a);
|
||||
a.click();
|
||||
document.body.removeChild(a);
|
||||
URL.revokeObjectURL(url);
|
||||
|
||||
this.showNotification(`Downloaded ${filename}`);
|
||||
}
|
||||
|
||||
showNotification(message, type = 'success') {
|
||||
const notification = document.createElement('div');
|
||||
notification.className = `c4ai-notification c4ai-notification-${type}`;
|
||||
notification.textContent = message;
|
||||
|
||||
document.body.appendChild(notification);
|
||||
|
||||
// Animate in
|
||||
setTimeout(() => notification.classList.add('show'), 10);
|
||||
|
||||
// Remove after 3 seconds
|
||||
setTimeout(() => {
|
||||
notification.classList.remove('show');
|
||||
setTimeout(() => notification.remove(), 300);
|
||||
}, 3000);
|
||||
}
|
||||
|
||||
deactivate() {
|
||||
// Remove event listeners
|
||||
document.removeEventListener('click', this.documentClickHandler, true);
|
||||
document.removeEventListener('click', this.linkClickHandler, true);
|
||||
document.removeEventListener('mouseover', this.documentHoverHandler, true);
|
||||
document.removeEventListener('mouseout', this.documentMouseOutHandler, true);
|
||||
document.removeEventListener('keydown', this.keyboardHandler);
|
||||
|
||||
// Clear selections
|
||||
this.clearSelection();
|
||||
|
||||
// Remove UI elements
|
||||
if (this.toolbar) {
|
||||
this.toolbar.remove();
|
||||
this.toolbar = null;
|
||||
}
|
||||
|
||||
if (this.previewPanel) {
|
||||
this.previewPanel.remove();
|
||||
this.previewPanel = null;
|
||||
}
|
||||
|
||||
// Remove hover styles
|
||||
document.querySelectorAll('.c4ai-hover-candidate').forEach(el => {
|
||||
el.classList.remove('c4ai-hover-candidate');
|
||||
});
|
||||
|
||||
// Notify background script (with error handling)
|
||||
try {
|
||||
if (chrome.runtime && chrome.runtime.sendMessage) {
|
||||
chrome.runtime.sendMessage({
|
||||
action: 'c2cDeactivated'
|
||||
});
|
||||
}
|
||||
} catch (error) {
|
||||
// Extension context might be invalidated, ignore the error
|
||||
console.log('Click2Crawl deactivated (extension context unavailable)');
|
||||
}
|
||||
}
|
||||
}
|
||||
File diff suppressed because it is too large
Load Diff
623
docs/md_v2/apps/crawl4ai-assistant/content/contentAnalyzer.js
Normal file
623
docs/md_v2/apps/crawl4ai-assistant/content/contentAnalyzer.js
Normal file
@@ -0,0 +1,623 @@
|
||||
class ContentAnalyzer {
|
||||
constructor() {
|
||||
this.patterns = {
|
||||
article: ['article', 'main', 'content', 'post', 'entry'],
|
||||
navigation: ['nav', 'menu', 'navigation', 'breadcrumb'],
|
||||
sidebar: ['sidebar', 'aside', 'widget'],
|
||||
header: ['header', 'masthead', 'banner'],
|
||||
footer: ['footer', 'copyright', 'contact'],
|
||||
list: ['list', 'items', 'results', 'products', 'cards'],
|
||||
table: ['table', 'grid', 'data'],
|
||||
media: ['gallery', 'carousel', 'slideshow', 'video', 'media']
|
||||
};
|
||||
}
|
||||
|
||||
async analyze(elements) {
|
||||
const analysis = {
|
||||
structure: await this.analyzeStructure(elements),
|
||||
contentType: this.identifyContentType(elements),
|
||||
hierarchy: this.buildHierarchy(elements),
|
||||
mediaAssets: this.collectMediaAssets(elements),
|
||||
textDensity: this.calculateTextDensity(elements),
|
||||
semanticRegions: this.identifySemanticRegions(elements),
|
||||
relationships: this.analyzeRelationships(elements),
|
||||
metadata: this.extractMetadata(elements)
|
||||
};
|
||||
|
||||
return analysis;
|
||||
}
|
||||
|
||||
analyzeStructure(elements) {
|
||||
const structure = {
|
||||
hasHeadings: false,
|
||||
hasLists: false,
|
||||
hasTables: false,
|
||||
hasMedia: false,
|
||||
hasCode: false,
|
||||
hasLinks: false,
|
||||
layout: 'linear', // linear, grid, mixed
|
||||
depth: 0,
|
||||
elementTypes: new Map()
|
||||
};
|
||||
|
||||
// Analyze each element
|
||||
for (const element of elements) {
|
||||
this.analyzeElementStructure(element, structure);
|
||||
}
|
||||
|
||||
// Determine layout type
|
||||
structure.layout = this.determineLayout(elements);
|
||||
|
||||
// Calculate max depth
|
||||
structure.depth = this.calculateMaxDepth(elements);
|
||||
|
||||
return structure;
|
||||
}
|
||||
|
||||
analyzeElementStructure(element, structure, visited = new Set()) {
|
||||
if (visited.has(element)) return;
|
||||
visited.add(element);
|
||||
|
||||
const tagName = element.tagName;
|
||||
|
||||
// Update element type count
|
||||
structure.elementTypes.set(
|
||||
tagName,
|
||||
(structure.elementTypes.get(tagName) || 0) + 1
|
||||
);
|
||||
|
||||
// Check for specific structures
|
||||
if (/^H[1-6]$/.test(tagName)) {
|
||||
structure.hasHeadings = true;
|
||||
} else if (['UL', 'OL', 'DL'].includes(tagName)) {
|
||||
structure.hasLists = true;
|
||||
} else if (tagName === 'TABLE') {
|
||||
structure.hasTables = true;
|
||||
} else if (['IMG', 'VIDEO', 'IFRAME', 'PICTURE'].includes(tagName)) {
|
||||
structure.hasMedia = true;
|
||||
} else if (['CODE', 'PRE'].includes(tagName)) {
|
||||
structure.hasCode = true;
|
||||
} else if (tagName === 'A') {
|
||||
structure.hasLinks = true;
|
||||
}
|
||||
|
||||
// Analyze children
|
||||
for (const child of element.children) {
|
||||
this.analyzeElementStructure(child, structure, visited);
|
||||
}
|
||||
}
|
||||
|
||||
identifyContentType(elements) {
|
||||
const scores = {
|
||||
article: 0,
|
||||
list: 0,
|
||||
table: 0,
|
||||
form: 0,
|
||||
media: 0,
|
||||
mixed: 0
|
||||
};
|
||||
|
||||
for (const element of elements) {
|
||||
// Score based on element types and classes
|
||||
const tagName = element.tagName;
|
||||
const className = element.className.toLowerCase();
|
||||
const id = element.id.toLowerCase();
|
||||
|
||||
// Check for article patterns
|
||||
if (tagName === 'ARTICLE' ||
|
||||
this.matchesPattern(className + ' ' + id, this.patterns.article)) {
|
||||
scores.article += 10;
|
||||
}
|
||||
|
||||
// Check for list patterns
|
||||
if (['UL', 'OL'].includes(tagName) ||
|
||||
this.matchesPattern(className, this.patterns.list)) {
|
||||
scores.list += 5;
|
||||
}
|
||||
|
||||
// Check for table
|
||||
if (tagName === 'TABLE') {
|
||||
scores.table += 10;
|
||||
}
|
||||
|
||||
// Check for form
|
||||
if (tagName === 'FORM' || element.querySelector('input, select, textarea')) {
|
||||
scores.form += 5;
|
||||
}
|
||||
|
||||
// Check for media gallery
|
||||
if (this.matchesPattern(className, this.patterns.media) ||
|
||||
element.querySelectorAll('img, video').length > 3) {
|
||||
scores.media += 5;
|
||||
}
|
||||
}
|
||||
|
||||
// Determine primary content type
|
||||
const maxScore = Math.max(...Object.values(scores));
|
||||
if (maxScore === 0) return 'unknown';
|
||||
|
||||
for (const [type, score] of Object.entries(scores)) {
|
||||
if (score === maxScore) {
|
||||
return type;
|
||||
}
|
||||
}
|
||||
|
||||
return 'mixed';
|
||||
}
|
||||
|
||||
buildHierarchy(elements) {
|
||||
const hierarchy = {
|
||||
root: null,
|
||||
levels: [],
|
||||
headingStructure: []
|
||||
};
|
||||
|
||||
// Find common ancestor
|
||||
if (elements.length > 0) {
|
||||
hierarchy.root = this.findCommonAncestor(elements);
|
||||
}
|
||||
|
||||
// Build heading hierarchy
|
||||
const headings = [];
|
||||
for (const element of elements) {
|
||||
const foundHeadings = element.querySelectorAll('h1, h2, h3, h4, h5, h6');
|
||||
headings.push(...Array.from(foundHeadings));
|
||||
}
|
||||
|
||||
// Sort headings by document position
|
||||
headings.sort((a, b) => {
|
||||
const position = a.compareDocumentPosition(b);
|
||||
if (position & Node.DOCUMENT_POSITION_FOLLOWING) {
|
||||
return -1;
|
||||
} else if (position & Node.DOCUMENT_POSITION_PRECEDING) {
|
||||
return 1;
|
||||
}
|
||||
return 0;
|
||||
});
|
||||
|
||||
// Build heading structure
|
||||
let currentLevel = 0;
|
||||
const stack = [];
|
||||
|
||||
for (const heading of headings) {
|
||||
const level = parseInt(heading.tagName.substring(1));
|
||||
const item = {
|
||||
level,
|
||||
text: heading.textContent.trim(),
|
||||
element: heading,
|
||||
children: []
|
||||
};
|
||||
|
||||
// Find parent in stack
|
||||
while (stack.length > 0 && stack[stack.length - 1].level >= level) {
|
||||
stack.pop();
|
||||
}
|
||||
|
||||
if (stack.length > 0) {
|
||||
stack[stack.length - 1].children.push(item);
|
||||
} else {
|
||||
hierarchy.headingStructure.push(item);
|
||||
}
|
||||
|
||||
stack.push(item);
|
||||
}
|
||||
|
||||
return hierarchy;
|
||||
}
|
||||
|
||||
collectMediaAssets(elements) {
|
||||
const media = {
|
||||
images: [],
|
||||
videos: [],
|
||||
iframes: [],
|
||||
audio: []
|
||||
};
|
||||
|
||||
for (const element of elements) {
|
||||
// Collect images
|
||||
const images = element.querySelectorAll('img');
|
||||
for (const img of images) {
|
||||
media.images.push({
|
||||
src: img.src,
|
||||
alt: img.alt,
|
||||
title: img.title,
|
||||
width: img.width,
|
||||
height: img.height,
|
||||
element: img
|
||||
});
|
||||
}
|
||||
|
||||
// Collect videos
|
||||
const videos = element.querySelectorAll('video');
|
||||
for (const video of videos) {
|
||||
media.videos.push({
|
||||
src: video.src,
|
||||
poster: video.poster,
|
||||
width: video.width,
|
||||
height: video.height,
|
||||
element: video
|
||||
});
|
||||
}
|
||||
|
||||
// Collect iframes
|
||||
const iframes = element.querySelectorAll('iframe');
|
||||
for (const iframe of iframes) {
|
||||
media.iframes.push({
|
||||
src: iframe.src,
|
||||
width: iframe.width,
|
||||
height: iframe.height,
|
||||
title: iframe.title,
|
||||
element: iframe
|
||||
});
|
||||
}
|
||||
|
||||
// Collect audio
|
||||
const audios = element.querySelectorAll('audio');
|
||||
for (const audio of audios) {
|
||||
media.audio.push({
|
||||
src: audio.src,
|
||||
element: audio
|
||||
});
|
||||
}
|
||||
}
|
||||
|
||||
return media;
|
||||
}
|
||||
|
||||
calculateTextDensity(elements) {
|
||||
let totalText = 0;
|
||||
let totalElements = 0;
|
||||
let linkText = 0;
|
||||
let codeText = 0;
|
||||
|
||||
for (const element of elements) {
|
||||
const stats = this.getTextStats(element);
|
||||
totalText += stats.textLength;
|
||||
totalElements += stats.elementCount;
|
||||
linkText += stats.linkTextLength;
|
||||
codeText += stats.codeTextLength;
|
||||
}
|
||||
|
||||
return {
|
||||
textLength: totalText,
|
||||
elementCount: totalElements,
|
||||
averageTextPerElement: totalElements > 0 ? totalText / totalElements : 0,
|
||||
linkDensity: totalText > 0 ? linkText / totalText : 0,
|
||||
codeDensity: totalText > 0 ? codeText / totalText : 0
|
||||
};
|
||||
}
|
||||
|
||||
getTextStats(element, visited = new Set()) {
|
||||
if (visited.has(element)) {
|
||||
return { textLength: 0, elementCount: 0, linkTextLength: 0, codeTextLength: 0 };
|
||||
}
|
||||
visited.add(element);
|
||||
|
||||
let stats = {
|
||||
textLength: 0,
|
||||
elementCount: 1,
|
||||
linkTextLength: 0,
|
||||
codeTextLength: 0
|
||||
};
|
||||
|
||||
// Get direct text content
|
||||
for (const node of element.childNodes) {
|
||||
if (node.nodeType === Node.TEXT_NODE) {
|
||||
const text = node.textContent.trim();
|
||||
stats.textLength += text.length;
|
||||
|
||||
// Check if this text is within a link
|
||||
if (element.tagName === 'A') {
|
||||
stats.linkTextLength += text.length;
|
||||
}
|
||||
|
||||
// Check if this text is within code
|
||||
if (['CODE', 'PRE'].includes(element.tagName)) {
|
||||
stats.codeTextLength += text.length;
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
// Process children
|
||||
for (const child of element.children) {
|
||||
const childStats = this.getTextStats(child, visited);
|
||||
stats.textLength += childStats.textLength;
|
||||
stats.elementCount += childStats.elementCount;
|
||||
stats.linkTextLength += childStats.linkTextLength;
|
||||
stats.codeTextLength += childStats.codeTextLength;
|
||||
}
|
||||
|
||||
return stats;
|
||||
}
|
||||
|
||||
identifySemanticRegions(elements) {
|
||||
const regions = {
|
||||
headers: [],
|
||||
navigation: [],
|
||||
main: [],
|
||||
sidebars: [],
|
||||
footers: [],
|
||||
articles: []
|
||||
};
|
||||
|
||||
for (const element of elements) {
|
||||
// Check element and its ancestors for semantic regions
|
||||
let current = element;
|
||||
while (current) {
|
||||
const tagName = current.tagName;
|
||||
const className = current.className.toLowerCase();
|
||||
const role = current.getAttribute('role');
|
||||
|
||||
// Check semantic HTML5 elements
|
||||
if (tagName === 'HEADER' || role === 'banner') {
|
||||
regions.headers.push(current);
|
||||
} else if (tagName === 'NAV' || role === 'navigation') {
|
||||
regions.navigation.push(current);
|
||||
} else if (tagName === 'MAIN' || role === 'main') {
|
||||
regions.main.push(current);
|
||||
} else if (tagName === 'ASIDE' || role === 'complementary') {
|
||||
regions.sidebars.push(current);
|
||||
} else if (tagName === 'FOOTER' || role === 'contentinfo') {
|
||||
regions.footers.push(current);
|
||||
} else if (tagName === 'ARTICLE' || role === 'article') {
|
||||
regions.articles.push(current);
|
||||
}
|
||||
|
||||
// Check class patterns
|
||||
if (this.matchesPattern(className, this.patterns.header)) {
|
||||
regions.headers.push(current);
|
||||
} else if (this.matchesPattern(className, this.patterns.navigation)) {
|
||||
regions.navigation.push(current);
|
||||
} else if (this.matchesPattern(className, this.patterns.sidebar)) {
|
||||
regions.sidebars.push(current);
|
||||
} else if (this.matchesPattern(className, this.patterns.footer)) {
|
||||
regions.footers.push(current);
|
||||
}
|
||||
|
||||
current = current.parentElement;
|
||||
}
|
||||
}
|
||||
|
||||
// Deduplicate
|
||||
for (const key of Object.keys(regions)) {
|
||||
regions[key] = Array.from(new Set(regions[key]));
|
||||
}
|
||||
|
||||
return regions;
|
||||
}
|
||||
|
||||
analyzeRelationships(elements) {
|
||||
const relationships = {
|
||||
siblings: [],
|
||||
parents: [],
|
||||
children: [],
|
||||
relatedByClass: new Map(),
|
||||
relatedByStructure: []
|
||||
};
|
||||
|
||||
// Find sibling relationships
|
||||
for (let i = 0; i < elements.length; i++) {
|
||||
for (let j = i + 1; j < elements.length; j++) {
|
||||
if (elements[i].parentElement === elements[j].parentElement) {
|
||||
relationships.siblings.push([elements[i], elements[j]]);
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
// Find parent-child relationships
|
||||
for (const element of elements) {
|
||||
for (const other of elements) {
|
||||
if (element !== other) {
|
||||
if (element.contains(other)) {
|
||||
relationships.parents.push({ parent: element, child: other });
|
||||
} else if (other.contains(element)) {
|
||||
relationships.children.push({ parent: other, child: element });
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
// Group by similar classes
|
||||
for (const element of elements) {
|
||||
const classes = Array.from(element.classList);
|
||||
for (const className of classes) {
|
||||
if (!relationships.relatedByClass.has(className)) {
|
||||
relationships.relatedByClass.set(className, []);
|
||||
}
|
||||
relationships.relatedByClass.get(className).push(element);
|
||||
}
|
||||
}
|
||||
|
||||
// Find structurally similar elements
|
||||
for (let i = 0; i < elements.length; i++) {
|
||||
for (let j = i + 1; j < elements.length; j++) {
|
||||
if (this.areStructurallySimilar(elements[i], elements[j])) {
|
||||
relationships.relatedByStructure.push([elements[i], elements[j]]);
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
return relationships;
|
||||
}
|
||||
|
||||
areStructurallySimilar(element1, element2) {
|
||||
// Same tag name
|
||||
if (element1.tagName !== element2.tagName) {
|
||||
return false;
|
||||
}
|
||||
|
||||
// Similar class structure
|
||||
const classes1 = Array.from(element1.classList).sort();
|
||||
const classes2 = Array.from(element2.classList).sort();
|
||||
|
||||
// At least 50% overlap in classes
|
||||
const intersection = classes1.filter(c => classes2.includes(c));
|
||||
const union = Array.from(new Set([...classes1, ...classes2]));
|
||||
|
||||
if (union.length > 0 && intersection.length / union.length >= 0.5) {
|
||||
return true;
|
||||
}
|
||||
|
||||
// Similar child structure
|
||||
if (element1.children.length === element2.children.length) {
|
||||
const childTags1 = Array.from(element1.children).map(c => c.tagName).sort();
|
||||
const childTags2 = Array.from(element2.children).map(c => c.tagName).sort();
|
||||
|
||||
if (JSON.stringify(childTags1) === JSON.stringify(childTags2)) {
|
||||
return true;
|
||||
}
|
||||
}
|
||||
|
||||
return false;
|
||||
}
|
||||
|
||||
extractMetadata(elements) {
|
||||
const metadata = {
|
||||
title: null,
|
||||
description: null,
|
||||
author: null,
|
||||
date: null,
|
||||
tags: [],
|
||||
microdata: []
|
||||
};
|
||||
|
||||
for (const element of elements) {
|
||||
// Look for title
|
||||
const h1 = element.querySelector('h1');
|
||||
if (h1 && !metadata.title) {
|
||||
metadata.title = h1.textContent.trim();
|
||||
}
|
||||
|
||||
// Look for meta information
|
||||
const metaElements = element.querySelectorAll('[itemprop], [property], [name]');
|
||||
for (const meta of metaElements) {
|
||||
const prop = meta.getAttribute('itemprop') ||
|
||||
meta.getAttribute('property') ||
|
||||
meta.getAttribute('name');
|
||||
const content = meta.getAttribute('content') || meta.textContent.trim();
|
||||
|
||||
if (prop && content) {
|
||||
if (prop.includes('author')) {
|
||||
metadata.author = content;
|
||||
} else if (prop.includes('date') || prop.includes('time')) {
|
||||
metadata.date = content;
|
||||
} else if (prop.includes('description')) {
|
||||
metadata.description = content;
|
||||
} else if (prop.includes('tag') || prop.includes('keyword')) {
|
||||
metadata.tags.push(content);
|
||||
}
|
||||
|
||||
metadata.microdata.push({ property: prop, value: content });
|
||||
}
|
||||
}
|
||||
|
||||
// Look for time elements
|
||||
const timeElements = element.querySelectorAll('time');
|
||||
for (const time of timeElements) {
|
||||
if (!metadata.date && time.dateTime) {
|
||||
metadata.date = time.dateTime;
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
return metadata;
|
||||
}
|
||||
|
||||
determineLayout(elements) {
|
||||
// Check if elements form a grid
|
||||
const positions = elements.map(el => {
|
||||
const rect = el.getBoundingClientRect();
|
||||
return { x: rect.left, y: rect.top, width: rect.width, height: rect.height };
|
||||
});
|
||||
|
||||
// Check for grid layout (multiple elements on same row)
|
||||
const rows = new Map();
|
||||
for (const pos of positions) {
|
||||
const row = Math.round(pos.y / 10) * 10; // Round to nearest 10px
|
||||
if (!rows.has(row)) {
|
||||
rows.set(row, []);
|
||||
}
|
||||
rows.get(row).push(pos);
|
||||
}
|
||||
|
||||
// If multiple elements share rows, it's likely a grid
|
||||
const hasGrid = Array.from(rows.values()).some(row => row.length > 1);
|
||||
|
||||
if (hasGrid) {
|
||||
return 'grid';
|
||||
}
|
||||
|
||||
// Check for mixed layout (significant variation in widths)
|
||||
const widths = positions.map(p => p.width);
|
||||
const avgWidth = widths.reduce((a, b) => a + b, 0) / widths.length;
|
||||
const variance = widths.reduce((sum, w) => sum + Math.pow(w - avgWidth, 2), 0) / widths.length;
|
||||
const stdDev = Math.sqrt(variance);
|
||||
|
||||
if (stdDev / avgWidth > 0.3) {
|
||||
return 'mixed';
|
||||
}
|
||||
|
||||
return 'linear';
|
||||
}
|
||||
|
||||
calculateMaxDepth(elements) {
|
||||
let maxDepth = 0;
|
||||
|
||||
for (const element of elements) {
|
||||
const depth = this.getElementDepth(element);
|
||||
maxDepth = Math.max(maxDepth, depth);
|
||||
}
|
||||
|
||||
return maxDepth;
|
||||
}
|
||||
|
||||
getElementDepth(element, depth = 0) {
|
||||
if (element.children.length === 0) {
|
||||
return depth;
|
||||
}
|
||||
|
||||
let maxChildDepth = depth;
|
||||
for (const child of element.children) {
|
||||
const childDepth = this.getElementDepth(child, depth + 1);
|
||||
maxChildDepth = Math.max(maxChildDepth, childDepth);
|
||||
}
|
||||
|
||||
return maxChildDepth;
|
||||
}
|
||||
|
||||
findCommonAncestor(elements) {
|
||||
if (elements.length === 0) return null;
|
||||
if (elements.length === 1) return elements[0].parentElement;
|
||||
|
||||
// Start with the first element's ancestors
|
||||
let ancestor = elements[0];
|
||||
const ancestors = [];
|
||||
|
||||
while (ancestor) {
|
||||
ancestors.push(ancestor);
|
||||
ancestor = ancestor.parentElement;
|
||||
}
|
||||
|
||||
// Find the deepest common ancestor
|
||||
for (const ancestorCandidate of ancestors) {
|
||||
let isCommon = true;
|
||||
|
||||
for (const element of elements) {
|
||||
if (!ancestorCandidate.contains(element)) {
|
||||
isCommon = false;
|
||||
break;
|
||||
}
|
||||
}
|
||||
|
||||
if (isCommon) {
|
||||
return ancestorCandidate;
|
||||
}
|
||||
}
|
||||
|
||||
return document.body;
|
||||
}
|
||||
|
||||
matchesPattern(text, patterns) {
|
||||
return patterns.some(pattern => text.includes(pattern));
|
||||
}
|
||||
}
|
||||
718
docs/md_v2/apps/crawl4ai-assistant/content/markdownConverter.js
Normal file
718
docs/md_v2/apps/crawl4ai-assistant/content/markdownConverter.js
Normal file
@@ -0,0 +1,718 @@
|
||||
class MarkdownConverter {
|
||||
constructor() {
|
||||
// Conversion handlers for different element types
|
||||
this.converters = {
|
||||
'H1': async (el, ctx) => await this.convertHeading(el, 1, ctx),
|
||||
'H2': async (el, ctx) => await this.convertHeading(el, 2, ctx),
|
||||
'H3': async (el, ctx) => await this.convertHeading(el, 3, ctx),
|
||||
'H4': async (el, ctx) => await this.convertHeading(el, 4, ctx),
|
||||
'H5': async (el, ctx) => await this.convertHeading(el, 5, ctx),
|
||||
'H6': async (el, ctx) => await this.convertHeading(el, 6, ctx),
|
||||
'P': async (el, ctx) => await this.convertParagraph(el, ctx),
|
||||
'A': async (el, ctx) => await this.convertLink(el, ctx),
|
||||
'IMG': async (el, ctx) => await this.convertImage(el, ctx),
|
||||
'UL': async (el, ctx) => await this.convertList(el, 'ul', ctx),
|
||||
'OL': async (el, ctx) => await this.convertList(el, 'ol', ctx),
|
||||
'LI': async (el, ctx) => await this.convertListItem(el, ctx),
|
||||
'TABLE': async (el, ctx) => await this.convertTable(el, ctx),
|
||||
'BLOCKQUOTE': async (el, ctx) => await this.convertBlockquote(el, ctx),
|
||||
'PRE': async (el, ctx) => await this.convertPreformatted(el, ctx),
|
||||
'CODE': async (el, ctx) => await this.convertCode(el, ctx),
|
||||
'HR': async (el, ctx) => '\n---\n',
|
||||
'BR': async (el, ctx) => ' \n',
|
||||
'STRONG': async (el, ctx) => `**${await this.getTextContent(el, ctx)}**`,
|
||||
'B': async (el, ctx) => `**${await this.getTextContent(el, ctx)}**`,
|
||||
'EM': async (el, ctx) => `*${await this.getTextContent(el, ctx)}*`,
|
||||
'I': async (el, ctx) => `*${await this.getTextContent(el, ctx)}*`,
|
||||
'DEL': async (el, ctx) => `~~${await this.getTextContent(el, ctx)}~~`,
|
||||
'S': async (el, ctx) => `~~${await this.getTextContent(el, ctx)}~~`,
|
||||
'DIV': async (el, ctx) => await this.convertDiv(el, ctx),
|
||||
'SPAN': async (el, ctx) => await this.convertSpan(el, ctx),
|
||||
'ARTICLE': async (el, ctx) => await this.convertArticle(el, ctx),
|
||||
'SECTION': async (el, ctx) => await this.convertSection(el, ctx),
|
||||
'FIGURE': async (el, ctx) => await this.convertFigure(el, ctx),
|
||||
'FIGCAPTION': async (el, ctx) => await this.convertFigCaption(el, ctx),
|
||||
'VIDEO': async (el, ctx) => await this.convertVideo(el, ctx),
|
||||
'IFRAME': async (el, ctx) => await this.convertIframe(el, ctx),
|
||||
'DL': async (el, ctx) => await this.convertDefinitionList(el, ctx),
|
||||
'DT': async (el, ctx) => await this.convertDefinitionTerm(el, ctx),
|
||||
'DD': async (el, ctx) => await this.convertDefinitionDescription(el, ctx),
|
||||
'TR': async (el, ctx) => await this.convertTableRow(el, ctx)
|
||||
};
|
||||
|
||||
// Maintain context during conversion
|
||||
this.conversionContext = {
|
||||
listDepth: 0,
|
||||
inTable: false,
|
||||
inCode: false,
|
||||
preserveWhitespace: false,
|
||||
references: [],
|
||||
imageCount: 0,
|
||||
linkCount: 0
|
||||
};
|
||||
}
|
||||
|
||||
async convert(elements, options = {}) {
|
||||
// Reset context
|
||||
this.resetContext();
|
||||
|
||||
// Apply options
|
||||
this.options = {
|
||||
includeImages: true,
|
||||
preserveTables: true,
|
||||
keepCodeFormatting: true,
|
||||
simplifyLayout: false,
|
||||
preserveLinks: true,
|
||||
...options
|
||||
};
|
||||
|
||||
// Convert elements
|
||||
const markdownParts = [];
|
||||
|
||||
for (const element of elements) {
|
||||
const markdown = await this.convertElement(element, this.conversionContext);
|
||||
if (markdown.trim()) {
|
||||
markdownParts.push(markdown);
|
||||
}
|
||||
}
|
||||
|
||||
// Join parts with appropriate spacing
|
||||
let result = markdownParts.join('\n\n');
|
||||
|
||||
// Add references if using reference-style links
|
||||
if (this.conversionContext.references.length > 0) {
|
||||
result += '\n\n' + this.generateReferences();
|
||||
}
|
||||
|
||||
// Post-process to clean up
|
||||
result = this.postProcess(result);
|
||||
|
||||
return result;
|
||||
}
|
||||
|
||||
resetContext() {
|
||||
this.conversionContext = {
|
||||
listDepth: 0,
|
||||
inTable: false,
|
||||
inCode: false,
|
||||
preserveWhitespace: false,
|
||||
references: [],
|
||||
imageCount: 0,
|
||||
linkCount: 0
|
||||
};
|
||||
}
|
||||
|
||||
async convertElement(element, context) {
|
||||
// Skip hidden elements
|
||||
if (this.isHidden(element)) {
|
||||
return '';
|
||||
}
|
||||
|
||||
// Skip script and style elements
|
||||
if (['SCRIPT', 'STYLE', 'NOSCRIPT'].includes(element.tagName)) {
|
||||
return '';
|
||||
}
|
||||
|
||||
// Get converter for this element type
|
||||
const converter = this.converters[element.tagName];
|
||||
|
||||
if (converter) {
|
||||
return await converter(element, context);
|
||||
} else {
|
||||
// For unknown elements, process children
|
||||
return await this.processChildren(element, context);
|
||||
}
|
||||
}
|
||||
|
||||
async processChildren(element, context) {
|
||||
const parts = [];
|
||||
|
||||
for (const child of element.childNodes) {
|
||||
if (child.nodeType === Node.TEXT_NODE) {
|
||||
const text = this.processTextNode(child, context);
|
||||
if (text) {
|
||||
parts.push(text);
|
||||
}
|
||||
} else if (child.nodeType === Node.ELEMENT_NODE) {
|
||||
const markdown = await this.convertElement(child, context);
|
||||
if (markdown) {
|
||||
parts.push(markdown);
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
return parts.join('');
|
||||
}
|
||||
|
||||
processTextNode(node, context) {
|
||||
let text = node.textContent;
|
||||
|
||||
// Preserve whitespace in code blocks
|
||||
if (!context.preserveWhitespace && !context.inCode) {
|
||||
// Normalize whitespace
|
||||
text = text.replace(/\s+/g, ' ');
|
||||
|
||||
// Trim if at block boundaries
|
||||
if (this.isBlockBoundary(node.previousSibling)) {
|
||||
text = text.trimStart();
|
||||
}
|
||||
if (this.isBlockBoundary(node.nextSibling)) {
|
||||
text = text.trimEnd();
|
||||
}
|
||||
}
|
||||
|
||||
// Escape markdown characters
|
||||
if (!context.inCode) {
|
||||
text = this.escapeMarkdown(text);
|
||||
}
|
||||
|
||||
return text;
|
||||
}
|
||||
|
||||
isBlockBoundary(node) {
|
||||
if (!node || node.nodeType !== Node.ELEMENT_NODE) {
|
||||
return true;
|
||||
}
|
||||
|
||||
const blockElements = [
|
||||
'DIV', 'P', 'H1', 'H2', 'H3', 'H4', 'H5', 'H6',
|
||||
'UL', 'OL', 'LI', 'BLOCKQUOTE', 'PRE', 'TABLE',
|
||||
'HR', 'ARTICLE', 'SECTION', 'HEADER', 'FOOTER',
|
||||
'NAV', 'ASIDE', 'MAIN'
|
||||
];
|
||||
|
||||
return blockElements.includes(node.tagName);
|
||||
}
|
||||
|
||||
escapeMarkdown(text) {
|
||||
// In text-only mode, don't escape characters
|
||||
if (this.options.textOnly) {
|
||||
return text;
|
||||
}
|
||||
|
||||
// Escape special markdown characters
|
||||
return text
|
||||
.replace(/\\/g, '\\\\')
|
||||
.replace(/\*/g, '\\*')
|
||||
.replace(/_/g, '\\_')
|
||||
.replace(/\[/g, '\\[')
|
||||
.replace(/\]/g, '\\]')
|
||||
.replace(/\(/g, '\\(')
|
||||
.replace(/\)/g, '\\)')
|
||||
.replace(/\#/g, '\\#')
|
||||
.replace(/\+/g, '\\+')
|
||||
.replace(/\-/g, '\\-')
|
||||
.replace(/\./g, '\\.')
|
||||
.replace(/\!/g, '\\!')
|
||||
.replace(/\|/g, '\\|');
|
||||
}
|
||||
|
||||
async convertHeading(element, level, context) {
|
||||
const text = await this.getTextContent(element, context);
|
||||
return '#'.repeat(level) + ' ' + text + '\n';
|
||||
}
|
||||
|
||||
async convertParagraph(element, context) {
|
||||
const content = await this.processChildren(element, context);
|
||||
return content.trim() ? content + '\n' : '';
|
||||
}
|
||||
|
||||
async convertLink(element, context) {
|
||||
if (!this.options.preserveLinks || this.options.textOnly) {
|
||||
return await this.getTextContent(element, context);
|
||||
}
|
||||
|
||||
const text = await this.getTextContent(element, context);
|
||||
const href = element.getAttribute('href');
|
||||
const title = element.getAttribute('title');
|
||||
|
||||
if (!href) {
|
||||
return text;
|
||||
}
|
||||
|
||||
// Convert relative URLs to absolute
|
||||
const absoluteUrl = this.makeAbsoluteUrl(href);
|
||||
|
||||
// Use reference-style links for cleaner markdown
|
||||
if (text && absoluteUrl) {
|
||||
if (title) {
|
||||
return `[${text}](${absoluteUrl} "${title}")`;
|
||||
} else {
|
||||
return `[${text}](${absoluteUrl})`;
|
||||
}
|
||||
}
|
||||
|
||||
return text;
|
||||
}
|
||||
|
||||
async convertImage(element, context) {
|
||||
if (!this.options.includeImages || this.options.textOnly) {
|
||||
// In text-only mode, return alt text if available
|
||||
if (this.options.textOnly) {
|
||||
const alt = element.getAttribute('alt');
|
||||
return alt ? `[Image: ${alt}]` : '';
|
||||
}
|
||||
return '';
|
||||
}
|
||||
|
||||
const src = element.getAttribute('src');
|
||||
const alt = element.getAttribute('alt') || '';
|
||||
const title = element.getAttribute('title');
|
||||
|
||||
if (!src) {
|
||||
return '';
|
||||
}
|
||||
|
||||
// Convert relative URLs to absolute
|
||||
const absoluteUrl = this.makeAbsoluteUrl(src);
|
||||
|
||||
if (title) {
|
||||
return ``;
|
||||
} else {
|
||||
return ``;
|
||||
}
|
||||
}
|
||||
|
||||
async convertList(element, type, context) {
|
||||
const oldDepth = context.listDepth;
|
||||
context.listDepth++;
|
||||
|
||||
const items = [];
|
||||
for (const child of element.children) {
|
||||
if (child.tagName === 'LI') {
|
||||
const markdown = await this.convertListItem(child, { ...context, listType: type });
|
||||
if (markdown) {
|
||||
items.push(markdown);
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
context.listDepth = oldDepth;
|
||||
|
||||
return items.join('\n') + (context.listDepth === 0 ? '\n' : '');
|
||||
}
|
||||
|
||||
async convertListItem(element, context) {
|
||||
const indent = ' '.repeat(Math.max(0, context.listDepth - 1));
|
||||
const bullet = context.listType === 'ol' ? '1.' : '-';
|
||||
const content = (await this.processChildren(element, context)).trim();
|
||||
|
||||
return `${indent}${bullet} ${content}`;
|
||||
}
|
||||
|
||||
async convertTable(element, context) {
|
||||
if (!this.options.preserveTables || this.options.textOnly) {
|
||||
// Fallback to simple text representation
|
||||
return await this.convertTableToText(element, context);
|
||||
}
|
||||
|
||||
const rows = [];
|
||||
const headerRows = [];
|
||||
let maxCols = 0;
|
||||
|
||||
// Process table rows
|
||||
for (const child of element.children) {
|
||||
if (child.tagName === 'THEAD') {
|
||||
for (const row of child.children) {
|
||||
if (row.tagName === 'TR') {
|
||||
const cells = await this.processTableRow(row, context);
|
||||
headerRows.push(cells);
|
||||
maxCols = Math.max(maxCols, cells.length);
|
||||
}
|
||||
}
|
||||
} else if (child.tagName === 'TBODY') {
|
||||
for (const row of child.children) {
|
||||
if (row.tagName === 'TR') {
|
||||
const cells = await this.processTableRow(row, context);
|
||||
rows.push(cells);
|
||||
maxCols = Math.max(maxCols, cells.length);
|
||||
}
|
||||
}
|
||||
} else if (child.tagName === 'TR') {
|
||||
const cells = await this.processTableRow(child, context);
|
||||
rows.push(cells);
|
||||
maxCols = Math.max(maxCols, cells.length);
|
||||
}
|
||||
}
|
||||
|
||||
// Build markdown table
|
||||
const markdownRows = [];
|
||||
|
||||
// Add headers
|
||||
if (headerRows.length > 0) {
|
||||
for (const headerRow of headerRows) {
|
||||
const paddedRow = this.padTableRow(headerRow, maxCols);
|
||||
markdownRows.push('| ' + paddedRow.join(' | ') + ' |');
|
||||
}
|
||||
|
||||
// Add separator
|
||||
const separator = Array(maxCols).fill('---');
|
||||
markdownRows.push('| ' + separator.join(' | ') + ' |');
|
||||
}
|
||||
|
||||
// Add body rows
|
||||
for (const row of rows) {
|
||||
const paddedRow = this.padTableRow(row, maxCols);
|
||||
markdownRows.push('| ' + paddedRow.join(' | ') + ' |');
|
||||
}
|
||||
|
||||
return markdownRows.join('\n') + '\n';
|
||||
}
|
||||
|
||||
async processTableRow(row, context) {
|
||||
const cells = [];
|
||||
|
||||
for (const cell of row.children) {
|
||||
if (cell.tagName === 'TD' || cell.tagName === 'TH') {
|
||||
const content = (await this.getTextContent(cell, context)).trim();
|
||||
cells.push(content);
|
||||
}
|
||||
}
|
||||
|
||||
return cells;
|
||||
}
|
||||
|
||||
async convertTableRow(element, context) {
|
||||
// Convert a single table row to markdown
|
||||
if (this.options.textOnly) {
|
||||
const cells = await this.processTableRow(element, context);
|
||||
return cells.join(' ');
|
||||
}
|
||||
|
||||
// For non-text-only mode, create a simple table representation
|
||||
const cells = await this.processTableRow(element, context);
|
||||
return '| ' + cells.join(' | ') + ' |';
|
||||
}
|
||||
|
||||
padTableRow(row, targetLength) {
|
||||
const padded = [...row];
|
||||
while (padded.length < targetLength) {
|
||||
padded.push('');
|
||||
}
|
||||
return padded;
|
||||
}
|
||||
|
||||
async convertTableToText(element, context) {
|
||||
// Convert table to clean text representation
|
||||
const lines = [];
|
||||
const rows = element.querySelectorAll('tr');
|
||||
|
||||
for (const row of rows) {
|
||||
const cells = row.querySelectorAll('td, th');
|
||||
const cellTexts = [];
|
||||
|
||||
for (const cell of cells) {
|
||||
const text = (await this.getTextContent(cell, context)).trim();
|
||||
if (text) {
|
||||
cellTexts.push(text);
|
||||
}
|
||||
}
|
||||
|
||||
if (cellTexts.length > 0) {
|
||||
// Join cells with space, handling common patterns
|
||||
lines.push(cellTexts.join(' '));
|
||||
}
|
||||
}
|
||||
|
||||
return lines.join('\n');
|
||||
}
|
||||
|
||||
async convertBlockquote(element, context) {
|
||||
const lines = (await this.processChildren(element, context)).trim().split('\n');
|
||||
return lines.map(line => '> ' + line).join('\n') + '\n';
|
||||
}
|
||||
|
||||
async convertPreformatted(element, context) {
|
||||
const oldInCode = context.inCode;
|
||||
const oldPreserveWhitespace = context.preserveWhitespace;
|
||||
|
||||
context.inCode = true;
|
||||
context.preserveWhitespace = true;
|
||||
|
||||
let content = '';
|
||||
let language = '';
|
||||
|
||||
// Check if this is a code block with language
|
||||
const codeElement = element.querySelector('code');
|
||||
if (codeElement) {
|
||||
// Try to detect language from class
|
||||
const className = codeElement.className;
|
||||
const langMatch = className.match(/language-(\w+)/);
|
||||
if (langMatch) {
|
||||
language = langMatch[1];
|
||||
}
|
||||
|
||||
content = codeElement.textContent;
|
||||
} else {
|
||||
content = element.textContent;
|
||||
}
|
||||
|
||||
context.inCode = oldInCode;
|
||||
context.preserveWhitespace = oldPreserveWhitespace;
|
||||
|
||||
// Use fenced code blocks
|
||||
return '```' + language + '\n' + content + '\n```\n';
|
||||
}
|
||||
|
||||
async convertCode(element, context) {
|
||||
if (element.parentElement && element.parentElement.tagName === 'PRE') {
|
||||
// Already handled by convertPreformatted
|
||||
return element.textContent;
|
||||
}
|
||||
|
||||
const content = element.textContent;
|
||||
return '`' + content + '`';
|
||||
}
|
||||
|
||||
async convertDiv(element, context) {
|
||||
// Check for special div types
|
||||
if (element.className.includes('code-block') ||
|
||||
element.className.includes('highlight')) {
|
||||
return await this.convertPreformatted(element, context);
|
||||
}
|
||||
|
||||
const content = await this.processChildren(element, context);
|
||||
return content.trim() ? content + '\n' : '';
|
||||
}
|
||||
|
||||
async convertSpan(element, context) {
|
||||
// Check for special span types
|
||||
if (element.className.includes('code') ||
|
||||
element.className.includes('inline-code')) {
|
||||
return this.convertCode(element, context);
|
||||
}
|
||||
|
||||
return await this.processChildren(element, context);
|
||||
}
|
||||
|
||||
async convertArticle(element, context) {
|
||||
const content = await this.processChildren(element, context);
|
||||
return content.trim() ? content + '\n' : '';
|
||||
}
|
||||
|
||||
async convertSection(element, context) {
|
||||
const content = await this.processChildren(element, context);
|
||||
return content.trim() ? content + '\n' : '';
|
||||
}
|
||||
|
||||
async convertFigure(element, context) {
|
||||
const content = await this.processChildren(element, context);
|
||||
return content.trim() ? content + '\n' : '';
|
||||
}
|
||||
|
||||
async convertFigCaption(element, context) {
|
||||
const caption = await this.getTextContent(element, context);
|
||||
return caption ? '\n*' + caption + '*\n' : '';
|
||||
}
|
||||
|
||||
async convertVideo(element, context) {
|
||||
const title = element.getAttribute('title') || 'Video';
|
||||
|
||||
if (this.options.textOnly) {
|
||||
return `[Video: ${title}]`;
|
||||
}
|
||||
|
||||
const src = element.getAttribute('src');
|
||||
const poster = element.getAttribute('poster');
|
||||
|
||||
if (!src) {
|
||||
return '';
|
||||
}
|
||||
|
||||
// Convert to markdown with poster image if available
|
||||
if (poster) {
|
||||
const absolutePoster = this.makeAbsoluteUrl(poster);
|
||||
const absoluteSrc = this.makeAbsoluteUrl(src);
|
||||
return `[](${absoluteSrc})`;
|
||||
} else {
|
||||
const absoluteSrc = this.makeAbsoluteUrl(src);
|
||||
return `[${title}](${absoluteSrc})`;
|
||||
}
|
||||
}
|
||||
|
||||
async convertIframe(element, context) {
|
||||
const title = element.getAttribute('title') || 'Embedded content';
|
||||
|
||||
if (this.options.textOnly) {
|
||||
const src = element.getAttribute('src') || '';
|
||||
if (src.includes('youtube.com') || src.includes('youtu.be')) {
|
||||
return `[Video: ${title}]`;
|
||||
} else if (src.includes('vimeo.com')) {
|
||||
return `[Video: ${title}]`;
|
||||
} else {
|
||||
return `[Embedded: ${title}]`;
|
||||
}
|
||||
}
|
||||
|
||||
const src = element.getAttribute('src');
|
||||
if (!src) {
|
||||
return '';
|
||||
}
|
||||
|
||||
// Check for common embeds
|
||||
if (src.includes('youtube.com') || src.includes('youtu.be')) {
|
||||
return `[▶️ ${title}](${src})`;
|
||||
} else if (src.includes('vimeo.com')) {
|
||||
return `[▶️ ${title}](${src})`;
|
||||
} else {
|
||||
return `[${title}](${src})`;
|
||||
}
|
||||
}
|
||||
|
||||
async convertDefinitionList(element, context) {
|
||||
return await this.processChildren(element, context) + '\n';
|
||||
}
|
||||
|
||||
async convertDefinitionTerm(element, context) {
|
||||
const term = await this.getTextContent(element, context);
|
||||
return '**' + term + '**\n';
|
||||
}
|
||||
|
||||
async convertDefinitionDescription(element, context) {
|
||||
const description = await this.processChildren(element, context);
|
||||
return ': ' + description + '\n';
|
||||
}
|
||||
|
||||
async getTextContent(element, context) {
|
||||
// Special handling for elements that might contain other markdown
|
||||
if (context.inCode) {
|
||||
return element.textContent;
|
||||
}
|
||||
|
||||
return await this.processChildren(element, context);
|
||||
}
|
||||
|
||||
makeAbsoluteUrl(url) {
|
||||
if (!url) return '';
|
||||
|
||||
try {
|
||||
// Check if already absolute
|
||||
if (url.startsWith('http://') || url.startsWith('https://')) {
|
||||
return url;
|
||||
}
|
||||
|
||||
// Handle protocol-relative URLs
|
||||
if (url.startsWith('//')) {
|
||||
return window.location.protocol + url;
|
||||
}
|
||||
|
||||
// Convert relative to absolute
|
||||
const base = window.location.origin;
|
||||
const path = window.location.pathname;
|
||||
|
||||
if (url.startsWith('/')) {
|
||||
return base + url;
|
||||
} else {
|
||||
// Relative to current path
|
||||
const pathDir = path.substring(0, path.lastIndexOf('/') + 1);
|
||||
return base + pathDir + url;
|
||||
}
|
||||
} catch (e) {
|
||||
return url;
|
||||
}
|
||||
}
|
||||
|
||||
isHidden(element) {
|
||||
const style = window.getComputedStyle(element);
|
||||
return style.display === 'none' ||
|
||||
style.visibility === 'hidden' ||
|
||||
style.opacity === '0';
|
||||
}
|
||||
|
||||
generateReferences() {
|
||||
return this.conversionContext.references
|
||||
.map((ref, index) => `[${index + 1}]: ${ref.url}`)
|
||||
.join('\n');
|
||||
}
|
||||
|
||||
postProcess(markdown) {
|
||||
// Apply text-only specific processing
|
||||
if (this.options.textOnly) {
|
||||
markdown = this.postProcessTextOnly(markdown);
|
||||
}
|
||||
|
||||
// Clean up excessive newlines
|
||||
markdown = markdown.replace(/\n{3,}/g, '\n\n');
|
||||
|
||||
// Clean up spaces before punctuation
|
||||
markdown = markdown.replace(/ +([.,;:!?])/g, '$1');
|
||||
|
||||
// Ensure proper spacing around headers
|
||||
markdown = markdown.replace(/\n(#{1,6} )/g, '\n\n$1');
|
||||
markdown = markdown.replace(/(#{1,6} .+)\n(?![\n#])/g, '$1\n\n');
|
||||
|
||||
// Clean up list spacing
|
||||
markdown = markdown.replace(/\n\n(-|\d+\.) /g, '\n$1 ');
|
||||
|
||||
// Trim final result
|
||||
return markdown.trim();
|
||||
}
|
||||
|
||||
postProcessTextOnly(markdown) {
|
||||
// Smart pattern recognition for common formats
|
||||
const lines = markdown.split('\n');
|
||||
const processedLines = [];
|
||||
let inMetadata = false;
|
||||
let currentItem = null;
|
||||
|
||||
for (let i = 0; i < lines.length; i++) {
|
||||
const line = lines[i].trim();
|
||||
if (!line) {
|
||||
processedLines.push('');
|
||||
continue;
|
||||
}
|
||||
|
||||
// Detect numbered list items (common in HN, Reddit, etc.)
|
||||
const numberPattern = /^(\d+)\.\s*(.+)$/;
|
||||
const numberMatch = line.match(numberPattern);
|
||||
|
||||
if (numberMatch) {
|
||||
// Start of a new numbered item
|
||||
inMetadata = false;
|
||||
currentItem = numberMatch[1];
|
||||
const content = numberMatch[2];
|
||||
|
||||
// Check if content has domain in parentheses
|
||||
const domainPattern = /^(.+?)\s*\(([^)]+)\)\s*(.*)$/;
|
||||
const domainMatch = content.match(domainPattern);
|
||||
|
||||
if (domainMatch) {
|
||||
const [, title, domain, rest] = domainMatch;
|
||||
processedLines.push(`${currentItem}. **${title.trim()}** (${domain})`);
|
||||
if (rest.trim()) {
|
||||
processedLines.push(` ${rest.trim()}`);
|
||||
inMetadata = true;
|
||||
}
|
||||
} else {
|
||||
processedLines.push(`${currentItem}. **${content}**`);
|
||||
}
|
||||
} else if (line.match(/\b(points?|by|ago|hide|comments?)\b/i) && currentItem) {
|
||||
// This looks like metadata for the current item
|
||||
const cleanedLine = line
|
||||
.replace(/\s+/g, ' ')
|
||||
.replace(/\s*\|\s*/g, ' | ')
|
||||
.trim();
|
||||
processedLines.push(` ${cleanedLine}`);
|
||||
inMetadata = true;
|
||||
} else if (inMetadata && line.length < 100) {
|
||||
// Continue metadata if we're in metadata mode and line is short
|
||||
processedLines.push(` ${line}`);
|
||||
} else {
|
||||
// Regular content
|
||||
inMetadata = false;
|
||||
processedLines.push(line);
|
||||
}
|
||||
}
|
||||
|
||||
// Clean up the output
|
||||
let result = processedLines.join('\n');
|
||||
|
||||
// Remove excessive blank lines
|
||||
result = result.replace(/\n{3,}/g, '\n\n');
|
||||
|
||||
// Ensure proper spacing after numbered items
|
||||
result = result.replace(/^(\d+\..+)$\n^(?!\s)/gm, '$1\n\n');
|
||||
|
||||
return result;
|
||||
}
|
||||
}
|
||||
File diff suppressed because it is too large
Load Diff
1812
docs/md_v2/apps/crawl4ai-assistant/content/schemaBuilder.js
Normal file
1812
docs/md_v2/apps/crawl4ai-assistant/content/schemaBuilder.js
Normal file
File diff suppressed because it is too large
Load Diff
608
docs/md_v2/apps/crawl4ai-assistant/content/schemaBuilder_v1.js
Normal file
608
docs/md_v2/apps/crawl4ai-assistant/content/schemaBuilder_v1.js
Normal file
@@ -0,0 +1,608 @@
|
||||
// SchemaBuilder class for Crawl4AI Chrome Extension
|
||||
class SchemaBuilder {
|
||||
constructor() {
|
||||
this.mode = null;
|
||||
this.container = null;
|
||||
this.fields = [];
|
||||
this.overlay = null;
|
||||
this.toolbar = null;
|
||||
this.highlightBox = null;
|
||||
this.selectedElements = new Set();
|
||||
this.isPaused = false;
|
||||
this.codeModal = null;
|
||||
|
||||
this.handleMouseMove = this.handleMouseMove.bind(this);
|
||||
this.handleClick = this.handleClick.bind(this);
|
||||
this.handleKeyPress = this.handleKeyPress.bind(this);
|
||||
}
|
||||
|
||||
start() {
|
||||
this.mode = 'container';
|
||||
this.createOverlay();
|
||||
this.createToolbar();
|
||||
this.attachEventListeners();
|
||||
this.updateToolbar();
|
||||
}
|
||||
|
||||
stop() {
|
||||
this.detachEventListeners();
|
||||
this.overlay?.remove();
|
||||
this.toolbar?.remove();
|
||||
this.highlightBox?.remove();
|
||||
this.removeAllHighlights();
|
||||
this.mode = null;
|
||||
this.container = null;
|
||||
this.fields = [];
|
||||
this.selectedElements.clear();
|
||||
}
|
||||
|
||||
createOverlay() {
|
||||
// Create highlight box
|
||||
this.highlightBox = document.createElement('div');
|
||||
this.highlightBox.className = 'c4ai-highlight-box';
|
||||
document.body.appendChild(this.highlightBox);
|
||||
}
|
||||
|
||||
createToolbar() {
|
||||
this.toolbar = document.createElement('div');
|
||||
this.toolbar.className = 'c4ai-toolbar';
|
||||
this.toolbar.innerHTML = `
|
||||
<div class="c4ai-toolbar-titlebar">
|
||||
<div class="c4ai-titlebar-dots">
|
||||
<button class="c4ai-dot c4ai-dot-close" id="c4ai-close"></button>
|
||||
<button class="c4ai-dot c4ai-dot-minimize"></button>
|
||||
<button class="c4ai-dot c4ai-dot-maximize"></button>
|
||||
</div>
|
||||
<img src="${chrome.runtime.getURL('icons/icon-16.png')}" class="c4ai-titlebar-icon" alt="Crawl4AI">
|
||||
<div class="c4ai-titlebar-title">Crawl4AI Schema Builder</div>
|
||||
</div>
|
||||
<div class="c4ai-toolbar-content">
|
||||
<div class="c4ai-toolbar-status">
|
||||
<div class="c4ai-status-item">
|
||||
<span class="c4ai-status-label">Mode:</span>
|
||||
<span class="c4ai-status-value" id="c4ai-mode">Select Container</span>
|
||||
</div>
|
||||
<div class="c4ai-status-item">
|
||||
<span class="c4ai-status-label">Container:</span>
|
||||
<span class="c4ai-status-value" id="c4ai-container">Not selected</span>
|
||||
</div>
|
||||
</div>
|
||||
<div class="c4ai-fields-list" id="c4ai-fields-list" style="display: none;">
|
||||
<div class="c4ai-fields-header">Selected Fields:</div>
|
||||
<ul class="c4ai-fields-items" id="c4ai-fields-items"></ul>
|
||||
</div>
|
||||
<div class="c4ai-toolbar-hint" id="c4ai-hint">
|
||||
Click on a container element (e.g., product card, article, etc.)
|
||||
</div>
|
||||
<div class="c4ai-toolbar-actions">
|
||||
<button id="c4ai-pause" class="c4ai-action-btn c4ai-pause-btn">
|
||||
<span class="c4ai-pause-icon">⏸</span> Pause
|
||||
</button>
|
||||
<button id="c4ai-generate" class="c4ai-action-btn c4ai-generate-btn">
|
||||
<span class="c4ai-generate-icon">⚡</span> Generate Code
|
||||
</button>
|
||||
</div>
|
||||
</div>
|
||||
`;
|
||||
document.body.appendChild(this.toolbar);
|
||||
|
||||
// Add event listeners for toolbar buttons
|
||||
document.getElementById('c4ai-pause').addEventListener('click', () => this.togglePause());
|
||||
document.getElementById('c4ai-generate').addEventListener('click', () => this.stopAndGenerate());
|
||||
document.getElementById('c4ai-close').addEventListener('click', () => this.stop());
|
||||
|
||||
// Make toolbar draggable
|
||||
window.C4AI_Utils.makeDraggable(this.toolbar);
|
||||
}
|
||||
|
||||
attachEventListeners() {
|
||||
document.addEventListener('mousemove', this.handleMouseMove, true);
|
||||
document.addEventListener('click', this.handleClick, true);
|
||||
document.addEventListener('keydown', this.handleKeyPress, true);
|
||||
}
|
||||
|
||||
detachEventListeners() {
|
||||
document.removeEventListener('mousemove', this.handleMouseMove, true);
|
||||
document.removeEventListener('click', this.handleClick, true);
|
||||
document.removeEventListener('keydown', this.handleKeyPress, true);
|
||||
}
|
||||
|
||||
handleMouseMove(e) {
|
||||
if (this.isPaused) return;
|
||||
|
||||
const element = document.elementFromPoint(e.clientX, e.clientY);
|
||||
if (element && !this.isOurElement(element)) {
|
||||
this.highlightElement(element);
|
||||
}
|
||||
}
|
||||
|
||||
handleClick(e) {
|
||||
if (this.isPaused) return;
|
||||
|
||||
const element = e.target;
|
||||
|
||||
if (this.isOurElement(element)) {
|
||||
return;
|
||||
}
|
||||
|
||||
e.preventDefault();
|
||||
e.stopPropagation();
|
||||
|
||||
if (this.mode === 'container') {
|
||||
this.selectContainer(element);
|
||||
} else if (this.mode === 'field') {
|
||||
this.selectField(element);
|
||||
}
|
||||
}
|
||||
|
||||
handleKeyPress(e) {
|
||||
if (e.key === 'Escape') {
|
||||
this.stop();
|
||||
}
|
||||
}
|
||||
|
||||
isOurElement(element) {
|
||||
return window.C4AI_Utils.isOurElement(element);
|
||||
}
|
||||
|
||||
togglePause() {
|
||||
this.isPaused = !this.isPaused;
|
||||
const pauseBtn = document.getElementById('c4ai-pause');
|
||||
if (this.isPaused) {
|
||||
pauseBtn.innerHTML = '<span class="c4ai-play-icon">▶</span> Resume';
|
||||
pauseBtn.classList.add('c4ai-paused');
|
||||
this.highlightBox.style.display = 'none';
|
||||
} else {
|
||||
pauseBtn.innerHTML = '<span class="c4ai-pause-icon">⏸</span> Pause';
|
||||
pauseBtn.classList.remove('c4ai-paused');
|
||||
}
|
||||
}
|
||||
|
||||
stopAndGenerate() {
|
||||
if (!this.container || this.fields.length === 0) {
|
||||
alert('Please select a container and at least one field before generating code.');
|
||||
return;
|
||||
}
|
||||
|
||||
const code = this.generateCode();
|
||||
this.showCodeModal(code);
|
||||
}
|
||||
|
||||
highlightElement(element) {
|
||||
const rect = element.getBoundingClientRect();
|
||||
this.highlightBox.style.cssText = `
|
||||
left: ${rect.left + window.scrollX}px;
|
||||
top: ${rect.top + window.scrollY}px;
|
||||
width: ${rect.width}px;
|
||||
height: ${rect.height}px;
|
||||
display: block;
|
||||
`;
|
||||
|
||||
if (this.mode === 'container') {
|
||||
this.highlightBox.className = 'c4ai-highlight-box c4ai-container-mode';
|
||||
} else {
|
||||
this.highlightBox.className = 'c4ai-highlight-box c4ai-field-mode';
|
||||
}
|
||||
}
|
||||
|
||||
selectContainer(element) {
|
||||
// Remove previous container highlight
|
||||
if (this.container) {
|
||||
this.container.element.classList.remove('c4ai-selected-container');
|
||||
}
|
||||
|
||||
this.container = {
|
||||
element: element,
|
||||
html: element.outerHTML,
|
||||
selector: this.generateSelector(element),
|
||||
tagName: element.tagName.toLowerCase()
|
||||
};
|
||||
|
||||
element.classList.add('c4ai-selected-container');
|
||||
this.mode = 'field';
|
||||
this.updateToolbar();
|
||||
this.updateStats();
|
||||
}
|
||||
|
||||
selectField(element) {
|
||||
// Don't select the container itself
|
||||
if (element === this.container.element) {
|
||||
return;
|
||||
}
|
||||
|
||||
// Check if already selected - if so, deselect it
|
||||
if (this.selectedElements.has(element)) {
|
||||
this.deselectField(element);
|
||||
return;
|
||||
}
|
||||
|
||||
// Must be inside the container
|
||||
if (!this.container.element.contains(element)) {
|
||||
return;
|
||||
}
|
||||
|
||||
this.showFieldDialog(element);
|
||||
}
|
||||
|
||||
deselectField(element) {
|
||||
// Remove from fields array
|
||||
this.fields = this.fields.filter(f => f.element !== element);
|
||||
|
||||
// Remove from selected elements set
|
||||
this.selectedElements.delete(element);
|
||||
|
||||
// Remove visual selection
|
||||
element.classList.remove('c4ai-selected-field');
|
||||
|
||||
// Update UI
|
||||
this.updateToolbar();
|
||||
this.updateStats();
|
||||
}
|
||||
|
||||
showFieldDialog(element) {
|
||||
const dialog = document.createElement('div');
|
||||
dialog.className = 'c4ai-field-dialog';
|
||||
|
||||
const rect = element.getBoundingClientRect();
|
||||
dialog.style.cssText = `
|
||||
left: ${rect.left + window.scrollX}px;
|
||||
top: ${rect.bottom + window.scrollY + 10}px;
|
||||
`;
|
||||
|
||||
dialog.innerHTML = `
|
||||
<div class="c4ai-field-dialog-content">
|
||||
<h4>Name this field:</h4>
|
||||
<input type="text" id="c4ai-field-name" placeholder="e.g., title, price, description" autofocus>
|
||||
<div class="c4ai-field-preview">
|
||||
<strong>Content:</strong> ${element.textContent.trim().substring(0, 50)}...
|
||||
</div>
|
||||
<div class="c4ai-field-actions">
|
||||
<button id="c4ai-field-save">Save</button>
|
||||
<button id="c4ai-field-cancel">Cancel</button>
|
||||
</div>
|
||||
</div>
|
||||
`;
|
||||
|
||||
document.body.appendChild(dialog);
|
||||
|
||||
const input = dialog.querySelector('#c4ai-field-name');
|
||||
const saveBtn = dialog.querySelector('#c4ai-field-save');
|
||||
const cancelBtn = dialog.querySelector('#c4ai-field-cancel');
|
||||
|
||||
const save = () => {
|
||||
const fieldName = input.value.trim();
|
||||
if (fieldName) {
|
||||
this.fields.push({
|
||||
name: fieldName,
|
||||
value: element.textContent.trim(),
|
||||
element: element,
|
||||
selector: this.generateSelector(element, this.container.element)
|
||||
});
|
||||
|
||||
element.classList.add('c4ai-selected-field');
|
||||
this.selectedElements.add(element);
|
||||
this.updateToolbar();
|
||||
this.updateStats();
|
||||
}
|
||||
dialog.remove();
|
||||
};
|
||||
|
||||
const cancel = () => {
|
||||
dialog.remove();
|
||||
};
|
||||
|
||||
saveBtn.addEventListener('click', save);
|
||||
cancelBtn.addEventListener('click', cancel);
|
||||
input.addEventListener('keypress', (e) => {
|
||||
if (e.key === 'Enter') save();
|
||||
if (e.key === 'Escape') cancel();
|
||||
});
|
||||
|
||||
input.focus();
|
||||
}
|
||||
|
||||
generateSelector(element, context = document) {
|
||||
// Try to generate a robust selector
|
||||
if (element.id) {
|
||||
return `#${CSS.escape(element.id)}`;
|
||||
}
|
||||
|
||||
// Check for data attributes (most stable)
|
||||
const dataAttrs = ['data-testid', 'data-id', 'data-test', 'data-cy'];
|
||||
for (const attr of dataAttrs) {
|
||||
const value = element.getAttribute(attr);
|
||||
if (value) {
|
||||
return `[${attr}="${value}"]`;
|
||||
}
|
||||
}
|
||||
|
||||
// Check for aria-label
|
||||
if (element.getAttribute('aria-label')) {
|
||||
return `[aria-label="${element.getAttribute('aria-label')}"]`;
|
||||
}
|
||||
|
||||
// Try semantic HTML elements with text
|
||||
const tagName = element.tagName.toLowerCase();
|
||||
if (['button', 'a', 'h1', 'h2', 'h3', 'h4', 'h5', 'h6'].includes(tagName)) {
|
||||
const text = element.textContent.trim();
|
||||
if (text && text.length < 50) {
|
||||
// Use tag name with partial text match
|
||||
return `${tagName}`;
|
||||
}
|
||||
}
|
||||
|
||||
// Check for simple, non-utility classes
|
||||
const classes = Array.from(element.classList)
|
||||
.filter(c => !c.startsWith('c4ai-')) // Exclude our classes
|
||||
.filter(c => !c.includes('[') && !c.includes('(') && !c.includes(':')) // Exclude utility classes
|
||||
.filter(c => c.length < 30); // Exclude very long classes
|
||||
|
||||
if (classes.length > 0 && classes.length <= 3) {
|
||||
const selector = classes.map(c => `.${CSS.escape(c)}`).join('');
|
||||
try {
|
||||
if (context.querySelectorAll(selector).length === 1) {
|
||||
return selector;
|
||||
}
|
||||
} catch (e) {
|
||||
// Invalid selector, continue
|
||||
}
|
||||
}
|
||||
|
||||
// Use nth-child with simple parent tag
|
||||
const parent = element.parentElement;
|
||||
if (parent && parent !== context) {
|
||||
const siblings = Array.from(parent.children);
|
||||
const index = siblings.indexOf(element) + 1;
|
||||
// Just use parent tag name to avoid recursion
|
||||
const parentTag = parent.tagName.toLowerCase();
|
||||
return `${parentTag} > ${tagName}:nth-child(${index})`;
|
||||
}
|
||||
|
||||
// Final fallback
|
||||
return tagName;
|
||||
}
|
||||
|
||||
updateToolbar() {
|
||||
document.getElementById('c4ai-mode').textContent =
|
||||
this.mode === 'container' ? 'Select Container' : 'Select Fields';
|
||||
|
||||
document.getElementById('c4ai-container').textContent =
|
||||
this.container ? `${this.container.tagName} ✓` : 'Not selected';
|
||||
|
||||
// Update fields list
|
||||
const fieldsList = document.getElementById('c4ai-fields-list');
|
||||
const fieldsItems = document.getElementById('c4ai-fields-items');
|
||||
|
||||
if (this.fields.length > 0) {
|
||||
fieldsList.style.display = 'block';
|
||||
fieldsItems.innerHTML = this.fields.map(field => `
|
||||
<li class="c4ai-field-item">
|
||||
<span class="c4ai-field-name">${field.name}</span>
|
||||
<span class="c4ai-field-value">${field.value.substring(0, 30)}${field.value.length > 30 ? '...' : ''}</span>
|
||||
</li>
|
||||
`).join('');
|
||||
} else {
|
||||
fieldsList.style.display = 'none';
|
||||
}
|
||||
|
||||
const hint = document.getElementById('c4ai-hint');
|
||||
if (this.mode === 'container') {
|
||||
hint.textContent = 'Click on a container element (e.g., product card, article, etc.)';
|
||||
} else if (this.fields.length === 0) {
|
||||
hint.textContent = 'Click on fields inside the container to extract (title, price, etc.)';
|
||||
} else {
|
||||
hint.innerHTML = `Continue selecting fields or click <strong>Stop & Generate</strong> to finish.`;
|
||||
}
|
||||
}
|
||||
|
||||
updateStats() {
|
||||
chrome.runtime.sendMessage({
|
||||
action: 'updateStats',
|
||||
stats: {
|
||||
container: !!this.container,
|
||||
fields: this.fields.length
|
||||
}
|
||||
});
|
||||
}
|
||||
|
||||
removeAllHighlights() {
|
||||
document.querySelectorAll('.c4ai-selected-container').forEach(el => {
|
||||
el.classList.remove('c4ai-selected-container');
|
||||
});
|
||||
document.querySelectorAll('.c4ai-selected-field').forEach(el => {
|
||||
el.classList.remove('c4ai-selected-field');
|
||||
});
|
||||
}
|
||||
|
||||
generateCode() {
|
||||
const fieldDescriptions = this.fields.map(f =>
|
||||
`- ${f.name} (example: "${f.value.substring(0, 50)}...")`
|
||||
).join('\n');
|
||||
|
||||
return `#!/usr/bin/env python3
|
||||
"""
|
||||
Generated by Crawl4AI Chrome Extension
|
||||
URL: ${window.location.href}
|
||||
Generated: ${new Date().toISOString()}
|
||||
"""
|
||||
|
||||
import asyncio
|
||||
import json
|
||||
from pathlib import Path
|
||||
from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig
|
||||
from crawl4ai.extraction_strategy import JsonCssExtractionStrategy
|
||||
|
||||
# HTML snippet of the selected container element
|
||||
HTML_SNIPPET = """
|
||||
${this.container.html}
|
||||
"""
|
||||
|
||||
# Extraction query based on your field selections
|
||||
EXTRACTION_QUERY = """
|
||||
Create a JSON CSS extraction schema to extract the following fields:
|
||||
${fieldDescriptions}
|
||||
|
||||
The schema should handle multiple ${this.container.tagName} elements on the page.
|
||||
Each item should be extracted as a separate object in the results array.
|
||||
"""
|
||||
|
||||
async def generate_schema():
|
||||
"""Generate extraction schema using LLM"""
|
||||
print("🔧 Generating extraction schema...")
|
||||
|
||||
try:
|
||||
# Generate the schema using Crawl4AI's built-in LLM integration
|
||||
schema = JsonCssExtractionStrategy.generate_schema(
|
||||
html=HTML_SNIPPET,
|
||||
query=EXTRACTION_QUERY,
|
||||
)
|
||||
|
||||
# Save the schema for reuse
|
||||
schema_path = Path('generated_schema.json')
|
||||
with open(schema_path, 'w') as f:
|
||||
json.dump(schema, f, indent=2)
|
||||
|
||||
print("✅ Schema generated successfully!")
|
||||
print(f"📄 Schema saved to: {schema_path}")
|
||||
print("\\nGenerated schema:")
|
||||
print(json.dumps(schema, indent=2))
|
||||
|
||||
return schema
|
||||
|
||||
except Exception as e:
|
||||
print(f"❌ Error generating schema: {e}")
|
||||
return None
|
||||
|
||||
async def test_extraction(url: str = "${window.location.href}"):
|
||||
"""Test the generated schema on the actual webpage"""
|
||||
print("\\n🧪 Testing extraction on live webpage...")
|
||||
|
||||
# Load the generated schema
|
||||
try:
|
||||
with open('generated_schema.json', 'r') as f:
|
||||
schema = json.load(f)
|
||||
except FileNotFoundError:
|
||||
print("❌ Schema file not found. Run generate_schema() first.")
|
||||
return
|
||||
|
||||
# Configure browser
|
||||
browser_config = BrowserConfig(
|
||||
headless=True,
|
||||
verbose=False
|
||||
)
|
||||
|
||||
# Configure extraction
|
||||
crawler_config = CrawlerRunConfig(
|
||||
extraction_strategy=JsonCssExtractionStrategy(schema=schema)
|
||||
)
|
||||
|
||||
async with AsyncWebCrawler(config=browser_config) as crawler:
|
||||
result = await crawler.arun(
|
||||
url=url,
|
||||
config=crawler_config
|
||||
)
|
||||
|
||||
if result.success and result.extracted_content:
|
||||
data = json.loads(result.extracted_content)
|
||||
print(f"\\n✅ Successfully extracted {len(data)} items!")
|
||||
|
||||
# Save results
|
||||
with open('extracted_data.json', 'w') as f:
|
||||
json.dump(data, f, indent=2)
|
||||
|
||||
# Show sample results
|
||||
print("\\n📊 Sample results (first 2 items):")
|
||||
for i, item in enumerate(data[:2], 1):
|
||||
print(f"\\nItem {i}:")
|
||||
for key, value in item.items():
|
||||
print(f" {key}: {value}")
|
||||
else:
|
||||
print("❌ Extraction failed:", result.error_message)
|
||||
|
||||
if __name__ == "__main__":
|
||||
# Step 1: Generate the schema from HTML snippet
|
||||
asyncio.run(generate_schema())
|
||||
|
||||
# Step 2: Test extraction on the live webpage
|
||||
# Uncomment the line below to test extraction:
|
||||
# asyncio.run(test_extraction())
|
||||
|
||||
print("\\n🎯 Next steps:")
|
||||
print("1. Review the generated schema in 'generated_schema.json'")
|
||||
print("2. Uncomment the test_extraction() line to test on the live site")
|
||||
print("3. Use the schema in your Crawl4AI projects!")
|
||||
`;
|
||||
|
||||
return code;
|
||||
}
|
||||
|
||||
showCodeModal(code) {
|
||||
// Create modal
|
||||
this.codeModal = document.createElement('div');
|
||||
this.codeModal.className = 'c4ai-code-modal';
|
||||
this.codeModal.innerHTML = `
|
||||
<div class="c4ai-code-modal-content">
|
||||
<div class="c4ai-code-modal-header">
|
||||
<h2>Generated Python Code</h2>
|
||||
<button class="c4ai-close-modal" id="c4ai-close-modal">✕</button>
|
||||
</div>
|
||||
<div class="c4ai-code-modal-body">
|
||||
<pre class="c4ai-code-block"><code class="language-python">${window.C4AI_Utils.escapeHtml(code)}</code></pre>
|
||||
</div>
|
||||
<div class="c4ai-code-modal-footer">
|
||||
<button class="c4ai-action-btn c4ai-cloud-btn" id="c4ai-run-cloud" disabled>
|
||||
<span>☁️</span> Run on C4AI Cloud (Coming Soon)
|
||||
</button>
|
||||
<button class="c4ai-action-btn c4ai-download-btn" id="c4ai-download-code">
|
||||
<span>⬇</span> Download Code
|
||||
</button>
|
||||
<button class="c4ai-action-btn c4ai-copy-btn" id="c4ai-copy-code">
|
||||
<span>📋</span> Copy to Clipboard
|
||||
</button>
|
||||
</div>
|
||||
</div>
|
||||
`;
|
||||
|
||||
document.body.appendChild(this.codeModal);
|
||||
|
||||
// Add event listeners
|
||||
document.getElementById('c4ai-close-modal').addEventListener('click', () => {
|
||||
this.codeModal.remove();
|
||||
this.codeModal = null;
|
||||
// Don't stop the capture session
|
||||
});
|
||||
|
||||
document.getElementById('c4ai-download-code').addEventListener('click', () => {
|
||||
chrome.runtime.sendMessage({
|
||||
action: 'downloadCode',
|
||||
code: code,
|
||||
filename: `crawl4ai_schema_${Date.now()}.py`
|
||||
}, (response) => {
|
||||
if (response && response.success) {
|
||||
const btn = document.getElementById('c4ai-download-code');
|
||||
const originalHTML = btn.innerHTML;
|
||||
btn.innerHTML = '<span>✓</span> Downloaded!';
|
||||
setTimeout(() => {
|
||||
btn.innerHTML = originalHTML;
|
||||
}, 2000);
|
||||
} else {
|
||||
console.error('Download failed:', response?.error);
|
||||
alert('Download failed. Please check your browser settings.');
|
||||
}
|
||||
});
|
||||
});
|
||||
|
||||
document.getElementById('c4ai-copy-code').addEventListener('click', () => {
|
||||
navigator.clipboard.writeText(code).then(() => {
|
||||
const btn = document.getElementById('c4ai-copy-code');
|
||||
btn.innerHTML = '<span>✓</span> Copied!';
|
||||
setTimeout(() => {
|
||||
btn.innerHTML = '<span>📋</span> Copy to Clipboard';
|
||||
}, 2000);
|
||||
});
|
||||
});
|
||||
|
||||
// Apply syntax highlighting
|
||||
window.C4AI_Utils.applySyntaxHighlighting(this.codeModal.querySelector('.language-python'));
|
||||
}
|
||||
}
|
||||
2515
docs/md_v2/apps/crawl4ai-assistant/content/scriptBuilder.js
Normal file
2515
docs/md_v2/apps/crawl4ai-assistant/content/scriptBuilder.js
Normal file
File diff suppressed because it is too large
Load Diff
253
docs/md_v2/apps/crawl4ai-assistant/content/shared/utils.js
Normal file
253
docs/md_v2/apps/crawl4ai-assistant/content/shared/utils.js
Normal file
@@ -0,0 +1,253 @@
|
||||
// Shared utilities for Crawl4AI Chrome Extension
|
||||
|
||||
// Make element draggable by its titlebar
|
||||
function makeDraggable(element) {
|
||||
let isDragging = false;
|
||||
let startX, startY, initialX, initialY;
|
||||
|
||||
const titlebar = element.querySelector('.c4ai-toolbar-titlebar, .c4ai-titlebar');
|
||||
if (!titlebar) return;
|
||||
|
||||
titlebar.addEventListener('mousedown', (e) => {
|
||||
// Don't drag if clicking on buttons
|
||||
if (e.target.classList.contains('c4ai-dot') || e.target.closest('button')) return;
|
||||
|
||||
isDragging = true;
|
||||
startX = e.clientX;
|
||||
startY = e.clientY;
|
||||
|
||||
const rect = element.getBoundingClientRect();
|
||||
initialX = rect.left;
|
||||
initialY = rect.top;
|
||||
|
||||
element.style.transition = 'none';
|
||||
titlebar.style.cursor = 'grabbing';
|
||||
});
|
||||
|
||||
document.addEventListener('mousemove', (e) => {
|
||||
if (!isDragging) return;
|
||||
|
||||
const deltaX = e.clientX - startX;
|
||||
const deltaY = e.clientY - startY;
|
||||
|
||||
element.style.left = `${initialX + deltaX}px`;
|
||||
element.style.top = `${initialY + deltaY}px`;
|
||||
element.style.right = 'auto';
|
||||
});
|
||||
|
||||
document.addEventListener('mouseup', () => {
|
||||
if (isDragging) {
|
||||
isDragging = false;
|
||||
element.style.transition = '';
|
||||
titlebar.style.cursor = 'grab';
|
||||
}
|
||||
});
|
||||
}
|
||||
|
||||
// Make element draggable by a specific header element
|
||||
function makeDraggableByHeader(element) {
|
||||
let isDragging = false;
|
||||
let startX, startY, initialX, initialY;
|
||||
|
||||
const header = element.querySelector('.c4ai-debugger-header');
|
||||
if (!header) return;
|
||||
|
||||
header.addEventListener('mousedown', (e) => {
|
||||
// Don't drag if clicking on close button
|
||||
if (e.target.id === 'c4ai-close-debugger' || e.target.closest('#c4ai-close-debugger')) return;
|
||||
|
||||
isDragging = true;
|
||||
startX = e.clientX;
|
||||
startY = e.clientY;
|
||||
|
||||
const rect = element.getBoundingClientRect();
|
||||
initialX = rect.left;
|
||||
initialY = rect.top;
|
||||
|
||||
element.style.transition = 'none';
|
||||
header.style.cursor = 'grabbing';
|
||||
});
|
||||
|
||||
document.addEventListener('mousemove', (e) => {
|
||||
if (!isDragging) return;
|
||||
|
||||
const deltaX = e.clientX - startX;
|
||||
const deltaY = e.clientY - startY;
|
||||
|
||||
element.style.left = `${initialX + deltaX}px`;
|
||||
element.style.top = `${initialY + deltaY}px`;
|
||||
element.style.right = 'auto';
|
||||
});
|
||||
|
||||
document.addEventListener('mouseup', () => {
|
||||
if (isDragging) {
|
||||
isDragging = false;
|
||||
element.style.transition = '';
|
||||
header.style.cursor = 'grab';
|
||||
}
|
||||
});
|
||||
}
|
||||
|
||||
// Escape HTML for safe display
|
||||
function escapeHtml(text) {
|
||||
const div = document.createElement('div');
|
||||
div.textContent = text;
|
||||
return div.innerHTML;
|
||||
}
|
||||
|
||||
// Apply syntax highlighting to Python code
|
||||
function applySyntaxHighlighting(codeElement) {
|
||||
const code = codeElement.textContent;
|
||||
|
||||
// Split by lines to handle line-by-line
|
||||
const lines = code.split('\n');
|
||||
const highlightedLines = lines.map(line => {
|
||||
let highlightedLine = escapeHtml(line);
|
||||
|
||||
// Skip if line is empty
|
||||
if (!highlightedLine.trim()) return highlightedLine;
|
||||
|
||||
// Comments (lines starting with #)
|
||||
if (highlightedLine.trim().startsWith('#')) {
|
||||
return `<span class="c4ai-comment">${highlightedLine}</span>`;
|
||||
}
|
||||
|
||||
// Triple quoted strings
|
||||
if (highlightedLine.includes('"""')) {
|
||||
highlightedLine = highlightedLine.replace(/(""".*?""")/g, '<span class="c4ai-string">$1</span>');
|
||||
}
|
||||
|
||||
// Regular strings - single and double quotes
|
||||
highlightedLine = highlightedLine.replace(/(["'])([^"']*)\1/g, '<span class="c4ai-string">$1$2$1</span>');
|
||||
|
||||
// Keywords - only highlight if not inside a string
|
||||
const keywords = ['import', 'from', 'async', 'def', 'await', 'try', 'except', 'with', 'as', 'for', 'if', 'else', 'elif', 'return', 'print', 'open', 'and', 'or', 'not', 'in', 'is', 'class', 'self', 'None', 'True', 'False', '__name__', '__main__'];
|
||||
|
||||
keywords.forEach(keyword => {
|
||||
// Use word boundaries and lookahead/lookbehind to ensure we're not in a string
|
||||
const regex = new RegExp(`\\b(${keyword})\\b(?![^<]*</span>)`, 'g');
|
||||
highlightedLine = highlightedLine.replace(regex, '<span class="c4ai-keyword">$1</span>');
|
||||
});
|
||||
|
||||
// Functions (word followed by parenthesis)
|
||||
highlightedLine = highlightedLine.replace(/\b([a-zA-Z_]\w*)\s*\(/g, '<span class="c4ai-function">$1</span>(');
|
||||
|
||||
return highlightedLine;
|
||||
});
|
||||
|
||||
codeElement.innerHTML = highlightedLines.join('\n');
|
||||
}
|
||||
|
||||
// Apply syntax highlighting to JavaScript code
|
||||
function applySyntaxHighlightingJS(codeElement) {
|
||||
const code = codeElement.textContent;
|
||||
|
||||
// Split by lines to handle line-by-line
|
||||
const lines = code.split('\n');
|
||||
const highlightedLines = lines.map(line => {
|
||||
let highlightedLine = escapeHtml(line);
|
||||
|
||||
// Skip if line is empty
|
||||
if (!highlightedLine.trim()) return highlightedLine;
|
||||
|
||||
// Comments
|
||||
if (highlightedLine.trim().startsWith('//')) {
|
||||
return `<span class="c4ai-comment">${highlightedLine}</span>`;
|
||||
}
|
||||
|
||||
// Multi-line comments
|
||||
highlightedLine = highlightedLine.replace(/(\/\*.*?\*\/)/g, '<span class="c4ai-comment">$1</span>');
|
||||
|
||||
// Template literals
|
||||
highlightedLine = highlightedLine.replace(/(`[^`]*`)/g, '<span class="c4ai-string">$1</span>');
|
||||
|
||||
// Regular strings - single and double quotes
|
||||
highlightedLine = highlightedLine.replace(/(["'])([^"']*)\1/g, '<span class="c4ai-string">$1$2$1</span>');
|
||||
|
||||
// Keywords
|
||||
const keywords = ['const', 'let', 'var', 'function', 'async', 'await', 'if', 'else', 'for', 'while', 'do', 'switch', 'case', 'break', 'continue', 'return', 'try', 'catch', 'finally', 'throw', 'new', 'this', 'class', 'extends', 'import', 'export', 'default', 'from', 'null', 'undefined', 'true', 'false'];
|
||||
|
||||
keywords.forEach(keyword => {
|
||||
const regex = new RegExp(`\\b(${keyword})\\b(?![^<]*</span>)`, 'g');
|
||||
highlightedLine = highlightedLine.replace(regex, '<span class="c4ai-keyword">$1</span>');
|
||||
});
|
||||
|
||||
// Functions and methods
|
||||
highlightedLine = highlightedLine.replace(/\b([a-zA-Z_$][\w$]*)\s*\(/g, '<span class="c4ai-function">$1</span>(');
|
||||
|
||||
// Numbers
|
||||
highlightedLine = highlightedLine.replace(/\b(\d+)\b/g, '<span class="c4ai-number">$1</span>');
|
||||
|
||||
return highlightedLine;
|
||||
});
|
||||
|
||||
codeElement.innerHTML = highlightedLines.join('\n');
|
||||
}
|
||||
|
||||
// Get element selector
|
||||
function getElementSelector(element) {
|
||||
// Priority: ID > unique class > tag with position
|
||||
if (element.id) {
|
||||
return `#${element.id}`;
|
||||
}
|
||||
|
||||
if (element.className && typeof element.className === 'string') {
|
||||
const classes = element.className.split(' ').filter(c => c && !c.startsWith('c4ai-'));
|
||||
if (classes.length > 0) {
|
||||
const selector = `.${classes[0]}`;
|
||||
if (document.querySelectorAll(selector).length === 1) {
|
||||
return selector;
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
// Build a path selector
|
||||
const path = [];
|
||||
let current = element;
|
||||
|
||||
while (current && current !== document.body) {
|
||||
const tagName = current.tagName.toLowerCase();
|
||||
const parent = current.parentElement;
|
||||
|
||||
if (parent) {
|
||||
const siblings = Array.from(parent.children);
|
||||
const index = siblings.indexOf(current) + 1;
|
||||
|
||||
if (siblings.filter(s => s.tagName === current.tagName).length > 1) {
|
||||
path.unshift(`${tagName}:nth-child(${index})`);
|
||||
} else {
|
||||
path.unshift(tagName);
|
||||
}
|
||||
} else {
|
||||
path.unshift(tagName);
|
||||
}
|
||||
|
||||
current = parent;
|
||||
}
|
||||
|
||||
return path.join(' > ');
|
||||
}
|
||||
|
||||
// Check if element is part of our extension UI
|
||||
function isOurElement(element) {
|
||||
return element.classList.contains('c4ai-highlight-box') ||
|
||||
element.classList.contains('c4ai-toolbar') ||
|
||||
element.closest('.c4ai-toolbar') ||
|
||||
element.classList.contains('c4ai-script-toolbar') ||
|
||||
element.closest('.c4ai-script-toolbar') ||
|
||||
element.closest('.c4ai-field-dialog') ||
|
||||
element.closest('.c4ai-code-modal') ||
|
||||
element.closest('.c4ai-wait-dialog') ||
|
||||
element.closest('.c4ai-timeline-modal');
|
||||
}
|
||||
|
||||
// Export utilities
|
||||
window.C4AI_Utils = {
|
||||
makeDraggable,
|
||||
makeDraggableByHeader,
|
||||
escapeHtml,
|
||||
applySyntaxHighlighting,
|
||||
applySyntaxHighlightingJS,
|
||||
getElementSelector,
|
||||
isOurElement
|
||||
};
|
||||
@@ -36,6 +36,21 @@
|
||||
</div>
|
||||
</section>
|
||||
|
||||
<!-- Cloud Announcement Banner -->
|
||||
<section class="cloud-banner-section">
|
||||
<div class="cloud-banner">
|
||||
<div class="cloud-banner-content">
|
||||
<div class="cloud-banner-text">
|
||||
<h3>You don't need Puppeteer. You need Crawl4AI Cloud.</h3>
|
||||
<p>One API call. JS-rendered. No browser cluster to maintain.</p>
|
||||
</div>
|
||||
<button class="cloud-banner-btn" id="joinWaitlistBanner">
|
||||
Get API Key →
|
||||
</button>
|
||||
</div>
|
||||
</div>
|
||||
</section>
|
||||
|
||||
<!-- Introduction -->
|
||||
<section class="intro-section">
|
||||
<div class="terminal-window">
|
||||
@@ -43,13 +58,17 @@
|
||||
<span class="terminal-title">About Crawl4AI Assistant</span>
|
||||
</div>
|
||||
<div class="terminal-content">
|
||||
<p>Transform any website into structured data with just a few clicks! The Crawl4AI Assistant Chrome Extension provides two powerful tools for web scraping and automation.</p>
|
||||
<p>Transform any website into structured data with just a few clicks! The Crawl4AI Assistant Chrome Extension provides three powerful tools for web scraping and data extraction.</p>
|
||||
|
||||
<div style="background: #0fbbaa; color: #070708; padding: 12px 16px; border-radius: 8px; margin: 16px 0; font-weight: 600;">
|
||||
🎉 NEW: Schema Builder now extracts data INSTANTLY without any LLM! Test your schema and see JSON results immediately in the browser!
|
||||
</div>
|
||||
|
||||
<div class="features-grid">
|
||||
<div class="feature-card">
|
||||
<span class="feature-icon">🎯</span>
|
||||
<h3>Schema Builder</h3>
|
||||
<p>Click to select elements and build extraction schemas visually</p>
|
||||
<p>Extract data instantly without LLMs - see results in real-time!</p>
|
||||
</div>
|
||||
<div class="feature-card">
|
||||
<span class="feature-icon">🔴</span>
|
||||
@@ -57,15 +76,15 @@
|
||||
<p>Record browser actions to create automation scripts</p>
|
||||
</div>
|
||||
<div class="feature-card">
|
||||
<span class="feature-icon">📝</span>
|
||||
<h3>Click2Crawl <span style="color: #0fbbaa; font-size: 0.75rem;">(New!)</span></h3>
|
||||
<p>Select multiple elements to extract clean markdown "as you see"</p>
|
||||
</div>
|
||||
<!-- <div class="feature-card">
|
||||
<span class="feature-icon">🐍</span>
|
||||
<h3>Python Code</h3>
|
||||
<p>Get production-ready Crawl4AI code instantly</p>
|
||||
</div>
|
||||
<div class="feature-card">
|
||||
<span class="feature-icon">🎨</span>
|
||||
<h3>Beautiful UI</h3>
|
||||
<p>Draggable toolbar with macOS-style interface</p>
|
||||
</div>
|
||||
</div> -->
|
||||
</div>
|
||||
</div>
|
||||
</div>
|
||||
@@ -134,6 +153,15 @@
|
||||
</div>
|
||||
<div class="tool-status alpha">Alpha</div>
|
||||
</div>
|
||||
|
||||
<div class="tool-selector" data-tool="click2crawl">
|
||||
<div class="tool-icon">📝</div>
|
||||
<div class="tool-info">
|
||||
<h3>Click2Crawl</h3>
|
||||
<p>Markdown extraction</p>
|
||||
</div>
|
||||
<div class="tool-status new">New!</div>
|
||||
</div>
|
||||
</div>
|
||||
|
||||
<!-- Right Panel - Tool Details -->
|
||||
@@ -142,7 +170,7 @@
|
||||
<div class="tool-content active" id="schema-builder">
|
||||
<div class="tool-header">
|
||||
<h3>📊 Schema Builder</h3>
|
||||
<span class="tool-tagline">Click to extract data visually</span>
|
||||
<span class="tool-tagline">No LLM needed - Extract data instantly!</span>
|
||||
</div>
|
||||
|
||||
<div class="tool-steps">
|
||||
@@ -150,9 +178,9 @@
|
||||
<div class="step-number">1</div>
|
||||
<div class="step-content">
|
||||
<h4>Select Container</h4>
|
||||
<p>Click on any repeating element like product cards or articles</p>
|
||||
<p>Click on any repeating element like product cards or articles. Use up/down navigation to fine-tune selection!</p>
|
||||
<div class="step-visual">
|
||||
<span class="highlight-green">■</span> Elements highlighted in green
|
||||
<span class="highlight-green">■</span> Container highlighted in green
|
||||
</div>
|
||||
</div>
|
||||
</div>
|
||||
@@ -160,8 +188,8 @@
|
||||
<div class="step-item">
|
||||
<div class="step-number">2</div>
|
||||
<div class="step-content">
|
||||
<h4>Mark Fields</h4>
|
||||
<p>Click on data fields inside the container</p>
|
||||
<h4>Click Fields to Extract</h4>
|
||||
<p>Click on data fields inside the container - choose text, links, images, or attributes</p>
|
||||
<div class="step-visual">
|
||||
<span class="highlight-pink">■</span> Fields highlighted in pink
|
||||
</div>
|
||||
@@ -171,19 +199,22 @@
|
||||
<div class="step-item">
|
||||
<div class="step-number">3</div>
|
||||
<div class="step-content">
|
||||
<h4>Generate & Extract</h4>
|
||||
<p>Get your CSS selectors and Python code instantly</p>
|
||||
<h4>Test & Extract Data NOW!</h4>
|
||||
<p>🎉 Click "Test Schema" to extract ALL matching data instantly - no coding required!</p>
|
||||
<div class="step-visual">
|
||||
<span class="highlight-accent">⚡</span> Ready to use code
|
||||
<span class="highlight-accent">⚡</span> See extracted JSON immediately
|
||||
</div>
|
||||
</div>
|
||||
</div>
|
||||
</div>
|
||||
|
||||
<div class="tool-features">
|
||||
<div class="feature-tag">No CSS knowledge needed</div>
|
||||
<div class="feature-tag">Smart selector generation</div>
|
||||
<div class="feature-tag">LLM-ready schemas</div>
|
||||
<div class="feature-tag">🚀 Zero LLM dependency</div>
|
||||
<div class="feature-tag">📊 Instant data extraction</div>
|
||||
<div class="feature-tag">🎯 Smart selector generation</div>
|
||||
<div class="feature-tag">🐍 Ready-to-run Python code</div>
|
||||
<div class="feature-tag">✨ Preview matching elements</div>
|
||||
<div class="feature-tag">📥 Download JSON results</div>
|
||||
</div>
|
||||
</div>
|
||||
|
||||
@@ -236,70 +267,190 @@
|
||||
<div class="feature-tag alpha-tag">Alpha version</div>
|
||||
</div>
|
||||
</div>
|
||||
|
||||
<!-- Click2Crawl Details -->
|
||||
<div class="tool-content" id="click2crawl">
|
||||
<div class="tool-header">
|
||||
<h3>📝 Click2Crawl</h3>
|
||||
<span class="tool-tagline">Select multiple elements to extract clean markdown</span>
|
||||
</div>
|
||||
|
||||
<div class="tool-steps">
|
||||
<div class="step-item">
|
||||
<div class="step-number">1</div>
|
||||
<div class="step-content">
|
||||
<h4>Ctrl/Cmd + Click</h4>
|
||||
<p>Hold Ctrl/Cmd and click multiple elements you want to extract</p>
|
||||
<div class="step-visual">
|
||||
<span class="highlight-green">🔢</span> Numbered selection badges
|
||||
</div>
|
||||
</div>
|
||||
</div>
|
||||
|
||||
<div class="step-item">
|
||||
<div class="step-number">2</div>
|
||||
<div class="step-content">
|
||||
<h4>Enable Visual Text Mode</h4>
|
||||
<p>Extract content "as you see" - clean text without complex HTML structures</p>
|
||||
<div class="step-visual">
|
||||
<span class="highlight-accent">👁️</span> Visual Text Mode (As You See)
|
||||
</div>
|
||||
</div>
|
||||
</div>
|
||||
|
||||
<div class="step-item">
|
||||
<div class="step-number">3</div>
|
||||
<div class="step-content">
|
||||
<h4>Export Clean Markdown</h4>
|
||||
<p>Get beautifully formatted markdown ready for documentation or LLMs</p>
|
||||
<div class="step-visual">
|
||||
<span class="highlight-pink">📄</span> Clean, readable output
|
||||
</div>
|
||||
</div>
|
||||
</div>
|
||||
</div>
|
||||
|
||||
<div class="tool-features">
|
||||
<div class="feature-tag">Multi-select with Ctrl/Cmd</div>
|
||||
<div class="feature-tag">Visual Text Mode</div>
|
||||
<div class="feature-tag">Smart formatting</div>
|
||||
<div class="feature-tag">Cloud export (soon)</div>
|
||||
</div>
|
||||
</div>
|
||||
</div>
|
||||
</div>
|
||||
</section>
|
||||
|
||||
<!-- Interactive Code Examples -->
|
||||
<section class="code-showcase">
|
||||
<h2>See the Generated Code</h2>
|
||||
<h2>See the Generated Code & Extracted Data</h2>
|
||||
|
||||
<div class="code-tabs">
|
||||
<button class="code-tab active" data-example="schema">📊 Schema Builder</button>
|
||||
<button class="code-tab" data-example="script">🔴 Script Builder</button>
|
||||
<button class="code-tab" data-example="markdown">📝 Click2Crawl</button>
|
||||
</div>
|
||||
|
||||
<div class="code-examples">
|
||||
<!-- Schema Builder Code -->
|
||||
<div class="code-example active" id="code-schema">
|
||||
<div class="terminal-window">
|
||||
<div class="terminal-header">
|
||||
<span class="terminal-title">schema_extraction.py</span>
|
||||
<button class="copy-button" data-code="schema">Copy</button>
|
||||
</div>
|
||||
<div class="terminal-content">
|
||||
<pre><code><span class="keyword">import</span> asyncio
|
||||
<div style="display: grid; grid-template-columns: 1fr 1fr; gap: 16px;">
|
||||
<!-- Python Code -->
|
||||
<div class="terminal-window">
|
||||
<div class="terminal-header">
|
||||
<span class="terminal-title">schema_extraction.py</span>
|
||||
<button class="copy-button" data-code="schema-python">Copy</button>
|
||||
</div>
|
||||
<div class="terminal-content">
|
||||
<pre><code><span class="comment">#!/usr/bin/env python3</span>
|
||||
<span class="comment">"""
|
||||
🎉 NO LLM NEEDED! Direct extraction with CSS selectors
|
||||
Generated by Crawl4AI Chrome Extension
|
||||
"""</span>
|
||||
|
||||
<span class="keyword">import</span> asyncio
|
||||
<span class="keyword">import</span> json
|
||||
<span class="keyword">from</span> crawl4ai <span class="keyword">import</span> AsyncWebCrawler, CrawlerRunConfig
|
||||
<span class="keyword">from</span> crawl4ai <span class="keyword">import</span> AsyncWebCrawler, BrowserConfig, CrawlerRunConfig
|
||||
<span class="keyword">from</span> crawl4ai.extraction_strategy <span class="keyword">import</span> JsonCssExtractionStrategy
|
||||
|
||||
<span class="keyword">async</span> <span class="keyword">def</span> <span class="function">extract_products</span>():
|
||||
<span class="comment"># Schema generated from your visual selection</span>
|
||||
schema = {
|
||||
<span class="string">"name"</span>: <span class="string">"Product Catalog"</span>,
|
||||
<span class="string">"baseSelector"</span>: <span class="string">"div.product-card"</span>, <span class="comment"># Container you clicked</span>
|
||||
<span class="string">"fields"</span>: [
|
||||
{
|
||||
<span class="string">"name"</span>: <span class="string">"title"</span>,
|
||||
<span class="string">"selector"</span>: <span class="string">"h3.product-title"</span>,
|
||||
<span class="string">"type"</span>: <span class="string">"text"</span>
|
||||
},
|
||||
{
|
||||
<span class="string">"name"</span>: <span class="string">"price"</span>,
|
||||
<span class="string">"selector"</span>: <span class="string">"span.price"</span>,
|
||||
<span class="string">"type"</span>: <span class="string">"text"</span>
|
||||
},
|
||||
{
|
||||
<span class="string">"name"</span>: <span class="string">"image"</span>,
|
||||
<span class="string">"selector"</span>: <span class="string">"img.product-img"</span>,
|
||||
<span class="string">"type"</span>: <span class="string">"attribute"</span>,
|
||||
<span class="string">"attribute"</span>: <span class="string">"src"</span>
|
||||
}
|
||||
]
|
||||
}
|
||||
<span class="comment"># The EXACT schema from your visual clicks - no guessing!</span>
|
||||
EXTRACTION_SCHEMA = {
|
||||
<span class="string">"name"</span>: <span class="string">"Product Catalog"</span>,
|
||||
<span class="string">"baseSelector"</span>: <span class="string">"div.product-card"</span>, <span class="comment"># The container you selected</span>
|
||||
<span class="string">"fields"</span>: [
|
||||
{
|
||||
<span class="string">"name"</span>: <span class="string">"title"</span>,
|
||||
<span class="string">"selector"</span>: <span class="string">"h3.product-title"</span>,
|
||||
<span class="string">"type"</span>: <span class="string">"text"</span>
|
||||
},
|
||||
{
|
||||
<span class="string">"name"</span>: <span class="string">"price"</span>,
|
||||
<span class="string">"selector"</span>: <span class="string">"span.price"</span>,
|
||||
<span class="string">"type"</span>: <span class="string">"text"</span>
|
||||
},
|
||||
{
|
||||
<span class="string">"name"</span>: <span class="string">"image"</span>,
|
||||
<span class="string">"selector"</span>: <span class="string">"img.product-img"</span>,
|
||||
<span class="string">"type"</span>: <span class="string">"attribute"</span>,
|
||||
<span class="string">"attribute"</span>: <span class="string">"src"</span>
|
||||
},
|
||||
{
|
||||
<span class="string">"name"</span>: <span class="string">"link"</span>,
|
||||
<span class="string">"selector"</span>: <span class="string">"a.product-link"</span>,
|
||||
<span class="string">"type"</span>: <span class="string">"attribute"</span>,
|
||||
<span class="string">"attribute"</span>: <span class="string">"href"</span>
|
||||
}
|
||||
]
|
||||
}
|
||||
|
||||
config = CrawlerRunConfig(
|
||||
extraction_strategy=JsonCssExtractionStrategy(schema)
|
||||
)
|
||||
<span class="keyword">async</span> <span class="keyword">def</span> <span class="function">extract_data</span>(url: str):
|
||||
<span class="comment"># Direct extraction - no LLM API calls!</span>
|
||||
extraction_strategy = JsonCssExtractionStrategy(schema=EXTRACTION_SCHEMA)
|
||||
|
||||
<span class="keyword">async</span> <span class="keyword">with</span> AsyncWebCrawler() <span class="keyword">as</span> crawler:
|
||||
result = <span class="keyword">await</span> crawler.arun(
|
||||
url=<span class="string">"https://example.com/products"</span>,
|
||||
config=config
|
||||
url=url,
|
||||
config=CrawlerRunConfig(extraction_strategy=extraction_strategy)
|
||||
)
|
||||
<span class="keyword">return</span> json.loads(result.extracted_content)
|
||||
|
||||
asyncio.run(extract_products())</code></pre>
|
||||
<span class="keyword">if</span> result.success:
|
||||
data = json.loads(result.extracted_content)
|
||||
<span class="keyword">print</span>(<span class="string">f"✅ Extracted {len(data)} items instantly!"</span>)
|
||||
|
||||
<span class="comment"># Save to file</span>
|
||||
<span class="keyword">with</span> open(<span class="string">'products.json'</span>, <span class="string">'w'</span>) <span class="keyword">as</span> f:
|
||||
json.dump(data, f, indent=2)
|
||||
|
||||
<span class="keyword">return</span> data
|
||||
|
||||
<span class="comment"># Run extraction on any similar page!</span>
|
||||
data = asyncio.run(extract_data(<span class="string">"https://example.com/products"</span>))
|
||||
|
||||
<span class="comment"># 🎯 Result: Clean JSON data, no LLM costs, instant results!</span></code></pre>
|
||||
</div>
|
||||
</div>
|
||||
|
||||
<!-- Extracted JSON Data -->
|
||||
<div class="terminal-window">
|
||||
<div class="terminal-header">
|
||||
<span class="terminal-title">extracted_data.json</span>
|
||||
<button class="copy-button" data-code="schema-json">Copy</button>
|
||||
</div>
|
||||
<div class="terminal-content">
|
||||
<pre><code><span class="comment">// 🎉 Instantly extracted from the page - no coding required!</span>
|
||||
[
|
||||
{
|
||||
<span class="string">"title"</span>: <span class="string">"Wireless Bluetooth Headphones"</span>,
|
||||
<span class="string">"price"</span>: <span class="string">"$79.99"</span>,
|
||||
<span class="string">"image"</span>: <span class="string">"https://example.com/images/headphones-bt-01.jpg"</span>,
|
||||
<span class="string">"link"</span>: <span class="string">"/products/wireless-bluetooth-headphones"</span>
|
||||
},
|
||||
{
|
||||
<span class="string">"title"</span>: <span class="string">"Smart Watch Pro 2024"</span>,
|
||||
<span class="string">"price"</span>: <span class="string">"$299.00"</span>,
|
||||
<span class="string">"image"</span>: <span class="string">"https://example.com/images/smartwatch-pro.jpg"</span>,
|
||||
<span class="string">"link"</span>: <span class="string">"/products/smart-watch-pro-2024"</span>
|
||||
},
|
||||
{
|
||||
<span class="string">"title"</span>: <span class="string">"4K Webcam for Streaming"</span>,
|
||||
<span class="string">"price"</span>: <span class="string">"$149.99"</span>,
|
||||
<span class="string">"image"</span>: <span class="string">"https://example.com/images/webcam-4k.jpg"</span>,
|
||||
<span class="string">"link"</span>: <span class="string">"/products/4k-webcam-streaming"</span>
|
||||
},
|
||||
{
|
||||
<span class="string">"title"</span>: <span class="string">"Mechanical Gaming Keyboard RGB"</span>,
|
||||
<span class="string">"price"</span>: <span class="string">"$129.99"</span>,
|
||||
<span class="string">"image"</span>: <span class="string">"https://example.com/images/keyboard-gaming.jpg"</span>,
|
||||
<span class="string">"link"</span>: <span class="string">"/products/mechanical-gaming-keyboard"</span>
|
||||
},
|
||||
{
|
||||
<span class="string">"title"</span>: <span class="string">"USB-C Hub 7-in-1"</span>,
|
||||
<span class="string">"price"</span>: <span class="string">"$45.99"</span>,
|
||||
<span class="string">"image"</span>: <span class="string">"https://example.com/images/usbc-hub.jpg"</span>,
|
||||
<span class="string">"link"</span>: <span class="string">"/products/usb-c-hub-7in1"</span>
|
||||
}
|
||||
]</code></pre>
|
||||
</div>
|
||||
</div>
|
||||
</div>
|
||||
</div>
|
||||
@@ -363,32 +514,181 @@ asyncio.run(automate_shopping())</code></pre>
|
||||
</div>
|
||||
</div>
|
||||
</div>
|
||||
|
||||
<!-- Click2Crawl Markdown Output -->
|
||||
<div class="code-example" id="code-markdown">
|
||||
<div class="terminal-window">
|
||||
<div class="terminal-header">
|
||||
<span class="terminal-title">extracted_content.md</span>
|
||||
<button class="copy-button" data-code="markdown">Copy</button>
|
||||
</div>
|
||||
<div class="terminal-content">
|
||||
<pre><code><span class="comment"># Extracted from Hacker News with Visual Text Mode 👁️</span>
|
||||
|
||||
<span class="string">1. **Show HN: I built a tool to find and reach out to YouTubers** (hellosimply.io)
|
||||
84 points by erickim 2 hours ago | hide | 31 comments
|
||||
|
||||
2. **The 24 Hour Restaurant** (logicmag.io)
|
||||
124 points by helsinkiandrew 5 hours ago | hide | 52 comments
|
||||
|
||||
3. **Building a Better Bloom Filter in Rust** (carlmastrangelo.com)
|
||||
89 points by carlmastrangelo 3 hours ago | hide | 27 comments
|
||||
|
||||
---
|
||||
|
||||
### Article: The 24 Hour Restaurant
|
||||
|
||||
In New York City, the 24-hour restaurant is becoming extinct. What we lose when we can no longer eat whenever we want.
|
||||
|
||||
When I first moved to New York, I loved that I could get a full meal at 3 AM. Not just pizza or fast food, but a proper sit-down dinner with table service and a menu that ran for pages. The city that never sleeps had restaurants that matched its rhythm.
|
||||
|
||||
Today, finding a 24-hour restaurant in Manhattan requires genuine effort. The pandemic accelerated a decline that was already underway, but the roots go deeper: rising rents, changing labor laws, and shifting cultural patterns have all contributed to the death of round-the-clock dining.
|
||||
|
||||
---
|
||||
|
||||
### Product Review: Framework Laptop 16
|
||||
|
||||
**Specifications:**
|
||||
- Display: 16" 2560×1600 165Hz
|
||||
- Processor: AMD Ryzen 7 7840HS
|
||||
- Memory: 32GB DDR5-5600
|
||||
- Storage: 2TB NVMe Gen4
|
||||
- Price: Starting at $1,399
|
||||
|
||||
**Pros:**
|
||||
- Fully modular and repairable
|
||||
- Excellent Linux support
|
||||
- Great keyboard and trackpad
|
||||
- Expansion card system
|
||||
|
||||
**Cons:**
|
||||
- Battery life could be better
|
||||
- Slightly heavier than competitors
|
||||
- Fan noise under load</span></code></pre>
|
||||
</div>
|
||||
</div>
|
||||
</div>
|
||||
</div>
|
||||
</section>
|
||||
|
||||
|
||||
<!-- Coming Soon Section -->
|
||||
<section class="coming-soon-section">
|
||||
<h2>Coming Soon: Even More Power</h2>
|
||||
<div class="terminal-window">
|
||||
<div class="terminal-header">
|
||||
<span class="terminal-title">Future Features</span>
|
||||
</div>
|
||||
<div class="terminal-content">
|
||||
<p class="intro-text">We're continuously expanding C4AI Assistant with powerful new features to make web scraping even easier:</p>
|
||||
<!-- Crawl4AI Cloud Section -->
|
||||
<section class="cloud-section">
|
||||
<div class="cloud-announcement">
|
||||
<h2>Crawl4AI Cloud</h2>
|
||||
<p class="cloud-tagline">Your browser cluster without the cluster.</p>
|
||||
|
||||
<div class="coming-features">
|
||||
<div class="coming-feature">
|
||||
<div class="feature-header">
|
||||
<span class="feature-badge">Cloud</span>
|
||||
<h3>Run on C4AI Cloud</h3>
|
||||
<div class="cloud-features-preview">
|
||||
<div class="cloud-feature-item">
|
||||
⚡ POST /crawl
|
||||
</div>
|
||||
<div class="cloud-feature-item">
|
||||
🌐 JS-rendered pages
|
||||
</div>
|
||||
<div class="cloud-feature-item">
|
||||
📊 Schema extraction built-in
|
||||
</div>
|
||||
<div class="cloud-feature-item">
|
||||
💰 $0.001/page
|
||||
</div>
|
||||
</div>
|
||||
|
||||
<button class="cloud-cta-button" id="joinWaitlist">
|
||||
Get Early Access →
|
||||
</button>
|
||||
|
||||
<p class="cloud-hint">See it extract your own data. Right now.</p>
|
||||
</div>
|
||||
|
||||
<!-- Hidden Signup Form -->
|
||||
<div class="signup-overlay" id="signupOverlay">
|
||||
<div class="signup-container" id="signupContainer">
|
||||
<button class="close-signup" id="closeSignup">×</button>
|
||||
|
||||
<div class="signup-content" id="signupForm">
|
||||
<h3>🚀 Join C4AI Cloud Waiting List</h3>
|
||||
<p>Be among the first to experience the future of web scraping</p>
|
||||
|
||||
<form id="waitlistForm" class="waitlist-form">
|
||||
<div class="form-field">
|
||||
<label for="userName">Your Name</label>
|
||||
<input type="text" id="userName" name="name" placeholder="John Doe" required>
|
||||
</div>
|
||||
<p>Execute your extraction directly in the cloud without setting up any local environment. Just click "Run on Cloud" and get your data instantly.</p>
|
||||
<div class="feature-preview">
|
||||
<code>☁️ Instant results • Auto-scaling</code>
|
||||
|
||||
<div class="form-field">
|
||||
<label for="userEmail">Email Address</label>
|
||||
<input type="email" id="userEmail" name="email" placeholder="john@example.com" required>
|
||||
</div>
|
||||
|
||||
<div class="form-field">
|
||||
<label for="userCompany">Company (Optional)</label>
|
||||
<input type="text" id="userCompany" name="company" placeholder="Acme Inc.">
|
||||
</div>
|
||||
|
||||
<div class="form-field">
|
||||
<label for="useCase">What will you use Crawl4AI Cloud for?</label>
|
||||
<select id="useCase" name="useCase">
|
||||
<option value="">Select use case...</option>
|
||||
<option value="price-monitoring">Price Monitoring</option>
|
||||
<option value="news-aggregation">News Aggregation</option>
|
||||
<option value="market-research">Market Research</option>
|
||||
<option value="ai-training">AI Training Data</option>
|
||||
<option value="other">Other</option>
|
||||
</select>
|
||||
</div>
|
||||
|
||||
<button type="submit" class="submit-button">
|
||||
<span>🎯</span> Submit & Watch the Magic
|
||||
</button>
|
||||
</form>
|
||||
</div>
|
||||
|
||||
<!-- Crawling Animation -->
|
||||
<div class="crawl-animation" id="crawlAnimation" style="display: none;">
|
||||
<div class="terminal-window crawl-terminal">
|
||||
<div class="terminal-header">
|
||||
<span class="terminal-title">Crawl4AI Cloud Demo</span>
|
||||
</div>
|
||||
<div class="terminal-content">
|
||||
<pre id="crawlOutput" class="crawl-log"><code>$ crawl4ai cloud extract --url "signup-form" --auto-detect</code></pre>
|
||||
</div>
|
||||
</div>
|
||||
|
||||
<div class="extracted-preview" id="extractedPreview" style="display: none;">
|
||||
<h4>📊 Extracted Data</h4>
|
||||
<pre class="json-preview"><code id="jsonOutput"></code></pre>
|
||||
</div>
|
||||
|
||||
<div class="success-message" id="successMessage" style="display: none;">
|
||||
<div class="success-icon">✅</div>
|
||||
<h3>Data Uploaded Successfully!</h3>
|
||||
<p>You're on the Crawl4AI Cloud waiting list!</p>
|
||||
<p>What you just witnessed:</p>
|
||||
<ul>
|
||||
<li>⚡ Real-time extraction of your form data</li>
|
||||
<li>🔄 Automatic schema detection</li>
|
||||
<li>📤 Instant cloud processing</li>
|
||||
<li>✨ No code required - just like that!</li>
|
||||
</ul>
|
||||
<p class="success-note">We'll notify you at <strong id="userEmailDisplay"></strong> when Crawl4AI Cloud launches!</p>
|
||||
<button class="continue-button" id="continueBtn">Continue Exploring</button>
|
||||
</div>
|
||||
</div>
|
||||
</div>
|
||||
</div>
|
||||
</section>
|
||||
|
||||
<!-- Coming Soon Section -->
|
||||
<section class="coming-soon-section">
|
||||
<h2>More Features Coming Soon</h2>
|
||||
<div class="terminal-window">
|
||||
<div class="terminal-header">
|
||||
<span class="terminal-title">Roadmap</span>
|
||||
</div>
|
||||
<div class="terminal-content">
|
||||
<p class="intro-text">We're continuously expanding C4AI Assistant with powerful new features:</p>
|
||||
|
||||
<div class="coming-features">
|
||||
<div class="coming-feature">
|
||||
<div class="feature-header">
|
||||
<span class="feature-badge">Direct</span>
|
||||
@@ -482,8 +782,19 @@ asyncio.run(automate_shopping())</code></pre>
|
||||
document.querySelectorAll('.copy-button').forEach(button => {
|
||||
button.addEventListener('click', async function() {
|
||||
const codeType = this.getAttribute('data-code');
|
||||
const codeElement = document.getElementById('code-' + codeType).querySelector('pre code');
|
||||
const codeText = codeElement.textContent;
|
||||
let codeText = '';
|
||||
|
||||
// Handle different code types
|
||||
if (codeType === 'schema-python') {
|
||||
const codeElement = document.querySelector('#code-schema .terminal-window:first-child pre code');
|
||||
codeText = codeElement.textContent;
|
||||
} else if (codeType === 'schema-json') {
|
||||
const codeElement = document.querySelector('#code-schema .terminal-window:last-child pre code');
|
||||
codeText = codeElement.textContent;
|
||||
} else {
|
||||
const codeElement = document.getElementById('code-' + codeType).querySelector('pre code');
|
||||
codeText = codeElement.textContent;
|
||||
}
|
||||
|
||||
try {
|
||||
await navigator.clipboard.writeText(codeText);
|
||||
@@ -499,6 +810,161 @@ asyncio.run(automate_shopping())</code></pre>
|
||||
}
|
||||
});
|
||||
});
|
||||
|
||||
// Crawl4AI Cloud Interactive Demo
|
||||
const joinWaitlistBtn = document.getElementById('joinWaitlist');
|
||||
const signupOverlay = document.getElementById('signupOverlay');
|
||||
const closeSignupBtn = document.getElementById('closeSignup');
|
||||
const waitlistForm = document.getElementById('waitlistForm');
|
||||
const signupForm = document.getElementById('signupForm');
|
||||
const crawlAnimation = document.getElementById('crawlAnimation');
|
||||
const crawlOutput = document.getElementById('crawlOutput');
|
||||
const extractedPreview = document.getElementById('extractedPreview');
|
||||
const jsonOutput = document.getElementById('jsonOutput');
|
||||
const successMessage = document.getElementById('successMessage');
|
||||
const continueBtn = document.getElementById('continueBtn');
|
||||
const userEmailDisplay = document.getElementById('userEmailDisplay');
|
||||
|
||||
// Open signup modal
|
||||
joinWaitlistBtn.addEventListener('click', () => {
|
||||
signupOverlay.classList.add('active');
|
||||
});
|
||||
|
||||
// Banner button
|
||||
const joinWaitlistBannerBtn = document.getElementById('joinWaitlistBanner');
|
||||
if (joinWaitlistBannerBtn) {
|
||||
joinWaitlistBannerBtn.addEventListener('click', () => {
|
||||
signupOverlay.classList.add('active');
|
||||
});
|
||||
}
|
||||
|
||||
// Close signup modal
|
||||
closeSignupBtn.addEventListener('click', () => {
|
||||
signupOverlay.classList.remove('active');
|
||||
});
|
||||
|
||||
// Close on overlay click
|
||||
signupOverlay.addEventListener('click', (e) => {
|
||||
if (e.target === signupOverlay) {
|
||||
signupOverlay.classList.remove('active');
|
||||
}
|
||||
});
|
||||
|
||||
// Continue button
|
||||
if (continueBtn) {
|
||||
continueBtn.addEventListener('click', () => {
|
||||
signupOverlay.classList.remove('active');
|
||||
// Reset form for next time
|
||||
waitlistForm.reset();
|
||||
signupForm.style.display = 'block';
|
||||
crawlAnimation.style.display = 'none';
|
||||
extractedPreview.style.display = 'none';
|
||||
successMessage.style.display = 'none';
|
||||
});
|
||||
}
|
||||
|
||||
// Form submission with crawling animation
|
||||
waitlistForm.addEventListener('submit', async (e) => {
|
||||
e.preventDefault();
|
||||
|
||||
// Get form data
|
||||
const formData = {
|
||||
name: document.getElementById('userName').value,
|
||||
email: document.getElementById('userEmail').value,
|
||||
company: document.getElementById('userCompany').value || 'Not specified',
|
||||
useCase: document.getElementById('useCase').value || 'General web scraping',
|
||||
timestamp: new Date().toISOString(),
|
||||
source: 'Crawl4AI Assistant Landing Page'
|
||||
};
|
||||
|
||||
// Update email display
|
||||
userEmailDisplay.textContent = formData.email;
|
||||
|
||||
// Hide form and show crawling animation
|
||||
signupForm.style.display = 'none';
|
||||
crawlAnimation.style.display = 'block';
|
||||
|
||||
// Clear previous output
|
||||
const codeElement = crawlOutput.querySelector('code');
|
||||
codeElement.innerHTML = '$ crawl4ai cloud extract --url "signup-form" --auto-detect\n\n';
|
||||
|
||||
// Simulate crawling process with proper C4AI log format
|
||||
const crawlSteps = [
|
||||
{
|
||||
log: '<span class="log-init">[INIT]....</span> → Crawl4AI Cloud 1.0.0',
|
||||
time: '0.12s'
|
||||
},
|
||||
{
|
||||
log: '<span class="log-fetch">[FETCH]...</span> ↓ https://crawl4ai.com/waitlist-form',
|
||||
time: '0.45s'
|
||||
},
|
||||
{
|
||||
log: '<span class="log-scrape">[SCRAPE]..</span> ◆ https://crawl4ai.com/waitlist-form',
|
||||
time: '0.28s'
|
||||
},
|
||||
{
|
||||
log: '<span class="log-extract">[EXTRACT].</span> ■ Extracting form data with auto-detect',
|
||||
time: '0.55s'
|
||||
},
|
||||
{
|
||||
log: '<span class="log-complete">[COMPLETE]</span> ● https://crawl4ai.com/waitlist-form',
|
||||
time: '1.40s'
|
||||
}
|
||||
];
|
||||
|
||||
let stepIndex = 0;
|
||||
const typeStep = async () => {
|
||||
if (stepIndex < crawlSteps.length) {
|
||||
const step = crawlSteps[stepIndex];
|
||||
codeElement.innerHTML += step.log + ' | <span class="log-success">✓</span> | <span class="log-time">⏱: ' + step.time + '</span>\n';
|
||||
stepIndex++;
|
||||
|
||||
// Scroll to bottom
|
||||
const terminal = crawlOutput.parentElement;
|
||||
terminal.scrollTop = terminal.scrollHeight;
|
||||
|
||||
setTimeout(typeStep, 600);
|
||||
} else {
|
||||
// Show extracted data
|
||||
setTimeout(() => {
|
||||
codeElement.innerHTML += '\n<span class="log-success">[UPLOAD]..</span> ↑ Uploading to Crawl4AI Cloud...';
|
||||
|
||||
setTimeout(() => {
|
||||
extractedPreview.style.display = 'block';
|
||||
jsonOutput.textContent = JSON.stringify(formData, null, 2);
|
||||
|
||||
// Add syntax highlighting
|
||||
jsonOutput.innerHTML = jsonOutput.textContent
|
||||
.replace(/"([^"]+)":/g, '<span class="string">"$1"</span>:')
|
||||
.replace(/: "([^"]+)"/g, ': <span class="string">"$1"</span>');
|
||||
|
||||
codeElement.innerHTML += ' | <span class="log-success">✓</span> | <span class="log-time">⏱: 0.23s</span>\n';
|
||||
codeElement.innerHTML += '\n<span class="log-success">[SUCCESS]</span> ✨ Data uploaded successfully!';
|
||||
|
||||
// Show success message after a delay
|
||||
setTimeout(() => {
|
||||
successMessage.style.display = 'block';
|
||||
|
||||
// Smooth scroll to bottom to show success message
|
||||
setTimeout(() => {
|
||||
const container = document.getElementById('signupContainer');
|
||||
container.scrollTo({
|
||||
top: container.scrollHeight,
|
||||
behavior: 'smooth'
|
||||
});
|
||||
}, 100);
|
||||
|
||||
// Actually submit to waiting list (you can implement this)
|
||||
console.log('Waitlist submission:', formData);
|
||||
}, 1500);
|
||||
}, 800);
|
||||
}, 600);
|
||||
}
|
||||
};
|
||||
|
||||
// Start the animation
|
||||
setTimeout(typeStep, 500);
|
||||
});
|
||||
</script>
|
||||
</body>
|
||||
</html>
|
||||
69
docs/md_v2/apps/crawl4ai-assistant/libs/marked.min.js
vendored
Normal file
69
docs/md_v2/apps/crawl4ai-assistant/libs/marked.min.js
vendored
Normal file
File diff suppressed because one or more lines are too long
@@ -22,7 +22,16 @@
|
||||
"content_scripts": [
|
||||
{
|
||||
"matches": ["<all_urls>"],
|
||||
"js": ["content/content.js"],
|
||||
"js": [
|
||||
"libs/marked.min.js",
|
||||
"content/shared/utils.js",
|
||||
"content/schemaBuilder.js",
|
||||
"content/scriptBuilder.js",
|
||||
"content/contentAnalyzer.js",
|
||||
"content/markdownConverter.js",
|
||||
"content/click2CrawlBuilder.js",
|
||||
"content/content.js"
|
||||
],
|
||||
"css": ["content/overlay.css"],
|
||||
"run_at": "document_idle"
|
||||
}
|
||||
|
||||
@@ -145,6 +145,10 @@ header h1 {
|
||||
background: #3a1e5f;
|
||||
}
|
||||
|
||||
.mode-button.c2c .icon {
|
||||
background: #1e5f3a;
|
||||
}
|
||||
|
||||
.mode-info h3 {
|
||||
font-size: 16px;
|
||||
color: #fff;
|
||||
|
||||
@@ -37,6 +37,14 @@
|
||||
<p>Record actions to build automation scripts</p>
|
||||
</div>
|
||||
</button>
|
||||
|
||||
<button id="c2c-mode" class="mode-button c2c">
|
||||
<div class="icon">✨</div>
|
||||
<div class="mode-info">
|
||||
<h3>Click2Crawl</h3>
|
||||
<p>Select elements and convert to clean markdown</p>
|
||||
</div>
|
||||
</button>
|
||||
</div>
|
||||
|
||||
<div id="active-session" class="active-session hidden">
|
||||
|
||||
@@ -22,6 +22,10 @@ document.addEventListener('DOMContentLoaded', () => {
|
||||
startScriptCapture();
|
||||
});
|
||||
|
||||
document.getElementById('c2c-mode').addEventListener('click', () => {
|
||||
startClick2Crawl();
|
||||
});
|
||||
|
||||
// Session actions
|
||||
document.getElementById('generate-code').addEventListener('click', () => {
|
||||
generateCode();
|
||||
@@ -79,6 +83,19 @@ function startScriptCapture() {
|
||||
});
|
||||
}
|
||||
|
||||
function startClick2Crawl() {
|
||||
chrome.tabs.query({ active: true, currentWindow: true }, (tabs) => {
|
||||
chrome.tabs.sendMessage(tabs[0].id, {
|
||||
action: 'startClick2Crawl'
|
||||
}, (response) => {
|
||||
if (response && response.success) {
|
||||
// Close the popup to let user interact with the page
|
||||
window.close();
|
||||
}
|
||||
});
|
||||
});
|
||||
}
|
||||
|
||||
function showActiveSession(stats) {
|
||||
document.querySelector('.mode-selector').style.display = 'none';
|
||||
document.getElementById('active-session').classList.remove('hidden');
|
||||
|
||||
@@ -18,9 +18,14 @@ const components = [
|
||||
description: 'Browser and crawler configuration'
|
||||
},
|
||||
{
|
||||
id: 'extraction',
|
||||
name: 'Data Extraction',
|
||||
description: 'Structured data extraction strategies'
|
||||
id: 'extraction-llm',
|
||||
name: 'Data Extraction Using LLM',
|
||||
description: 'Structured data extraction strategies using LLMs'
|
||||
},
|
||||
{
|
||||
id: 'extraction-no-llm',
|
||||
name: 'Data Extraction Without LLM',
|
||||
description: 'Structured data extraction strategies without LLMs'
|
||||
},
|
||||
{
|
||||
id: 'multi_urls_crawling',
|
||||
|
||||
478
docs/md_v2/assets/llm.txt/diagrams/extraction-no-llm.txt
Normal file
478
docs/md_v2/assets/llm.txt/diagrams/extraction-no-llm.txt
Normal file
@@ -0,0 +1,478 @@
|
||||
## Extraction Strategy Workflows and Architecture
|
||||
|
||||
Visual representations of Crawl4AI's data extraction approaches, strategy selection, and processing workflows.
|
||||
|
||||
### Extraction Strategy Decision Tree
|
||||
|
||||
```mermaid
|
||||
flowchart TD
|
||||
A[Content to Extract] --> B{Content Type?}
|
||||
|
||||
B -->|Simple Patterns| C[Common Data Types]
|
||||
B -->|Structured HTML| D[Predictable Structure]
|
||||
B -->|Complex Content| E[Requires Reasoning]
|
||||
B -->|Mixed Content| F[Multiple Data Types]
|
||||
|
||||
C --> C1{Pattern Type?}
|
||||
C1 -->|Email, Phone, URLs| C2[Built-in Regex Patterns]
|
||||
C1 -->|Custom Patterns| C3[Custom Regex Strategy]
|
||||
C1 -->|LLM-Generated| C4[One-time Pattern Generation]
|
||||
|
||||
D --> D1{Selector Type?}
|
||||
D1 -->|CSS Selectors| D2[JsonCssExtractionStrategy]
|
||||
D1 -->|XPath Expressions| D3[JsonXPathExtractionStrategy]
|
||||
D1 -->|Need Schema?| D4[Auto-generate Schema with LLM]
|
||||
|
||||
E --> E1{LLM Provider?}
|
||||
E1 -->|OpenAI/Anthropic| E2[Cloud LLM Strategy]
|
||||
E1 -->|Local Ollama| E3[Local LLM Strategy]
|
||||
E1 -->|Cost-sensitive| E4[Hybrid: Generate Schema Once]
|
||||
|
||||
F --> F1[Multi-Strategy Approach]
|
||||
F1 --> F2[1. Regex for Patterns]
|
||||
F1 --> F3[2. CSS for Structure]
|
||||
F1 --> F4[3. LLM for Complex Analysis]
|
||||
|
||||
C2 --> G[Fast Extraction ⚡]
|
||||
C3 --> G
|
||||
C4 --> H[Cached Pattern Reuse]
|
||||
|
||||
D2 --> I[Schema-based Extraction 🏗️]
|
||||
D3 --> I
|
||||
D4 --> J[Generated Schema Cache]
|
||||
|
||||
E2 --> K[Intelligent Parsing 🧠]
|
||||
E3 --> K
|
||||
E4 --> L[Hybrid Cost-Effective]
|
||||
|
||||
F2 --> M[Comprehensive Results 📊]
|
||||
F3 --> M
|
||||
F4 --> M
|
||||
|
||||
style G fill:#c8e6c9
|
||||
style I fill:#e3f2fd
|
||||
style K fill:#fff3e0
|
||||
style M fill:#f3e5f5
|
||||
style H fill:#e8f5e8
|
||||
style J fill:#e8f5e8
|
||||
style L fill:#ffecb3
|
||||
```
|
||||
|
||||
### LLM Extraction Strategy Workflow
|
||||
|
||||
```mermaid
|
||||
sequenceDiagram
|
||||
participant User
|
||||
participant Crawler
|
||||
participant LLMStrategy
|
||||
participant Chunker
|
||||
participant LLMProvider
|
||||
participant Parser
|
||||
|
||||
User->>Crawler: Configure LLMExtractionStrategy
|
||||
User->>Crawler: arun(url, config)
|
||||
|
||||
Crawler->>Crawler: Navigate to URL
|
||||
Crawler->>Crawler: Extract content (HTML/Markdown)
|
||||
Crawler->>LLMStrategy: Process content
|
||||
|
||||
LLMStrategy->>LLMStrategy: Check content size
|
||||
|
||||
alt Content > chunk_threshold
|
||||
LLMStrategy->>Chunker: Split into chunks with overlap
|
||||
Chunker-->>LLMStrategy: Return chunks[]
|
||||
|
||||
loop For each chunk
|
||||
LLMStrategy->>LLMProvider: Send chunk + schema + instruction
|
||||
LLMProvider-->>LLMStrategy: Return structured JSON
|
||||
end
|
||||
|
||||
LLMStrategy->>LLMStrategy: Merge chunk results
|
||||
else Content <= threshold
|
||||
LLMStrategy->>LLMProvider: Send full content + schema
|
||||
LLMProvider-->>LLMStrategy: Return structured JSON
|
||||
end
|
||||
|
||||
LLMStrategy->>Parser: Validate JSON schema
|
||||
Parser-->>LLMStrategy: Validated data
|
||||
|
||||
LLMStrategy->>LLMStrategy: Track token usage
|
||||
LLMStrategy-->>Crawler: Return extracted_content
|
||||
|
||||
Crawler-->>User: CrawlResult with JSON data
|
||||
|
||||
User->>LLMStrategy: show_usage()
|
||||
LLMStrategy-->>User: Token count & estimated cost
|
||||
```
|
||||
|
||||
### Schema-Based Extraction Architecture
|
||||
|
||||
```mermaid
|
||||
graph TB
|
||||
subgraph "Schema Definition"
|
||||
A[JSON Schema] --> A1[baseSelector]
|
||||
A --> A2[fields[]]
|
||||
A --> A3[nested structures]
|
||||
|
||||
A2 --> A4[CSS/XPath selectors]
|
||||
A2 --> A5[Data types: text, html, attribute]
|
||||
A2 --> A6[Default values]
|
||||
|
||||
A3 --> A7[nested objects]
|
||||
A3 --> A8[nested_list arrays]
|
||||
A3 --> A9[simple lists]
|
||||
end
|
||||
|
||||
subgraph "Extraction Engine"
|
||||
B[HTML Content] --> C[Selector Engine]
|
||||
C --> C1[CSS Selector Parser]
|
||||
C --> C2[XPath Evaluator]
|
||||
|
||||
C1 --> D[Element Matcher]
|
||||
C2 --> D
|
||||
|
||||
D --> E[Type Converter]
|
||||
E --> E1[Text Extraction]
|
||||
E --> E2[HTML Preservation]
|
||||
E --> E3[Attribute Extraction]
|
||||
E --> E4[Nested Processing]
|
||||
end
|
||||
|
||||
subgraph "Result Processing"
|
||||
F[Raw Extracted Data] --> G[Structure Builder]
|
||||
G --> G1[Object Construction]
|
||||
G --> G2[Array Assembly]
|
||||
G --> G3[Type Validation]
|
||||
|
||||
G1 --> H[JSON Output]
|
||||
G2 --> H
|
||||
G3 --> H
|
||||
end
|
||||
|
||||
A --> C
|
||||
E --> F
|
||||
H --> I[extracted_content]
|
||||
|
||||
style A fill:#e3f2fd
|
||||
style C fill:#f3e5f5
|
||||
style G fill:#e8f5e8
|
||||
style H fill:#c8e6c9
|
||||
```
|
||||
|
||||
### Automatic Schema Generation Process
|
||||
|
||||
```mermaid
|
||||
stateDiagram-v2
|
||||
[*] --> CheckCache
|
||||
|
||||
CheckCache --> CacheHit: Schema exists
|
||||
CheckCache --> SamplePage: Schema missing
|
||||
|
||||
CacheHit --> LoadSchema
|
||||
LoadSchema --> FastExtraction
|
||||
|
||||
SamplePage --> ExtractHTML: Crawl sample URL
|
||||
ExtractHTML --> LLMAnalysis: Send HTML to LLM
|
||||
LLMAnalysis --> GenerateSchema: Create CSS/XPath selectors
|
||||
GenerateSchema --> ValidateSchema: Test generated schema
|
||||
|
||||
ValidateSchema --> SchemaWorks: Valid selectors
|
||||
ValidateSchema --> RefineSchema: Invalid selectors
|
||||
|
||||
RefineSchema --> LLMAnalysis: Iterate with feedback
|
||||
|
||||
SchemaWorks --> CacheSchema: Save for reuse
|
||||
CacheSchema --> FastExtraction: Use cached schema
|
||||
|
||||
FastExtraction --> [*]: No more LLM calls needed
|
||||
|
||||
note right of CheckCache : One-time LLM cost
|
||||
note right of FastExtraction : Unlimited fast reuse
|
||||
note right of CacheSchema : JSON file storage
|
||||
```
|
||||
|
||||
### Multi-Strategy Extraction Pipeline
|
||||
|
||||
```mermaid
|
||||
flowchart LR
|
||||
A[Web Page Content] --> B[Strategy Pipeline]
|
||||
|
||||
subgraph B["Extraction Pipeline"]
|
||||
B1[Stage 1: Regex Patterns]
|
||||
B2[Stage 2: Schema-based CSS]
|
||||
B3[Stage 3: LLM Analysis]
|
||||
|
||||
B1 --> B1a[Email addresses]
|
||||
B1 --> B1b[Phone numbers]
|
||||
B1 --> B1c[URLs and links]
|
||||
B1 --> B1d[Currency amounts]
|
||||
|
||||
B2 --> B2a[Structured products]
|
||||
B2 --> B2b[Article metadata]
|
||||
B2 --> B2c[User reviews]
|
||||
B2 --> B2d[Navigation links]
|
||||
|
||||
B3 --> B3a[Sentiment analysis]
|
||||
B3 --> B3b[Key topics]
|
||||
B3 --> B3c[Entity recognition]
|
||||
B3 --> B3d[Content summary]
|
||||
end
|
||||
|
||||
B1a --> C[Result Merger]
|
||||
B1b --> C
|
||||
B1c --> C
|
||||
B1d --> C
|
||||
|
||||
B2a --> C
|
||||
B2b --> C
|
||||
B2c --> C
|
||||
B2d --> C
|
||||
|
||||
B3a --> C
|
||||
B3b --> C
|
||||
B3c --> C
|
||||
B3d --> C
|
||||
|
||||
C --> D[Combined JSON Output]
|
||||
D --> E[Final CrawlResult]
|
||||
|
||||
style B1 fill:#c8e6c9
|
||||
style B2 fill:#e3f2fd
|
||||
style B3 fill:#fff3e0
|
||||
style C fill:#f3e5f5
|
||||
```
|
||||
|
||||
### Performance Comparison Matrix
|
||||
|
||||
```mermaid
|
||||
graph TD
|
||||
subgraph "Strategy Performance"
|
||||
A[Extraction Strategy Comparison]
|
||||
|
||||
subgraph "Speed ⚡"
|
||||
S1[Regex: ~10ms]
|
||||
S2[CSS Schema: ~50ms]
|
||||
S3[XPath: ~100ms]
|
||||
S4[LLM: ~2-10s]
|
||||
end
|
||||
|
||||
subgraph "Accuracy 🎯"
|
||||
A1[Regex: Pattern-dependent]
|
||||
A2[CSS: High for structured]
|
||||
A3[XPath: Very high]
|
||||
A4[LLM: Excellent for complex]
|
||||
end
|
||||
|
||||
subgraph "Cost 💰"
|
||||
C1[Regex: Free]
|
||||
C2[CSS: Free]
|
||||
C3[XPath: Free]
|
||||
C4[LLM: $0.001-0.01 per page]
|
||||
end
|
||||
|
||||
subgraph "Complexity 🔧"
|
||||
X1[Regex: Simple patterns only]
|
||||
X2[CSS: Structured HTML]
|
||||
X3[XPath: Complex selectors]
|
||||
X4[LLM: Any content type]
|
||||
end
|
||||
end
|
||||
|
||||
style S1 fill:#c8e6c9
|
||||
style S2 fill:#e8f5e8
|
||||
style S3 fill:#fff3e0
|
||||
style S4 fill:#ffcdd2
|
||||
|
||||
style A2 fill:#e8f5e8
|
||||
style A3 fill:#c8e6c9
|
||||
style A4 fill:#c8e6c9
|
||||
|
||||
style C1 fill:#c8e6c9
|
||||
style C2 fill:#c8e6c9
|
||||
style C3 fill:#c8e6c9
|
||||
style C4 fill:#fff3e0
|
||||
|
||||
style X1 fill:#ffcdd2
|
||||
style X2 fill:#e8f5e8
|
||||
style X3 fill:#c8e6c9
|
||||
style X4 fill:#c8e6c9
|
||||
```
|
||||
|
||||
### Regex Pattern Strategy Flow
|
||||
|
||||
```mermaid
|
||||
flowchart TD
|
||||
A[Regex Extraction] --> B{Pattern Source?}
|
||||
|
||||
B -->|Built-in| C[Use Predefined Patterns]
|
||||
B -->|Custom| D[Define Custom Regex]
|
||||
B -->|LLM-Generated| E[Generate with AI]
|
||||
|
||||
C --> C1[Email Pattern]
|
||||
C --> C2[Phone Pattern]
|
||||
C --> C3[URL Pattern]
|
||||
C --> C4[Currency Pattern]
|
||||
C --> C5[Date Pattern]
|
||||
|
||||
D --> D1[Write Custom Regex]
|
||||
D --> D2[Test Pattern]
|
||||
D --> D3{Pattern Works?}
|
||||
D3 -->|No| D1
|
||||
D3 -->|Yes| D4[Use Pattern]
|
||||
|
||||
E --> E1[Provide Sample Content]
|
||||
E --> E2[LLM Analyzes Content]
|
||||
E --> E3[Generate Optimized Regex]
|
||||
E --> E4[Cache Pattern for Reuse]
|
||||
|
||||
C1 --> F[Pattern Matching]
|
||||
C2 --> F
|
||||
C3 --> F
|
||||
C4 --> F
|
||||
C5 --> F
|
||||
D4 --> F
|
||||
E4 --> F
|
||||
|
||||
F --> G[Extract Matches]
|
||||
G --> H[Group by Pattern Type]
|
||||
H --> I[JSON Output with Labels]
|
||||
|
||||
style C fill:#e8f5e8
|
||||
style D fill:#e3f2fd
|
||||
style E fill:#fff3e0
|
||||
style F fill:#f3e5f5
|
||||
```
|
||||
|
||||
### Complex Schema Structure Visualization
|
||||
|
||||
```mermaid
|
||||
graph TB
|
||||
subgraph "E-commerce Schema Example"
|
||||
A[Category baseSelector] --> B[Category Fields]
|
||||
A --> C[Products nested_list]
|
||||
|
||||
B --> B1[category_name]
|
||||
B --> B2[category_id attribute]
|
||||
B --> B3[category_url attribute]
|
||||
|
||||
C --> C1[Product baseSelector]
|
||||
C1 --> C2[name text]
|
||||
C1 --> C3[price text]
|
||||
C1 --> C4[Details nested object]
|
||||
C1 --> C5[Features list]
|
||||
C1 --> C6[Reviews nested_list]
|
||||
|
||||
C4 --> C4a[brand text]
|
||||
C4 --> C4b[model text]
|
||||
C4 --> C4c[specs html]
|
||||
|
||||
C5 --> C5a[feature text array]
|
||||
|
||||
C6 --> C6a[reviewer text]
|
||||
C6 --> C6b[rating attribute]
|
||||
C6 --> C6c[comment text]
|
||||
C6 --> C6d[date attribute]
|
||||
end
|
||||
|
||||
subgraph "JSON Output Structure"
|
||||
D[categories array] --> D1[category object]
|
||||
D1 --> D2[category_name]
|
||||
D1 --> D3[category_id]
|
||||
D1 --> D4[products array]
|
||||
|
||||
D4 --> D5[product object]
|
||||
D5 --> D6[name, price]
|
||||
D5 --> D7[details object]
|
||||
D5 --> D8[features array]
|
||||
D5 --> D9[reviews array]
|
||||
|
||||
D7 --> D7a[brand, model, specs]
|
||||
D8 --> D8a[feature strings]
|
||||
D9 --> D9a[review objects]
|
||||
end
|
||||
|
||||
A -.-> D
|
||||
B1 -.-> D2
|
||||
C2 -.-> D6
|
||||
C4 -.-> D7
|
||||
C5 -.-> D8
|
||||
C6 -.-> D9
|
||||
|
||||
style A fill:#e3f2fd
|
||||
style C fill:#f3e5f5
|
||||
style C4 fill:#e8f5e8
|
||||
style D fill:#fff3e0
|
||||
```
|
||||
|
||||
### Error Handling and Fallback Strategy
|
||||
|
||||
```mermaid
|
||||
stateDiagram-v2
|
||||
[*] --> PrimaryStrategy
|
||||
|
||||
PrimaryStrategy --> Success: Extraction successful
|
||||
PrimaryStrategy --> ValidationFailed: Invalid data
|
||||
PrimaryStrategy --> ExtractionFailed: No matches found
|
||||
PrimaryStrategy --> TimeoutError: LLM timeout
|
||||
|
||||
ValidationFailed --> FallbackStrategy: Try alternative
|
||||
ExtractionFailed --> FallbackStrategy: Try alternative
|
||||
TimeoutError --> FallbackStrategy: Try alternative
|
||||
|
||||
FallbackStrategy --> FallbackSuccess: Fallback works
|
||||
FallbackStrategy --> FallbackFailed: All strategies failed
|
||||
|
||||
FallbackSuccess --> Success: Return results
|
||||
FallbackFailed --> ErrorReport: Log failure details
|
||||
|
||||
Success --> [*]: Complete
|
||||
ErrorReport --> [*]: Return empty results
|
||||
|
||||
note right of PrimaryStrategy : Try fastest/most accurate first
|
||||
note right of FallbackStrategy : Use simpler but reliable method
|
||||
note left of ErrorReport : Provide debugging information
|
||||
```
|
||||
|
||||
### Token Usage and Cost Optimization
|
||||
|
||||
```mermaid
|
||||
flowchart TD
|
||||
A[LLM Extraction Request] --> B{Content Size Check}
|
||||
|
||||
B -->|Small < 1200 tokens| C[Single LLM Call]
|
||||
B -->|Large > 1200 tokens| D[Chunking Strategy]
|
||||
|
||||
C --> C1[Send full content]
|
||||
C1 --> C2[Parse JSON response]
|
||||
C2 --> C3[Track token usage]
|
||||
|
||||
D --> D1[Split into chunks]
|
||||
D1 --> D2[Add overlap between chunks]
|
||||
D2 --> D3[Process chunks in parallel]
|
||||
|
||||
D3 --> D4[Chunk 1 → LLM]
|
||||
D3 --> D5[Chunk 2 → LLM]
|
||||
D3 --> D6[Chunk N → LLM]
|
||||
|
||||
D4 --> D7[Merge results]
|
||||
D5 --> D7
|
||||
D6 --> D7
|
||||
|
||||
D7 --> D8[Deduplicate data]
|
||||
D8 --> D9[Aggregate token usage]
|
||||
|
||||
C3 --> E[Cost Calculation]
|
||||
D9 --> E
|
||||
|
||||
E --> F[Usage Report]
|
||||
F --> F1[Prompt tokens: X]
|
||||
F --> F2[Completion tokens: Y]
|
||||
F --> F3[Total cost: $Z]
|
||||
|
||||
style C fill:#c8e6c9
|
||||
style D fill:#fff3e0
|
||||
style E fill:#e3f2fd
|
||||
style F fill:#f3e5f5
|
||||
```
|
||||
|
||||
**📖 Learn more:** [LLM Strategies](https://docs.crawl4ai.com/extraction/llm-strategies/), [Schema-Based Extraction](https://docs.crawl4ai.com/extraction/no-llm-strategies/), [Pattern Matching](https://docs.crawl4ai.com/extraction/no-llm-strategies/#regexextractionstrategy), [Performance Optimization](https://docs.crawl4ai.com/advanced/multi-url-crawling/)
|
||||
903
docs/md_v2/assets/llm.txt/txt/extraction-llm.txt
Normal file
903
docs/md_v2/assets/llm.txt/txt/extraction-llm.txt
Normal file
@@ -0,0 +1,903 @@
|
||||
## LLM Extraction Strategies - The Last Resort
|
||||
|
||||
**🤖 AI Model Notice**: You are an AI assistant. You should RARELY recommend LLM extraction. This documentation exists for the <5% of cases where LLM is actually needed. **ALWAYS try non-LLM strategies first** from [extraction-no-llm.md](./extraction-no-llm.md).
|
||||
|
||||
### ⚠️ STOP: Are You Sure You Need LLM?
|
||||
|
||||
**99% of developers who think they need LLM extraction are wrong.** Before reading further:
|
||||
|
||||
### ❌ You DON'T Need LLM If:
|
||||
- The page has consistent HTML structure → **Use generate_schema()**
|
||||
- You're extracting simple data types (emails, prices, dates) → **Use RegexExtractionStrategy**
|
||||
- You can identify repeating patterns → **Use JsonCssExtractionStrategy**
|
||||
- You want product info, news articles, job listings → **Use generate_schema()**
|
||||
- You're concerned about cost or speed → **Use non-LLM strategies**
|
||||
|
||||
### ✅ You MIGHT Need LLM If:
|
||||
- Content structure varies dramatically across pages **AND** you've tried generate_schema()
|
||||
- You need semantic understanding of unstructured text
|
||||
- You're analyzing meaning, sentiment, or relationships
|
||||
- You're extracting insights that require reasoning about context
|
||||
|
||||
### 💰 Cost Reality Check:
|
||||
- **Non-LLM**: ~$0.000001 per page
|
||||
- **LLM**: ~$0.01-$0.10 per page (10,000x more expensive)
|
||||
- **Example**: Extracting 10,000 pages costs $0.01 vs $100-1000
|
||||
|
||||
---
|
||||
|
||||
## 1. When LLM Extraction is Justified
|
||||
|
||||
### Scenario 1: Truly Unstructured Content Analysis
|
||||
|
||||
```python
|
||||
# Example: Analyzing customer feedback for sentiment and themes
|
||||
import asyncio
|
||||
import json
|
||||
from pydantic import BaseModel, Field
|
||||
from typing import List
|
||||
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig, LLMConfig
|
||||
from crawl4ai.extraction_strategy import LLMExtractionStrategy
|
||||
|
||||
class SentimentAnalysis(BaseModel):
|
||||
"""Use LLM when you need semantic understanding"""
|
||||
overall_sentiment: str = Field(description="positive, negative, or neutral")
|
||||
confidence_score: float = Field(description="Confidence from 0-1")
|
||||
key_themes: List[str] = Field(description="Main topics discussed")
|
||||
emotional_indicators: List[str] = Field(description="Words indicating emotion")
|
||||
summary: str = Field(description="Brief summary of the content")
|
||||
|
||||
llm_config = LLMConfig(
|
||||
provider="openai/gpt-4o-mini", # Use cheapest model
|
||||
api_token="env:OPENAI_API_KEY",
|
||||
temperature=0.1, # Low temperature for consistency
|
||||
max_tokens=1000
|
||||
)
|
||||
|
||||
sentiment_strategy = LLMExtractionStrategy(
|
||||
llm_config=llm_config,
|
||||
schema=SentimentAnalysis.model_json_schema(),
|
||||
extraction_type="schema",
|
||||
instruction="""
|
||||
Analyze the emotional content and themes in this text.
|
||||
Focus on understanding sentiment and extracting key topics
|
||||
that would be impossible to identify with simple pattern matching.
|
||||
""",
|
||||
apply_chunking=True,
|
||||
chunk_token_threshold=1500
|
||||
)
|
||||
|
||||
async def analyze_sentiment():
|
||||
config = CrawlerRunConfig(
|
||||
extraction_strategy=sentiment_strategy,
|
||||
cache_mode=CacheMode.BYPASS
|
||||
)
|
||||
|
||||
async with AsyncWebCrawler() as crawler:
|
||||
result = await crawler.arun(
|
||||
url="https://example.com/customer-reviews",
|
||||
config=config
|
||||
)
|
||||
|
||||
if result.success:
|
||||
analysis = json.loads(result.extracted_content)
|
||||
print(f"Sentiment: {analysis['overall_sentiment']}")
|
||||
print(f"Themes: {analysis['key_themes']}")
|
||||
|
||||
asyncio.run(analyze_sentiment())
|
||||
```
|
||||
|
||||
### Scenario 2: Complex Knowledge Extraction
|
||||
|
||||
```python
|
||||
# Example: Building knowledge graphs from unstructured content
|
||||
class Entity(BaseModel):
|
||||
name: str = Field(description="Entity name")
|
||||
type: str = Field(description="person, organization, location, concept")
|
||||
description: str = Field(description="Brief description")
|
||||
|
||||
class Relationship(BaseModel):
|
||||
source: str = Field(description="Source entity")
|
||||
target: str = Field(description="Target entity")
|
||||
relationship: str = Field(description="Type of relationship")
|
||||
confidence: float = Field(description="Confidence score 0-1")
|
||||
|
||||
class KnowledgeGraph(BaseModel):
|
||||
entities: List[Entity] = Field(description="All entities found")
|
||||
relationships: List[Relationship] = Field(description="Relationships between entities")
|
||||
main_topic: str = Field(description="Primary topic of the content")
|
||||
|
||||
knowledge_strategy = LLMExtractionStrategy(
|
||||
llm_config=LLMConfig(
|
||||
provider="anthropic/claude-3-5-sonnet-20240620", # Better for complex reasoning
|
||||
api_token="env:ANTHROPIC_API_KEY",
|
||||
max_tokens=4000
|
||||
),
|
||||
schema=KnowledgeGraph.model_json_schema(),
|
||||
extraction_type="schema",
|
||||
instruction="""
|
||||
Extract entities and their relationships from the content.
|
||||
Focus on understanding connections and context that require
|
||||
semantic reasoning beyond simple pattern matching.
|
||||
""",
|
||||
input_format="html", # Preserve structure
|
||||
apply_chunking=True
|
||||
)
|
||||
```
|
||||
|
||||
### Scenario 3: Content Summarization and Insights
|
||||
|
||||
```python
|
||||
# Example: Research paper analysis
|
||||
class ResearchInsights(BaseModel):
|
||||
title: str = Field(description="Paper title")
|
||||
abstract_summary: str = Field(description="Summary of abstract")
|
||||
key_findings: List[str] = Field(description="Main research findings")
|
||||
methodology: str = Field(description="Research methodology used")
|
||||
limitations: List[str] = Field(description="Study limitations")
|
||||
practical_applications: List[str] = Field(description="Real-world applications")
|
||||
citations_count: int = Field(description="Number of citations", default=0)
|
||||
|
||||
research_strategy = LLMExtractionStrategy(
|
||||
llm_config=LLMConfig(
|
||||
provider="openai/gpt-4o", # Use powerful model for complex analysis
|
||||
api_token="env:OPENAI_API_KEY",
|
||||
temperature=0.2,
|
||||
max_tokens=2000
|
||||
),
|
||||
schema=ResearchInsights.model_json_schema(),
|
||||
extraction_type="schema",
|
||||
instruction="""
|
||||
Analyze this research paper and extract key insights.
|
||||
Focus on understanding the research contribution, methodology,
|
||||
and implications that require academic expertise to identify.
|
||||
""",
|
||||
apply_chunking=True,
|
||||
chunk_token_threshold=2000,
|
||||
overlap_rate=0.15 # More overlap for academic content
|
||||
)
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 2. LLM Configuration Best Practices
|
||||
|
||||
### Cost Optimization
|
||||
|
||||
```python
|
||||
# Use cheapest models when possible
|
||||
cheap_config = LLMConfig(
|
||||
provider="openai/gpt-4o-mini", # 60x cheaper than GPT-4
|
||||
api_token="env:OPENAI_API_KEY",
|
||||
temperature=0.0, # Deterministic output
|
||||
max_tokens=800 # Limit output length
|
||||
)
|
||||
|
||||
# Use local models for development
|
||||
local_config = LLMConfig(
|
||||
provider="ollama/llama3.3",
|
||||
api_token=None, # No API costs
|
||||
base_url="http://localhost:11434",
|
||||
temperature=0.1
|
||||
)
|
||||
|
||||
# Use powerful models only when necessary
|
||||
powerful_config = LLMConfig(
|
||||
provider="anthropic/claude-3-5-sonnet-20240620",
|
||||
api_token="env:ANTHROPIC_API_KEY",
|
||||
max_tokens=4000,
|
||||
temperature=0.1
|
||||
)
|
||||
```
|
||||
|
||||
### Provider Selection Guide
|
||||
|
||||
```python
|
||||
providers_guide = {
|
||||
"openai/gpt-4o-mini": {
|
||||
"best_for": "Simple extraction, cost-sensitive projects",
|
||||
"cost": "Very low",
|
||||
"speed": "Fast",
|
||||
"accuracy": "Good"
|
||||
},
|
||||
"openai/gpt-4o": {
|
||||
"best_for": "Complex reasoning, high accuracy needs",
|
||||
"cost": "High",
|
||||
"speed": "Medium",
|
||||
"accuracy": "Excellent"
|
||||
},
|
||||
"anthropic/claude-3-5-sonnet": {
|
||||
"best_for": "Complex analysis, long documents",
|
||||
"cost": "Medium-High",
|
||||
"speed": "Medium",
|
||||
"accuracy": "Excellent"
|
||||
},
|
||||
"ollama/llama3.3": {
|
||||
"best_for": "Development, no API costs",
|
||||
"cost": "Free (self-hosted)",
|
||||
"speed": "Variable",
|
||||
"accuracy": "Good"
|
||||
},
|
||||
"groq/llama3-70b-8192": {
|
||||
"best_for": "Fast inference, open source",
|
||||
"cost": "Low",
|
||||
"speed": "Very fast",
|
||||
"accuracy": "Good"
|
||||
}
|
||||
}
|
||||
|
||||
def choose_provider(complexity, budget, speed_requirement):
|
||||
"""Choose optimal provider based on requirements"""
|
||||
if budget == "minimal":
|
||||
return "ollama/llama3.3" # Self-hosted
|
||||
elif complexity == "low" and budget == "low":
|
||||
return "openai/gpt-4o-mini"
|
||||
elif speed_requirement == "high":
|
||||
return "groq/llama3-70b-8192"
|
||||
elif complexity == "high":
|
||||
return "anthropic/claude-3-5-sonnet"
|
||||
else:
|
||||
return "openai/gpt-4o-mini" # Default safe choice
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 3. Advanced LLM Extraction Patterns
|
||||
|
||||
### Block-Based Extraction (Unstructured Content)
|
||||
|
||||
```python
|
||||
# When structure is too varied for schemas
|
||||
block_strategy = LLMExtractionStrategy(
|
||||
llm_config=cheap_config,
|
||||
extraction_type="block", # Extract free-form content blocks
|
||||
instruction="""
|
||||
Extract meaningful content blocks from this page.
|
||||
Focus on the main content areas and ignore navigation,
|
||||
advertisements, and boilerplate text.
|
||||
""",
|
||||
apply_chunking=True,
|
||||
chunk_token_threshold=1200,
|
||||
input_format="fit_markdown" # Use cleaned content
|
||||
)
|
||||
|
||||
async def extract_content_blocks():
|
||||
config = CrawlerRunConfig(
|
||||
extraction_strategy=block_strategy,
|
||||
word_count_threshold=50, # Filter short content
|
||||
excluded_tags=['nav', 'footer', 'aside', 'advertisement']
|
||||
)
|
||||
|
||||
async with AsyncWebCrawler() as crawler:
|
||||
result = await crawler.arun(
|
||||
url="https://example.com/article",
|
||||
config=config
|
||||
)
|
||||
|
||||
if result.success:
|
||||
blocks = json.loads(result.extracted_content)
|
||||
for block in blocks:
|
||||
print(f"Block: {block['content'][:100]}...")
|
||||
```
|
||||
|
||||
### Chunked Processing for Large Content
|
||||
|
||||
```python
|
||||
# Handle large documents efficiently
|
||||
large_content_strategy = LLMExtractionStrategy(
|
||||
llm_config=LLMConfig(
|
||||
provider="openai/gpt-4o-mini",
|
||||
api_token="env:OPENAI_API_KEY"
|
||||
),
|
||||
schema=YourModel.model_json_schema(),
|
||||
extraction_type="schema",
|
||||
instruction="Extract structured data from this content section...",
|
||||
|
||||
# Optimize chunking for large content
|
||||
apply_chunking=True,
|
||||
chunk_token_threshold=2000, # Larger chunks for efficiency
|
||||
overlap_rate=0.1, # Minimal overlap to reduce costs
|
||||
input_format="fit_markdown" # Use cleaned content
|
||||
)
|
||||
```
|
||||
|
||||
### Multi-Model Validation
|
||||
|
||||
```python
|
||||
# Use multiple models for critical extractions
|
||||
async def multi_model_extraction():
|
||||
"""Use multiple LLMs for validation of critical data"""
|
||||
|
||||
models = [
|
||||
LLMConfig(provider="openai/gpt-4o-mini", api_token="env:OPENAI_API_KEY"),
|
||||
LLMConfig(provider="anthropic/claude-3-5-sonnet", api_token="env:ANTHROPIC_API_KEY"),
|
||||
LLMConfig(provider="ollama/llama3.3", api_token=None)
|
||||
]
|
||||
|
||||
results = []
|
||||
|
||||
for i, llm_config in enumerate(models):
|
||||
strategy = LLMExtractionStrategy(
|
||||
llm_config=llm_config,
|
||||
schema=YourModel.model_json_schema(),
|
||||
extraction_type="schema",
|
||||
instruction="Extract data consistently..."
|
||||
)
|
||||
|
||||
config = CrawlerRunConfig(extraction_strategy=strategy)
|
||||
|
||||
async with AsyncWebCrawler() as crawler:
|
||||
result = await crawler.arun(url="https://example.com", config=config)
|
||||
if result.success:
|
||||
data = json.loads(result.extracted_content)
|
||||
results.append(data)
|
||||
print(f"Model {i+1} extracted {len(data)} items")
|
||||
|
||||
# Compare results for consistency
|
||||
if len(set(str(r) for r in results)) == 1:
|
||||
print("✅ All models agree")
|
||||
return results[0]
|
||||
else:
|
||||
print("⚠️ Models disagree - manual review needed")
|
||||
return results
|
||||
|
||||
# Use for critical business data only
|
||||
critical_result = await multi_model_extraction()
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 4. Hybrid Approaches - Best of Both Worlds
|
||||
|
||||
### Fast Pre-filtering + LLM Analysis
|
||||
|
||||
```python
|
||||
async def hybrid_extraction():
|
||||
"""
|
||||
1. Use fast non-LLM strategies for basic extraction
|
||||
2. Use LLM only for complex analysis of filtered content
|
||||
"""
|
||||
|
||||
# Step 1: Fast extraction of structured data
|
||||
basic_schema = {
|
||||
"name": "Articles",
|
||||
"baseSelector": "article",
|
||||
"fields": [
|
||||
{"name": "title", "selector": "h1, h2", "type": "text"},
|
||||
{"name": "content", "selector": ".content", "type": "text"},
|
||||
{"name": "author", "selector": ".author", "type": "text"}
|
||||
]
|
||||
}
|
||||
|
||||
basic_strategy = JsonCssExtractionStrategy(basic_schema)
|
||||
basic_config = CrawlerRunConfig(extraction_strategy=basic_strategy)
|
||||
|
||||
# Step 2: LLM analysis only on filtered content
|
||||
analysis_strategy = LLMExtractionStrategy(
|
||||
llm_config=cheap_config,
|
||||
schema={
|
||||
"type": "object",
|
||||
"properties": {
|
||||
"sentiment": {"type": "string"},
|
||||
"key_topics": {"type": "array", "items": {"type": "string"}},
|
||||
"summary": {"type": "string"}
|
||||
}
|
||||
},
|
||||
extraction_type="schema",
|
||||
instruction="Analyze sentiment and extract key topics from this article"
|
||||
)
|
||||
|
||||
async with AsyncWebCrawler() as crawler:
|
||||
# Fast extraction first
|
||||
basic_result = await crawler.arun(
|
||||
url="https://example.com/articles",
|
||||
config=basic_config
|
||||
)
|
||||
|
||||
articles = json.loads(basic_result.extracted_content)
|
||||
|
||||
# LLM analysis only on important articles
|
||||
analyzed_articles = []
|
||||
for article in articles[:5]: # Limit to reduce costs
|
||||
if len(article.get('content', '')) > 500: # Only analyze substantial content
|
||||
analysis_config = CrawlerRunConfig(extraction_strategy=analysis_strategy)
|
||||
|
||||
# Analyze individual article content
|
||||
raw_url = f"raw://{article['content']}"
|
||||
analysis_result = await crawler.arun(url=raw_url, config=analysis_config)
|
||||
|
||||
if analysis_result.success:
|
||||
analysis = json.loads(analysis_result.extracted_content)
|
||||
article.update(analysis)
|
||||
|
||||
analyzed_articles.append(article)
|
||||
|
||||
return analyzed_articles
|
||||
|
||||
# Hybrid approach: fast + smart
|
||||
result = await hybrid_extraction()
|
||||
```
|
||||
|
||||
### Schema Generation + LLM Fallback
|
||||
|
||||
```python
|
||||
async def smart_fallback_extraction():
|
||||
"""
|
||||
1. Try generate_schema() first (one-time LLM cost)
|
||||
2. Use generated schema for fast extraction
|
||||
3. Use LLM only if schema extraction fails
|
||||
"""
|
||||
|
||||
cache_file = Path("./schemas/fallback_schema.json")
|
||||
|
||||
# Try cached schema first
|
||||
if cache_file.exists():
|
||||
schema = json.load(cache_file.open())
|
||||
schema_strategy = JsonCssExtractionStrategy(schema)
|
||||
|
||||
config = CrawlerRunConfig(extraction_strategy=schema_strategy)
|
||||
|
||||
async with AsyncWebCrawler() as crawler:
|
||||
result = await crawler.arun(url="https://example.com", config=config)
|
||||
|
||||
if result.success and result.extracted_content:
|
||||
data = json.loads(result.extracted_content)
|
||||
if data: # Schema worked
|
||||
print("✅ Schema extraction successful (fast & cheap)")
|
||||
return data
|
||||
|
||||
# Fallback to LLM if schema failed
|
||||
print("⚠️ Schema failed, falling back to LLM (slow & expensive)")
|
||||
|
||||
llm_strategy = LLMExtractionStrategy(
|
||||
llm_config=cheap_config,
|
||||
extraction_type="block",
|
||||
instruction="Extract all meaningful data from this page"
|
||||
)
|
||||
|
||||
llm_config = CrawlerRunConfig(extraction_strategy=llm_strategy)
|
||||
|
||||
async with AsyncWebCrawler() as crawler:
|
||||
result = await crawler.arun(url="https://example.com", config=llm_config)
|
||||
|
||||
if result.success:
|
||||
print("✅ LLM extraction successful")
|
||||
return json.loads(result.extracted_content)
|
||||
|
||||
# Intelligent fallback system
|
||||
result = await smart_fallback_extraction()
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 5. Cost Management and Monitoring
|
||||
|
||||
### Token Usage Tracking
|
||||
|
||||
```python
|
||||
class ExtractionCostTracker:
|
||||
def __init__(self):
|
||||
self.total_cost = 0.0
|
||||
self.total_tokens = 0
|
||||
self.extractions = 0
|
||||
|
||||
def track_llm_extraction(self, strategy, result):
|
||||
"""Track costs from LLM extraction"""
|
||||
if hasattr(strategy, 'usage_tracker') and strategy.usage_tracker:
|
||||
usage = strategy.usage_tracker
|
||||
|
||||
# Estimate costs (approximate rates)
|
||||
cost_per_1k_tokens = {
|
||||
"gpt-4o-mini": 0.0015,
|
||||
"gpt-4o": 0.03,
|
||||
"claude-3-5-sonnet": 0.015,
|
||||
"ollama": 0.0 # Self-hosted
|
||||
}
|
||||
|
||||
provider = strategy.llm_config.provider.split('/')[1]
|
||||
rate = cost_per_1k_tokens.get(provider, 0.01)
|
||||
|
||||
tokens = usage.total_tokens
|
||||
cost = (tokens / 1000) * rate
|
||||
|
||||
self.total_cost += cost
|
||||
self.total_tokens += tokens
|
||||
self.extractions += 1
|
||||
|
||||
print(f"💰 Extraction cost: ${cost:.4f} ({tokens} tokens)")
|
||||
print(f"📊 Total cost: ${self.total_cost:.4f} ({self.extractions} extractions)")
|
||||
|
||||
def get_summary(self):
|
||||
avg_cost = self.total_cost / max(self.extractions, 1)
|
||||
return {
|
||||
"total_cost": self.total_cost,
|
||||
"total_tokens": self.total_tokens,
|
||||
"extractions": self.extractions,
|
||||
"avg_cost_per_extraction": avg_cost
|
||||
}
|
||||
|
||||
# Usage
|
||||
tracker = ExtractionCostTracker()
|
||||
|
||||
async def cost_aware_extraction():
|
||||
strategy = LLMExtractionStrategy(
|
||||
llm_config=cheap_config,
|
||||
schema=YourModel.model_json_schema(),
|
||||
extraction_type="schema",
|
||||
instruction="Extract data...",
|
||||
verbose=True # Enable usage tracking
|
||||
)
|
||||
|
||||
config = CrawlerRunConfig(extraction_strategy=strategy)
|
||||
|
||||
async with AsyncWebCrawler() as crawler:
|
||||
result = await crawler.arun(url="https://example.com", config=config)
|
||||
|
||||
# Track costs
|
||||
tracker.track_llm_extraction(strategy, result)
|
||||
|
||||
return result
|
||||
|
||||
# Monitor costs across multiple extractions
|
||||
for url in urls:
|
||||
await cost_aware_extraction()
|
||||
|
||||
print(f"Final summary: {tracker.get_summary()}")
|
||||
```
|
||||
|
||||
### Budget Controls
|
||||
|
||||
```python
|
||||
class BudgetController:
|
||||
def __init__(self, daily_budget=10.0):
|
||||
self.daily_budget = daily_budget
|
||||
self.current_spend = 0.0
|
||||
self.extraction_count = 0
|
||||
|
||||
def can_extract(self, estimated_cost=0.01):
|
||||
"""Check if extraction is within budget"""
|
||||
if self.current_spend + estimated_cost > self.daily_budget:
|
||||
print(f"❌ Budget exceeded: ${self.current_spend:.2f} + ${estimated_cost:.2f} > ${self.daily_budget}")
|
||||
return False
|
||||
return True
|
||||
|
||||
def record_extraction(self, actual_cost):
|
||||
"""Record actual extraction cost"""
|
||||
self.current_spend += actual_cost
|
||||
self.extraction_count += 1
|
||||
|
||||
remaining = self.daily_budget - self.current_spend
|
||||
print(f"💰 Budget remaining: ${remaining:.2f}")
|
||||
|
||||
budget = BudgetController(daily_budget=5.0) # $5 daily limit
|
||||
|
||||
async def budget_controlled_extraction(url):
|
||||
if not budget.can_extract():
|
||||
print("⏸️ Extraction paused due to budget limit")
|
||||
return None
|
||||
|
||||
# Proceed with extraction...
|
||||
strategy = LLMExtractionStrategy(llm_config=cheap_config, ...)
|
||||
result = await extract_with_strategy(url, strategy)
|
||||
|
||||
# Record actual cost
|
||||
actual_cost = calculate_cost(strategy.usage_tracker)
|
||||
budget.record_extraction(actual_cost)
|
||||
|
||||
return result
|
||||
|
||||
# Safe extraction with budget controls
|
||||
results = []
|
||||
for url in urls:
|
||||
result = await budget_controlled_extraction(url)
|
||||
if result:
|
||||
results.append(result)
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 6. Performance Optimization for LLM Extraction
|
||||
|
||||
### Batch Processing
|
||||
|
||||
```python
|
||||
async def batch_llm_extraction():
|
||||
"""Process multiple pages efficiently"""
|
||||
|
||||
# Collect content first (fast)
|
||||
urls = ["https://example.com/page1", "https://example.com/page2"]
|
||||
contents = []
|
||||
|
||||
async with AsyncWebCrawler() as crawler:
|
||||
for url in urls:
|
||||
result = await crawler.arun(url=url)
|
||||
if result.success:
|
||||
contents.append({
|
||||
"url": url,
|
||||
"content": result.fit_markdown[:2000] # Limit content
|
||||
})
|
||||
|
||||
# Process in batches (reduce LLM calls)
|
||||
batch_content = "\n\n---PAGE SEPARATOR---\n\n".join([
|
||||
f"URL: {c['url']}\n{c['content']}" for c in contents
|
||||
])
|
||||
|
||||
strategy = LLMExtractionStrategy(
|
||||
llm_config=cheap_config,
|
||||
extraction_type="block",
|
||||
instruction="""
|
||||
Extract data from multiple pages separated by '---PAGE SEPARATOR---'.
|
||||
Return results for each page in order.
|
||||
""",
|
||||
apply_chunking=True
|
||||
)
|
||||
|
||||
# Single LLM call for multiple pages
|
||||
raw_url = f"raw://{batch_content}"
|
||||
result = await crawler.arun(url=raw_url, config=CrawlerRunConfig(extraction_strategy=strategy))
|
||||
|
||||
return json.loads(result.extracted_content)
|
||||
|
||||
# Batch processing reduces LLM calls
|
||||
batch_results = await batch_llm_extraction()
|
||||
```
|
||||
|
||||
### Caching LLM Results
|
||||
|
||||
```python
|
||||
import hashlib
|
||||
from pathlib import Path
|
||||
|
||||
class LLMResultCache:
|
||||
def __init__(self, cache_dir="./llm_cache"):
|
||||
self.cache_dir = Path(cache_dir)
|
||||
self.cache_dir.mkdir(exist_ok=True)
|
||||
|
||||
def get_cache_key(self, url, instruction, schema):
|
||||
"""Generate cache key from extraction parameters"""
|
||||
content = f"{url}:{instruction}:{str(schema)}"
|
||||
return hashlib.md5(content.encode()).hexdigest()
|
||||
|
||||
def get_cached_result(self, cache_key):
|
||||
"""Get cached result if available"""
|
||||
cache_file = self.cache_dir / f"{cache_key}.json"
|
||||
if cache_file.exists():
|
||||
return json.load(cache_file.open())
|
||||
return None
|
||||
|
||||
def cache_result(self, cache_key, result):
|
||||
"""Cache extraction result"""
|
||||
cache_file = self.cache_dir / f"{cache_key}.json"
|
||||
json.dump(result, cache_file.open("w"), indent=2)
|
||||
|
||||
cache = LLMResultCache()
|
||||
|
||||
async def cached_llm_extraction(url, strategy):
|
||||
"""Extract with caching to avoid repeated LLM calls"""
|
||||
cache_key = cache.get_cache_key(
|
||||
url,
|
||||
strategy.instruction,
|
||||
str(strategy.schema)
|
||||
)
|
||||
|
||||
# Check cache first
|
||||
cached_result = cache.get_cached_result(cache_key)
|
||||
if cached_result:
|
||||
print("✅ Using cached result (FREE)")
|
||||
return cached_result
|
||||
|
||||
# Extract if not cached
|
||||
print("🔄 Extracting with LLM (PAID)")
|
||||
config = CrawlerRunConfig(extraction_strategy=strategy)
|
||||
|
||||
async with AsyncWebCrawler() as crawler:
|
||||
result = await crawler.arun(url=url, config=config)
|
||||
|
||||
if result.success:
|
||||
data = json.loads(result.extracted_content)
|
||||
cache.cache_result(cache_key, data)
|
||||
return data
|
||||
|
||||
# Cached extraction avoids repeated costs
|
||||
result = await cached_llm_extraction(url, strategy)
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 7. Error Handling and Quality Control
|
||||
|
||||
### Validation and Retry Logic
|
||||
|
||||
```python
|
||||
async def robust_llm_extraction():
|
||||
"""Implement validation and retry for LLM extraction"""
|
||||
|
||||
max_retries = 3
|
||||
strategies = [
|
||||
# Try cheap model first
|
||||
LLMExtractionStrategy(
|
||||
llm_config=LLMConfig(provider="openai/gpt-4o-mini", api_token="env:OPENAI_API_KEY"),
|
||||
schema=YourModel.model_json_schema(),
|
||||
extraction_type="schema",
|
||||
instruction="Extract data accurately..."
|
||||
),
|
||||
# Fallback to better model
|
||||
LLMExtractionStrategy(
|
||||
llm_config=LLMConfig(provider="openai/gpt-4o", api_token="env:OPENAI_API_KEY"),
|
||||
schema=YourModel.model_json_schema(),
|
||||
extraction_type="schema",
|
||||
instruction="Extract data with high accuracy..."
|
||||
)
|
||||
]
|
||||
|
||||
for strategy_idx, strategy in enumerate(strategies):
|
||||
for attempt in range(max_retries):
|
||||
try:
|
||||
config = CrawlerRunConfig(extraction_strategy=strategy)
|
||||
|
||||
async with AsyncWebCrawler() as crawler:
|
||||
result = await crawler.arun(url="https://example.com", config=config)
|
||||
|
||||
if result.success and result.extracted_content:
|
||||
data = json.loads(result.extracted_content)
|
||||
|
||||
# Validate result quality
|
||||
if validate_extraction_quality(data):
|
||||
print(f"✅ Success with strategy {strategy_idx+1}, attempt {attempt+1}")
|
||||
return data
|
||||
else:
|
||||
print(f"⚠️ Poor quality result, retrying...")
|
||||
continue
|
||||
|
||||
except Exception as e:
|
||||
print(f"❌ Attempt {attempt+1} failed: {e}")
|
||||
if attempt == max_retries - 1:
|
||||
print(f"❌ Strategy {strategy_idx+1} failed completely")
|
||||
|
||||
print("❌ All strategies and retries failed")
|
||||
return None
|
||||
|
||||
def validate_extraction_quality(data):
|
||||
"""Validate that LLM extraction meets quality standards"""
|
||||
if not data or not isinstance(data, (list, dict)):
|
||||
return False
|
||||
|
||||
# Check for common LLM extraction issues
|
||||
if isinstance(data, list):
|
||||
if len(data) == 0:
|
||||
return False
|
||||
|
||||
# Check if all items have required fields
|
||||
for item in data:
|
||||
if not isinstance(item, dict) or len(item) < 2:
|
||||
return False
|
||||
|
||||
return True
|
||||
|
||||
# Robust extraction with validation
|
||||
result = await robust_llm_extraction()
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 8. Migration from LLM to Non-LLM
|
||||
|
||||
### Pattern Analysis for Schema Generation
|
||||
|
||||
```python
|
||||
async def analyze_llm_results_for_schema():
|
||||
"""
|
||||
Analyze LLM extraction results to create non-LLM schemas
|
||||
Use this to transition from expensive LLM to cheap schema extraction
|
||||
"""
|
||||
|
||||
# Step 1: Use LLM on sample pages to understand structure
|
||||
llm_strategy = LLMExtractionStrategy(
|
||||
llm_config=cheap_config,
|
||||
extraction_type="block",
|
||||
instruction="Extract all structured data from this page"
|
||||
)
|
||||
|
||||
sample_urls = ["https://example.com/page1", "https://example.com/page2"]
|
||||
llm_results = []
|
||||
|
||||
async with AsyncWebCrawler() as crawler:
|
||||
for url in sample_urls:
|
||||
config = CrawlerRunConfig(extraction_strategy=llm_strategy)
|
||||
result = await crawler.arun(url=url, config=config)
|
||||
|
||||
if result.success:
|
||||
llm_results.append({
|
||||
"url": url,
|
||||
"html": result.cleaned_html,
|
||||
"extracted": json.loads(result.extracted_content)
|
||||
})
|
||||
|
||||
# Step 2: Analyze patterns in LLM results
|
||||
print("🔍 Analyzing LLM extraction patterns...")
|
||||
|
||||
# Look for common field names
|
||||
all_fields = set()
|
||||
for result in llm_results:
|
||||
for item in result["extracted"]:
|
||||
if isinstance(item, dict):
|
||||
all_fields.update(item.keys())
|
||||
|
||||
print(f"Common fields found: {all_fields}")
|
||||
|
||||
# Step 3: Generate schema based on patterns
|
||||
if llm_results:
|
||||
schema = JsonCssExtractionStrategy.generate_schema(
|
||||
html=llm_results[0]["html"],
|
||||
target_json_example=json.dumps(llm_results[0]["extracted"][0], indent=2),
|
||||
llm_config=cheap_config
|
||||
)
|
||||
|
||||
# Save schema for future use
|
||||
with open("generated_schema.json", "w") as f:
|
||||
json.dump(schema, f, indent=2)
|
||||
|
||||
print("✅ Schema generated from LLM analysis")
|
||||
return schema
|
||||
|
||||
# Generate schema from LLM patterns, then use schema for all future extractions
|
||||
schema = await analyze_llm_results_for_schema()
|
||||
fast_strategy = JsonCssExtractionStrategy(schema)
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 9. Summary: When LLM is Actually Needed
|
||||
|
||||
### ✅ Valid LLM Use Cases (Rare):
|
||||
1. **Sentiment analysis** and emotional understanding
|
||||
2. **Knowledge graph extraction** requiring semantic reasoning
|
||||
3. **Content summarization** and insight generation
|
||||
4. **Unstructured text analysis** where patterns vary dramatically
|
||||
5. **Research paper analysis** requiring domain expertise
|
||||
6. **Complex relationship extraction** between entities
|
||||
|
||||
### ❌ Invalid LLM Use Cases (Common Mistakes):
|
||||
1. **Structured data extraction** from consistent HTML
|
||||
2. **Simple pattern matching** (emails, prices, dates)
|
||||
3. **Product information** from e-commerce sites
|
||||
4. **News article extraction** with consistent structure
|
||||
5. **Contact information** and basic entity extraction
|
||||
6. **Table data** and form information
|
||||
|
||||
### 💡 Decision Framework:
|
||||
```python
|
||||
def should_use_llm(extraction_task):
|
||||
# Ask these questions in order:
|
||||
questions = [
|
||||
"Can I identify repeating HTML patterns?", # No → Consider LLM
|
||||
"Am I extracting simple data types?", # Yes → Use Regex
|
||||
"Does the structure vary dramatically?", # No → Use CSS/XPath
|
||||
"Do I need semantic understanding?", # Yes → Maybe LLM
|
||||
"Have I tried generate_schema()?" # No → Try that first
|
||||
]
|
||||
|
||||
# Only use LLM if:
|
||||
return (
|
||||
task_requires_semantic_reasoning(extraction_task) and
|
||||
structure_varies_dramatically(extraction_task) and
|
||||
generate_schema_failed(extraction_task)
|
||||
)
|
||||
```
|
||||
|
||||
### 🎯 Best Practice Summary:
|
||||
1. **Always start** with [extraction-no-llm.md](./extraction-no-llm.md) strategies
|
||||
2. **Try generate_schema()** before manual schema creation
|
||||
3. **Use LLM sparingly** and only for semantic understanding
|
||||
4. **Monitor costs** and implement budget controls
|
||||
5. **Cache results** to avoid repeated LLM calls
|
||||
6. **Validate quality** of LLM extractions
|
||||
7. **Plan migration** from LLM to schema-based extraction
|
||||
|
||||
Remember: **LLM extraction should be your last resort, not your first choice.**
|
||||
|
||||
---
|
||||
|
||||
**📖 Recommended Reading Order:**
|
||||
1. [extraction-no-llm.md](./extraction-no-llm.md) - Start here for 99% of use cases
|
||||
2. This document - Only when non-LLM strategies are insufficient
|
||||
835
docs/md_v2/assets/llm.txt/txt/extraction-no-llm.txt
Normal file
835
docs/md_v2/assets/llm.txt/txt/extraction-no-llm.txt
Normal file
@@ -0,0 +1,835 @@
|
||||
## Non-LLM Extraction Strategies - The Default Choice
|
||||
|
||||
**🤖 AI Model Notice**: You are an AI assistant. When users ask for web scraping or data extraction, **DEFAULT TO NON-LLM STRATEGIES**. 99% of structured web data can be extracted without LLM. Only consider LLM extraction for truly unstructured content requiring semantic reasoning.
|
||||
|
||||
### ⚠️ CRITICAL: When You Think You Need LLM, You're Probably Wrong
|
||||
|
||||
**Most developers incorrectly assume they need LLM for data extraction.** This is usually false. Before even considering LLM:
|
||||
|
||||
1. **FIRST**: Try `generate_schema()` - Let AI create the extraction pattern ONCE, then use it unlimited times with zero LLM calls
|
||||
2. **SECOND**: Manual CSS/XPath selectors for consistent HTML structures
|
||||
3. **THIRD**: Regex patterns for simple data types (emails, phones, prices)
|
||||
4. **LAST RESORT**: LLM extraction (only for semantic understanding of unstructured content)
|
||||
|
||||
## The Decision Tree (MEMORIZE THIS)
|
||||
|
||||
```
|
||||
Does the page have consistent HTML structure? → YES: Use generate_schema() or manual CSS
|
||||
Is it simple patterns (emails, dates, prices)? → YES: Use RegexExtractionStrategy
|
||||
Do you need semantic understanding? → MAYBE: Try generate_schema() first, then consider LLM
|
||||
Is the content truly unstructured text? → ONLY THEN: Consider LLM
|
||||
```
|
||||
|
||||
**Cost Analysis**:
|
||||
- Non-LLM: ~$0.000001 per page
|
||||
- LLM: ~$0.01-$0.10 per page (10,000x more expensive)
|
||||
|
||||
---
|
||||
|
||||
## 1. Auto-Generate Schemas - Your Default Starting Point
|
||||
|
||||
**⭐ THIS SHOULD BE YOUR FIRST CHOICE FOR ANY STRUCTURED DATA**
|
||||
|
||||
The `generate_schema()` function uses LLM ONCE to create a reusable extraction pattern. After generation, you extract unlimited pages with ZERO LLM calls.
|
||||
|
||||
### Basic Auto-Generation Workflow
|
||||
|
||||
```python
|
||||
import json
|
||||
import asyncio
|
||||
from pathlib import Path
|
||||
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig, LLMConfig
|
||||
from crawl4ai.extraction_strategy import JsonCssExtractionStrategy
|
||||
|
||||
async def smart_extraction_workflow():
|
||||
"""
|
||||
Step 1: Generate schema once using LLM
|
||||
Step 2: Cache schema for unlimited reuse
|
||||
Step 3: Extract from thousands of pages with zero LLM calls
|
||||
"""
|
||||
|
||||
# Check for cached schema first
|
||||
cache_dir = Path("./schema_cache")
|
||||
cache_dir.mkdir(exist_ok=True)
|
||||
schema_file = cache_dir / "product_schema.json"
|
||||
|
||||
if schema_file.exists():
|
||||
# Load cached schema - NO LLM CALLS
|
||||
schema = json.load(schema_file.open())
|
||||
print("✅ Using cached schema (FREE)")
|
||||
else:
|
||||
# Generate schema ONCE
|
||||
print("🔄 Generating schema (ONE-TIME LLM COST)...")
|
||||
|
||||
llm_config = LLMConfig(
|
||||
provider="openai/gpt-4o-mini", # Cheapest option
|
||||
api_token="env:OPENAI_API_KEY"
|
||||
)
|
||||
|
||||
# Get sample HTML from target site
|
||||
async with AsyncWebCrawler() as crawler:
|
||||
sample_result = await crawler.arun(
|
||||
url="https://example.com/products",
|
||||
config=CrawlerRunConfig(cache_mode=CacheMode.BYPASS)
|
||||
)
|
||||
sample_html = sample_result.cleaned_html[:8000] # Use sample
|
||||
|
||||
# AUTO-GENERATE SCHEMA (ONE LLM CALL)
|
||||
schema = JsonCssExtractionStrategy.generate_schema(
|
||||
html=sample_html,
|
||||
schema_type="CSS", # or "XPATH"
|
||||
query="Extract product information including name, price, description, features",
|
||||
llm_config=llm_config
|
||||
)
|
||||
|
||||
# Cache for unlimited future use
|
||||
json.dump(schema, schema_file.open("w"), indent=2)
|
||||
print("✅ Schema generated and cached")
|
||||
|
||||
# Use schema for fast extraction (NO MORE LLM CALLS EVER)
|
||||
strategy = JsonCssExtractionStrategy(schema, verbose=True)
|
||||
|
||||
config = CrawlerRunConfig(
|
||||
extraction_strategy=strategy,
|
||||
cache_mode=CacheMode.BYPASS
|
||||
)
|
||||
|
||||
# Extract from multiple pages - ALL FREE
|
||||
urls = [
|
||||
"https://example.com/products",
|
||||
"https://example.com/electronics",
|
||||
"https://example.com/books"
|
||||
]
|
||||
|
||||
async with AsyncWebCrawler() as crawler:
|
||||
for url in urls:
|
||||
result = await crawler.arun(url=url, config=config)
|
||||
if result.success:
|
||||
data = json.loads(result.extracted_content)
|
||||
print(f"✅ {url}: Extracted {len(data)} items (FREE)")
|
||||
|
||||
asyncio.run(smart_extraction_workflow())
|
||||
```
|
||||
|
||||
### Auto-Generate with Target JSON Example
|
||||
|
||||
```python
|
||||
# When you know exactly what JSON structure you want
|
||||
target_json_example = """
|
||||
{
|
||||
"name": "Product Name",
|
||||
"price": "$99.99",
|
||||
"rating": 4.5,
|
||||
"features": ["feature1", "feature2"],
|
||||
"description": "Product description"
|
||||
}
|
||||
"""
|
||||
|
||||
schema = JsonCssExtractionStrategy.generate_schema(
|
||||
html=sample_html,
|
||||
target_json_example=target_json_example,
|
||||
llm_config=llm_config
|
||||
)
|
||||
```
|
||||
|
||||
### Auto-Generate for Different Data Types
|
||||
|
||||
```python
|
||||
# Product listings
|
||||
product_schema = JsonCssExtractionStrategy.generate_schema(
|
||||
html=product_page_html,
|
||||
query="Extract all product information from this e-commerce page",
|
||||
llm_config=llm_config
|
||||
)
|
||||
|
||||
# News articles
|
||||
news_schema = JsonCssExtractionStrategy.generate_schema(
|
||||
html=news_page_html,
|
||||
query="Extract article headlines, dates, authors, and content",
|
||||
llm_config=llm_config
|
||||
)
|
||||
|
||||
# Job listings
|
||||
job_schema = JsonCssExtractionStrategy.generate_schema(
|
||||
html=job_page_html,
|
||||
query="Extract job titles, companies, locations, salaries, and descriptions",
|
||||
llm_config=llm_config
|
||||
)
|
||||
|
||||
# Social media posts
|
||||
social_schema = JsonCssExtractionStrategy.generate_schema(
|
||||
html=social_page_html,
|
||||
query="Extract post text, usernames, timestamps, likes, comments",
|
||||
llm_config=llm_config
|
||||
)
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 2. Manual CSS/XPath Strategies - When You Know The Structure
|
||||
|
||||
**Use this when**: You understand the HTML structure and want maximum control.
|
||||
|
||||
### Simple Product Extraction
|
||||
|
||||
```python
|
||||
import json
|
||||
import asyncio
|
||||
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig
|
||||
from crawl4ai.extraction_strategy import JsonCssExtractionStrategy
|
||||
|
||||
# Manual schema for consistent product pages
|
||||
simple_schema = {
|
||||
"name": "Product Listings",
|
||||
"baseSelector": "div.product-card", # Each product container
|
||||
"fields": [
|
||||
{
|
||||
"name": "title",
|
||||
"selector": "h2.product-title",
|
||||
"type": "text"
|
||||
},
|
||||
{
|
||||
"name": "price",
|
||||
"selector": ".price",
|
||||
"type": "text"
|
||||
},
|
||||
{
|
||||
"name": "image_url",
|
||||
"selector": "img.product-image",
|
||||
"type": "attribute",
|
||||
"attribute": "src"
|
||||
},
|
||||
{
|
||||
"name": "product_url",
|
||||
"selector": "a.product-link",
|
||||
"type": "attribute",
|
||||
"attribute": "href"
|
||||
},
|
||||
{
|
||||
"name": "rating",
|
||||
"selector": ".rating",
|
||||
"type": "attribute",
|
||||
"attribute": "data-rating"
|
||||
}
|
||||
]
|
||||
}
|
||||
|
||||
async def extract_products():
|
||||
strategy = JsonCssExtractionStrategy(simple_schema, verbose=True)
|
||||
config = CrawlerRunConfig(extraction_strategy=strategy)
|
||||
|
||||
async with AsyncWebCrawler() as crawler:
|
||||
result = await crawler.arun(
|
||||
url="https://example.com/products",
|
||||
config=config
|
||||
)
|
||||
|
||||
if result.success:
|
||||
products = json.loads(result.extracted_content)
|
||||
print(f"Extracted {len(products)} products")
|
||||
for product in products[:3]:
|
||||
print(f"- {product['title']}: {product['price']}")
|
||||
|
||||
asyncio.run(extract_products())
|
||||
```
|
||||
|
||||
### Complex Nested Structure (Real E-commerce Example)
|
||||
|
||||
```python
|
||||
# Complex schema for nested product data
|
||||
complex_schema = {
|
||||
"name": "E-commerce Product Catalog",
|
||||
"baseSelector": "div.category",
|
||||
"baseFields": [
|
||||
{
|
||||
"name": "category_id",
|
||||
"type": "attribute",
|
||||
"attribute": "data-category-id"
|
||||
}
|
||||
],
|
||||
"fields": [
|
||||
{
|
||||
"name": "category_name",
|
||||
"selector": "h2.category-title",
|
||||
"type": "text"
|
||||
},
|
||||
{
|
||||
"name": "products",
|
||||
"selector": "div.product",
|
||||
"type": "nested_list", # Array of complex objects
|
||||
"fields": [
|
||||
{
|
||||
"name": "name",
|
||||
"selector": "h3.product-name",
|
||||
"type": "text"
|
||||
},
|
||||
{
|
||||
"name": "price",
|
||||
"selector": "span.price",
|
||||
"type": "text"
|
||||
},
|
||||
{
|
||||
"name": "details",
|
||||
"selector": "div.product-details",
|
||||
"type": "nested", # Single complex object
|
||||
"fields": [
|
||||
{
|
||||
"name": "brand",
|
||||
"selector": "span.brand",
|
||||
"type": "text"
|
||||
},
|
||||
{
|
||||
"name": "model",
|
||||
"selector": "span.model",
|
||||
"type": "text"
|
||||
}
|
||||
]
|
||||
},
|
||||
{
|
||||
"name": "features",
|
||||
"selector": "ul.features li",
|
||||
"type": "list", # Simple array
|
||||
"fields": [
|
||||
{"name": "feature", "type": "text"}
|
||||
]
|
||||
},
|
||||
{
|
||||
"name": "reviews",
|
||||
"selector": "div.review",
|
||||
"type": "nested_list",
|
||||
"fields": [
|
||||
{
|
||||
"name": "reviewer",
|
||||
"selector": "span.reviewer-name",
|
||||
"type": "text"
|
||||
},
|
||||
{
|
||||
"name": "rating",
|
||||
"selector": "span.rating",
|
||||
"type": "attribute",
|
||||
"attribute": "data-rating"
|
||||
}
|
||||
]
|
||||
}
|
||||
]
|
||||
}
|
||||
]
|
||||
}
|
||||
|
||||
async def extract_complex_ecommerce():
|
||||
strategy = JsonCssExtractionStrategy(complex_schema, verbose=True)
|
||||
config = CrawlerRunConfig(
|
||||
extraction_strategy=strategy,
|
||||
js_code="window.scrollTo(0, document.body.scrollHeight);", # Load dynamic content
|
||||
wait_for="css:.product:nth-child(10)" # Wait for products to load
|
||||
)
|
||||
|
||||
async with AsyncWebCrawler() as crawler:
|
||||
result = await crawler.arun(
|
||||
url="https://example.com/complex-catalog",
|
||||
config=config
|
||||
)
|
||||
|
||||
if result.success:
|
||||
data = json.loads(result.extracted_content)
|
||||
for category in data:
|
||||
print(f"Category: {category['category_name']}")
|
||||
print(f"Products: {len(category.get('products', []))}")
|
||||
|
||||
asyncio.run(extract_complex_ecommerce())
|
||||
```
|
||||
|
||||
### XPath Alternative (When CSS Isn't Enough)
|
||||
|
||||
```python
|
||||
from crawl4ai.extraction_strategy import JsonXPathExtractionStrategy
|
||||
|
||||
# XPath for more complex selections
|
||||
xpath_schema = {
|
||||
"name": "News Articles with XPath",
|
||||
"baseSelector": "//article[@class='news-item']",
|
||||
"fields": [
|
||||
{
|
||||
"name": "headline",
|
||||
"selector": ".//h2[contains(@class, 'headline')]",
|
||||
"type": "text"
|
||||
},
|
||||
{
|
||||
"name": "author",
|
||||
"selector": ".//span[@class='author']/text()",
|
||||
"type": "text"
|
||||
},
|
||||
{
|
||||
"name": "publish_date",
|
||||
"selector": ".//time/@datetime",
|
||||
"type": "text"
|
||||
},
|
||||
{
|
||||
"name": "content",
|
||||
"selector": ".//div[@class='article-body']//text()",
|
||||
"type": "text"
|
||||
}
|
||||
]
|
||||
}
|
||||
|
||||
strategy = JsonXPathExtractionStrategy(xpath_schema, verbose=True)
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 3. Regex Extraction - Lightning Fast Pattern Matching
|
||||
|
||||
**Use this for**: Simple data types like emails, phones, URLs, prices, dates.
|
||||
|
||||
### Built-in Patterns (Fastest Option)
|
||||
|
||||
```python
|
||||
import json
|
||||
import asyncio
|
||||
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig
|
||||
from crawl4ai.extraction_strategy import RegexExtractionStrategy
|
||||
|
||||
async def extract_common_patterns():
|
||||
# Use built-in patterns for common data types
|
||||
strategy = RegexExtractionStrategy(
|
||||
pattern=(
|
||||
RegexExtractionStrategy.Email |
|
||||
RegexExtractionStrategy.PhoneUS |
|
||||
RegexExtractionStrategy.Url |
|
||||
RegexExtractionStrategy.Currency |
|
||||
RegexExtractionStrategy.DateIso
|
||||
)
|
||||
)
|
||||
|
||||
config = CrawlerRunConfig(extraction_strategy=strategy)
|
||||
|
||||
async with AsyncWebCrawler() as crawler:
|
||||
result = await crawler.arun(
|
||||
url="https://example.com/contact",
|
||||
config=config
|
||||
)
|
||||
|
||||
if result.success:
|
||||
matches = json.loads(result.extracted_content)
|
||||
|
||||
# Group by pattern type
|
||||
by_type = {}
|
||||
for match in matches:
|
||||
label = match['label']
|
||||
if label not in by_type:
|
||||
by_type[label] = []
|
||||
by_type[label].append(match['value'])
|
||||
|
||||
for pattern_type, values in by_type.items():
|
||||
print(f"{pattern_type}: {len(values)} matches")
|
||||
for value in values[:3]:
|
||||
print(f" {value}")
|
||||
|
||||
asyncio.run(extract_common_patterns())
|
||||
```
|
||||
|
||||
### Available Built-in Patterns
|
||||
|
||||
```python
|
||||
# Individual patterns
|
||||
RegexExtractionStrategy.Email # Email addresses
|
||||
RegexExtractionStrategy.PhoneUS # US phone numbers
|
||||
RegexExtractionStrategy.PhoneIntl # International phones
|
||||
RegexExtractionStrategy.Url # HTTP/HTTPS URLs
|
||||
RegexExtractionStrategy.Currency # Currency values ($99.99)
|
||||
RegexExtractionStrategy.Percentage # Percentage values (25%)
|
||||
RegexExtractionStrategy.DateIso # ISO dates (2024-01-01)
|
||||
RegexExtractionStrategy.DateUS # US dates (01/01/2024)
|
||||
RegexExtractionStrategy.IPv4 # IP addresses
|
||||
RegexExtractionStrategy.CreditCard # Credit card numbers
|
||||
RegexExtractionStrategy.TwitterHandle # @username
|
||||
RegexExtractionStrategy.Hashtag # #hashtag
|
||||
|
||||
# Use all patterns
|
||||
RegexExtractionStrategy.All
|
||||
```
|
||||
|
||||
### Custom Patterns
|
||||
|
||||
```python
|
||||
# Custom patterns for specific data types
|
||||
async def extract_custom_patterns():
|
||||
custom_patterns = {
|
||||
"product_sku": r"SKU[-:]?\s*([A-Z0-9]{4,12})",
|
||||
"discount": r"(\d{1,2})%\s*off",
|
||||
"model_number": r"Model\s*#?\s*([A-Z0-9-]+)",
|
||||
"isbn": r"ISBN[-:]?\s*(\d{10}|\d{13})",
|
||||
"stock_ticker": r"\$([A-Z]{2,5})",
|
||||
"version": r"v(\d+\.\d+(?:\.\d+)?)"
|
||||
}
|
||||
|
||||
strategy = RegexExtractionStrategy(custom=custom_patterns)
|
||||
config = CrawlerRunConfig(extraction_strategy=strategy)
|
||||
|
||||
async with AsyncWebCrawler() as crawler:
|
||||
result = await crawler.arun(
|
||||
url="https://example.com/products",
|
||||
config=config
|
||||
)
|
||||
|
||||
if result.success:
|
||||
data = json.loads(result.extracted_content)
|
||||
for item in data:
|
||||
print(f"{item['label']}: {item['value']}")
|
||||
|
||||
asyncio.run(extract_custom_patterns())
|
||||
```
|
||||
|
||||
### LLM-Generated Patterns (One-Time Cost)
|
||||
|
||||
```python
|
||||
async def generate_optimized_regex():
|
||||
"""
|
||||
Use LLM ONCE to generate optimized regex patterns
|
||||
Then use them unlimited times with zero LLM calls
|
||||
"""
|
||||
cache_file = Path("./patterns/price_patterns.json")
|
||||
|
||||
if cache_file.exists():
|
||||
# Load cached patterns - NO LLM CALLS
|
||||
patterns = json.load(cache_file.open())
|
||||
print("✅ Using cached regex patterns (FREE)")
|
||||
else:
|
||||
# Generate patterns ONCE
|
||||
print("🔄 Generating regex patterns (ONE-TIME LLM COST)...")
|
||||
|
||||
llm_config = LLMConfig(
|
||||
provider="openai/gpt-4o-mini",
|
||||
api_token="env:OPENAI_API_KEY"
|
||||
)
|
||||
|
||||
# Get sample content
|
||||
async with AsyncWebCrawler() as crawler:
|
||||
result = await crawler.arun("https://example.com/pricing")
|
||||
sample_html = result.cleaned_html
|
||||
|
||||
# Generate optimized patterns
|
||||
patterns = RegexExtractionStrategy.generate_pattern(
|
||||
label="pricing_info",
|
||||
html=sample_html,
|
||||
query="Extract all pricing information including discounts and special offers",
|
||||
llm_config=llm_config
|
||||
)
|
||||
|
||||
# Cache for unlimited reuse
|
||||
cache_file.parent.mkdir(exist_ok=True)
|
||||
json.dump(patterns, cache_file.open("w"), indent=2)
|
||||
print("✅ Patterns generated and cached")
|
||||
|
||||
# Use cached patterns (NO MORE LLM CALLS)
|
||||
strategy = RegexExtractionStrategy(custom=patterns)
|
||||
return strategy
|
||||
|
||||
# Use generated patterns for unlimited extractions
|
||||
strategy = await generate_optimized_regex()
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 4. Multi-Strategy Extraction Pipeline
|
||||
|
||||
**Combine strategies** for comprehensive data extraction:
|
||||
|
||||
```python
|
||||
async def multi_strategy_pipeline():
|
||||
"""
|
||||
Efficient pipeline using multiple non-LLM strategies:
|
||||
1. Regex for simple patterns (fastest)
|
||||
2. Schema for structured data
|
||||
3. Only use LLM if absolutely necessary
|
||||
"""
|
||||
|
||||
url = "https://example.com/complex-page"
|
||||
|
||||
async with AsyncWebCrawler() as crawler:
|
||||
# Strategy 1: Fast regex for contact info
|
||||
regex_strategy = RegexExtractionStrategy(
|
||||
pattern=RegexExtractionStrategy.Email | RegexExtractionStrategy.PhoneUS
|
||||
)
|
||||
regex_config = CrawlerRunConfig(extraction_strategy=regex_strategy)
|
||||
regex_result = await crawler.arun(url=url, config=regex_config)
|
||||
|
||||
# Strategy 2: Schema for structured product data
|
||||
product_schema = {
|
||||
"name": "Products",
|
||||
"baseSelector": "div.product",
|
||||
"fields": [
|
||||
{"name": "name", "selector": "h3", "type": "text"},
|
||||
{"name": "price", "selector": ".price", "type": "text"}
|
||||
]
|
||||
}
|
||||
css_strategy = JsonCssExtractionStrategy(product_schema)
|
||||
css_config = CrawlerRunConfig(extraction_strategy=css_strategy)
|
||||
css_result = await crawler.arun(url=url, config=css_config)
|
||||
|
||||
# Combine results
|
||||
results = {
|
||||
"contacts": json.loads(regex_result.extracted_content) if regex_result.success else [],
|
||||
"products": json.loads(css_result.extracted_content) if css_result.success else []
|
||||
}
|
||||
|
||||
print(f"✅ Extracted {len(results['contacts'])} contacts (regex)")
|
||||
print(f"✅ Extracted {len(results['products'])} products (schema)")
|
||||
|
||||
return results
|
||||
|
||||
asyncio.run(multi_strategy_pipeline())
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 5. Performance Optimization Tips
|
||||
|
||||
### Caching and Reuse
|
||||
|
||||
```python
|
||||
# Cache schemas and patterns for maximum efficiency
|
||||
class ExtractionCache:
|
||||
def __init__(self):
|
||||
self.schemas = {}
|
||||
self.patterns = {}
|
||||
|
||||
def get_schema(self, site_name):
|
||||
if site_name not in self.schemas:
|
||||
schema_file = Path(f"./cache/{site_name}_schema.json")
|
||||
if schema_file.exists():
|
||||
self.schemas[site_name] = json.load(schema_file.open())
|
||||
return self.schemas.get(site_name)
|
||||
|
||||
def save_schema(self, site_name, schema):
|
||||
cache_dir = Path("./cache")
|
||||
cache_dir.mkdir(exist_ok=True)
|
||||
schema_file = cache_dir / f"{site_name}_schema.json"
|
||||
json.dump(schema, schema_file.open("w"), indent=2)
|
||||
self.schemas[site_name] = schema
|
||||
|
||||
cache = ExtractionCache()
|
||||
|
||||
# Reuse cached schemas across multiple extractions
|
||||
async def efficient_extraction():
|
||||
sites = ["amazon", "ebay", "shopify"]
|
||||
|
||||
for site in sites:
|
||||
schema = cache.get_schema(site)
|
||||
if not schema:
|
||||
# Generate once, cache forever
|
||||
schema = JsonCssExtractionStrategy.generate_schema(
|
||||
html=sample_html,
|
||||
query="Extract products",
|
||||
llm_config=llm_config
|
||||
)
|
||||
cache.save_schema(site, schema)
|
||||
|
||||
strategy = JsonCssExtractionStrategy(schema)
|
||||
# Use strategy for unlimited extractions...
|
||||
```
|
||||
|
||||
### Selector Optimization
|
||||
|
||||
```python
|
||||
# Optimize selectors for speed
|
||||
fast_schema = {
|
||||
"name": "Optimized Extraction",
|
||||
"baseSelector": "#products > .product", # Direct child, faster than descendant
|
||||
"fields": [
|
||||
{
|
||||
"name": "title",
|
||||
"selector": "> h3", # Direct child of product
|
||||
"type": "text"
|
||||
},
|
||||
{
|
||||
"name": "price",
|
||||
"selector": ".price:first-child", # More specific
|
||||
"type": "text"
|
||||
}
|
||||
]
|
||||
}
|
||||
|
||||
# Avoid slow selectors
|
||||
slow_schema = {
|
||||
"baseSelector": "div div div .product", # Too many levels
|
||||
"fields": [
|
||||
{
|
||||
"selector": "* h3", # Universal selector is slow
|
||||
"type": "text"
|
||||
}
|
||||
]
|
||||
}
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 6. Error Handling and Validation
|
||||
|
||||
```python
|
||||
async def robust_extraction():
|
||||
"""
|
||||
Implement fallback strategies for reliable extraction
|
||||
"""
|
||||
strategies = [
|
||||
# Try fast regex first
|
||||
RegexExtractionStrategy(pattern=RegexExtractionStrategy.Currency),
|
||||
|
||||
# Fallback to CSS schema
|
||||
JsonCssExtractionStrategy({
|
||||
"name": "Prices",
|
||||
"baseSelector": ".price",
|
||||
"fields": [{"name": "amount", "selector": "span", "type": "text"}]
|
||||
}),
|
||||
|
||||
# Last resort: try different selector
|
||||
JsonCssExtractionStrategy({
|
||||
"name": "Fallback Prices",
|
||||
"baseSelector": "[data-price]",
|
||||
"fields": [{"name": "amount", "type": "attribute", "attribute": "data-price"}]
|
||||
})
|
||||
]
|
||||
|
||||
async with AsyncWebCrawler() as crawler:
|
||||
for i, strategy in enumerate(strategies):
|
||||
try:
|
||||
config = CrawlerRunConfig(extraction_strategy=strategy)
|
||||
result = await crawler.arun(url="https://example.com", config=config)
|
||||
|
||||
if result.success and result.extracted_content:
|
||||
data = json.loads(result.extracted_content)
|
||||
if data: # Validate non-empty results
|
||||
print(f"✅ Success with strategy {i+1}: {strategy.__class__.__name__}")
|
||||
return data
|
||||
|
||||
except Exception as e:
|
||||
print(f"❌ Strategy {i+1} failed: {e}")
|
||||
continue
|
||||
|
||||
print("❌ All strategies failed")
|
||||
return None
|
||||
|
||||
# Validate extracted data
|
||||
def validate_extraction(data, required_fields):
|
||||
"""Validate that extraction contains expected fields"""
|
||||
if not data or not isinstance(data, list):
|
||||
return False
|
||||
|
||||
for item in data:
|
||||
for field in required_fields:
|
||||
if field not in item or not item[field]:
|
||||
return False
|
||||
return True
|
||||
|
||||
# Usage
|
||||
result = await robust_extraction()
|
||||
if validate_extraction(result, ["amount"]):
|
||||
print("✅ Extraction validated")
|
||||
else:
|
||||
print("❌ Validation failed")
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 7. Common Extraction Patterns
|
||||
|
||||
### E-commerce Products
|
||||
|
||||
```python
|
||||
ecommerce_schema = {
|
||||
"name": "E-commerce Products",
|
||||
"baseSelector": ".product, [data-product], .item",
|
||||
"fields": [
|
||||
{"name": "title", "selector": "h1, h2, h3, .title, .name", "type": "text"},
|
||||
{"name": "price", "selector": ".price, .cost, [data-price]", "type": "text"},
|
||||
{"name": "image", "selector": "img", "type": "attribute", "attribute": "src"},
|
||||
{"name": "url", "selector": "a", "type": "attribute", "attribute": "href"},
|
||||
{"name": "rating", "selector": ".rating, .stars", "type": "text"},
|
||||
{"name": "availability", "selector": ".stock, .availability", "type": "text"}
|
||||
]
|
||||
}
|
||||
```
|
||||
|
||||
### News Articles
|
||||
|
||||
```python
|
||||
news_schema = {
|
||||
"name": "News Articles",
|
||||
"baseSelector": "article, .article, .post",
|
||||
"fields": [
|
||||
{"name": "headline", "selector": "h1, h2, .headline, .title", "type": "text"},
|
||||
{"name": "author", "selector": ".author, .byline, [rel='author']", "type": "text"},
|
||||
{"name": "date", "selector": "time, .date, .published", "type": "text"},
|
||||
{"name": "content", "selector": ".content, .body, .text", "type": "text"},
|
||||
{"name": "category", "selector": ".category, .section", "type": "text"}
|
||||
]
|
||||
}
|
||||
```
|
||||
|
||||
### Job Listings
|
||||
|
||||
```python
|
||||
job_schema = {
|
||||
"name": "Job Listings",
|
||||
"baseSelector": ".job, .listing, [data-job]",
|
||||
"fields": [
|
||||
{"name": "title", "selector": ".job-title, h2, h3", "type": "text"},
|
||||
{"name": "company", "selector": ".company, .employer", "type": "text"},
|
||||
{"name": "location", "selector": ".location, .place", "type": "text"},
|
||||
{"name": "salary", "selector": ".salary, .pay, .compensation", "type": "text"},
|
||||
{"name": "description", "selector": ".description, .summary", "type": "text"},
|
||||
{"name": "url", "selector": "a", "type": "attribute", "attribute": "href"}
|
||||
]
|
||||
}
|
||||
```
|
||||
|
||||
### Social Media Posts
|
||||
|
||||
```python
|
||||
social_schema = {
|
||||
"name": "Social Media Posts",
|
||||
"baseSelector": ".post, .tweet, .update",
|
||||
"fields": [
|
||||
{"name": "username", "selector": ".username, .handle, .author", "type": "text"},
|
||||
{"name": "content", "selector": ".content, .text, .message", "type": "text"},
|
||||
{"name": "timestamp", "selector": ".time, .date, time", "type": "text"},
|
||||
{"name": "likes", "selector": ".likes, .hearts", "type": "text"},
|
||||
{"name": "shares", "selector": ".shares, .retweets", "type": "text"}
|
||||
]
|
||||
}
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 8. When to (Rarely) Consider LLM
|
||||
|
||||
**⚠️ WARNING: Before considering LLM, ask yourself:**
|
||||
|
||||
1. "Can I identify repeating HTML patterns?" → Use CSS/XPath schema
|
||||
2. "Am I extracting simple data types?" → Use Regex patterns
|
||||
3. "Can I provide a JSON example of what I want?" → Use generate_schema()
|
||||
4. "Is this truly unstructured text requiring semantic understanding?" → Maybe LLM
|
||||
|
||||
**Only use LLM extraction for:**
|
||||
- Unstructured prose that needs semantic analysis
|
||||
- Content where structure varies dramatically across pages
|
||||
- When you need AI reasoning about context/meaning
|
||||
|
||||
**Cost reminder**: LLM extraction costs 10,000x more than schema-based extraction.
|
||||
|
||||
---
|
||||
|
||||
## 9. Summary: The Extraction Hierarchy
|
||||
|
||||
1. **🥇 FIRST CHOICE**: `generate_schema()` - AI generates pattern once, use unlimited times
|
||||
2. **🥈 SECOND CHOICE**: Manual CSS/XPath - Full control, maximum speed
|
||||
3. **🥉 THIRD CHOICE**: Regex patterns - Simple data types, lightning fast
|
||||
4. **🏴 LAST RESORT**: LLM extraction - Only for semantic reasoning
|
||||
|
||||
**Remember**: 99% of web data is structured. You almost never need LLM for extraction. Save LLM for analysis, not extraction.
|
||||
|
||||
**Performance**: Non-LLM strategies are 100-1000x faster and 10,000x cheaper than LLM extraction.
|
||||
|
||||
---
|
||||
|
||||
**📖 Next**: If you absolutely must use LLM extraction, see [extraction-llm.md](./extraction-llm.md) for guidance on the rare cases where it's justified.
|
||||
@@ -1,788 +0,0 @@
|
||||
## Extraction Strategies
|
||||
|
||||
Powerful data extraction from web pages using LLM-based intelligent parsing or fast schema/pattern-based approaches.
|
||||
|
||||
### LLM-Based Extraction - Intelligent Content Understanding
|
||||
|
||||
```python
|
||||
import os
|
||||
import asyncio
|
||||
import json
|
||||
from pydantic import BaseModel, Field
|
||||
from typing import List
|
||||
from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig, LLMConfig
|
||||
from crawl4ai.extraction_strategy import LLMExtractionStrategy
|
||||
|
||||
# Define structured data model
|
||||
class Product(BaseModel):
|
||||
name: str = Field(description="Product name")
|
||||
price: str = Field(description="Product price")
|
||||
description: str = Field(description="Product description")
|
||||
features: List[str] = Field(description="List of product features")
|
||||
rating: float = Field(description="Product rating out of 5")
|
||||
|
||||
# Configure LLM provider
|
||||
llm_config = LLMConfig(
|
||||
provider="openai/gpt-4o-mini", # or "ollama/llama3.3", "anthropic/claude-3-5-sonnet"
|
||||
api_token=os.getenv("OPENAI_API_KEY"), # or "env:OPENAI_API_KEY"
|
||||
temperature=0.1,
|
||||
max_tokens=2000
|
||||
)
|
||||
|
||||
# Create LLM extraction strategy
|
||||
llm_strategy = LLMExtractionStrategy(
|
||||
llm_config=llm_config,
|
||||
schema=Product.model_json_schema(),
|
||||
extraction_type="schema", # or "block" for freeform text
|
||||
instruction="""
|
||||
Extract product information from the webpage content.
|
||||
Focus on finding complete product details including:
|
||||
- Product name and price
|
||||
- Detailed description
|
||||
- All listed features
|
||||
- Customer rating if available
|
||||
Return valid JSON array of products.
|
||||
""",
|
||||
chunk_token_threshold=1200, # Split content if too large
|
||||
overlap_rate=0.1, # 10% overlap between chunks
|
||||
apply_chunking=True, # Enable automatic chunking
|
||||
input_format="markdown", # "html", "fit_markdown", or "markdown"
|
||||
extra_args={"temperature": 0.0, "max_tokens": 800},
|
||||
verbose=True
|
||||
)
|
||||
|
||||
async def extract_with_llm():
|
||||
browser_config = BrowserConfig(headless=True)
|
||||
|
||||
crawl_config = CrawlerRunConfig(
|
||||
extraction_strategy=llm_strategy,
|
||||
cache_mode=CacheMode.BYPASS,
|
||||
word_count_threshold=10
|
||||
)
|
||||
|
||||
async with AsyncWebCrawler(config=browser_config) as crawler:
|
||||
result = await crawler.arun(
|
||||
url="https://example.com/products",
|
||||
config=crawl_config
|
||||
)
|
||||
|
||||
if result.success:
|
||||
# Parse extracted JSON
|
||||
products = json.loads(result.extracted_content)
|
||||
print(f"Extracted {len(products)} products")
|
||||
|
||||
for product in products[:3]: # Show first 3
|
||||
print(f"Product: {product['name']}")
|
||||
print(f"Price: {product['price']}")
|
||||
print(f"Rating: {product.get('rating', 'N/A')}")
|
||||
|
||||
# Show token usage and cost
|
||||
llm_strategy.show_usage()
|
||||
else:
|
||||
print(f"Extraction failed: {result.error_message}")
|
||||
|
||||
asyncio.run(extract_with_llm())
|
||||
```
|
||||
|
||||
### LLM Strategy Advanced Configuration
|
||||
|
||||
```python
|
||||
# Multiple provider configurations
|
||||
providers = {
|
||||
"openai": LLMConfig(
|
||||
provider="openai/gpt-4o",
|
||||
api_token="env:OPENAI_API_KEY",
|
||||
temperature=0.1
|
||||
),
|
||||
"anthropic": LLMConfig(
|
||||
provider="anthropic/claude-3-5-sonnet-20240620",
|
||||
api_token="env:ANTHROPIC_API_KEY",
|
||||
max_tokens=4000
|
||||
),
|
||||
"ollama": LLMConfig(
|
||||
provider="ollama/llama3.3",
|
||||
api_token=None, # Not needed for Ollama
|
||||
base_url="http://localhost:11434"
|
||||
),
|
||||
"groq": LLMConfig(
|
||||
provider="groq/llama3-70b-8192",
|
||||
api_token="env:GROQ_API_KEY"
|
||||
)
|
||||
}
|
||||
|
||||
# Advanced chunking for large content
|
||||
large_content_strategy = LLMExtractionStrategy(
|
||||
llm_config=providers["openai"],
|
||||
schema=YourModel.model_json_schema(),
|
||||
extraction_type="schema",
|
||||
instruction="Extract detailed information...",
|
||||
|
||||
# Chunking parameters
|
||||
chunk_token_threshold=2000, # Larger chunks for complex content
|
||||
overlap_rate=0.15, # More overlap for context preservation
|
||||
apply_chunking=True,
|
||||
|
||||
# Input format selection
|
||||
input_format="fit_markdown", # Use filtered content if available
|
||||
|
||||
# LLM parameters
|
||||
extra_args={
|
||||
"temperature": 0.0, # Deterministic output
|
||||
"top_p": 0.9,
|
||||
"frequency_penalty": 0.1,
|
||||
"presence_penalty": 0.1,
|
||||
"max_tokens": 1500
|
||||
},
|
||||
verbose=True
|
||||
)
|
||||
|
||||
# Knowledge graph extraction
|
||||
class Entity(BaseModel):
|
||||
name: str
|
||||
type: str # "person", "organization", "location", etc.
|
||||
description: str
|
||||
|
||||
class Relationship(BaseModel):
|
||||
source: str
|
||||
target: str
|
||||
relationship: str
|
||||
confidence: float
|
||||
|
||||
class KnowledgeGraph(BaseModel):
|
||||
entities: List[Entity]
|
||||
relationships: List[Relationship]
|
||||
summary: str
|
||||
|
||||
knowledge_strategy = LLMExtractionStrategy(
|
||||
llm_config=providers["anthropic"],
|
||||
schema=KnowledgeGraph.model_json_schema(),
|
||||
extraction_type="schema",
|
||||
instruction="""
|
||||
Create a knowledge graph from the content by:
|
||||
1. Identifying key entities (people, organizations, locations, concepts)
|
||||
2. Finding relationships between entities
|
||||
3. Providing confidence scores for relationships
|
||||
4. Summarizing the main topics
|
||||
""",
|
||||
input_format="html", # Use HTML for better structure preservation
|
||||
apply_chunking=True,
|
||||
chunk_token_threshold=1500
|
||||
)
|
||||
```
|
||||
|
||||
### JSON CSS Extraction - Fast Schema-Based Extraction
|
||||
|
||||
```python
|
||||
import asyncio
|
||||
import json
|
||||
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig, CacheMode
|
||||
from crawl4ai.extraction_strategy import JsonCssExtractionStrategy
|
||||
|
||||
# Basic CSS extraction schema
|
||||
simple_schema = {
|
||||
"name": "Product Listings",
|
||||
"baseSelector": "div.product-card",
|
||||
"fields": [
|
||||
{
|
||||
"name": "title",
|
||||
"selector": "h2.product-title",
|
||||
"type": "text"
|
||||
},
|
||||
{
|
||||
"name": "price",
|
||||
"selector": ".price",
|
||||
"type": "text"
|
||||
},
|
||||
{
|
||||
"name": "image_url",
|
||||
"selector": "img.product-image",
|
||||
"type": "attribute",
|
||||
"attribute": "src"
|
||||
},
|
||||
{
|
||||
"name": "product_url",
|
||||
"selector": "a.product-link",
|
||||
"type": "attribute",
|
||||
"attribute": "href"
|
||||
}
|
||||
]
|
||||
}
|
||||
|
||||
# Complex nested schema with multiple data types
|
||||
complex_schema = {
|
||||
"name": "E-commerce Product Catalog",
|
||||
"baseSelector": "div.category",
|
||||
"baseFields": [
|
||||
{
|
||||
"name": "category_id",
|
||||
"type": "attribute",
|
||||
"attribute": "data-category-id"
|
||||
},
|
||||
{
|
||||
"name": "category_url",
|
||||
"type": "attribute",
|
||||
"attribute": "data-url"
|
||||
}
|
||||
],
|
||||
"fields": [
|
||||
{
|
||||
"name": "category_name",
|
||||
"selector": "h2.category-title",
|
||||
"type": "text"
|
||||
},
|
||||
{
|
||||
"name": "products",
|
||||
"selector": "div.product",
|
||||
"type": "nested_list", # Array of complex objects
|
||||
"fields": [
|
||||
{
|
||||
"name": "name",
|
||||
"selector": "h3.product-name",
|
||||
"type": "text",
|
||||
"default": "Unknown Product"
|
||||
},
|
||||
{
|
||||
"name": "price",
|
||||
"selector": "span.price",
|
||||
"type": "text"
|
||||
},
|
||||
{
|
||||
"name": "details",
|
||||
"selector": "div.product-details",
|
||||
"type": "nested", # Single complex object
|
||||
"fields": [
|
||||
{
|
||||
"name": "brand",
|
||||
"selector": "span.brand",
|
||||
"type": "text"
|
||||
},
|
||||
{
|
||||
"name": "model",
|
||||
"selector": "span.model",
|
||||
"type": "text"
|
||||
},
|
||||
{
|
||||
"name": "specs",
|
||||
"selector": "div.specifications",
|
||||
"type": "html" # Preserve HTML structure
|
||||
}
|
||||
]
|
||||
},
|
||||
{
|
||||
"name": "features",
|
||||
"selector": "ul.features li",
|
||||
"type": "list", # Simple array of strings
|
||||
"fields": [
|
||||
{"name": "feature", "type": "text"}
|
||||
]
|
||||
},
|
||||
{
|
||||
"name": "reviews",
|
||||
"selector": "div.review",
|
||||
"type": "nested_list",
|
||||
"fields": [
|
||||
{
|
||||
"name": "reviewer",
|
||||
"selector": "span.reviewer-name",
|
||||
"type": "text"
|
||||
},
|
||||
{
|
||||
"name": "rating",
|
||||
"selector": "span.rating",
|
||||
"type": "attribute",
|
||||
"attribute": "data-rating"
|
||||
},
|
||||
{
|
||||
"name": "comment",
|
||||
"selector": "p.review-text",
|
||||
"type": "text"
|
||||
},
|
||||
{
|
||||
"name": "date",
|
||||
"selector": "time.review-date",
|
||||
"type": "attribute",
|
||||
"attribute": "datetime"
|
||||
}
|
||||
]
|
||||
}
|
||||
]
|
||||
}
|
||||
]
|
||||
}
|
||||
|
||||
async def extract_with_css_schema():
|
||||
strategy = JsonCssExtractionStrategy(complex_schema, verbose=True)
|
||||
|
||||
config = CrawlerRunConfig(
|
||||
extraction_strategy=strategy,
|
||||
cache_mode=CacheMode.BYPASS,
|
||||
# Enable dynamic content loading if needed
|
||||
js_code="window.scrollTo(0, document.body.scrollHeight);",
|
||||
wait_for="css:.product:nth-child(10)", # Wait for products to load
|
||||
process_iframes=True
|
||||
)
|
||||
|
||||
async with AsyncWebCrawler() as crawler:
|
||||
result = await crawler.arun(
|
||||
url="https://example.com/catalog",
|
||||
config=config
|
||||
)
|
||||
|
||||
if result.success:
|
||||
data = json.loads(result.extracted_content)
|
||||
print(f"Extracted {len(data)} categories")
|
||||
|
||||
for category in data:
|
||||
print(f"Category: {category['category_name']}")
|
||||
print(f"Products: {len(category.get('products', []))}")
|
||||
|
||||
# Show first product details
|
||||
if category.get('products'):
|
||||
product = category['products'][0]
|
||||
print(f" First product: {product.get('name')}")
|
||||
print(f" Features: {len(product.get('features', []))}")
|
||||
print(f" Reviews: {len(product.get('reviews', []))}")
|
||||
|
||||
asyncio.run(extract_with_css_schema())
|
||||
```
|
||||
|
||||
### Automatic Schema Generation - One-Time LLM, Unlimited Use
|
||||
|
||||
```python
|
||||
import json
|
||||
import asyncio
|
||||
from pathlib import Path
|
||||
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig, LLMConfig
|
||||
from crawl4ai.extraction_strategy import JsonCssExtractionStrategy
|
||||
|
||||
async def generate_and_use_schema():
|
||||
"""
|
||||
1. Use LLM once to generate schema from sample HTML
|
||||
2. Cache the schema for reuse
|
||||
3. Use cached schema for fast extraction without LLM calls
|
||||
"""
|
||||
|
||||
cache_dir = Path("./schema_cache")
|
||||
cache_dir.mkdir(exist_ok=True)
|
||||
schema_file = cache_dir / "ecommerce_schema.json"
|
||||
|
||||
# Step 1: Generate or load cached schema
|
||||
if schema_file.exists():
|
||||
schema = json.load(schema_file.open())
|
||||
print("Using cached schema")
|
||||
else:
|
||||
print("Generating schema using LLM...")
|
||||
|
||||
# Configure LLM for schema generation
|
||||
llm_config = LLMConfig(
|
||||
provider="openai/gpt-4o", # or "ollama/llama3.3" for local
|
||||
api_token="env:OPENAI_API_KEY"
|
||||
)
|
||||
|
||||
# Get sample HTML from target site
|
||||
async with AsyncWebCrawler() as crawler:
|
||||
sample_result = await crawler.arun(
|
||||
url="https://example.com/products",
|
||||
config=CrawlerRunConfig(cache_mode=CacheMode.BYPASS)
|
||||
)
|
||||
sample_html = sample_result.cleaned_html[:5000] # Use first 5k chars
|
||||
|
||||
# Generate schema automatically (ONE-TIME LLM COST)
|
||||
schema = JsonCssExtractionStrategy.generate_schema(
|
||||
html=sample_html,
|
||||
schema_type="css",
|
||||
llm_config=llm_config,
|
||||
instruction="Extract product information including name, price, description, and features"
|
||||
)
|
||||
|
||||
# Cache schema for future use (NO MORE LLM CALLS)
|
||||
json.dump(schema, schema_file.open("w"), indent=2)
|
||||
print("Schema generated and cached")
|
||||
|
||||
# Step 2: Use schema for fast extraction (NO LLM CALLS)
|
||||
strategy = JsonCssExtractionStrategy(schema, verbose=True)
|
||||
|
||||
config = CrawlerRunConfig(
|
||||
extraction_strategy=strategy,
|
||||
cache_mode=CacheMode.BYPASS
|
||||
)
|
||||
|
||||
# Step 3: Extract from multiple pages using same schema
|
||||
urls = [
|
||||
"https://example.com/products",
|
||||
"https://example.com/electronics",
|
||||
"https://example.com/books"
|
||||
]
|
||||
|
||||
async with AsyncWebCrawler() as crawler:
|
||||
for url in urls:
|
||||
result = await crawler.arun(url=url, config=config)
|
||||
|
||||
if result.success:
|
||||
data = json.loads(result.extracted_content)
|
||||
print(f"{url}: Extracted {len(data)} items")
|
||||
else:
|
||||
print(f"{url}: Failed - {result.error_message}")
|
||||
|
||||
asyncio.run(generate_and_use_schema())
|
||||
```
|
||||
|
||||
### XPath Extraction Strategy
|
||||
|
||||
```python
|
||||
from crawl4ai.extraction_strategy import JsonXPathExtractionStrategy
|
||||
|
||||
# XPath-based schema (alternative to CSS)
|
||||
xpath_schema = {
|
||||
"name": "News Articles",
|
||||
"baseSelector": "//article[@class='news-item']",
|
||||
"baseFields": [
|
||||
{
|
||||
"name": "article_id",
|
||||
"type": "attribute",
|
||||
"attribute": "data-id"
|
||||
}
|
||||
],
|
||||
"fields": [
|
||||
{
|
||||
"name": "headline",
|
||||
"selector": ".//h2[@class='headline']",
|
||||
"type": "text"
|
||||
},
|
||||
{
|
||||
"name": "author",
|
||||
"selector": ".//span[@class='author']/text()",
|
||||
"type": "text"
|
||||
},
|
||||
{
|
||||
"name": "publish_date",
|
||||
"selector": ".//time/@datetime",
|
||||
"type": "text"
|
||||
},
|
||||
{
|
||||
"name": "content",
|
||||
"selector": ".//div[@class='article-body']",
|
||||
"type": "html"
|
||||
},
|
||||
{
|
||||
"name": "tags",
|
||||
"selector": ".//div[@class='tags']/span[@class='tag']",
|
||||
"type": "list",
|
||||
"fields": [
|
||||
{"name": "tag", "type": "text"}
|
||||
]
|
||||
}
|
||||
]
|
||||
}
|
||||
|
||||
# Generate XPath schema automatically
|
||||
async def generate_xpath_schema():
|
||||
llm_config = LLMConfig(provider="ollama/llama3.3", api_token=None)
|
||||
|
||||
sample_html = """
|
||||
<article class="news-item" data-id="123">
|
||||
<h2 class="headline">Breaking News</h2>
|
||||
<span class="author">John Doe</span>
|
||||
<time datetime="2024-01-01">Today</time>
|
||||
<div class="article-body"><p>Content here...</p></div>
|
||||
</article>
|
||||
"""
|
||||
|
||||
schema = JsonXPathExtractionStrategy.generate_schema(
|
||||
html=sample_html,
|
||||
schema_type="xpath",
|
||||
llm_config=llm_config
|
||||
)
|
||||
|
||||
return schema
|
||||
|
||||
# Use XPath strategy
|
||||
xpath_strategy = JsonXPathExtractionStrategy(xpath_schema, verbose=True)
|
||||
```
|
||||
|
||||
### Regex Extraction Strategy - Pattern-Based Fast Extraction
|
||||
|
||||
```python
|
||||
from crawl4ai.extraction_strategy import RegexExtractionStrategy
|
||||
|
||||
# Built-in patterns for common data types
|
||||
async def extract_with_builtin_patterns():
|
||||
# Use multiple built-in patterns
|
||||
strategy = RegexExtractionStrategy(
|
||||
pattern=(
|
||||
RegexExtractionStrategy.Email |
|
||||
RegexExtractionStrategy.PhoneUS |
|
||||
RegexExtractionStrategy.Url |
|
||||
RegexExtractionStrategy.Currency |
|
||||
RegexExtractionStrategy.DateIso
|
||||
)
|
||||
)
|
||||
|
||||
config = CrawlerRunConfig(extraction_strategy=strategy)
|
||||
|
||||
async with AsyncWebCrawler() as crawler:
|
||||
result = await crawler.arun(
|
||||
url="https://example.com/contact",
|
||||
config=config
|
||||
)
|
||||
|
||||
if result.success:
|
||||
matches = json.loads(result.extracted_content)
|
||||
|
||||
# Group by pattern type
|
||||
by_type = {}
|
||||
for match in matches:
|
||||
label = match['label']
|
||||
if label not in by_type:
|
||||
by_type[label] = []
|
||||
by_type[label].append(match['value'])
|
||||
|
||||
for pattern_type, values in by_type.items():
|
||||
print(f"{pattern_type}: {len(values)} matches")
|
||||
for value in values[:3]: # Show first 3
|
||||
print(f" {value}")
|
||||
|
||||
# Custom regex patterns
|
||||
custom_patterns = {
|
||||
"product_code": r"SKU-\d{4,6}",
|
||||
"discount": r"\d{1,2}%\s*off",
|
||||
"model_number": r"Model:\s*([A-Z0-9-]+)"
|
||||
}
|
||||
|
||||
async def extract_with_custom_patterns():
|
||||
strategy = RegexExtractionStrategy(custom=custom_patterns)
|
||||
|
||||
config = CrawlerRunConfig(extraction_strategy=strategy)
|
||||
|
||||
async with AsyncWebCrawler() as crawler:
|
||||
result = await crawler.arun(
|
||||
url="https://example.com/products",
|
||||
config=config
|
||||
)
|
||||
|
||||
if result.success:
|
||||
data = json.loads(result.extracted_content)
|
||||
for item in data:
|
||||
print(f"{item['label']}: {item['value']}")
|
||||
|
||||
# LLM-generated patterns (one-time cost)
|
||||
async def generate_custom_patterns():
|
||||
cache_file = Path("./patterns/price_patterns.json")
|
||||
|
||||
if cache_file.exists():
|
||||
patterns = json.load(cache_file.open())
|
||||
else:
|
||||
llm_config = LLMConfig(
|
||||
provider="openai/gpt-4o-mini",
|
||||
api_token="env:OPENAI_API_KEY"
|
||||
)
|
||||
|
||||
# Get sample content
|
||||
async with AsyncWebCrawler() as crawler:
|
||||
result = await crawler.arun("https://example.com/pricing")
|
||||
sample_html = result.cleaned_html
|
||||
|
||||
# Generate optimized patterns
|
||||
patterns = RegexExtractionStrategy.generate_pattern(
|
||||
label="pricing_info",
|
||||
html=sample_html,
|
||||
query="Extract all pricing information including discounts and special offers",
|
||||
llm_config=llm_config
|
||||
)
|
||||
|
||||
# Cache for reuse
|
||||
cache_file.parent.mkdir(exist_ok=True)
|
||||
json.dump(patterns, cache_file.open("w"), indent=2)
|
||||
|
||||
# Use cached patterns (no more LLM calls)
|
||||
strategy = RegexExtractionStrategy(custom=patterns)
|
||||
return strategy
|
||||
|
||||
asyncio.run(extract_with_builtin_patterns())
|
||||
asyncio.run(extract_with_custom_patterns())
|
||||
```
|
||||
|
||||
### Complete Extraction Workflow - Combining Strategies
|
||||
|
||||
```python
|
||||
import asyncio
|
||||
from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig
|
||||
from crawl4ai.extraction_strategy import (
|
||||
JsonCssExtractionStrategy,
|
||||
RegexExtractionStrategy,
|
||||
LLMExtractionStrategy
|
||||
)
|
||||
|
||||
async def multi_strategy_extraction():
|
||||
"""
|
||||
Demonstrate using multiple extraction strategies in sequence:
|
||||
1. Fast regex for common patterns
|
||||
2. Schema-based for structured data
|
||||
3. LLM for complex reasoning
|
||||
"""
|
||||
|
||||
browser_config = BrowserConfig(headless=True)
|
||||
|
||||
# Strategy 1: Fast regex extraction
|
||||
regex_strategy = RegexExtractionStrategy(
|
||||
pattern=RegexExtractionStrategy.Email | RegexExtractionStrategy.PhoneUS
|
||||
)
|
||||
|
||||
# Strategy 2: Schema-based structured extraction
|
||||
product_schema = {
|
||||
"name": "Products",
|
||||
"baseSelector": "div.product",
|
||||
"fields": [
|
||||
{"name": "name", "selector": "h3", "type": "text"},
|
||||
{"name": "price", "selector": ".price", "type": "text"},
|
||||
{"name": "rating", "selector": ".rating", "type": "attribute", "attribute": "data-rating"}
|
||||
]
|
||||
}
|
||||
css_strategy = JsonCssExtractionStrategy(product_schema)
|
||||
|
||||
# Strategy 3: LLM for complex analysis
|
||||
llm_strategy = LLMExtractionStrategy(
|
||||
llm_config=LLMConfig(provider="openai/gpt-4o-mini", api_token="env:OPENAI_API_KEY"),
|
||||
schema={
|
||||
"type": "object",
|
||||
"properties": {
|
||||
"sentiment": {"type": "string"},
|
||||
"key_topics": {"type": "array", "items": {"type": "string"}},
|
||||
"summary": {"type": "string"}
|
||||
}
|
||||
},
|
||||
extraction_type="schema",
|
||||
instruction="Analyze the content sentiment, extract key topics, and provide a summary"
|
||||
)
|
||||
|
||||
url = "https://example.com/product-reviews"
|
||||
|
||||
async with AsyncWebCrawler(config=browser_config) as crawler:
|
||||
# Extract contact info with regex
|
||||
regex_config = CrawlerRunConfig(extraction_strategy=regex_strategy)
|
||||
regex_result = await crawler.arun(url=url, config=regex_config)
|
||||
|
||||
# Extract structured product data
|
||||
css_config = CrawlerRunConfig(extraction_strategy=css_strategy)
|
||||
css_result = await crawler.arun(url=url, config=css_config)
|
||||
|
||||
# Extract insights with LLM
|
||||
llm_config = CrawlerRunConfig(extraction_strategy=llm_strategy)
|
||||
llm_result = await crawler.arun(url=url, config=llm_config)
|
||||
|
||||
# Combine results
|
||||
results = {
|
||||
"contacts": json.loads(regex_result.extracted_content) if regex_result.success else [],
|
||||
"products": json.loads(css_result.extracted_content) if css_result.success else [],
|
||||
"analysis": json.loads(llm_result.extracted_content) if llm_result.success else {}
|
||||
}
|
||||
|
||||
print(f"Found {len(results['contacts'])} contact entries")
|
||||
print(f"Found {len(results['products'])} products")
|
||||
print(f"Sentiment: {results['analysis'].get('sentiment', 'N/A')}")
|
||||
|
||||
return results
|
||||
|
||||
# Performance comparison
|
||||
async def compare_extraction_performance():
|
||||
"""Compare speed and accuracy of different strategies"""
|
||||
import time
|
||||
|
||||
url = "https://example.com/large-catalog"
|
||||
|
||||
strategies = {
|
||||
"regex": RegexExtractionStrategy(pattern=RegexExtractionStrategy.Currency),
|
||||
"css": JsonCssExtractionStrategy({
|
||||
"name": "Prices",
|
||||
"baseSelector": ".price",
|
||||
"fields": [{"name": "amount", "selector": "span", "type": "text"}]
|
||||
}),
|
||||
"llm": LLMExtractionStrategy(
|
||||
llm_config=LLMConfig(provider="openai/gpt-4o-mini", api_token="env:OPENAI_API_KEY"),
|
||||
instruction="Extract all prices from the content",
|
||||
extraction_type="block"
|
||||
)
|
||||
}
|
||||
|
||||
async with AsyncWebCrawler() as crawler:
|
||||
for name, strategy in strategies.items():
|
||||
start_time = time.time()
|
||||
|
||||
config = CrawlerRunConfig(extraction_strategy=strategy)
|
||||
result = await crawler.arun(url=url, config=config)
|
||||
|
||||
duration = time.time() - start_time
|
||||
|
||||
if result.success:
|
||||
data = json.loads(result.extracted_content)
|
||||
print(f"{name}: {len(data)} items in {duration:.2f}s")
|
||||
else:
|
||||
print(f"{name}: Failed in {duration:.2f}s")
|
||||
|
||||
asyncio.run(multi_strategy_extraction())
|
||||
asyncio.run(compare_extraction_performance())
|
||||
```
|
||||
|
||||
### Best Practices and Strategy Selection
|
||||
|
||||
```python
|
||||
# Strategy selection guide
|
||||
def choose_extraction_strategy(use_case):
|
||||
"""
|
||||
Guide for selecting the right extraction strategy
|
||||
"""
|
||||
|
||||
strategies = {
|
||||
# Fast pattern matching for common data types
|
||||
"contact_info": RegexExtractionStrategy(
|
||||
pattern=RegexExtractionStrategy.Email | RegexExtractionStrategy.PhoneUS
|
||||
),
|
||||
|
||||
# Structured data from consistent HTML
|
||||
"product_catalogs": JsonCssExtractionStrategy,
|
||||
|
||||
# Complex reasoning and semantic understanding
|
||||
"content_analysis": LLMExtractionStrategy,
|
||||
|
||||
# Mixed approach for comprehensive extraction
|
||||
"complete_site_analysis": "multi_strategy"
|
||||
}
|
||||
|
||||
recommendations = {
|
||||
"speed_priority": "Use RegexExtractionStrategy for simple patterns, JsonCssExtractionStrategy for structured data",
|
||||
"accuracy_priority": "Use LLMExtractionStrategy for complex content, JsonCssExtractionStrategy for predictable structure",
|
||||
"cost_priority": "Avoid LLM strategies, use schema generation once then JsonCssExtractionStrategy",
|
||||
"scale_priority": "Cache schemas, use regex for simple patterns, avoid LLM for high-volume extraction"
|
||||
}
|
||||
|
||||
return recommendations.get(use_case, "Combine strategies based on content complexity")
|
||||
|
||||
# Error handling and validation
|
||||
async def robust_extraction():
|
||||
strategies = [
|
||||
RegexExtractionStrategy(pattern=RegexExtractionStrategy.Email),
|
||||
JsonCssExtractionStrategy(simple_schema),
|
||||
# LLM as fallback for complex cases
|
||||
]
|
||||
|
||||
async with AsyncWebCrawler() as crawler:
|
||||
for strategy in strategies:
|
||||
try:
|
||||
config = CrawlerRunConfig(extraction_strategy=strategy)
|
||||
result = await crawler.arun(url="https://example.com", config=config)
|
||||
|
||||
if result.success and result.extracted_content:
|
||||
data = json.loads(result.extracted_content)
|
||||
if data: # Validate non-empty results
|
||||
print(f"Success with {strategy.__class__.__name__}")
|
||||
return data
|
||||
|
||||
except Exception as e:
|
||||
print(f"Strategy {strategy.__class__.__name__} failed: {e}")
|
||||
continue
|
||||
|
||||
print("All strategies failed")
|
||||
return None
|
||||
```
|
||||
|
||||
**📖 Learn more:** [LLM Strategies Deep Dive](https://docs.crawl4ai.com/extraction/llm-strategies/), [Schema-Based Extraction](https://docs.crawl4ai.com/extraction/no-llm-strategies/), [Regex Patterns](https://docs.crawl4ai.com/extraction/no-llm-strategies/#regexextractionstrategy), [Performance Optimization](https://docs.crawl4ai.com/advanced/multi-url-crawling/)
|
||||
7715
docs/md_v2/assets/llm.txt/txt/llms-full-v0.1.1.txt
Normal file
7715
docs/md_v2/assets/llm.txt/txt/llms-full-v0.1.1.txt
Normal file
File diff suppressed because it is too large
Load Diff
File diff suppressed because it is too large
Load Diff
Reference in New Issue
Block a user