feat: Major Chrome Extension overhaul with Click2Crawl, instant Schema extraction, and modular architecture

 New Features:
- Click2Crawl: Visual element selection with markdown conversion
  - Ctrl/Cmd+Click to select multiple elements
  - Visual text mode for WYSIWYG extraction
  - Real-time markdown preview with syntax highlighting
  - Export to .md file or clipboard

- Schema Builder Enhancement: Instant data extraction without LLMs
  - Test schemas directly in browser
  - See JSON results immediately
  - Export data or Python code
  - Cloud deployment ready (coming soon)

- Modular Architecture:
  - Separated into schemaBuilder.js, scriptBuilder.js, click2CrawlBuilder.js
  - Added contentAnalyzer.js and markdownConverter.js modules
  - Shared utilities and CSS reset system
  - Integrated marked.js for markdown rendering

🎨 UI/UX Improvements:
- Added edgy cloud announcement banner with seamless shimmer animation
- Direct, technical copy: "You don't need Puppeteer. You need Crawl4AI Cloud."
- Enhanced feature cards with emojis
- Fixed CSS conflicts with targeted reset approach
- Improved badge hover effects (red on hover)
- Added wrap toggle for code preview

📚 Documentation Updates:
- Split extraction diagrams into LLM and no-LLM versions
- Updated llms-full.txt with latest content
- Added versioned LLM context (v0.1.1)

🔧 Technical Enhancements:
- Refactored 3464 lines of monolithic content.js into modules
- Added proper event handling and cleanup
- Improved z-index management
- Better scroll position tracking for badges
- Enhanced error handling throughout

This release transforms the Chrome Extension from a simple tool into a powerful
visual data extraction suite, making web scraping accessible to everyone.
This commit is contained in:
UncleCode
2025-06-09 23:18:27 +08:00
parent 40640badad
commit 0ac12da9f3
25 changed files with 23686 additions and 6524 deletions

View File

@@ -626,6 +626,16 @@ code {
background: var(--primary-pink);
}
.tool-status.new {
background: var(--primary-green);
animation: pulse 2s ease-in-out infinite;
}
@keyframes pulse {
0%, 100% { opacity: 1; }
50% { opacity: 0.8; }
}
/* Tool Details Panel */
.tool-details {
background: var(--bg-secondary);
@@ -1026,4 +1036,516 @@ code {
.coming-soon-section h2 {
font-size: 1.5rem;
}
}
}
/* Code Examples Grid Layout */
.code-example > div[style*="grid"] {
min-height: 500px;
}
.code-example > div[style*="grid"] .terminal-window {
height: 100%;
display: flex;
flex-direction: column;
}
.code-example > div[style*="grid"] .terminal-content {
flex: 1;
overflow: auto;
max-height: 450px;
}
@media (max-width: 1200px) {
.code-example > div[style*="grid"] {
grid-template-columns: 1fr \!important;
gap: 12px \!important;
}
}
/* Cloud Banner Section (Thin Version) */
.cloud-banner-section {
margin: 2rem 0 3rem 0;
}
.cloud-banner {
background: linear-gradient(135deg, rgba(15, 187, 170, 0.05) 0%, rgba(243, 128, 245, 0.05) 100%);
border: 1px solid rgba(15, 187, 170, 0.3);
border-radius: 12px;
padding: 1.5rem 2rem;
position: relative;
overflow: hidden;
}
.cloud-banner::before {
content: "";
position: absolute;
top: 0;
left: 0;
width: 200%;
height: 100%;
background: linear-gradient(90deg,
transparent 0%,
rgba(15, 187, 170, 0.1) 25%,
transparent 50%,
rgba(15, 187, 170, 0.1) 75%,
transparent 100%
);
animation: cloud-shimmer 4s linear infinite;
}
@keyframes cloud-shimmer {
0% { transform: translateX(0); }
100% { transform: translateX(-50%); }
}
.cloud-banner-content {
display: flex;
align-items: center;
justify-content: space-between;
gap: 2rem;
position: relative;
z-index: 1;
}
.cloud-banner-text {
flex: 1;
text-align: left;
}
.cloud-banner-text h3 {
margin: 0;
font-size: 1.25rem;
color: var(--text-primary);
font-weight: 600;
letter-spacing: -0.02em;
}
.cloud-banner-text p {
margin: 0.25rem 0 0;
font-size: 0.875rem;
color: var(--text-secondary);
}
.cloud-banner-btn {
background: var(--primary-green);
color: var(--bg-dark);
border: none;
padding: 0.75rem 1.5rem;
font-size: 0.875rem;
font-weight: 600;
border-radius: 25px;
cursor: pointer;
transition: all 0.3s ease;
font-family: var(--font-primary);
white-space: nowrap;
flex-shrink: 0;
}
.cloud-banner-btn:hover {
background: #1fcbba;
transform: translateY(-2px);
box-shadow: 0 6px 20px rgba(15, 187, 170, 0.3);
}
@media (max-width: 768px) {
.cloud-banner-content {
flex-direction: column;
text-align: center;
gap: 1rem;
}
.cloud-banner-text {
text-align: center;
}
.cloud-banner-icon {
font-size: 2rem;
}
.cloud-banner-text h3 {
font-size: 1.25rem;
}
}
/* Crawl4AI Cloud Section */
.cloud-section {
margin: 5rem 0;
}
.cloud-announcement {
background: linear-gradient(135deg, #1a1a1a 0%, #2a2a2a 100%);
border: 2px solid var(--primary-green);
border-radius: 20px;
padding: 4rem 3rem;
position: relative;
overflow: hidden;
box-shadow: 0 20px 60px rgba(15, 187, 170, 0.2);
text-align: center;
}
.cloud-announcement::before {
content: "";
position: absolute;
top: -50%;
left: -50%;
width: 200%;
height: 450%;
background: radial-gradient(circle, rgba(15, 187, 170, 0.1) 0%, transparent 70%);
animation: rotate 20s linear infinite;
}
@keyframes rotate {
from { transform: rotate(0deg); }
to { transform: rotate(360deg); }
}
@keyframes float {
0%, 100% { transform: translateY(0); }
50% { transform: translateY(-10px); }
}
.cloud-announcement h2 {
font-size: 2.5rem;
margin: 0 0 0.5rem 0;
color: var(--text-primary);
font-weight: 700;
letter-spacing: -0.03em;
position: relative;
z-index: 1;
}
.cloud-tagline {
font-size: 1.25rem;
color: var(--text-secondary);
margin: 0.5rem 0 2rem;
position: relative;
z-index: 1;
}
.cloud-features-preview {
display: flex;
justify-content: center;
gap: 2rem;
margin: 2rem 0 3rem;
flex-wrap: wrap;
position: relative;
z-index: 1;
}
.cloud-feature-item {
font-size: 0.875rem;
color: var(--text-secondary);
font-family: var(--font-code);
padding: 0.5rem 1rem;
background: var(--bg-secondary);
border: 1px solid var(--border-color);
border-radius: 6px;
}
.cloud-cta-button {
background: var(--primary-green);
color: var(--bg-dark);
border: none;
padding: 0.875rem 2rem;
font-size: 1rem;
font-weight: 600;
border-radius: 6px;
cursor: pointer;
transition: all 0.2s ease;
position: relative;
z-index: 1;
font-family: var(--font-primary);
text-transform: none;
letter-spacing: -0.01em;
}
.cloud-cta-button:hover {
transform: translateY(-2px);
box-shadow: 0 10px 30px rgba(15, 187, 170, 0.4);
background: #1fcbba;
}
.cloud-hint {
margin-top: 1.5rem;
font-size: 0.875rem;
color: var(--text-secondary);
position: relative;
z-index: 1;
font-style: italic;
}
/* Signup Overlay */
.signup-overlay {
position: fixed;
top: 0;
left: 0;
right: 0;
bottom: 0;
background: rgba(0, 0, 0, 0.9);
backdrop-filter: blur(10px);
z-index: 10000;
display: none;
align-items: center;
justify-content: center;
padding: 2rem;
}
.signup-overlay.active {
display: flex;
}
.signup-container {
background: var(--bg-secondary);
border: 2px solid var(--primary-green);
border-radius: 16px;
max-width: 600px;
width: 100%;
max-height: 90vh;
overflow: auto;
position: relative;
box-shadow: 0 20px 60px rgba(15, 187, 170, 0.3);
}
.close-signup {
position: absolute;
top: 1rem;
right: 1rem;
background: var(--bg-tertiary);
border: none;
color: var(--text-secondary);
width: 40px;
height: 40px;
border-radius: 50%;
font-size: 24px;
cursor: pointer;
transition: all 0.2s ease;
z-index: 10;
}
.close-signup:hover {
background: var(--primary-pink);
color: var(--bg-dark);
transform: rotate(90deg);
}
.signup-content {
padding: 3rem;
}
.signup-content h3 {
font-size: 1.75rem;
margin: 0 0 0.5rem;
color: var(--text-primary);
}
.signup-content p {
color: var(--text-secondary);
margin-bottom: 2rem;
}
.waitlist-form {
display: flex;
flex-direction: column;
gap: 1.5rem;
}
.form-field {
display: flex;
flex-direction: column;
gap: 0.5rem;
}
.form-field label {
font-size: 0.875rem;
color: var(--text-secondary);
text-transform: uppercase;
font-weight: 600;
}
.form-field input,
.form-field select {
background: var(--bg-tertiary);
border: 1px solid var(--border-color);
color: var(--text-primary);
padding: 0.75rem 1rem;
border-radius: 8px;
font-size: 1rem;
font-family: var(--font-primary);
transition: all 0.2s ease;
}
.form-field input:focus,
.form-field select:focus {
outline: none;
border-color: var(--primary-green);
box-shadow: 0 0 0 3px rgba(15, 187, 170, 0.2);
}
.submit-button {
background: var(--primary-green);
color: var(--bg-dark);
border: none;
padding: 1rem 2rem;
font-size: 1.125rem;
font-weight: 600;
border-radius: 8px;
cursor: pointer;
transition: all 0.2s ease;
font-family: var(--font-primary);
display: flex;
align-items: center;
justify-content: center;
gap: 0.5rem;
margin-top: 1rem;
}
.submit-button:hover {
background: #1fcbba;
transform: translateY(-2px);
box-shadow: 0 8px 24px rgba(15, 187, 170, 0.3);
}
/* Crawling Animation */
.crawl-animation {
padding: 3rem;
text-align: left;
}
.crawl-terminal {
margin-bottom: 2rem;
}
.crawl-terminal .terminal-content {
max-height: 400px;
overflow-y: auto;
}
.crawl-terminal code {
white-space: pre;
display: block;
line-height: 1.6;
}
.crawl-log {
color: var(--text-primary);
font-family: var(--font-code);
}
.crawl-log .log-init { color: #0fbbaa; }
.crawl-log .log-fetch { color: #4169e1; }
.crawl-log .log-scrape { color: #f380f5; }
.crawl-log .log-extract { color: #ffbd2e; }
.crawl-log .log-complete { color: #0fbbaa; }
.crawl-log .log-success { color: #0fbbaa; }
.crawl-log .log-time { color: #666; }
.extracted-preview {
background: var(--bg-tertiary);
border-radius: 12px;
padding: 1.5rem;
margin-bottom: 2rem;
}
.extracted-preview h4 {
margin: 0 0 1rem;
color: var(--primary-green);
font-size: 1.25rem;
}
.json-preview {
background: var(--bg-dark);
border: 1px solid var(--border-color);
border-radius: 8px;
padding: 1rem;
overflow-x: auto;
max-height: 300px;
}
.json-preview code {
color: var(--text-primary);
font-size: 0.875rem;
}
.success-message {
text-align: center;
padding: 2rem;
}
.continue-button {
background: var(--primary-green);
color: var(--bg-dark);
border: none;
padding: 1rem 2rem;
font-size: 1.125rem;
font-weight: 600;
border-radius: 8px;
cursor: pointer;
transition: all 0.2s ease;
font-family: var(--font-primary);
margin-top: 2rem;
}
.continue-button:hover {
background: #1fcbba;
transform: translateY(-2px);
box-shadow: 0 8px 24px rgba(15, 187, 170, 0.3);
}
.success-icon {
font-size: 4rem;
margin-bottom: 1rem;
animation: bounce 0.5s ease;
}
@keyframes bounce {
0%, 100% { transform: translateY(0); }
50% { transform: translateY(-20px); }
}
.success-message h3 {
font-size: 2rem;
margin: 0 0 1rem;
color: var(--primary-green);
}
.success-message ul {
list-style: none;
margin: 1.5rem 0;
padding: 0;
text-align: left;
max-width: 400px;
margin-left: auto;
margin-right: auto;
}
.success-message li {
padding: 0.5rem 0;
color: var(--text-primary);
font-size: 1.125rem;
}
.success-note {
color: var(--text-secondary);
font-size: 1rem;
margin-top: 2rem;
padding: 1rem;
background: var(--bg-tertiary);
border-radius: 8px;
}
@media (max-width: 768px) {
.cloud-announcement h2 {
font-size: 2rem;
}
.cloud-features-preview {
flex-direction: column;
gap: 1rem;
}
.signup-content {
padding: 2rem;
}
}

View File

@@ -0,0 +1,732 @@
class Click2CrawlBuilder {
constructor() {
this.selectedElements = new Set();
this.highlightBoxes = new Map();
this.selectionMode = false;
this.toolbar = null;
this.previewPanel = null;
this.selectionCounter = 0;
this.markdownConverter = null;
this.contentAnalyzer = null;
// Configuration options
this.options = {
includeImages: true,
preserveTables: true,
keepCodeFormatting: true,
simplifyLayout: false,
preserveLinks: true,
addSeparators: true,
includeXPath: false,
textOnly: false
};
this.init();
}
async init() {
// Initialize dependencies
this.markdownConverter = new MarkdownConverter();
this.contentAnalyzer = new ContentAnalyzer();
this.createToolbar();
this.setupEventListeners();
}
createToolbar() {
// Create floating toolbar
this.toolbar = document.createElement('div');
this.toolbar.className = 'c4ai-c2c-toolbar';
this.toolbar.innerHTML = `
<div class="c4ai-toolbar-header">
<div class="c4ai-toolbar-dots">
<span class="c4ai-dot c4ai-dot-red"></span>
<span class="c4ai-dot c4ai-dot-yellow"></span>
<span class="c4ai-dot c4ai-dot-green"></span>
</div>
<span class="c4ai-toolbar-title">Click2Crawl</span>
<button class="c4ai-close-btn" title="Close">×</button>
</div>
<div class="c4ai-toolbar-content">
<div class="c4ai-selection-info">
<span class="c4ai-selection-count">0 elements selected</span>
<button class="c4ai-clear-btn" title="Clear selection" disabled>Clear</button>
</div>
<div class="c4ai-toolbar-actions">
<button class="c4ai-preview-btn" disabled>Preview Markdown</button>
<button class="c4ai-copy-btn" disabled>Copy to Clipboard</button>
</div>
<div class="c4ai-toolbar-instructions">
<p>💡 <strong>Ctrl/Cmd + Click</strong> to select multiple elements</p>
<p>📝 Selected elements will be converted to clean markdown</p>
<p>⌨️ Press <strong>ESC</strong> to exit</p>
</div>
</div>
`;
document.body.appendChild(this.toolbar);
makeDraggableByHeader(this.toolbar);
// Position toolbar
this.toolbar.style.position = 'fixed';
this.toolbar.style.top = '20px';
this.toolbar.style.right = '20px';
this.toolbar.style.zIndex = '999999';
}
setupEventListeners() {
// Close button
this.toolbar.querySelector('.c4ai-close-btn').addEventListener('click', () => {
this.deactivate();
});
// Clear selection button
this.toolbar.querySelector('.c4ai-clear-btn').addEventListener('click', () => {
this.clearSelection();
});
// Preview button
this.toolbar.querySelector('.c4ai-preview-btn').addEventListener('click', () => {
this.showPreview();
});
// Copy button
this.toolbar.querySelector('.c4ai-copy-btn').addEventListener('click', () => {
this.copyToClipboard();
});
// Document click handler for element selection
this.documentClickHandler = (event) => this.handleElementClick(event);
document.addEventListener('click', this.documentClickHandler, true);
// Prevent default link behavior during selection mode
this.linkClickHandler = (event) => {
if (event.ctrlKey || event.metaKey) {
event.preventDefault();
event.stopPropagation();
}
};
document.addEventListener('click', this.linkClickHandler, true);
// Hover effect
this.documentHoverHandler = (event) => this.handleElementHover(event);
document.addEventListener('mouseover', this.documentHoverHandler, true);
// Remove hover on mouseout
this.documentMouseOutHandler = (event) => this.handleElementMouseOut(event);
document.addEventListener('mouseout', this.documentMouseOutHandler, true);
// Keyboard shortcuts
this.keyboardHandler = (event) => this.handleKeyboard(event);
document.addEventListener('keydown', this.keyboardHandler);
}
handleElementClick(event) {
// Check if Ctrl/Cmd is pressed
if (!event.ctrlKey && !event.metaKey) return;
// Prevent default behavior
event.preventDefault();
event.stopPropagation();
const element = event.target;
// Don't select our own UI elements
if (element.closest('.c4ai-c2c-toolbar') ||
element.closest('.c4ai-c2c-preview') ||
element.closest('.c4ai-highlight-box')) {
return;
}
// Toggle element selection
if (this.selectedElements.has(element)) {
this.deselectElement(element);
} else {
this.selectElement(element);
}
this.updateUI();
}
handleElementHover(event) {
const element = event.target;
// Don't hover our own UI elements
if (element.closest('.c4ai-c2c-toolbar') ||
element.closest('.c4ai-c2c-preview') ||
element.closest('.c4ai-highlight-box') ||
element.hasAttribute('data-c4ai-badge')) {
return;
}
// Add hover class
element.classList.add('c4ai-hover-candidate');
}
handleElementMouseOut(event) {
const element = event.target;
element.classList.remove('c4ai-hover-candidate');
}
handleKeyboard(event) {
// ESC to deactivate
if (event.key === 'Escape') {
this.deactivate();
}
// Ctrl/Cmd + A to select all visible elements
else if ((event.ctrlKey || event.metaKey) && event.key === 'a') {
event.preventDefault();
// Select all visible text-containing elements
const elements = document.querySelectorAll('p, h1, h2, h3, h4, h5, h6, li, td, th, div, span, article, section');
elements.forEach(el => {
if (el.textContent.trim() && this.isVisible(el) && !this.selectedElements.has(el)) {
this.selectElement(el);
}
});
this.updateUI();
}
}
isVisible(element) {
const rect = element.getBoundingClientRect();
const style = window.getComputedStyle(element);
return rect.width > 0 &&
rect.height > 0 &&
style.display !== 'none' &&
style.visibility !== 'hidden' &&
style.opacity !== '0';
}
selectElement(element) {
this.selectedElements.add(element);
// Create highlight box
const box = this.createHighlightBox(element);
this.highlightBoxes.set(element, box);
// Add selected class
element.classList.add('c4ai-selected');
this.selectionCounter++;
}
deselectElement(element) {
this.selectedElements.delete(element);
// Remove highlight box (badge)
const badge = this.highlightBoxes.get(element);
if (badge) {
// Remove scroll/resize listeners
if (badge._updatePosition) {
window.removeEventListener('scroll', badge._updatePosition, true);
window.removeEventListener('resize', badge._updatePosition);
}
badge.remove();
this.highlightBoxes.delete(element);
}
// Remove outline
element.style.outline = '';
element.style.outlineOffset = '';
// Remove attributes
element.removeAttribute('data-c4ai-selection-order');
element.classList.remove('c4ai-selected');
this.selectionCounter--;
}
createHighlightBox(element) {
// Add a data attribute to track selection order
element.setAttribute('data-c4ai-selection-order', this.selectionCounter + 1);
// Add selection outline directly to the element
element.style.outline = '2px solid #0fbbaa';
element.style.outlineOffset = '2px';
// Create badge with fixed positioning
const badge = document.createElement('div');
badge.className = 'c4ai-selection-badge-fixed';
badge.textContent = this.selectionCounter + 1;
badge.setAttribute('data-c4ai-badge', 'true');
badge.title = 'Click to deselect';
// Get element position and set badge position
const rect = element.getBoundingClientRect();
badge.style.cssText = `
position: fixed !important;
top: ${rect.top - 12}px !important;
left: ${rect.left - 12}px !important;
width: 24px !important;
height: 24px !important;
background: #0fbbaa !important;
color: #070708 !important;
border-radius: 50% !important;
display: flex !important;
align-items: center !important;
justify-content: center !important;
font-size: 12px !important;
font-weight: bold !important;
font-family: -apple-system, BlinkMacSystemFont, "Segoe UI", Roboto, sans-serif !important;
box-shadow: 0 2px 8px rgba(0, 0, 0, 0.3) !important;
z-index: 999998 !important;
cursor: pointer !important;
transition: all 0.2s ease !important;
pointer-events: auto !important;
border: none !important;
padding: 0 !important;
margin: 0 !important;
line-height: 1 !important;
text-align: center !important;
text-decoration: none !important;
box-sizing: border-box !important;
`;
// Add hover styles dynamically
badge.addEventListener('mouseenter', () => {
badge.style.setProperty('background', '#ff3c74', 'important');
badge.style.setProperty('transform', 'scale(1.1)', 'important');
});
badge.addEventListener('mouseleave', () => {
badge.style.setProperty('background', '#0fbbaa', 'important');
badge.style.setProperty('transform', 'scale(1)', 'important');
});
// Add click handler to badge for deselection
badge.addEventListener('click', (e) => {
e.stopPropagation();
e.preventDefault();
this.deselectElement(element);
this.updateUI();
});
// Add scroll listener to update position
const updatePosition = () => {
const newRect = element.getBoundingClientRect();
badge.style.top = `${newRect.top - 12}px`;
badge.style.left = `${newRect.left - 12}px`;
};
// Store the update function so we can remove it later
badge._updatePosition = updatePosition;
window.addEventListener('scroll', updatePosition, true);
window.addEventListener('resize', updatePosition);
document.body.appendChild(badge);
return badge;
}
clearSelection() {
// Clear all selections
this.selectedElements.forEach(element => {
// Remove badge
const badge = this.highlightBoxes.get(element);
if (badge) {
// Remove scroll/resize listeners
if (badge._updatePosition) {
window.removeEventListener('scroll', badge._updatePosition, true);
window.removeEventListener('resize', badge._updatePosition);
}
badge.remove();
}
// Remove outline
element.style.outline = '';
element.style.outlineOffset = '';
// Remove attributes
element.removeAttribute('data-c4ai-selection-order');
element.classList.remove('c4ai-selected');
});
this.selectedElements.clear();
this.highlightBoxes.clear();
this.selectionCounter = 0;
this.updateUI();
}
updateUI() {
const count = this.selectedElements.size;
// Update selection count
this.toolbar.querySelector('.c4ai-selection-count').textContent =
`${count} element${count !== 1 ? 's' : ''} selected`;
// Enable/disable buttons
const hasSelection = count > 0;
this.toolbar.querySelector('.c4ai-preview-btn').disabled = !hasSelection;
this.toolbar.querySelector('.c4ai-copy-btn').disabled = !hasSelection;
this.toolbar.querySelector('.c4ai-clear-btn').disabled = !hasSelection;
}
async showPreview() {
// Generate markdown from selected elements
const markdown = await this.generateMarkdown();
// Create or update preview panel
if (!this.previewPanel) {
this.createPreviewPanel();
}
await this.updatePreviewContent(markdown);
this.previewPanel.style.display = 'block';
}
createPreviewPanel() {
this.previewPanel = document.createElement('div');
this.previewPanel.className = 'c4ai-c2c-preview';
this.previewPanel.innerHTML = `
<div class="c4ai-preview-header">
<div class="c4ai-toolbar-dots">
<span class="c4ai-dot c4ai-dot-red"></span>
<span class="c4ai-dot c4ai-dot-yellow"></span>
<span class="c4ai-dot c4ai-dot-green"></span>
</div>
<span class="c4ai-preview-title">Markdown Preview</span>
<button class="c4ai-preview-close">×</button>
</div>
<div class="c4ai-preview-options">
<label><input type="checkbox" name="textOnly"> 👁️ Visual Text Mode (As You See) TRY THIS!!!</label>
<label><input type="checkbox" name="includeImages" checked> Include Images</label>
<label><input type="checkbox" name="preserveTables" checked> Preserve Tables</label>
<label><input type="checkbox" name="preserveLinks" checked> Preserve Links</label>
<label><input type="checkbox" name="keepCodeFormatting" checked> Keep Code Formatting</label>
<label><input type="checkbox" name="simplifyLayout"> Simplify Layout</label>
<label><input type="checkbox" name="addSeparators" checked> Add Separators</label>
<label><input type="checkbox" name="includeXPath"> Include XPath Headers</label>
</div>
<div class="c4ai-preview-content">
<div class="c4ai-preview-tabs">
<button class="c4ai-tab active" data-tab="preview">Preview</button>
<button class="c4ai-tab" data-tab="markdown">Markdown</button>
<button class="c4ai-wrap-toggle" title="Toggle word wrap">↔️ Wrap</button>
</div>
<div class="c4ai-preview-pane active" data-pane="preview"></div>
<div class="c4ai-preview-pane" data-pane="markdown"></div>
</div>
<div class="c4ai-preview-actions">
<button class="c4ai-download-btn">Download .md</button>
<button class="c4ai-copy-markdown-btn">Copy Markdown</button>
<button class="c4ai-cloud-btn" disabled>Send to Cloud (Coming Soon)</button>
</div>
`;
document.body.appendChild(this.previewPanel);
makeDraggableByHeader(this.previewPanel);
// Position preview panel
this.previewPanel.style.position = 'fixed';
this.previewPanel.style.top = '50%';
this.previewPanel.style.left = '50%';
this.previewPanel.style.transform = 'translate(-50%, -50%)';
this.previewPanel.style.zIndex = '999999';
this.setupPreviewEventListeners();
}
setupPreviewEventListeners() {
// Close button
this.previewPanel.querySelector('.c4ai-preview-close').addEventListener('click', () => {
this.previewPanel.style.display = 'none';
});
// Tab switching
this.previewPanel.querySelectorAll('.c4ai-tab').forEach(tab => {
tab.addEventListener('click', (e) => {
const tabName = e.target.dataset.tab;
this.switchPreviewTab(tabName);
});
});
// Wrap toggle
const wrapToggle = this.previewPanel.querySelector('.c4ai-wrap-toggle');
wrapToggle.addEventListener('click', () => {
const panes = this.previewPanel.querySelectorAll('.c4ai-preview-pane');
panes.forEach(pane => {
pane.classList.toggle('wrap');
});
wrapToggle.classList.toggle('active');
});
// Options change
this.previewPanel.querySelectorAll('input[type="checkbox"]').forEach(checkbox => {
checkbox.addEventListener('change', async (e) => {
this.options[e.target.name] = e.target.checked;
// If text-only is enabled, automatically disable certain options
if (e.target.name === 'textOnly' && e.target.checked) {
// Update UI checkboxes
const preserveLinksCheckbox = this.previewPanel.querySelector('input[name="preserveLinks"]');
if (preserveLinksCheckbox) {
preserveLinksCheckbox.checked = false;
preserveLinksCheckbox.disabled = true;
}
// Optionally disable images in text-only mode
const includeImagesCheckbox = this.previewPanel.querySelector('input[name="includeImages"]');
if (includeImagesCheckbox) {
includeImagesCheckbox.disabled = true;
}
} else if (e.target.name === 'textOnly' && !e.target.checked) {
// Re-enable options when text-only is disabled
const preserveLinksCheckbox = this.previewPanel.querySelector('input[name="preserveLinks"]');
if (preserveLinksCheckbox) {
preserveLinksCheckbox.disabled = false;
}
const includeImagesCheckbox = this.previewPanel.querySelector('input[name="includeImages"]');
if (includeImagesCheckbox) {
includeImagesCheckbox.disabled = false;
}
}
const markdown = await this.generateMarkdown();
await this.updatePreviewContent(markdown);
});
});
// Action buttons
this.previewPanel.querySelector('.c4ai-copy-markdown-btn').addEventListener('click', () => {
this.copyToClipboard();
});
this.previewPanel.querySelector('.c4ai-download-btn').addEventListener('click', () => {
this.downloadMarkdown();
});
}
switchPreviewTab(tabName) {
// Update active tab
this.previewPanel.querySelectorAll('.c4ai-tab').forEach(tab => {
tab.classList.toggle('active', tab.dataset.tab === tabName);
});
// Update active pane
this.previewPanel.querySelectorAll('.c4ai-preview-pane').forEach(pane => {
pane.classList.toggle('active', pane.dataset.pane === tabName);
});
}
async updatePreviewContent(markdown) {
// Update markdown pane
const markdownPane = this.previewPanel.querySelector('[data-pane="markdown"]');
markdownPane.innerHTML = `<pre><code>${this.escapeHtml(markdown)}</code></pre>`;
// Update preview pane using marked.js
const previewPane = this.previewPanel.querySelector('[data-pane="preview"]');
// Configure marked options (marked.js is already loaded via manifest)
if (window.marked) {
marked.setOptions({
gfm: true,
breaks: true,
tables: true,
headerIds: false,
mangle: false
});
// Render markdown to HTML
const html = marked.parse(markdown);
previewPane.innerHTML = `<div class="c4ai-markdown-preview">${html}</div>`;
} else {
// Fallback if marked.js is not available
previewPane.innerHTML = `<div class="c4ai-markdown-preview"><pre>${this.escapeHtml(markdown)}</pre></div>`;
}
}
escapeHtml(unsafe) {
return unsafe
.replace(/&/g, "&amp;")
.replace(/</g, "&lt;")
.replace(/>/g, "&gt;")
.replace(/"/g, "&quot;")
.replace(/'/g, "&#039;");
}
async generateMarkdown() {
// Get selected elements as array
const elements = Array.from(this.selectedElements);
// Sort elements by their selection order
const sortedElements = elements.sort((a, b) => {
const orderA = parseInt(a.getAttribute('data-c4ai-selection-order') || '0');
const orderB = parseInt(b.getAttribute('data-c4ai-selection-order') || '0');
return orderA - orderB;
});
// Convert each element separately
const markdownParts = [];
for (let i = 0; i < sortedElements.length; i++) {
const element = sortedElements[i];
// Add XPath header if enabled
if (this.options.includeXPath) {
const xpath = this.getXPath(element);
markdownParts.push(`### Element ${i + 1} - XPath: \`${xpath}\`\n`);
}
// Check if element is part of a table structure that should be processed specially
let elementsToConvert = [element];
// If text-only mode and element is a TR, process the entire table for better context
if (this.options.textOnly && element.tagName === 'TR') {
const table = element.closest('table');
if (table && !sortedElements.includes(table)) {
// Only include this table row, not the whole table
elementsToConvert = [element];
}
}
// Analyze and convert individual element
const analysis = await this.contentAnalyzer.analyze(elementsToConvert);
const markdown = await this.markdownConverter.convert(elementsToConvert, {
...this.options,
analysis
});
markdownParts.push(markdown.trim());
// Add separator if enabled and not last element
if (this.options.addSeparators && i < sortedElements.length - 1) {
markdownParts.push('\n\n---\n\n');
}
}
return markdownParts.join('\n\n');
}
getXPath(element) {
if (element.id) {
return `//*[@id="${element.id}"]`;
}
const parts = [];
let current = element;
while (current && current.nodeType === Node.ELEMENT_NODE) {
let index = 0;
let sibling = current.previousSibling;
while (sibling) {
if (sibling.nodeType === Node.ELEMENT_NODE && sibling.nodeName === current.nodeName) {
index++;
}
sibling = sibling.previousSibling;
}
const tagName = current.nodeName.toLowerCase();
const part = index > 0 ? `${tagName}[${index + 1}]` : tagName;
parts.unshift(part);
current = current.parentNode;
}
return '/' + parts.join('/');
}
sortElementsByPosition(elements) {
return elements.sort((a, b) => {
const position = a.compareDocumentPosition(b);
if (position & Node.DOCUMENT_POSITION_FOLLOWING) {
return -1;
} else if (position & Node.DOCUMENT_POSITION_PRECEDING) {
return 1;
}
return 0;
});
}
async copyToClipboard() {
const markdown = await this.generateMarkdown();
try {
await navigator.clipboard.writeText(markdown);
this.showNotification('Markdown copied to clipboard!');
} catch (err) {
console.error('Failed to copy:', err);
this.showNotification('Failed to copy. Please try again.', 'error');
}
}
async downloadMarkdown() {
const markdown = await this.generateMarkdown();
const timestamp = new Date().toISOString().replace(/[:.]/g, '-').slice(0, -5);
const filename = `crawl4ai-export-${timestamp}.md`;
// Create blob and download
const blob = new Blob([markdown], { type: 'text/markdown' });
const url = URL.createObjectURL(blob);
const a = document.createElement('a');
a.href = url;
a.download = filename;
document.body.appendChild(a);
a.click();
document.body.removeChild(a);
URL.revokeObjectURL(url);
this.showNotification(`Downloaded ${filename}`);
}
showNotification(message, type = 'success') {
const notification = document.createElement('div');
notification.className = `c4ai-notification c4ai-notification-${type}`;
notification.textContent = message;
document.body.appendChild(notification);
// Animate in
setTimeout(() => notification.classList.add('show'), 10);
// Remove after 3 seconds
setTimeout(() => {
notification.classList.remove('show');
setTimeout(() => notification.remove(), 300);
}, 3000);
}
deactivate() {
// Remove event listeners
document.removeEventListener('click', this.documentClickHandler, true);
document.removeEventListener('click', this.linkClickHandler, true);
document.removeEventListener('mouseover', this.documentHoverHandler, true);
document.removeEventListener('mouseout', this.documentMouseOutHandler, true);
document.removeEventListener('keydown', this.keyboardHandler);
// Clear selections
this.clearSelection();
// Remove UI elements
if (this.toolbar) {
this.toolbar.remove();
this.toolbar = null;
}
if (this.previewPanel) {
this.previewPanel.remove();
this.previewPanel = null;
}
// Remove hover styles
document.querySelectorAll('.c4ai-hover-candidate').forEach(el => {
el.classList.remove('c4ai-hover-candidate');
});
// Notify background script (with error handling)
try {
if (chrome.runtime && chrome.runtime.sendMessage) {
chrome.runtime.sendMessage({
action: 'c2cDeactivated'
});
}
} catch (error) {
// Extension context might be invalidated, ignore the error
console.log('Click2Crawl deactivated (extension context unavailable)');
}
}
}

File diff suppressed because it is too large Load Diff

View File

@@ -0,0 +1,623 @@
class ContentAnalyzer {
constructor() {
this.patterns = {
article: ['article', 'main', 'content', 'post', 'entry'],
navigation: ['nav', 'menu', 'navigation', 'breadcrumb'],
sidebar: ['sidebar', 'aside', 'widget'],
header: ['header', 'masthead', 'banner'],
footer: ['footer', 'copyright', 'contact'],
list: ['list', 'items', 'results', 'products', 'cards'],
table: ['table', 'grid', 'data'],
media: ['gallery', 'carousel', 'slideshow', 'video', 'media']
};
}
async analyze(elements) {
const analysis = {
structure: await this.analyzeStructure(elements),
contentType: this.identifyContentType(elements),
hierarchy: this.buildHierarchy(elements),
mediaAssets: this.collectMediaAssets(elements),
textDensity: this.calculateTextDensity(elements),
semanticRegions: this.identifySemanticRegions(elements),
relationships: this.analyzeRelationships(elements),
metadata: this.extractMetadata(elements)
};
return analysis;
}
analyzeStructure(elements) {
const structure = {
hasHeadings: false,
hasLists: false,
hasTables: false,
hasMedia: false,
hasCode: false,
hasLinks: false,
layout: 'linear', // linear, grid, mixed
depth: 0,
elementTypes: new Map()
};
// Analyze each element
for (const element of elements) {
this.analyzeElementStructure(element, structure);
}
// Determine layout type
structure.layout = this.determineLayout(elements);
// Calculate max depth
structure.depth = this.calculateMaxDepth(elements);
return structure;
}
analyzeElementStructure(element, structure, visited = new Set()) {
if (visited.has(element)) return;
visited.add(element);
const tagName = element.tagName;
// Update element type count
structure.elementTypes.set(
tagName,
(structure.elementTypes.get(tagName) || 0) + 1
);
// Check for specific structures
if (/^H[1-6]$/.test(tagName)) {
structure.hasHeadings = true;
} else if (['UL', 'OL', 'DL'].includes(tagName)) {
structure.hasLists = true;
} else if (tagName === 'TABLE') {
structure.hasTables = true;
} else if (['IMG', 'VIDEO', 'IFRAME', 'PICTURE'].includes(tagName)) {
structure.hasMedia = true;
} else if (['CODE', 'PRE'].includes(tagName)) {
structure.hasCode = true;
} else if (tagName === 'A') {
structure.hasLinks = true;
}
// Analyze children
for (const child of element.children) {
this.analyzeElementStructure(child, structure, visited);
}
}
identifyContentType(elements) {
const scores = {
article: 0,
list: 0,
table: 0,
form: 0,
media: 0,
mixed: 0
};
for (const element of elements) {
// Score based on element types and classes
const tagName = element.tagName;
const className = element.className.toLowerCase();
const id = element.id.toLowerCase();
// Check for article patterns
if (tagName === 'ARTICLE' ||
this.matchesPattern(className + ' ' + id, this.patterns.article)) {
scores.article += 10;
}
// Check for list patterns
if (['UL', 'OL'].includes(tagName) ||
this.matchesPattern(className, this.patterns.list)) {
scores.list += 5;
}
// Check for table
if (tagName === 'TABLE') {
scores.table += 10;
}
// Check for form
if (tagName === 'FORM' || element.querySelector('input, select, textarea')) {
scores.form += 5;
}
// Check for media gallery
if (this.matchesPattern(className, this.patterns.media) ||
element.querySelectorAll('img, video').length > 3) {
scores.media += 5;
}
}
// Determine primary content type
const maxScore = Math.max(...Object.values(scores));
if (maxScore === 0) return 'unknown';
for (const [type, score] of Object.entries(scores)) {
if (score === maxScore) {
return type;
}
}
return 'mixed';
}
buildHierarchy(elements) {
const hierarchy = {
root: null,
levels: [],
headingStructure: []
};
// Find common ancestor
if (elements.length > 0) {
hierarchy.root = this.findCommonAncestor(elements);
}
// Build heading hierarchy
const headings = [];
for (const element of elements) {
const foundHeadings = element.querySelectorAll('h1, h2, h3, h4, h5, h6');
headings.push(...Array.from(foundHeadings));
}
// Sort headings by document position
headings.sort((a, b) => {
const position = a.compareDocumentPosition(b);
if (position & Node.DOCUMENT_POSITION_FOLLOWING) {
return -1;
} else if (position & Node.DOCUMENT_POSITION_PRECEDING) {
return 1;
}
return 0;
});
// Build heading structure
let currentLevel = 0;
const stack = [];
for (const heading of headings) {
const level = parseInt(heading.tagName.substring(1));
const item = {
level,
text: heading.textContent.trim(),
element: heading,
children: []
};
// Find parent in stack
while (stack.length > 0 && stack[stack.length - 1].level >= level) {
stack.pop();
}
if (stack.length > 0) {
stack[stack.length - 1].children.push(item);
} else {
hierarchy.headingStructure.push(item);
}
stack.push(item);
}
return hierarchy;
}
collectMediaAssets(elements) {
const media = {
images: [],
videos: [],
iframes: [],
audio: []
};
for (const element of elements) {
// Collect images
const images = element.querySelectorAll('img');
for (const img of images) {
media.images.push({
src: img.src,
alt: img.alt,
title: img.title,
width: img.width,
height: img.height,
element: img
});
}
// Collect videos
const videos = element.querySelectorAll('video');
for (const video of videos) {
media.videos.push({
src: video.src,
poster: video.poster,
width: video.width,
height: video.height,
element: video
});
}
// Collect iframes
const iframes = element.querySelectorAll('iframe');
for (const iframe of iframes) {
media.iframes.push({
src: iframe.src,
width: iframe.width,
height: iframe.height,
title: iframe.title,
element: iframe
});
}
// Collect audio
const audios = element.querySelectorAll('audio');
for (const audio of audios) {
media.audio.push({
src: audio.src,
element: audio
});
}
}
return media;
}
calculateTextDensity(elements) {
let totalText = 0;
let totalElements = 0;
let linkText = 0;
let codeText = 0;
for (const element of elements) {
const stats = this.getTextStats(element);
totalText += stats.textLength;
totalElements += stats.elementCount;
linkText += stats.linkTextLength;
codeText += stats.codeTextLength;
}
return {
textLength: totalText,
elementCount: totalElements,
averageTextPerElement: totalElements > 0 ? totalText / totalElements : 0,
linkDensity: totalText > 0 ? linkText / totalText : 0,
codeDensity: totalText > 0 ? codeText / totalText : 0
};
}
getTextStats(element, visited = new Set()) {
if (visited.has(element)) {
return { textLength: 0, elementCount: 0, linkTextLength: 0, codeTextLength: 0 };
}
visited.add(element);
let stats = {
textLength: 0,
elementCount: 1,
linkTextLength: 0,
codeTextLength: 0
};
// Get direct text content
for (const node of element.childNodes) {
if (node.nodeType === Node.TEXT_NODE) {
const text = node.textContent.trim();
stats.textLength += text.length;
// Check if this text is within a link
if (element.tagName === 'A') {
stats.linkTextLength += text.length;
}
// Check if this text is within code
if (['CODE', 'PRE'].includes(element.tagName)) {
stats.codeTextLength += text.length;
}
}
}
// Process children
for (const child of element.children) {
const childStats = this.getTextStats(child, visited);
stats.textLength += childStats.textLength;
stats.elementCount += childStats.elementCount;
stats.linkTextLength += childStats.linkTextLength;
stats.codeTextLength += childStats.codeTextLength;
}
return stats;
}
identifySemanticRegions(elements) {
const regions = {
headers: [],
navigation: [],
main: [],
sidebars: [],
footers: [],
articles: []
};
for (const element of elements) {
// Check element and its ancestors for semantic regions
let current = element;
while (current) {
const tagName = current.tagName;
const className = current.className.toLowerCase();
const role = current.getAttribute('role');
// Check semantic HTML5 elements
if (tagName === 'HEADER' || role === 'banner') {
regions.headers.push(current);
} else if (tagName === 'NAV' || role === 'navigation') {
regions.navigation.push(current);
} else if (tagName === 'MAIN' || role === 'main') {
regions.main.push(current);
} else if (tagName === 'ASIDE' || role === 'complementary') {
regions.sidebars.push(current);
} else if (tagName === 'FOOTER' || role === 'contentinfo') {
regions.footers.push(current);
} else if (tagName === 'ARTICLE' || role === 'article') {
regions.articles.push(current);
}
// Check class patterns
if (this.matchesPattern(className, this.patterns.header)) {
regions.headers.push(current);
} else if (this.matchesPattern(className, this.patterns.navigation)) {
regions.navigation.push(current);
} else if (this.matchesPattern(className, this.patterns.sidebar)) {
regions.sidebars.push(current);
} else if (this.matchesPattern(className, this.patterns.footer)) {
regions.footers.push(current);
}
current = current.parentElement;
}
}
// Deduplicate
for (const key of Object.keys(regions)) {
regions[key] = Array.from(new Set(regions[key]));
}
return regions;
}
analyzeRelationships(elements) {
const relationships = {
siblings: [],
parents: [],
children: [],
relatedByClass: new Map(),
relatedByStructure: []
};
// Find sibling relationships
for (let i = 0; i < elements.length; i++) {
for (let j = i + 1; j < elements.length; j++) {
if (elements[i].parentElement === elements[j].parentElement) {
relationships.siblings.push([elements[i], elements[j]]);
}
}
}
// Find parent-child relationships
for (const element of elements) {
for (const other of elements) {
if (element !== other) {
if (element.contains(other)) {
relationships.parents.push({ parent: element, child: other });
} else if (other.contains(element)) {
relationships.children.push({ parent: other, child: element });
}
}
}
}
// Group by similar classes
for (const element of elements) {
const classes = Array.from(element.classList);
for (const className of classes) {
if (!relationships.relatedByClass.has(className)) {
relationships.relatedByClass.set(className, []);
}
relationships.relatedByClass.get(className).push(element);
}
}
// Find structurally similar elements
for (let i = 0; i < elements.length; i++) {
for (let j = i + 1; j < elements.length; j++) {
if (this.areStructurallySimilar(elements[i], elements[j])) {
relationships.relatedByStructure.push([elements[i], elements[j]]);
}
}
}
return relationships;
}
areStructurallySimilar(element1, element2) {
// Same tag name
if (element1.tagName !== element2.tagName) {
return false;
}
// Similar class structure
const classes1 = Array.from(element1.classList).sort();
const classes2 = Array.from(element2.classList).sort();
// At least 50% overlap in classes
const intersection = classes1.filter(c => classes2.includes(c));
const union = Array.from(new Set([...classes1, ...classes2]));
if (union.length > 0 && intersection.length / union.length >= 0.5) {
return true;
}
// Similar child structure
if (element1.children.length === element2.children.length) {
const childTags1 = Array.from(element1.children).map(c => c.tagName).sort();
const childTags2 = Array.from(element2.children).map(c => c.tagName).sort();
if (JSON.stringify(childTags1) === JSON.stringify(childTags2)) {
return true;
}
}
return false;
}
extractMetadata(elements) {
const metadata = {
title: null,
description: null,
author: null,
date: null,
tags: [],
microdata: []
};
for (const element of elements) {
// Look for title
const h1 = element.querySelector('h1');
if (h1 && !metadata.title) {
metadata.title = h1.textContent.trim();
}
// Look for meta information
const metaElements = element.querySelectorAll('[itemprop], [property], [name]');
for (const meta of metaElements) {
const prop = meta.getAttribute('itemprop') ||
meta.getAttribute('property') ||
meta.getAttribute('name');
const content = meta.getAttribute('content') || meta.textContent.trim();
if (prop && content) {
if (prop.includes('author')) {
metadata.author = content;
} else if (prop.includes('date') || prop.includes('time')) {
metadata.date = content;
} else if (prop.includes('description')) {
metadata.description = content;
} else if (prop.includes('tag') || prop.includes('keyword')) {
metadata.tags.push(content);
}
metadata.microdata.push({ property: prop, value: content });
}
}
// Look for time elements
const timeElements = element.querySelectorAll('time');
for (const time of timeElements) {
if (!metadata.date && time.dateTime) {
metadata.date = time.dateTime;
}
}
}
return metadata;
}
determineLayout(elements) {
// Check if elements form a grid
const positions = elements.map(el => {
const rect = el.getBoundingClientRect();
return { x: rect.left, y: rect.top, width: rect.width, height: rect.height };
});
// Check for grid layout (multiple elements on same row)
const rows = new Map();
for (const pos of positions) {
const row = Math.round(pos.y / 10) * 10; // Round to nearest 10px
if (!rows.has(row)) {
rows.set(row, []);
}
rows.get(row).push(pos);
}
// If multiple elements share rows, it's likely a grid
const hasGrid = Array.from(rows.values()).some(row => row.length > 1);
if (hasGrid) {
return 'grid';
}
// Check for mixed layout (significant variation in widths)
const widths = positions.map(p => p.width);
const avgWidth = widths.reduce((a, b) => a + b, 0) / widths.length;
const variance = widths.reduce((sum, w) => sum + Math.pow(w - avgWidth, 2), 0) / widths.length;
const stdDev = Math.sqrt(variance);
if (stdDev / avgWidth > 0.3) {
return 'mixed';
}
return 'linear';
}
calculateMaxDepth(elements) {
let maxDepth = 0;
for (const element of elements) {
const depth = this.getElementDepth(element);
maxDepth = Math.max(maxDepth, depth);
}
return maxDepth;
}
getElementDepth(element, depth = 0) {
if (element.children.length === 0) {
return depth;
}
let maxChildDepth = depth;
for (const child of element.children) {
const childDepth = this.getElementDepth(child, depth + 1);
maxChildDepth = Math.max(maxChildDepth, childDepth);
}
return maxChildDepth;
}
findCommonAncestor(elements) {
if (elements.length === 0) return null;
if (elements.length === 1) return elements[0].parentElement;
// Start with the first element's ancestors
let ancestor = elements[0];
const ancestors = [];
while (ancestor) {
ancestors.push(ancestor);
ancestor = ancestor.parentElement;
}
// Find the deepest common ancestor
for (const ancestorCandidate of ancestors) {
let isCommon = true;
for (const element of elements) {
if (!ancestorCandidate.contains(element)) {
isCommon = false;
break;
}
}
if (isCommon) {
return ancestorCandidate;
}
}
return document.body;
}
matchesPattern(text, patterns) {
return patterns.some(pattern => text.includes(pattern));
}
}

View File

@@ -0,0 +1,718 @@
class MarkdownConverter {
constructor() {
// Conversion handlers for different element types
this.converters = {
'H1': async (el, ctx) => await this.convertHeading(el, 1, ctx),
'H2': async (el, ctx) => await this.convertHeading(el, 2, ctx),
'H3': async (el, ctx) => await this.convertHeading(el, 3, ctx),
'H4': async (el, ctx) => await this.convertHeading(el, 4, ctx),
'H5': async (el, ctx) => await this.convertHeading(el, 5, ctx),
'H6': async (el, ctx) => await this.convertHeading(el, 6, ctx),
'P': async (el, ctx) => await this.convertParagraph(el, ctx),
'A': async (el, ctx) => await this.convertLink(el, ctx),
'IMG': async (el, ctx) => await this.convertImage(el, ctx),
'UL': async (el, ctx) => await this.convertList(el, 'ul', ctx),
'OL': async (el, ctx) => await this.convertList(el, 'ol', ctx),
'LI': async (el, ctx) => await this.convertListItem(el, ctx),
'TABLE': async (el, ctx) => await this.convertTable(el, ctx),
'BLOCKQUOTE': async (el, ctx) => await this.convertBlockquote(el, ctx),
'PRE': async (el, ctx) => await this.convertPreformatted(el, ctx),
'CODE': async (el, ctx) => await this.convertCode(el, ctx),
'HR': async (el, ctx) => '\n---\n',
'BR': async (el, ctx) => ' \n',
'STRONG': async (el, ctx) => `**${await this.getTextContent(el, ctx)}**`,
'B': async (el, ctx) => `**${await this.getTextContent(el, ctx)}**`,
'EM': async (el, ctx) => `*${await this.getTextContent(el, ctx)}*`,
'I': async (el, ctx) => `*${await this.getTextContent(el, ctx)}*`,
'DEL': async (el, ctx) => `~~${await this.getTextContent(el, ctx)}~~`,
'S': async (el, ctx) => `~~${await this.getTextContent(el, ctx)}~~`,
'DIV': async (el, ctx) => await this.convertDiv(el, ctx),
'SPAN': async (el, ctx) => await this.convertSpan(el, ctx),
'ARTICLE': async (el, ctx) => await this.convertArticle(el, ctx),
'SECTION': async (el, ctx) => await this.convertSection(el, ctx),
'FIGURE': async (el, ctx) => await this.convertFigure(el, ctx),
'FIGCAPTION': async (el, ctx) => await this.convertFigCaption(el, ctx),
'VIDEO': async (el, ctx) => await this.convertVideo(el, ctx),
'IFRAME': async (el, ctx) => await this.convertIframe(el, ctx),
'DL': async (el, ctx) => await this.convertDefinitionList(el, ctx),
'DT': async (el, ctx) => await this.convertDefinitionTerm(el, ctx),
'DD': async (el, ctx) => await this.convertDefinitionDescription(el, ctx),
'TR': async (el, ctx) => await this.convertTableRow(el, ctx)
};
// Maintain context during conversion
this.conversionContext = {
listDepth: 0,
inTable: false,
inCode: false,
preserveWhitespace: false,
references: [],
imageCount: 0,
linkCount: 0
};
}
async convert(elements, options = {}) {
// Reset context
this.resetContext();
// Apply options
this.options = {
includeImages: true,
preserveTables: true,
keepCodeFormatting: true,
simplifyLayout: false,
preserveLinks: true,
...options
};
// Convert elements
const markdownParts = [];
for (const element of elements) {
const markdown = await this.convertElement(element, this.conversionContext);
if (markdown.trim()) {
markdownParts.push(markdown);
}
}
// Join parts with appropriate spacing
let result = markdownParts.join('\n\n');
// Add references if using reference-style links
if (this.conversionContext.references.length > 0) {
result += '\n\n' + this.generateReferences();
}
// Post-process to clean up
result = this.postProcess(result);
return result;
}
resetContext() {
this.conversionContext = {
listDepth: 0,
inTable: false,
inCode: false,
preserveWhitespace: false,
references: [],
imageCount: 0,
linkCount: 0
};
}
async convertElement(element, context) {
// Skip hidden elements
if (this.isHidden(element)) {
return '';
}
// Skip script and style elements
if (['SCRIPT', 'STYLE', 'NOSCRIPT'].includes(element.tagName)) {
return '';
}
// Get converter for this element type
const converter = this.converters[element.tagName];
if (converter) {
return await converter(element, context);
} else {
// For unknown elements, process children
return await this.processChildren(element, context);
}
}
async processChildren(element, context) {
const parts = [];
for (const child of element.childNodes) {
if (child.nodeType === Node.TEXT_NODE) {
const text = this.processTextNode(child, context);
if (text) {
parts.push(text);
}
} else if (child.nodeType === Node.ELEMENT_NODE) {
const markdown = await this.convertElement(child, context);
if (markdown) {
parts.push(markdown);
}
}
}
return parts.join('');
}
processTextNode(node, context) {
let text = node.textContent;
// Preserve whitespace in code blocks
if (!context.preserveWhitespace && !context.inCode) {
// Normalize whitespace
text = text.replace(/\s+/g, ' ');
// Trim if at block boundaries
if (this.isBlockBoundary(node.previousSibling)) {
text = text.trimStart();
}
if (this.isBlockBoundary(node.nextSibling)) {
text = text.trimEnd();
}
}
// Escape markdown characters
if (!context.inCode) {
text = this.escapeMarkdown(text);
}
return text;
}
isBlockBoundary(node) {
if (!node || node.nodeType !== Node.ELEMENT_NODE) {
return true;
}
const blockElements = [
'DIV', 'P', 'H1', 'H2', 'H3', 'H4', 'H5', 'H6',
'UL', 'OL', 'LI', 'BLOCKQUOTE', 'PRE', 'TABLE',
'HR', 'ARTICLE', 'SECTION', 'HEADER', 'FOOTER',
'NAV', 'ASIDE', 'MAIN'
];
return blockElements.includes(node.tagName);
}
escapeMarkdown(text) {
// In text-only mode, don't escape characters
if (this.options.textOnly) {
return text;
}
// Escape special markdown characters
return text
.replace(/\\/g, '\\\\')
.replace(/\*/g, '\\*')
.replace(/_/g, '\\_')
.replace(/\[/g, '\\[')
.replace(/\]/g, '\\]')
.replace(/\(/g, '\\(')
.replace(/\)/g, '\\)')
.replace(/\#/g, '\\#')
.replace(/\+/g, '\\+')
.replace(/\-/g, '\\-')
.replace(/\./g, '\\.')
.replace(/\!/g, '\\!')
.replace(/\|/g, '\\|');
}
async convertHeading(element, level, context) {
const text = await this.getTextContent(element, context);
return '#'.repeat(level) + ' ' + text + '\n';
}
async convertParagraph(element, context) {
const content = await this.processChildren(element, context);
return content.trim() ? content + '\n' : '';
}
async convertLink(element, context) {
if (!this.options.preserveLinks || this.options.textOnly) {
return await this.getTextContent(element, context);
}
const text = await this.getTextContent(element, context);
const href = element.getAttribute('href');
const title = element.getAttribute('title');
if (!href) {
return text;
}
// Convert relative URLs to absolute
const absoluteUrl = this.makeAbsoluteUrl(href);
// Use reference-style links for cleaner markdown
if (text && absoluteUrl) {
if (title) {
return `[${text}](${absoluteUrl} "${title}")`;
} else {
return `[${text}](${absoluteUrl})`;
}
}
return text;
}
async convertImage(element, context) {
if (!this.options.includeImages || this.options.textOnly) {
// In text-only mode, return alt text if available
if (this.options.textOnly) {
const alt = element.getAttribute('alt');
return alt ? `[Image: ${alt}]` : '';
}
return '';
}
const src = element.getAttribute('src');
const alt = element.getAttribute('alt') || '';
const title = element.getAttribute('title');
if (!src) {
return '';
}
// Convert relative URLs to absolute
const absoluteUrl = this.makeAbsoluteUrl(src);
if (title) {
return `![${alt}](${absoluteUrl} "${title}")`;
} else {
return `![${alt}](${absoluteUrl})`;
}
}
async convertList(element, type, context) {
const oldDepth = context.listDepth;
context.listDepth++;
const items = [];
for (const child of element.children) {
if (child.tagName === 'LI') {
const markdown = await this.convertListItem(child, { ...context, listType: type });
if (markdown) {
items.push(markdown);
}
}
}
context.listDepth = oldDepth;
return items.join('\n') + (context.listDepth === 0 ? '\n' : '');
}
async convertListItem(element, context) {
const indent = ' '.repeat(Math.max(0, context.listDepth - 1));
const bullet = context.listType === 'ol' ? '1.' : '-';
const content = (await this.processChildren(element, context)).trim();
return `${indent}${bullet} ${content}`;
}
async convertTable(element, context) {
if (!this.options.preserveTables || this.options.textOnly) {
// Fallback to simple text representation
return await this.convertTableToText(element, context);
}
const rows = [];
const headerRows = [];
let maxCols = 0;
// Process table rows
for (const child of element.children) {
if (child.tagName === 'THEAD') {
for (const row of child.children) {
if (row.tagName === 'TR') {
const cells = await this.processTableRow(row, context);
headerRows.push(cells);
maxCols = Math.max(maxCols, cells.length);
}
}
} else if (child.tagName === 'TBODY') {
for (const row of child.children) {
if (row.tagName === 'TR') {
const cells = await this.processTableRow(row, context);
rows.push(cells);
maxCols = Math.max(maxCols, cells.length);
}
}
} else if (child.tagName === 'TR') {
const cells = await this.processTableRow(child, context);
rows.push(cells);
maxCols = Math.max(maxCols, cells.length);
}
}
// Build markdown table
const markdownRows = [];
// Add headers
if (headerRows.length > 0) {
for (const headerRow of headerRows) {
const paddedRow = this.padTableRow(headerRow, maxCols);
markdownRows.push('| ' + paddedRow.join(' | ') + ' |');
}
// Add separator
const separator = Array(maxCols).fill('---');
markdownRows.push('| ' + separator.join(' | ') + ' |');
}
// Add body rows
for (const row of rows) {
const paddedRow = this.padTableRow(row, maxCols);
markdownRows.push('| ' + paddedRow.join(' | ') + ' |');
}
return markdownRows.join('\n') + '\n';
}
async processTableRow(row, context) {
const cells = [];
for (const cell of row.children) {
if (cell.tagName === 'TD' || cell.tagName === 'TH') {
const content = (await this.getTextContent(cell, context)).trim();
cells.push(content);
}
}
return cells;
}
async convertTableRow(element, context) {
// Convert a single table row to markdown
if (this.options.textOnly) {
const cells = await this.processTableRow(element, context);
return cells.join(' ');
}
// For non-text-only mode, create a simple table representation
const cells = await this.processTableRow(element, context);
return '| ' + cells.join(' | ') + ' |';
}
padTableRow(row, targetLength) {
const padded = [...row];
while (padded.length < targetLength) {
padded.push('');
}
return padded;
}
async convertTableToText(element, context) {
// Convert table to clean text representation
const lines = [];
const rows = element.querySelectorAll('tr');
for (const row of rows) {
const cells = row.querySelectorAll('td, th');
const cellTexts = [];
for (const cell of cells) {
const text = (await this.getTextContent(cell, context)).trim();
if (text) {
cellTexts.push(text);
}
}
if (cellTexts.length > 0) {
// Join cells with space, handling common patterns
lines.push(cellTexts.join(' '));
}
}
return lines.join('\n');
}
async convertBlockquote(element, context) {
const lines = (await this.processChildren(element, context)).trim().split('\n');
return lines.map(line => '> ' + line).join('\n') + '\n';
}
async convertPreformatted(element, context) {
const oldInCode = context.inCode;
const oldPreserveWhitespace = context.preserveWhitespace;
context.inCode = true;
context.preserveWhitespace = true;
let content = '';
let language = '';
// Check if this is a code block with language
const codeElement = element.querySelector('code');
if (codeElement) {
// Try to detect language from class
const className = codeElement.className;
const langMatch = className.match(/language-(\w+)/);
if (langMatch) {
language = langMatch[1];
}
content = codeElement.textContent;
} else {
content = element.textContent;
}
context.inCode = oldInCode;
context.preserveWhitespace = oldPreserveWhitespace;
// Use fenced code blocks
return '```' + language + '\n' + content + '\n```\n';
}
async convertCode(element, context) {
if (element.parentElement && element.parentElement.tagName === 'PRE') {
// Already handled by convertPreformatted
return element.textContent;
}
const content = element.textContent;
return '`' + content + '`';
}
async convertDiv(element, context) {
// Check for special div types
if (element.className.includes('code-block') ||
element.className.includes('highlight')) {
return await this.convertPreformatted(element, context);
}
const content = await this.processChildren(element, context);
return content.trim() ? content + '\n' : '';
}
async convertSpan(element, context) {
// Check for special span types
if (element.className.includes('code') ||
element.className.includes('inline-code')) {
return this.convertCode(element, context);
}
return await this.processChildren(element, context);
}
async convertArticle(element, context) {
const content = await this.processChildren(element, context);
return content.trim() ? content + '\n' : '';
}
async convertSection(element, context) {
const content = await this.processChildren(element, context);
return content.trim() ? content + '\n' : '';
}
async convertFigure(element, context) {
const content = await this.processChildren(element, context);
return content.trim() ? content + '\n' : '';
}
async convertFigCaption(element, context) {
const caption = await this.getTextContent(element, context);
return caption ? '\n*' + caption + '*\n' : '';
}
async convertVideo(element, context) {
const title = element.getAttribute('title') || 'Video';
if (this.options.textOnly) {
return `[Video: ${title}]`;
}
const src = element.getAttribute('src');
const poster = element.getAttribute('poster');
if (!src) {
return '';
}
// Convert to markdown with poster image if available
if (poster) {
const absolutePoster = this.makeAbsoluteUrl(poster);
const absoluteSrc = this.makeAbsoluteUrl(src);
return `[![${title}](${absolutePoster})](${absoluteSrc})`;
} else {
const absoluteSrc = this.makeAbsoluteUrl(src);
return `[${title}](${absoluteSrc})`;
}
}
async convertIframe(element, context) {
const title = element.getAttribute('title') || 'Embedded content';
if (this.options.textOnly) {
const src = element.getAttribute('src') || '';
if (src.includes('youtube.com') || src.includes('youtu.be')) {
return `[Video: ${title}]`;
} else if (src.includes('vimeo.com')) {
return `[Video: ${title}]`;
} else {
return `[Embedded: ${title}]`;
}
}
const src = element.getAttribute('src');
if (!src) {
return '';
}
// Check for common embeds
if (src.includes('youtube.com') || src.includes('youtu.be')) {
return `[▶️ ${title}](${src})`;
} else if (src.includes('vimeo.com')) {
return `[▶️ ${title}](${src})`;
} else {
return `[${title}](${src})`;
}
}
async convertDefinitionList(element, context) {
return await this.processChildren(element, context) + '\n';
}
async convertDefinitionTerm(element, context) {
const term = await this.getTextContent(element, context);
return '**' + term + '**\n';
}
async convertDefinitionDescription(element, context) {
const description = await this.processChildren(element, context);
return ': ' + description + '\n';
}
async getTextContent(element, context) {
// Special handling for elements that might contain other markdown
if (context.inCode) {
return element.textContent;
}
return await this.processChildren(element, context);
}
makeAbsoluteUrl(url) {
if (!url) return '';
try {
// Check if already absolute
if (url.startsWith('http://') || url.startsWith('https://')) {
return url;
}
// Handle protocol-relative URLs
if (url.startsWith('//')) {
return window.location.protocol + url;
}
// Convert relative to absolute
const base = window.location.origin;
const path = window.location.pathname;
if (url.startsWith('/')) {
return base + url;
} else {
// Relative to current path
const pathDir = path.substring(0, path.lastIndexOf('/') + 1);
return base + pathDir + url;
}
} catch (e) {
return url;
}
}
isHidden(element) {
const style = window.getComputedStyle(element);
return style.display === 'none' ||
style.visibility === 'hidden' ||
style.opacity === '0';
}
generateReferences() {
return this.conversionContext.references
.map((ref, index) => `[${index + 1}]: ${ref.url}`)
.join('\n');
}
postProcess(markdown) {
// Apply text-only specific processing
if (this.options.textOnly) {
markdown = this.postProcessTextOnly(markdown);
}
// Clean up excessive newlines
markdown = markdown.replace(/\n{3,}/g, '\n\n');
// Clean up spaces before punctuation
markdown = markdown.replace(/ +([.,;:!?])/g, '$1');
// Ensure proper spacing around headers
markdown = markdown.replace(/\n(#{1,6} )/g, '\n\n$1');
markdown = markdown.replace(/(#{1,6} .+)\n(?![\n#])/g, '$1\n\n');
// Clean up list spacing
markdown = markdown.replace(/\n\n(-|\d+\.) /g, '\n$1 ');
// Trim final result
return markdown.trim();
}
postProcessTextOnly(markdown) {
// Smart pattern recognition for common formats
const lines = markdown.split('\n');
const processedLines = [];
let inMetadata = false;
let currentItem = null;
for (let i = 0; i < lines.length; i++) {
const line = lines[i].trim();
if (!line) {
processedLines.push('');
continue;
}
// Detect numbered list items (common in HN, Reddit, etc.)
const numberPattern = /^(\d+)\.\s*(.+)$/;
const numberMatch = line.match(numberPattern);
if (numberMatch) {
// Start of a new numbered item
inMetadata = false;
currentItem = numberMatch[1];
const content = numberMatch[2];
// Check if content has domain in parentheses
const domainPattern = /^(.+?)\s*\(([^)]+)\)\s*(.*)$/;
const domainMatch = content.match(domainPattern);
if (domainMatch) {
const [, title, domain, rest] = domainMatch;
processedLines.push(`${currentItem}. **${title.trim()}** (${domain})`);
if (rest.trim()) {
processedLines.push(` ${rest.trim()}`);
inMetadata = true;
}
} else {
processedLines.push(`${currentItem}. **${content}**`);
}
} else if (line.match(/\b(points?|by|ago|hide|comments?)\b/i) && currentItem) {
// This looks like metadata for the current item
const cleanedLine = line
.replace(/\s+/g, ' ')
.replace(/\s*\|\s*/g, ' | ')
.trim();
processedLines.push(` ${cleanedLine}`);
inMetadata = true;
} else if (inMetadata && line.length < 100) {
// Continue metadata if we're in metadata mode and line is short
processedLines.push(` ${line}`);
} else {
// Regular content
inMetadata = false;
processedLines.push(line);
}
}
// Clean up the output
let result = processedLines.join('\n');
// Remove excessive blank lines
result = result.replace(/\n{3,}/g, '\n\n');
// Ensure proper spacing after numbered items
result = result.replace(/^(\d+\..+)$\n^(?!\s)/gm, '$1\n\n');
return result;
}
}

File diff suppressed because it is too large Load Diff

File diff suppressed because it is too large Load Diff

View File

@@ -0,0 +1,608 @@
// SchemaBuilder class for Crawl4AI Chrome Extension
class SchemaBuilder {
constructor() {
this.mode = null;
this.container = null;
this.fields = [];
this.overlay = null;
this.toolbar = null;
this.highlightBox = null;
this.selectedElements = new Set();
this.isPaused = false;
this.codeModal = null;
this.handleMouseMove = this.handleMouseMove.bind(this);
this.handleClick = this.handleClick.bind(this);
this.handleKeyPress = this.handleKeyPress.bind(this);
}
start() {
this.mode = 'container';
this.createOverlay();
this.createToolbar();
this.attachEventListeners();
this.updateToolbar();
}
stop() {
this.detachEventListeners();
this.overlay?.remove();
this.toolbar?.remove();
this.highlightBox?.remove();
this.removeAllHighlights();
this.mode = null;
this.container = null;
this.fields = [];
this.selectedElements.clear();
}
createOverlay() {
// Create highlight box
this.highlightBox = document.createElement('div');
this.highlightBox.className = 'c4ai-highlight-box';
document.body.appendChild(this.highlightBox);
}
createToolbar() {
this.toolbar = document.createElement('div');
this.toolbar.className = 'c4ai-toolbar';
this.toolbar.innerHTML = `
<div class="c4ai-toolbar-titlebar">
<div class="c4ai-titlebar-dots">
<button class="c4ai-dot c4ai-dot-close" id="c4ai-close"></button>
<button class="c4ai-dot c4ai-dot-minimize"></button>
<button class="c4ai-dot c4ai-dot-maximize"></button>
</div>
<img src="${chrome.runtime.getURL('icons/icon-16.png')}" class="c4ai-titlebar-icon" alt="Crawl4AI">
<div class="c4ai-titlebar-title">Crawl4AI Schema Builder</div>
</div>
<div class="c4ai-toolbar-content">
<div class="c4ai-toolbar-status">
<div class="c4ai-status-item">
<span class="c4ai-status-label">Mode:</span>
<span class="c4ai-status-value" id="c4ai-mode">Select Container</span>
</div>
<div class="c4ai-status-item">
<span class="c4ai-status-label">Container:</span>
<span class="c4ai-status-value" id="c4ai-container">Not selected</span>
</div>
</div>
<div class="c4ai-fields-list" id="c4ai-fields-list" style="display: none;">
<div class="c4ai-fields-header">Selected Fields:</div>
<ul class="c4ai-fields-items" id="c4ai-fields-items"></ul>
</div>
<div class="c4ai-toolbar-hint" id="c4ai-hint">
Click on a container element (e.g., product card, article, etc.)
</div>
<div class="c4ai-toolbar-actions">
<button id="c4ai-pause" class="c4ai-action-btn c4ai-pause-btn">
<span class="c4ai-pause-icon">⏸</span> Pause
</button>
<button id="c4ai-generate" class="c4ai-action-btn c4ai-generate-btn">
<span class="c4ai-generate-icon">⚡</span> Generate Code
</button>
</div>
</div>
`;
document.body.appendChild(this.toolbar);
// Add event listeners for toolbar buttons
document.getElementById('c4ai-pause').addEventListener('click', () => this.togglePause());
document.getElementById('c4ai-generate').addEventListener('click', () => this.stopAndGenerate());
document.getElementById('c4ai-close').addEventListener('click', () => this.stop());
// Make toolbar draggable
window.C4AI_Utils.makeDraggable(this.toolbar);
}
attachEventListeners() {
document.addEventListener('mousemove', this.handleMouseMove, true);
document.addEventListener('click', this.handleClick, true);
document.addEventListener('keydown', this.handleKeyPress, true);
}
detachEventListeners() {
document.removeEventListener('mousemove', this.handleMouseMove, true);
document.removeEventListener('click', this.handleClick, true);
document.removeEventListener('keydown', this.handleKeyPress, true);
}
handleMouseMove(e) {
if (this.isPaused) return;
const element = document.elementFromPoint(e.clientX, e.clientY);
if (element && !this.isOurElement(element)) {
this.highlightElement(element);
}
}
handleClick(e) {
if (this.isPaused) return;
const element = e.target;
if (this.isOurElement(element)) {
return;
}
e.preventDefault();
e.stopPropagation();
if (this.mode === 'container') {
this.selectContainer(element);
} else if (this.mode === 'field') {
this.selectField(element);
}
}
handleKeyPress(e) {
if (e.key === 'Escape') {
this.stop();
}
}
isOurElement(element) {
return window.C4AI_Utils.isOurElement(element);
}
togglePause() {
this.isPaused = !this.isPaused;
const pauseBtn = document.getElementById('c4ai-pause');
if (this.isPaused) {
pauseBtn.innerHTML = '<span class="c4ai-play-icon">▶</span> Resume';
pauseBtn.classList.add('c4ai-paused');
this.highlightBox.style.display = 'none';
} else {
pauseBtn.innerHTML = '<span class="c4ai-pause-icon">⏸</span> Pause';
pauseBtn.classList.remove('c4ai-paused');
}
}
stopAndGenerate() {
if (!this.container || this.fields.length === 0) {
alert('Please select a container and at least one field before generating code.');
return;
}
const code = this.generateCode();
this.showCodeModal(code);
}
highlightElement(element) {
const rect = element.getBoundingClientRect();
this.highlightBox.style.cssText = `
left: ${rect.left + window.scrollX}px;
top: ${rect.top + window.scrollY}px;
width: ${rect.width}px;
height: ${rect.height}px;
display: block;
`;
if (this.mode === 'container') {
this.highlightBox.className = 'c4ai-highlight-box c4ai-container-mode';
} else {
this.highlightBox.className = 'c4ai-highlight-box c4ai-field-mode';
}
}
selectContainer(element) {
// Remove previous container highlight
if (this.container) {
this.container.element.classList.remove('c4ai-selected-container');
}
this.container = {
element: element,
html: element.outerHTML,
selector: this.generateSelector(element),
tagName: element.tagName.toLowerCase()
};
element.classList.add('c4ai-selected-container');
this.mode = 'field';
this.updateToolbar();
this.updateStats();
}
selectField(element) {
// Don't select the container itself
if (element === this.container.element) {
return;
}
// Check if already selected - if so, deselect it
if (this.selectedElements.has(element)) {
this.deselectField(element);
return;
}
// Must be inside the container
if (!this.container.element.contains(element)) {
return;
}
this.showFieldDialog(element);
}
deselectField(element) {
// Remove from fields array
this.fields = this.fields.filter(f => f.element !== element);
// Remove from selected elements set
this.selectedElements.delete(element);
// Remove visual selection
element.classList.remove('c4ai-selected-field');
// Update UI
this.updateToolbar();
this.updateStats();
}
showFieldDialog(element) {
const dialog = document.createElement('div');
dialog.className = 'c4ai-field-dialog';
const rect = element.getBoundingClientRect();
dialog.style.cssText = `
left: ${rect.left + window.scrollX}px;
top: ${rect.bottom + window.scrollY + 10}px;
`;
dialog.innerHTML = `
<div class="c4ai-field-dialog-content">
<h4>Name this field:</h4>
<input type="text" id="c4ai-field-name" placeholder="e.g., title, price, description" autofocus>
<div class="c4ai-field-preview">
<strong>Content:</strong> ${element.textContent.trim().substring(0, 50)}...
</div>
<div class="c4ai-field-actions">
<button id="c4ai-field-save">Save</button>
<button id="c4ai-field-cancel">Cancel</button>
</div>
</div>
`;
document.body.appendChild(dialog);
const input = dialog.querySelector('#c4ai-field-name');
const saveBtn = dialog.querySelector('#c4ai-field-save');
const cancelBtn = dialog.querySelector('#c4ai-field-cancel');
const save = () => {
const fieldName = input.value.trim();
if (fieldName) {
this.fields.push({
name: fieldName,
value: element.textContent.trim(),
element: element,
selector: this.generateSelector(element, this.container.element)
});
element.classList.add('c4ai-selected-field');
this.selectedElements.add(element);
this.updateToolbar();
this.updateStats();
}
dialog.remove();
};
const cancel = () => {
dialog.remove();
};
saveBtn.addEventListener('click', save);
cancelBtn.addEventListener('click', cancel);
input.addEventListener('keypress', (e) => {
if (e.key === 'Enter') save();
if (e.key === 'Escape') cancel();
});
input.focus();
}
generateSelector(element, context = document) {
// Try to generate a robust selector
if (element.id) {
return `#${CSS.escape(element.id)}`;
}
// Check for data attributes (most stable)
const dataAttrs = ['data-testid', 'data-id', 'data-test', 'data-cy'];
for (const attr of dataAttrs) {
const value = element.getAttribute(attr);
if (value) {
return `[${attr}="${value}"]`;
}
}
// Check for aria-label
if (element.getAttribute('aria-label')) {
return `[aria-label="${element.getAttribute('aria-label')}"]`;
}
// Try semantic HTML elements with text
const tagName = element.tagName.toLowerCase();
if (['button', 'a', 'h1', 'h2', 'h3', 'h4', 'h5', 'h6'].includes(tagName)) {
const text = element.textContent.trim();
if (text && text.length < 50) {
// Use tag name with partial text match
return `${tagName}`;
}
}
// Check for simple, non-utility classes
const classes = Array.from(element.classList)
.filter(c => !c.startsWith('c4ai-')) // Exclude our classes
.filter(c => !c.includes('[') && !c.includes('(') && !c.includes(':')) // Exclude utility classes
.filter(c => c.length < 30); // Exclude very long classes
if (classes.length > 0 && classes.length <= 3) {
const selector = classes.map(c => `.${CSS.escape(c)}`).join('');
try {
if (context.querySelectorAll(selector).length === 1) {
return selector;
}
} catch (e) {
// Invalid selector, continue
}
}
// Use nth-child with simple parent tag
const parent = element.parentElement;
if (parent && parent !== context) {
const siblings = Array.from(parent.children);
const index = siblings.indexOf(element) + 1;
// Just use parent tag name to avoid recursion
const parentTag = parent.tagName.toLowerCase();
return `${parentTag} > ${tagName}:nth-child(${index})`;
}
// Final fallback
return tagName;
}
updateToolbar() {
document.getElementById('c4ai-mode').textContent =
this.mode === 'container' ? 'Select Container' : 'Select Fields';
document.getElementById('c4ai-container').textContent =
this.container ? `${this.container.tagName}` : 'Not selected';
// Update fields list
const fieldsList = document.getElementById('c4ai-fields-list');
const fieldsItems = document.getElementById('c4ai-fields-items');
if (this.fields.length > 0) {
fieldsList.style.display = 'block';
fieldsItems.innerHTML = this.fields.map(field => `
<li class="c4ai-field-item">
<span class="c4ai-field-name">${field.name}</span>
<span class="c4ai-field-value">${field.value.substring(0, 30)}${field.value.length > 30 ? '...' : ''}</span>
</li>
`).join('');
} else {
fieldsList.style.display = 'none';
}
const hint = document.getElementById('c4ai-hint');
if (this.mode === 'container') {
hint.textContent = 'Click on a container element (e.g., product card, article, etc.)';
} else if (this.fields.length === 0) {
hint.textContent = 'Click on fields inside the container to extract (title, price, etc.)';
} else {
hint.innerHTML = `Continue selecting fields or click <strong>Stop & Generate</strong> to finish.`;
}
}
updateStats() {
chrome.runtime.sendMessage({
action: 'updateStats',
stats: {
container: !!this.container,
fields: this.fields.length
}
});
}
removeAllHighlights() {
document.querySelectorAll('.c4ai-selected-container').forEach(el => {
el.classList.remove('c4ai-selected-container');
});
document.querySelectorAll('.c4ai-selected-field').forEach(el => {
el.classList.remove('c4ai-selected-field');
});
}
generateCode() {
const fieldDescriptions = this.fields.map(f =>
`- ${f.name} (example: "${f.value.substring(0, 50)}...")`
).join('\n');
return `#!/usr/bin/env python3
"""
Generated by Crawl4AI Chrome Extension
URL: ${window.location.href}
Generated: ${new Date().toISOString()}
"""
import asyncio
import json
from pathlib import Path
from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig
from crawl4ai.extraction_strategy import JsonCssExtractionStrategy
# HTML snippet of the selected container element
HTML_SNIPPET = """
${this.container.html}
"""
# Extraction query based on your field selections
EXTRACTION_QUERY = """
Create a JSON CSS extraction schema to extract the following fields:
${fieldDescriptions}
The schema should handle multiple ${this.container.tagName} elements on the page.
Each item should be extracted as a separate object in the results array.
"""
async def generate_schema():
"""Generate extraction schema using LLM"""
print("🔧 Generating extraction schema...")
try:
# Generate the schema using Crawl4AI's built-in LLM integration
schema = JsonCssExtractionStrategy.generate_schema(
html=HTML_SNIPPET,
query=EXTRACTION_QUERY,
)
# Save the schema for reuse
schema_path = Path('generated_schema.json')
with open(schema_path, 'w') as f:
json.dump(schema, f, indent=2)
print("✅ Schema generated successfully!")
print(f"📄 Schema saved to: {schema_path}")
print("\\nGenerated schema:")
print(json.dumps(schema, indent=2))
return schema
except Exception as e:
print(f"❌ Error generating schema: {e}")
return None
async def test_extraction(url: str = "${window.location.href}"):
"""Test the generated schema on the actual webpage"""
print("\\n🧪 Testing extraction on live webpage...")
# Load the generated schema
try:
with open('generated_schema.json', 'r') as f:
schema = json.load(f)
except FileNotFoundError:
print("❌ Schema file not found. Run generate_schema() first.")
return
# Configure browser
browser_config = BrowserConfig(
headless=True,
verbose=False
)
# Configure extraction
crawler_config = CrawlerRunConfig(
extraction_strategy=JsonCssExtractionStrategy(schema=schema)
)
async with AsyncWebCrawler(config=browser_config) as crawler:
result = await crawler.arun(
url=url,
config=crawler_config
)
if result.success and result.extracted_content:
data = json.loads(result.extracted_content)
print(f"\\n✅ Successfully extracted {len(data)} items!")
# Save results
with open('extracted_data.json', 'w') as f:
json.dump(data, f, indent=2)
# Show sample results
print("\\n📊 Sample results (first 2 items):")
for i, item in enumerate(data[:2], 1):
print(f"\\nItem {i}:")
for key, value in item.items():
print(f" {key}: {value}")
else:
print("❌ Extraction failed:", result.error_message)
if __name__ == "__main__":
# Step 1: Generate the schema from HTML snippet
asyncio.run(generate_schema())
# Step 2: Test extraction on the live webpage
# Uncomment the line below to test extraction:
# asyncio.run(test_extraction())
print("\\n🎯 Next steps:")
print("1. Review the generated schema in 'generated_schema.json'")
print("2. Uncomment the test_extraction() line to test on the live site")
print("3. Use the schema in your Crawl4AI projects!")
`;
return code;
}
showCodeModal(code) {
// Create modal
this.codeModal = document.createElement('div');
this.codeModal.className = 'c4ai-code-modal';
this.codeModal.innerHTML = `
<div class="c4ai-code-modal-content">
<div class="c4ai-code-modal-header">
<h2>Generated Python Code</h2>
<button class="c4ai-close-modal" id="c4ai-close-modal">✕</button>
</div>
<div class="c4ai-code-modal-body">
<pre class="c4ai-code-block"><code class="language-python">${window.C4AI_Utils.escapeHtml(code)}</code></pre>
</div>
<div class="c4ai-code-modal-footer">
<button class="c4ai-action-btn c4ai-cloud-btn" id="c4ai-run-cloud" disabled>
<span>☁️</span> Run on C4AI Cloud (Coming Soon)
</button>
<button class="c4ai-action-btn c4ai-download-btn" id="c4ai-download-code">
<span>⬇</span> Download Code
</button>
<button class="c4ai-action-btn c4ai-copy-btn" id="c4ai-copy-code">
<span>📋</span> Copy to Clipboard
</button>
</div>
</div>
`;
document.body.appendChild(this.codeModal);
// Add event listeners
document.getElementById('c4ai-close-modal').addEventListener('click', () => {
this.codeModal.remove();
this.codeModal = null;
// Don't stop the capture session
});
document.getElementById('c4ai-download-code').addEventListener('click', () => {
chrome.runtime.sendMessage({
action: 'downloadCode',
code: code,
filename: `crawl4ai_schema_${Date.now()}.py`
}, (response) => {
if (response && response.success) {
const btn = document.getElementById('c4ai-download-code');
const originalHTML = btn.innerHTML;
btn.innerHTML = '<span>✓</span> Downloaded!';
setTimeout(() => {
btn.innerHTML = originalHTML;
}, 2000);
} else {
console.error('Download failed:', response?.error);
alert('Download failed. Please check your browser settings.');
}
});
});
document.getElementById('c4ai-copy-code').addEventListener('click', () => {
navigator.clipboard.writeText(code).then(() => {
const btn = document.getElementById('c4ai-copy-code');
btn.innerHTML = '<span>✓</span> Copied!';
setTimeout(() => {
btn.innerHTML = '<span>📋</span> Copy to Clipboard';
}, 2000);
});
});
// Apply syntax highlighting
window.C4AI_Utils.applySyntaxHighlighting(this.codeModal.querySelector('.language-python'));
}
}

File diff suppressed because it is too large Load Diff

View File

@@ -0,0 +1,253 @@
// Shared utilities for Crawl4AI Chrome Extension
// Make element draggable by its titlebar
function makeDraggable(element) {
let isDragging = false;
let startX, startY, initialX, initialY;
const titlebar = element.querySelector('.c4ai-toolbar-titlebar, .c4ai-titlebar');
if (!titlebar) return;
titlebar.addEventListener('mousedown', (e) => {
// Don't drag if clicking on buttons
if (e.target.classList.contains('c4ai-dot') || e.target.closest('button')) return;
isDragging = true;
startX = e.clientX;
startY = e.clientY;
const rect = element.getBoundingClientRect();
initialX = rect.left;
initialY = rect.top;
element.style.transition = 'none';
titlebar.style.cursor = 'grabbing';
});
document.addEventListener('mousemove', (e) => {
if (!isDragging) return;
const deltaX = e.clientX - startX;
const deltaY = e.clientY - startY;
element.style.left = `${initialX + deltaX}px`;
element.style.top = `${initialY + deltaY}px`;
element.style.right = 'auto';
});
document.addEventListener('mouseup', () => {
if (isDragging) {
isDragging = false;
element.style.transition = '';
titlebar.style.cursor = 'grab';
}
});
}
// Make element draggable by a specific header element
function makeDraggableByHeader(element) {
let isDragging = false;
let startX, startY, initialX, initialY;
const header = element.querySelector('.c4ai-debugger-header');
if (!header) return;
header.addEventListener('mousedown', (e) => {
// Don't drag if clicking on close button
if (e.target.id === 'c4ai-close-debugger' || e.target.closest('#c4ai-close-debugger')) return;
isDragging = true;
startX = e.clientX;
startY = e.clientY;
const rect = element.getBoundingClientRect();
initialX = rect.left;
initialY = rect.top;
element.style.transition = 'none';
header.style.cursor = 'grabbing';
});
document.addEventListener('mousemove', (e) => {
if (!isDragging) return;
const deltaX = e.clientX - startX;
const deltaY = e.clientY - startY;
element.style.left = `${initialX + deltaX}px`;
element.style.top = `${initialY + deltaY}px`;
element.style.right = 'auto';
});
document.addEventListener('mouseup', () => {
if (isDragging) {
isDragging = false;
element.style.transition = '';
header.style.cursor = 'grab';
}
});
}
// Escape HTML for safe display
function escapeHtml(text) {
const div = document.createElement('div');
div.textContent = text;
return div.innerHTML;
}
// Apply syntax highlighting to Python code
function applySyntaxHighlighting(codeElement) {
const code = codeElement.textContent;
// Split by lines to handle line-by-line
const lines = code.split('\n');
const highlightedLines = lines.map(line => {
let highlightedLine = escapeHtml(line);
// Skip if line is empty
if (!highlightedLine.trim()) return highlightedLine;
// Comments (lines starting with #)
if (highlightedLine.trim().startsWith('#')) {
return `<span class="c4ai-comment">${highlightedLine}</span>`;
}
// Triple quoted strings
if (highlightedLine.includes('"""')) {
highlightedLine = highlightedLine.replace(/(""".*?""")/g, '<span class="c4ai-string">$1</span>');
}
// Regular strings - single and double quotes
highlightedLine = highlightedLine.replace(/(["'])([^"']*)\1/g, '<span class="c4ai-string">$1$2$1</span>');
// Keywords - only highlight if not inside a string
const keywords = ['import', 'from', 'async', 'def', 'await', 'try', 'except', 'with', 'as', 'for', 'if', 'else', 'elif', 'return', 'print', 'open', 'and', 'or', 'not', 'in', 'is', 'class', 'self', 'None', 'True', 'False', '__name__', '__main__'];
keywords.forEach(keyword => {
// Use word boundaries and lookahead/lookbehind to ensure we're not in a string
const regex = new RegExp(`\\b(${keyword})\\b(?![^<]*</span>)`, 'g');
highlightedLine = highlightedLine.replace(regex, '<span class="c4ai-keyword">$1</span>');
});
// Functions (word followed by parenthesis)
highlightedLine = highlightedLine.replace(/\b([a-zA-Z_]\w*)\s*\(/g, '<span class="c4ai-function">$1</span>(');
return highlightedLine;
});
codeElement.innerHTML = highlightedLines.join('\n');
}
// Apply syntax highlighting to JavaScript code
function applySyntaxHighlightingJS(codeElement) {
const code = codeElement.textContent;
// Split by lines to handle line-by-line
const lines = code.split('\n');
const highlightedLines = lines.map(line => {
let highlightedLine = escapeHtml(line);
// Skip if line is empty
if (!highlightedLine.trim()) return highlightedLine;
// Comments
if (highlightedLine.trim().startsWith('//')) {
return `<span class="c4ai-comment">${highlightedLine}</span>`;
}
// Multi-line comments
highlightedLine = highlightedLine.replace(/(\/\*.*?\*\/)/g, '<span class="c4ai-comment">$1</span>');
// Template literals
highlightedLine = highlightedLine.replace(/(`[^`]*`)/g, '<span class="c4ai-string">$1</span>');
// Regular strings - single and double quotes
highlightedLine = highlightedLine.replace(/(["'])([^"']*)\1/g, '<span class="c4ai-string">$1$2$1</span>');
// Keywords
const keywords = ['const', 'let', 'var', 'function', 'async', 'await', 'if', 'else', 'for', 'while', 'do', 'switch', 'case', 'break', 'continue', 'return', 'try', 'catch', 'finally', 'throw', 'new', 'this', 'class', 'extends', 'import', 'export', 'default', 'from', 'null', 'undefined', 'true', 'false'];
keywords.forEach(keyword => {
const regex = new RegExp(`\\b(${keyword})\\b(?![^<]*</span>)`, 'g');
highlightedLine = highlightedLine.replace(regex, '<span class="c4ai-keyword">$1</span>');
});
// Functions and methods
highlightedLine = highlightedLine.replace(/\b([a-zA-Z_$][\w$]*)\s*\(/g, '<span class="c4ai-function">$1</span>(');
// Numbers
highlightedLine = highlightedLine.replace(/\b(\d+)\b/g, '<span class="c4ai-number">$1</span>');
return highlightedLine;
});
codeElement.innerHTML = highlightedLines.join('\n');
}
// Get element selector
function getElementSelector(element) {
// Priority: ID > unique class > tag with position
if (element.id) {
return `#${element.id}`;
}
if (element.className && typeof element.className === 'string') {
const classes = element.className.split(' ').filter(c => c && !c.startsWith('c4ai-'));
if (classes.length > 0) {
const selector = `.${classes[0]}`;
if (document.querySelectorAll(selector).length === 1) {
return selector;
}
}
}
// Build a path selector
const path = [];
let current = element;
while (current && current !== document.body) {
const tagName = current.tagName.toLowerCase();
const parent = current.parentElement;
if (parent) {
const siblings = Array.from(parent.children);
const index = siblings.indexOf(current) + 1;
if (siblings.filter(s => s.tagName === current.tagName).length > 1) {
path.unshift(`${tagName}:nth-child(${index})`);
} else {
path.unshift(tagName);
}
} else {
path.unshift(tagName);
}
current = parent;
}
return path.join(' > ');
}
// Check if element is part of our extension UI
function isOurElement(element) {
return element.classList.contains('c4ai-highlight-box') ||
element.classList.contains('c4ai-toolbar') ||
element.closest('.c4ai-toolbar') ||
element.classList.contains('c4ai-script-toolbar') ||
element.closest('.c4ai-script-toolbar') ||
element.closest('.c4ai-field-dialog') ||
element.closest('.c4ai-code-modal') ||
element.closest('.c4ai-wait-dialog') ||
element.closest('.c4ai-timeline-modal');
}
// Export utilities
window.C4AI_Utils = {
makeDraggable,
makeDraggableByHeader,
escapeHtml,
applySyntaxHighlighting,
applySyntaxHighlightingJS,
getElementSelector,
isOurElement
};

View File

@@ -36,6 +36,21 @@
</div>
</section>
<!-- Cloud Announcement Banner -->
<section class="cloud-banner-section">
<div class="cloud-banner">
<div class="cloud-banner-content">
<div class="cloud-banner-text">
<h3>You don't need Puppeteer. You need Crawl4AI Cloud.</h3>
<p>One API call. JS-rendered. No browser cluster to maintain.</p>
</div>
<button class="cloud-banner-btn" id="joinWaitlistBanner">
Get API Key →
</button>
</div>
</div>
</section>
<!-- Introduction -->
<section class="intro-section">
<div class="terminal-window">
@@ -43,13 +58,17 @@
<span class="terminal-title">About Crawl4AI Assistant</span>
</div>
<div class="terminal-content">
<p>Transform any website into structured data with just a few clicks! The Crawl4AI Assistant Chrome Extension provides two powerful tools for web scraping and automation.</p>
<p>Transform any website into structured data with just a few clicks! The Crawl4AI Assistant Chrome Extension provides three powerful tools for web scraping and data extraction.</p>
<div style="background: #0fbbaa; color: #070708; padding: 12px 16px; border-radius: 8px; margin: 16px 0; font-weight: 600;">
🎉 NEW: Schema Builder now extracts data INSTANTLY without any LLM! Test your schema and see JSON results immediately in the browser!
</div>
<div class="features-grid">
<div class="feature-card">
<span class="feature-icon">🎯</span>
<h3>Schema Builder</h3>
<p>Click to select elements and build extraction schemas visually</p>
<p>Extract data instantly without LLMs - see results in real-time!</p>
</div>
<div class="feature-card">
<span class="feature-icon">🔴</span>
@@ -57,15 +76,15 @@
<p>Record browser actions to create automation scripts</p>
</div>
<div class="feature-card">
<span class="feature-icon">📝</span>
<h3>Click2Crawl <span style="color: #0fbbaa; font-size: 0.75rem;">(New!)</span></h3>
<p>Select multiple elements to extract clean markdown "as you see"</p>
</div>
<!-- <div class="feature-card">
<span class="feature-icon">🐍</span>
<h3>Python Code</h3>
<p>Get production-ready Crawl4AI code instantly</p>
</div>
<div class="feature-card">
<span class="feature-icon">🎨</span>
<h3>Beautiful UI</h3>
<p>Draggable toolbar with macOS-style interface</p>
</div>
</div> -->
</div>
</div>
</div>
@@ -134,6 +153,15 @@
</div>
<div class="tool-status alpha">Alpha</div>
</div>
<div class="tool-selector" data-tool="click2crawl">
<div class="tool-icon">📝</div>
<div class="tool-info">
<h3>Click2Crawl</h3>
<p>Markdown extraction</p>
</div>
<div class="tool-status new">New!</div>
</div>
</div>
<!-- Right Panel - Tool Details -->
@@ -142,7 +170,7 @@
<div class="tool-content active" id="schema-builder">
<div class="tool-header">
<h3>📊 Schema Builder</h3>
<span class="tool-tagline">Click to extract data visually</span>
<span class="tool-tagline">No LLM needed - Extract data instantly!</span>
</div>
<div class="tool-steps">
@@ -150,9 +178,9 @@
<div class="step-number">1</div>
<div class="step-content">
<h4>Select Container</h4>
<p>Click on any repeating element like product cards or articles</p>
<p>Click on any repeating element like product cards or articles. Use up/down navigation to fine-tune selection!</p>
<div class="step-visual">
<span class="highlight-green"></span> Elements highlighted in green
<span class="highlight-green"></span> Container highlighted in green
</div>
</div>
</div>
@@ -160,8 +188,8 @@
<div class="step-item">
<div class="step-number">2</div>
<div class="step-content">
<h4>Mark Fields</h4>
<p>Click on data fields inside the container</p>
<h4>Click Fields to Extract</h4>
<p>Click on data fields inside the container - choose text, links, images, or attributes</p>
<div class="step-visual">
<span class="highlight-pink"></span> Fields highlighted in pink
</div>
@@ -171,19 +199,22 @@
<div class="step-item">
<div class="step-number">3</div>
<div class="step-content">
<h4>Generate & Extract</h4>
<p>Get your CSS selectors and Python code instantly</p>
<h4>Test & Extract Data NOW!</h4>
<p>🎉 Click "Test Schema" to extract ALL matching data instantly - no coding required!</p>
<div class="step-visual">
<span class="highlight-accent"></span> Ready to use code
<span class="highlight-accent"></span> See extracted JSON immediately
</div>
</div>
</div>
</div>
<div class="tool-features">
<div class="feature-tag">No CSS knowledge needed</div>
<div class="feature-tag">Smart selector generation</div>
<div class="feature-tag">LLM-ready schemas</div>
<div class="feature-tag">🚀 Zero LLM dependency</div>
<div class="feature-tag">📊 Instant data extraction</div>
<div class="feature-tag">🎯 Smart selector generation</div>
<div class="feature-tag">🐍 Ready-to-run Python code</div>
<div class="feature-tag">✨ Preview matching elements</div>
<div class="feature-tag">📥 Download JSON results</div>
</div>
</div>
@@ -236,70 +267,190 @@
<div class="feature-tag alpha-tag">Alpha version</div>
</div>
</div>
<!-- Click2Crawl Details -->
<div class="tool-content" id="click2crawl">
<div class="tool-header">
<h3>📝 Click2Crawl</h3>
<span class="tool-tagline">Select multiple elements to extract clean markdown</span>
</div>
<div class="tool-steps">
<div class="step-item">
<div class="step-number">1</div>
<div class="step-content">
<h4>Ctrl/Cmd + Click</h4>
<p>Hold Ctrl/Cmd and click multiple elements you want to extract</p>
<div class="step-visual">
<span class="highlight-green">🔢</span> Numbered selection badges
</div>
</div>
</div>
<div class="step-item">
<div class="step-number">2</div>
<div class="step-content">
<h4>Enable Visual Text Mode</h4>
<p>Extract content "as you see" - clean text without complex HTML structures</p>
<div class="step-visual">
<span class="highlight-accent">👁️</span> Visual Text Mode (As You See)
</div>
</div>
</div>
<div class="step-item">
<div class="step-number">3</div>
<div class="step-content">
<h4>Export Clean Markdown</h4>
<p>Get beautifully formatted markdown ready for documentation or LLMs</p>
<div class="step-visual">
<span class="highlight-pink">📄</span> Clean, readable output
</div>
</div>
</div>
</div>
<div class="tool-features">
<div class="feature-tag">Multi-select with Ctrl/Cmd</div>
<div class="feature-tag">Visual Text Mode</div>
<div class="feature-tag">Smart formatting</div>
<div class="feature-tag">Cloud export (soon)</div>
</div>
</div>
</div>
</div>
</section>
<!-- Interactive Code Examples -->
<section class="code-showcase">
<h2>See the Generated Code</h2>
<h2>See the Generated Code & Extracted Data</h2>
<div class="code-tabs">
<button class="code-tab active" data-example="schema">📊 Schema Builder</button>
<button class="code-tab" data-example="script">🔴 Script Builder</button>
<button class="code-tab" data-example="markdown">📝 Click2Crawl</button>
</div>
<div class="code-examples">
<!-- Schema Builder Code -->
<div class="code-example active" id="code-schema">
<div class="terminal-window">
<div class="terminal-header">
<span class="terminal-title">schema_extraction.py</span>
<button class="copy-button" data-code="schema">Copy</button>
</div>
<div class="terminal-content">
<pre><code><span class="keyword">import</span> asyncio
<div style="display: grid; grid-template-columns: 1fr 1fr; gap: 16px;">
<!-- Python Code -->
<div class="terminal-window">
<div class="terminal-header">
<span class="terminal-title">schema_extraction.py</span>
<button class="copy-button" data-code="schema-python">Copy</button>
</div>
<div class="terminal-content">
<pre><code><span class="comment">#!/usr/bin/env python3</span>
<span class="comment">"""
🎉 NO LLM NEEDED! Direct extraction with CSS selectors
Generated by Crawl4AI Chrome Extension
"""</span>
<span class="keyword">import</span> asyncio
<span class="keyword">import</span> json
<span class="keyword">from</span> crawl4ai <span class="keyword">import</span> AsyncWebCrawler, CrawlerRunConfig
<span class="keyword">from</span> crawl4ai <span class="keyword">import</span> AsyncWebCrawler, BrowserConfig, CrawlerRunConfig
<span class="keyword">from</span> crawl4ai.extraction_strategy <span class="keyword">import</span> JsonCssExtractionStrategy
<span class="keyword">async</span> <span class="keyword">def</span> <span class="function">extract_products</span>():
<span class="comment"># Schema generated from your visual selection</span>
schema = {
<span class="string">"name"</span>: <span class="string">"Product Catalog"</span>,
<span class="string">"baseSelector"</span>: <span class="string">"div.product-card"</span>, <span class="comment"># Container you clicked</span>
<span class="string">"fields"</span>: [
{
<span class="string">"name"</span>: <span class="string">"title"</span>,
<span class="string">"selector"</span>: <span class="string">"h3.product-title"</span>,
<span class="string">"type"</span>: <span class="string">"text"</span>
},
{
<span class="string">"name"</span>: <span class="string">"price"</span>,
<span class="string">"selector"</span>: <span class="string">"span.price"</span>,
<span class="string">"type"</span>: <span class="string">"text"</span>
},
{
<span class="string">"name"</span>: <span class="string">"image"</span>,
<span class="string">"selector"</span>: <span class="string">"img.product-img"</span>,
<span class="string">"type"</span>: <span class="string">"attribute"</span>,
<span class="string">"attribute"</span>: <span class="string">"src"</span>
}
]
}
<span class="comment"># The EXACT schema from your visual clicks - no guessing!</span>
EXTRACTION_SCHEMA = {
<span class="string">"name"</span>: <span class="string">"Product Catalog"</span>,
<span class="string">"baseSelector"</span>: <span class="string">"div.product-card"</span>, <span class="comment"># The container you selected</span>
<span class="string">"fields"</span>: [
{
<span class="string">"name"</span>: <span class="string">"title"</span>,
<span class="string">"selector"</span>: <span class="string">"h3.product-title"</span>,
<span class="string">"type"</span>: <span class="string">"text"</span>
},
{
<span class="string">"name"</span>: <span class="string">"price"</span>,
<span class="string">"selector"</span>: <span class="string">"span.price"</span>,
<span class="string">"type"</span>: <span class="string">"text"</span>
},
{
<span class="string">"name"</span>: <span class="string">"image"</span>,
<span class="string">"selector"</span>: <span class="string">"img.product-img"</span>,
<span class="string">"type"</span>: <span class="string">"attribute"</span>,
<span class="string">"attribute"</span>: <span class="string">"src"</span>
},
{
<span class="string">"name"</span>: <span class="string">"link"</span>,
<span class="string">"selector"</span>: <span class="string">"a.product-link"</span>,
<span class="string">"type"</span>: <span class="string">"attribute"</span>,
<span class="string">"attribute"</span>: <span class="string">"href"</span>
}
]
}
config = CrawlerRunConfig(
extraction_strategy=JsonCssExtractionStrategy(schema)
)
<span class="keyword">async</span> <span class="keyword">def</span> <span class="function">extract_data</span>(url: str):
<span class="comment"># Direct extraction - no LLM API calls!</span>
extraction_strategy = JsonCssExtractionStrategy(schema=EXTRACTION_SCHEMA)
<span class="keyword">async</span> <span class="keyword">with</span> AsyncWebCrawler() <span class="keyword">as</span> crawler:
result = <span class="keyword">await</span> crawler.arun(
url=<span class="string">"https://example.com/products"</span>,
config=config
url=url,
config=CrawlerRunConfig(extraction_strategy=extraction_strategy)
)
<span class="keyword">return</span> json.loads(result.extracted_content)
<span class="keyword">if</span> result.success:
data = json.loads(result.extracted_content)
<span class="keyword">print</span>(<span class="string">f"✅ Extracted {len(data)} items instantly!"</span>)
<span class="comment"># Save to file</span>
<span class="keyword">with</span> open(<span class="string">'products.json'</span>, <span class="string">'w'</span>) <span class="keyword">as</span> f:
json.dump(data, f, indent=2)
<span class="keyword">return</span> data
asyncio.run(extract_products())</code></pre>
<span class="comment"># Run extraction on any similar page!</span>
data = asyncio.run(extract_data(<span class="string">"https://example.com/products"</span>))
<span class="comment"># 🎯 Result: Clean JSON data, no LLM costs, instant results!</span></code></pre>
</div>
</div>
<!-- Extracted JSON Data -->
<div class="terminal-window">
<div class="terminal-header">
<span class="terminal-title">extracted_data.json</span>
<button class="copy-button" data-code="schema-json">Copy</button>
</div>
<div class="terminal-content">
<pre><code><span class="comment">// 🎉 Instantly extracted from the page - no coding required!</span>
[
{
<span class="string">"title"</span>: <span class="string">"Wireless Bluetooth Headphones"</span>,
<span class="string">"price"</span>: <span class="string">"$79.99"</span>,
<span class="string">"image"</span>: <span class="string">"https://example.com/images/headphones-bt-01.jpg"</span>,
<span class="string">"link"</span>: <span class="string">"/products/wireless-bluetooth-headphones"</span>
},
{
<span class="string">"title"</span>: <span class="string">"Smart Watch Pro 2024"</span>,
<span class="string">"price"</span>: <span class="string">"$299.00"</span>,
<span class="string">"image"</span>: <span class="string">"https://example.com/images/smartwatch-pro.jpg"</span>,
<span class="string">"link"</span>: <span class="string">"/products/smart-watch-pro-2024"</span>
},
{
<span class="string">"title"</span>: <span class="string">"4K Webcam for Streaming"</span>,
<span class="string">"price"</span>: <span class="string">"$149.99"</span>,
<span class="string">"image"</span>: <span class="string">"https://example.com/images/webcam-4k.jpg"</span>,
<span class="string">"link"</span>: <span class="string">"/products/4k-webcam-streaming"</span>
},
{
<span class="string">"title"</span>: <span class="string">"Mechanical Gaming Keyboard RGB"</span>,
<span class="string">"price"</span>: <span class="string">"$129.99"</span>,
<span class="string">"image"</span>: <span class="string">"https://example.com/images/keyboard-gaming.jpg"</span>,
<span class="string">"link"</span>: <span class="string">"/products/mechanical-gaming-keyboard"</span>
},
{
<span class="string">"title"</span>: <span class="string">"USB-C Hub 7-in-1"</span>,
<span class="string">"price"</span>: <span class="string">"$45.99"</span>,
<span class="string">"image"</span>: <span class="string">"https://example.com/images/usbc-hub.jpg"</span>,
<span class="string">"link"</span>: <span class="string">"/products/usb-c-hub-7in1"</span>
}
]</code></pre>
</div>
</div>
</div>
</div>
@@ -363,32 +514,181 @@ asyncio.run(automate_shopping())</code></pre>
</div>
</div>
</div>
<!-- Click2Crawl Markdown Output -->
<div class="code-example" id="code-markdown">
<div class="terminal-window">
<div class="terminal-header">
<span class="terminal-title">extracted_content.md</span>
<button class="copy-button" data-code="markdown">Copy</button>
</div>
<div class="terminal-content">
<pre><code><span class="comment"># Extracted from Hacker News with Visual Text Mode 👁️</span>
<span class="string">1. **Show HN: I built a tool to find and reach out to YouTubers** (hellosimply.io)
84 points by erickim 2 hours ago | hide | 31 comments
2. **The 24 Hour Restaurant** (logicmag.io)
124 points by helsinkiandrew 5 hours ago | hide | 52 comments
3. **Building a Better Bloom Filter in Rust** (carlmastrangelo.com)
89 points by carlmastrangelo 3 hours ago | hide | 27 comments
---
### Article: The 24 Hour Restaurant
In New York City, the 24-hour restaurant is becoming extinct. What we lose when we can no longer eat whenever we want.
When I first moved to New York, I loved that I could get a full meal at 3 AM. Not just pizza or fast food, but a proper sit-down dinner with table service and a menu that ran for pages. The city that never sleeps had restaurants that matched its rhythm.
Today, finding a 24-hour restaurant in Manhattan requires genuine effort. The pandemic accelerated a decline that was already underway, but the roots go deeper: rising rents, changing labor laws, and shifting cultural patterns have all contributed to the death of round-the-clock dining.
---
### Product Review: Framework Laptop 16
**Specifications:**
- Display: 16" 2560×1600 165Hz
- Processor: AMD Ryzen 7 7840HS
- Memory: 32GB DDR5-5600
- Storage: 2TB NVMe Gen4
- Price: Starting at $1,399
**Pros:**
- Fully modular and repairable
- Excellent Linux support
- Great keyboard and trackpad
- Expansion card system
**Cons:**
- Battery life could be better
- Slightly heavier than competitors
- Fan noise under load</span></code></pre>
</div>
</div>
</div>
</div>
</section>
<!-- Coming Soon Section -->
<section class="coming-soon-section">
<h2>Coming Soon: Even More Power</h2>
<div class="terminal-window">
<div class="terminal-header">
<span class="terminal-title">Future Features</span>
<!-- Crawl4AI Cloud Section -->
<section class="cloud-section">
<div class="cloud-announcement">
<h2>Crawl4AI Cloud</h2>
<p class="cloud-tagline">Your browser cluster without the cluster.</p>
<div class="cloud-features-preview">
<div class="cloud-feature-item">
⚡ POST /crawl
</div>
<div class="cloud-feature-item">
🌐 JS-rendered pages
</div>
<div class="cloud-feature-item">
📊 Schema extraction built-in
</div>
<div class="cloud-feature-item">
💰 $0.001/page
</div>
</div>
<div class="terminal-content">
<p class="intro-text">We're continuously expanding C4AI Assistant with powerful new features to make web scraping even easier:</p>
<button class="cloud-cta-button" id="joinWaitlist">
Get Early Access →
</button>
<p class="cloud-hint">See it extract your own data. Right now.</p>
</div>
<!-- Hidden Signup Form -->
<div class="signup-overlay" id="signupOverlay">
<div class="signup-container" id="signupContainer">
<button class="close-signup" id="closeSignup">×</button>
<div class="coming-features">
<div class="coming-feature">
<div class="feature-header">
<span class="feature-badge">Cloud</span>
<h3>Run on C4AI Cloud</h3>
<div class="signup-content" id="signupForm">
<h3>🚀 Join C4AI Cloud Waiting List</h3>
<p>Be among the first to experience the future of web scraping</p>
<form id="waitlistForm" class="waitlist-form">
<div class="form-field">
<label for="userName">Your Name</label>
<input type="text" id="userName" name="name" placeholder="John Doe" required>
</div>
<p>Execute your extraction directly in the cloud without setting up any local environment. Just click "Run on Cloud" and get your data instantly.</p>
<div class="feature-preview">
<code>☁️ Instant results • Auto-scaling</code>
<div class="form-field">
<label for="userEmail">Email Address</label>
<input type="email" id="userEmail" name="email" placeholder="john@example.com" required>
</div>
<div class="form-field">
<label for="userCompany">Company (Optional)</label>
<input type="text" id="userCompany" name="company" placeholder="Acme Inc.">
</div>
<div class="form-field">
<label for="useCase">What will you use Crawl4AI Cloud for?</label>
<select id="useCase" name="useCase">
<option value="">Select use case...</option>
<option value="price-monitoring">Price Monitoring</option>
<option value="news-aggregation">News Aggregation</option>
<option value="market-research">Market Research</option>
<option value="ai-training">AI Training Data</option>
<option value="other">Other</option>
</select>
</div>
<button type="submit" class="submit-button">
<span>🎯</span> Submit & Watch the Magic
</button>
</form>
</div>
<!-- Crawling Animation -->
<div class="crawl-animation" id="crawlAnimation" style="display: none;">
<div class="terminal-window crawl-terminal">
<div class="terminal-header">
<span class="terminal-title">Crawl4AI Cloud Demo</span>
</div>
<div class="terminal-content">
<pre id="crawlOutput" class="crawl-log"><code>$ crawl4ai cloud extract --url "signup-form" --auto-detect</code></pre>
</div>
</div>
<div class="extracted-preview" id="extractedPreview" style="display: none;">
<h4>📊 Extracted Data</h4>
<pre class="json-preview"><code id="jsonOutput"></code></pre>
</div>
<div class="success-message" id="successMessage" style="display: none;">
<div class="success-icon"></div>
<h3>Data Uploaded Successfully!</h3>
<p>You're on the Crawl4AI Cloud waiting list!</p>
<p>What you just witnessed:</p>
<ul>
<li>⚡ Real-time extraction of your form data</li>
<li>🔄 Automatic schema detection</li>
<li>📤 Instant cloud processing</li>
<li>✨ No code required - just like that!</li>
</ul>
<p class="success-note">We'll notify you at <strong id="userEmailDisplay"></strong> when Crawl4AI Cloud launches!</p>
<button class="continue-button" id="continueBtn">Continue Exploring</button>
</div>
</div>
</div>
</div>
</section>
<!-- Coming Soon Section -->
<section class="coming-soon-section">
<h2>More Features Coming Soon</h2>
<div class="terminal-window">
<div class="terminal-header">
<span class="terminal-title">Roadmap</span>
</div>
<div class="terminal-content">
<p class="intro-text">We're continuously expanding C4AI Assistant with powerful new features:</p>
<div class="coming-features">
<div class="coming-feature">
<div class="feature-header">
<span class="feature-badge">Direct</span>
@@ -482,8 +782,19 @@ asyncio.run(automate_shopping())</code></pre>
document.querySelectorAll('.copy-button').forEach(button => {
button.addEventListener('click', async function() {
const codeType = this.getAttribute('data-code');
const codeElement = document.getElementById('code-' + codeType).querySelector('pre code');
const codeText = codeElement.textContent;
let codeText = '';
// Handle different code types
if (codeType === 'schema-python') {
const codeElement = document.querySelector('#code-schema .terminal-window:first-child pre code');
codeText = codeElement.textContent;
} else if (codeType === 'schema-json') {
const codeElement = document.querySelector('#code-schema .terminal-window:last-child pre code');
codeText = codeElement.textContent;
} else {
const codeElement = document.getElementById('code-' + codeType).querySelector('pre code');
codeText = codeElement.textContent;
}
try {
await navigator.clipboard.writeText(codeText);
@@ -499,6 +810,161 @@ asyncio.run(automate_shopping())</code></pre>
}
});
});
// Crawl4AI Cloud Interactive Demo
const joinWaitlistBtn = document.getElementById('joinWaitlist');
const signupOverlay = document.getElementById('signupOverlay');
const closeSignupBtn = document.getElementById('closeSignup');
const waitlistForm = document.getElementById('waitlistForm');
const signupForm = document.getElementById('signupForm');
const crawlAnimation = document.getElementById('crawlAnimation');
const crawlOutput = document.getElementById('crawlOutput');
const extractedPreview = document.getElementById('extractedPreview');
const jsonOutput = document.getElementById('jsonOutput');
const successMessage = document.getElementById('successMessage');
const continueBtn = document.getElementById('continueBtn');
const userEmailDisplay = document.getElementById('userEmailDisplay');
// Open signup modal
joinWaitlistBtn.addEventListener('click', () => {
signupOverlay.classList.add('active');
});
// Banner button
const joinWaitlistBannerBtn = document.getElementById('joinWaitlistBanner');
if (joinWaitlistBannerBtn) {
joinWaitlistBannerBtn.addEventListener('click', () => {
signupOverlay.classList.add('active');
});
}
// Close signup modal
closeSignupBtn.addEventListener('click', () => {
signupOverlay.classList.remove('active');
});
// Close on overlay click
signupOverlay.addEventListener('click', (e) => {
if (e.target === signupOverlay) {
signupOverlay.classList.remove('active');
}
});
// Continue button
if (continueBtn) {
continueBtn.addEventListener('click', () => {
signupOverlay.classList.remove('active');
// Reset form for next time
waitlistForm.reset();
signupForm.style.display = 'block';
crawlAnimation.style.display = 'none';
extractedPreview.style.display = 'none';
successMessage.style.display = 'none';
});
}
// Form submission with crawling animation
waitlistForm.addEventListener('submit', async (e) => {
e.preventDefault();
// Get form data
const formData = {
name: document.getElementById('userName').value,
email: document.getElementById('userEmail').value,
company: document.getElementById('userCompany').value || 'Not specified',
useCase: document.getElementById('useCase').value || 'General web scraping',
timestamp: new Date().toISOString(),
source: 'Crawl4AI Assistant Landing Page'
};
// Update email display
userEmailDisplay.textContent = formData.email;
// Hide form and show crawling animation
signupForm.style.display = 'none';
crawlAnimation.style.display = 'block';
// Clear previous output
const codeElement = crawlOutput.querySelector('code');
codeElement.innerHTML = '$ crawl4ai cloud extract --url "signup-form" --auto-detect\n\n';
// Simulate crawling process with proper C4AI log format
const crawlSteps = [
{
log: '<span class="log-init">[INIT]....</span> → Crawl4AI Cloud 1.0.0',
time: '0.12s'
},
{
log: '<span class="log-fetch">[FETCH]...</span> ↓ https://crawl4ai.com/waitlist-form',
time: '0.45s'
},
{
log: '<span class="log-scrape">[SCRAPE]..</span> ◆ https://crawl4ai.com/waitlist-form',
time: '0.28s'
},
{
log: '<span class="log-extract">[EXTRACT].</span> ■ Extracting form data with auto-detect',
time: '0.55s'
},
{
log: '<span class="log-complete">[COMPLETE]</span> ● https://crawl4ai.com/waitlist-form',
time: '1.40s'
}
];
let stepIndex = 0;
const typeStep = async () => {
if (stepIndex < crawlSteps.length) {
const step = crawlSteps[stepIndex];
codeElement.innerHTML += step.log + ' | <span class="log-success">✓</span> | <span class="log-time">⏱: ' + step.time + '</span>\n';
stepIndex++;
// Scroll to bottom
const terminal = crawlOutput.parentElement;
terminal.scrollTop = terminal.scrollHeight;
setTimeout(typeStep, 600);
} else {
// Show extracted data
setTimeout(() => {
codeElement.innerHTML += '\n<span class="log-success">[UPLOAD]..</span> ↑ Uploading to Crawl4AI Cloud...';
setTimeout(() => {
extractedPreview.style.display = 'block';
jsonOutput.textContent = JSON.stringify(formData, null, 2);
// Add syntax highlighting
jsonOutput.innerHTML = jsonOutput.textContent
.replace(/"([^"]+)":/g, '<span class="string">"$1"</span>:')
.replace(/: "([^"]+)"/g, ': <span class="string">"$1"</span>');
codeElement.innerHTML += ' | <span class="log-success">✓</span> | <span class="log-time">⏱: 0.23s</span>\n';
codeElement.innerHTML += '\n<span class="log-success">[SUCCESS]</span> ✨ Data uploaded successfully!';
// Show success message after a delay
setTimeout(() => {
successMessage.style.display = 'block';
// Smooth scroll to bottom to show success message
setTimeout(() => {
const container = document.getElementById('signupContainer');
container.scrollTo({
top: container.scrollHeight,
behavior: 'smooth'
});
}, 100);
// Actually submit to waiting list (you can implement this)
console.log('Waitlist submission:', formData);
}, 1500);
}, 800);
}, 600);
}
};
// Start the animation
setTimeout(typeStep, 500);
});
</script>
</body>
</html>

File diff suppressed because one or more lines are too long

View File

@@ -22,7 +22,16 @@
"content_scripts": [
{
"matches": ["<all_urls>"],
"js": ["content/content.js"],
"js": [
"libs/marked.min.js",
"content/shared/utils.js",
"content/schemaBuilder.js",
"content/scriptBuilder.js",
"content/contentAnalyzer.js",
"content/markdownConverter.js",
"content/click2CrawlBuilder.js",
"content/content.js"
],
"css": ["content/overlay.css"],
"run_at": "document_idle"
}

View File

@@ -145,6 +145,10 @@ header h1 {
background: #3a1e5f;
}
.mode-button.c2c .icon {
background: #1e5f3a;
}
.mode-info h3 {
font-size: 16px;
color: #fff;

View File

@@ -37,6 +37,14 @@
<p>Record actions to build automation scripts</p>
</div>
</button>
<button id="c2c-mode" class="mode-button c2c">
<div class="icon"></div>
<div class="mode-info">
<h3>Click2Crawl</h3>
<p>Select elements and convert to clean markdown</p>
</div>
</button>
</div>
<div id="active-session" class="active-session hidden">

View File

@@ -22,6 +22,10 @@ document.addEventListener('DOMContentLoaded', () => {
startScriptCapture();
});
document.getElementById('c2c-mode').addEventListener('click', () => {
startClick2Crawl();
});
// Session actions
document.getElementById('generate-code').addEventListener('click', () => {
generateCode();
@@ -79,6 +83,19 @@ function startScriptCapture() {
});
}
function startClick2Crawl() {
chrome.tabs.query({ active: true, currentWindow: true }, (tabs) => {
chrome.tabs.sendMessage(tabs[0].id, {
action: 'startClick2Crawl'
}, (response) => {
if (response && response.success) {
// Close the popup to let user interact with the page
window.close();
}
});
});
}
function showActiveSession(stats) {
document.querySelector('.mode-selector').style.display = 'none';
document.getElementById('active-session').classList.remove('hidden');

View File

@@ -18,9 +18,14 @@ const components = [
description: 'Browser and crawler configuration'
},
{
id: 'extraction',
name: 'Data Extraction',
description: 'Structured data extraction strategies'
id: 'extraction-llm',
name: 'Data Extraction Using LLM',
description: 'Structured data extraction strategies using LLMs'
},
{
id: 'extraction-no-llm',
name: 'Data Extraction Without LLM',
description: 'Structured data extraction strategies without LLMs'
},
{
id: 'multi_urls_crawling',