Commit Graph

281 Commits

Author SHA1 Message Date
UncleCode
40640badad feat: add Script Builder to Chrome Extension and reorganize LLM context files
This commit introduces significant enhancements to the Crawl4AI ecosystem:

  Chrome Extension - Script Builder (Alpha):
  - Add recording functionality to capture user interactions (clicks, typing, scrolling)
  - Implement smart event grouping for cleaner script generation
  - Support export to both JavaScript and C4A script formats
  - Add timeline view for visualizing and editing recorded actions
  - Include wait commands (time-based and element-based)
  - Add saved flows functionality for reusing automation scripts
  - Update UI with consistent dark terminal theme (Dank Mono font, green/pink accents)
  - Release new extension versions: v1.1.0, v1.2.0, v1.2.1

  LLM Context Builder Improvements:
  - Reorganize context files from llmtxt/ to llm.txt/ with better structure
  - Separate diagram templates from text content (diagrams/ and txt/ subdirectories)
  - Add comprehensive context files for all major Crawl4AI components
  - Improve file naming convention for better discoverability

  Documentation Updates:
  - Update apps index page to match main documentation theme
  - Standardize color scheme: "Available" tags use primary color (#50ffff)
  - Change "Coming Soon" tags to dark gray for better visual hierarchy
  - Add interactive two-column layout for extension landing page
  - Include code examples for both Schema Builder and Script Builder features

  Technical Improvements:
  - Enhance event capture mechanism with better element selection
  - Add support for contenteditable elements and complex form interactions
  - Implement proper scroll event handling for both window and element scrolling
  - Add meta key support for keyboard shortcuts
  - Improve selector generation for more reliable element targeting

  The Script Builder is released as Alpha, acknowledging potential bugs while providing
  early access to this powerful automation recording feature.
2025-06-08 22:02:12 +08:00
UncleCode
926592649e Add Crawl4AI Assistant Chrome Extension
- Created manifest.json for the Crawl4AI Assistant extension.
- Added popup HTML, CSS, and JS files for the extension interface.
- Included icons and favicon for the extension.
- Implemented functionality for schema capture and code generation.
- Updated index.md to reflect the availability of the new extension.
- Enhanced LLM Context Builder layout and styles for consistency.
- Adjusted global styles for better branding and responsiveness.
2025-06-08 18:34:05 +08:00
UncleCode
6f3a0ea38e Create "Apps" section in documentation and Add interactive c4a-script playground and LLM context builder for Crawl4AI
- Created a new HTML page (`index.html`) for the interactive LLM context builder, allowing users to select and combine different `crawl4ai` context files.
- Implemented JavaScript functionality (`llmtxt.js`) to manage component selection, context types, and file downloads.
- Added CSS styles (`llmtxt.css`) for a terminal-themed UI.
- Introduced a new Markdown file (`build.md`) detailing the requirements and functionality of the context builder.
- Updated the navigation in `mkdocs.yml` to include links to the new context builder and demo apps.
- Added a new Markdown file (`why.md`) explaining the motivation behind the new context structure and its benefits for AI coding assistants.
2025-06-08 15:48:17 +08:00
UncleCode
08a2cdae53 Add C4A-Script support and documentation
- Generate OneShot js code geenrator
- Introduced a new C4A-Script tutorial example for login flow using Blockly.
- Updated index.html to include Blockly theme and event editor modal for script editing.
- Created a test HTML file for testing Blockly integration.
- Added comprehensive C4A-Script API reference documentation covering commands, syntax, and examples.
- Developed core documentation for C4A-Script, detailing its features, commands, and real-world examples.
- Updated mkdocs.yml to include new C4A-Script documentation in navigation.
2025-06-07 23:07:19 +08:00
UncleCode
ca03acbc82 Add some new commands for the Crawl4ai script transpiler and creating an interactive tutorial that allows users to go through multiple steps and apply the syntax to automate the page. Fixed some issues and add several new commands for setting input values, variables, clearing input fields, and more. 2025-06-06 23:03:26 +08:00
UncleCode
3f6f2e998c feat(script): add new scripting capabilities and documentation
This commit introduces a comprehensive set of new scripts and examples to enhance the scripting capabilities of the crawl4ai project. The changes include the addition of several Python scripts for compiling and executing scripts, as well as a variety of example scripts demonstrating different functionalities such as login flows, data extraction, and multi-step workflows. Additionally, detailed documentation has been created to guide users on how to utilize these new features effectively.

The following significant modifications were made:
- Added core scripting files: , , and .
- Created a new documentation file  to provide an overview of the new features.
- Introduced multiple example scripts in the  directory to showcase various use cases.
- Updated  and  to integrate the new functionalities.
- Added font assets for improved documentation presentation.

These changes significantly expand the functionality of the crawl4ai project, allowing users to create more complex and varied scripts with ease.
2025-06-06 17:16:53 +08:00
UncleCode
e731596315 docs(tutorial_url_seeder): refine summary and next steps, enhance agentic design patterns section 2025-06-05 16:20:58 +08:00
UncleCode
641526af81 docs(tutorial_url_seeder): add advanced agentic patterns and implementation examples 2025-06-05 16:07:05 +08:00
UncleCode
82a25c037a feat(async_url_seeder): add smart URL filtering to exclude nonsense URLs
This update introduces a new feature in the URL seeding process that allows for the automatic filtering of utility URLs, such as robots.txt and sitemap.xml, which are not useful for content crawling. The  class has been enhanced with a new parameter, , which is enabled by default. This change aims to improve the efficiency of the crawling process by reducing the number of irrelevant URLs processed.

Significant modifications include:
- Added  parameter to  in .
- Implemented logic in  to check and filter out nonsense URLs during the seeding process in .
- Updated documentation to reflect the new filtering feature and provide examples of its usage in .

This change enhances the overall functionality of the URL seeder, making it smarter and more efficient in identifying and excluding non-content URLs.

BREAKING CHANGE: The  now requires the  parameter to be explicitly set if the default behavior is to be altered.

Related issues: #123
2025-06-05 15:46:24 +08:00
UncleCode
c6fc5c0518 docs(linkdin, url_seeder): update and reorganize LinkedIn data discovery and URL seeder documentation
This commit introduces significant updates to the LinkedIn data discovery documentation by adding two new Jupyter notebooks that provide detailed insights into data discovery processes. The previous workshop notebook has been removed to streamline the content and avoid redundancy. Additionally, the URL seeder documentation has been expanded with a new tutorial and several enhancements to existing scripts, improving usability and clarity.

The changes include:
- Added  and  for comprehensive LinkedIn data discovery.
- Removed  to eliminate outdated content.
- Updated  to reflect new data visualization requirements.
- Introduced  and  to facilitate easier access to URL seeding techniques.
- Enhanced existing Python scripts and markdown files in the URL seeder section for better documentation and examples.

These changes aim to improve the overall documentation quality and user experience for developers working with LinkedIn data and URL seeding techniques.
2025-06-05 15:06:25 +08:00
UncleCode
b5c2732f88 Add BBC Sp0ort Research Assistant pipeline example
- Implemented a comprehensive research pipeline using URLSeeder.
- Steps include user query input, optional LLM enhancement, URL discovery and ranking, content crawling, and synthesis generation.
- Introduced caching mechanism for enhanced query results and crawled content.
- Configurable settings for testing and production modes.
- Output results in JSON and Markdown formats with detailed research insights and citations.
2025-06-04 23:23:21 +08:00
UncleCode
09fd3e152a fix: Import os and adjust file saving path in URL seeder demo 2025-06-03 23:34:11 +08:00
UncleCode
3048cc1ff9 feat: Add AsyncUrlSeeder for intelligent URL discovery and filtering
This commit introduces AsyncUrlSeeder, a high-performance URL discovery system that enables intelligent crawling at scale by pre-discovering and filtering URLs before crawling.

## Core Features

### AsyncUrlSeeder Component
- Discovers URLs from multiple sources:
  - Sitemaps (including nested and gzipped)
  - Common Crawl index
  - Combined sources for maximum coverage
- Extracts page metadata without full crawling:
  - Title, description, keywords
  - Open Graph and Twitter Card tags
  - JSON-LD structured data
  - Language and charset information
- BM25 relevance scoring for intelligent filtering:
  - Query-based URL discovery
  - Configurable score thresholds
  - Automatic ranking by relevance
- Performance optimizations:
  - Async/concurrent processing with configurable workers
  - Rate limiting (hits per second)
  - Automatic caching with TTL
  - Streaming results for large datasets

### SeedingConfig
- Comprehensive configuration for URL seeding:
  - Source selection (sitemap, cc, or both)
  - URL pattern filtering with wildcards
  - Live URL validation options
  - Metadata extraction controls
  - BM25 scoring parameters
  - Concurrency and rate limiting

### Integration with AsyncWebCrawler
- Seamless pipeline: discover → filter → crawl
- Direct compatibility with arun_many()
- Significant resource savings by pre-filtering URLs

## Documentation
- Comprehensive guide comparing URL seeding vs deep crawling
- Complete API reference with parameter tables
- Practical examples showing all features
- Performance benchmarks and best practices
- Integration patterns with AsyncWebCrawler

## Examples
- url_seeder_demo.py: Interactive Rich-based demo with:
  - Basic discovery
  - Cache management
  - Live validation
  - BM25 scoring
  - Multi-domain discovery
  - Complete pipeline integration
- url_seeder_quick_demo.py: Screenshot-friendly examples:
  - Pattern-based filtering
  - Metadata exploration
  - Smart search with BM25

## Testing
- Comprehensive test suite (test_async_url_seeder_bm25.py)
- Coverage of all major features
- Edge cases and error handling
- Performance and consistency tests

## Implementation Details
- Built on httpx with HTTP/2 support
- Optional dependencies: lxml, brotli, rank_bm25
- Cache management in ~/.crawl4ai/seeder_cache/
- Logger integration with AsyncLoggerBase
- Proper error handling and retry logic

## Bug Fixes
- Fixed logger color compatibility (lightblack → bright_black)
- Corrected URL extraction from seeder results for arun_many()
- Updated all examples and documentation with proper usage

This feature enables users to crawl smarter, not harder, by discovering
and analyzing URLs before committing resources to crawling them.
2025-06-03 23:27:12 +08:00
UncleCode
3b766e1aac Add Google Colab button to LinkedIn Prospect Wizard README
- Added Colab badge linking to the demo notebook
- Added call-to-action encouraging users to try the demo in Colab
- Provides zero-setup cloud environment for testing

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-05-26 14:35:06 +08:00
UncleCode
c3b7b7e918 Add linkedin example ipynb. 2025-05-25 17:55:22 +08:00
UncleCode
1fc45ffac8 Fix temperature typo and enhance LinkedIn extraction with Colab support
- Fixed widespread typo: `temprature` → `temperature` across LLMConfig and related files
- Enhanced CSS/XPath selector guidance for more reliable LinkedIn data extraction
- Added Google Colab display server support for running Crawl4AI in notebook environments
- Improved browser debugging with verbose startup args logging
- Updated LinkedIn schemas and HTML snippets for better parsing accuracy

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-05-25 16:47:12 +08:00
devin-ai-integration[bot]
9c2cc7f73c Fix BM25ContentFilter documentation to use language parameter instead of use_stemming (#1152)
Co-authored-by: Devin AI <158243242+devin-ai-integration[bot]@users.noreply.github.com>
Co-authored-by: UncleCode <unclecode@kidocode.com>
2025-05-25 10:02:13 +08:00
UncleCode
1c5e76d51a Adjust positioning and set only core component as selected item by default 2025-05-24 20:49:44 +08:00
UncleCode
7665a6832f Add LLMContext article and updte JS to not show all components. 2025-05-24 20:46:24 +08:00
UncleCode
a06710ff03 Adding LLMContext generator to website. 2025-05-24 20:37:09 +08:00
UncleCode
becc4624bb feat(favicon): add new favicon images for improved branding 2025-05-17 19:03:51 +08:00
UncleCode
ac9981a1f5 feat(favicon): add favicon image and update mkdocs configuration 2025-05-16 21:59:23 +08:00
UncleCode
83ef15fd47 feat(favicon): add favicon.ico for improved branding 2025-05-16 21:55:07 +08:00
UncleCode
a3cb938675 feat(theme): enable dark color mode in mkdocs configuration 2025-05-16 21:44:56 +08:00
UncleCode
9b60988232 feat(feedback): add feedback modal styles and integrate into mkdocs configuration 2025-05-16 21:25:10 +08:00
UncleCode
baca2df8df feat(analytics): add Google Tag Manager script and gtag.js for tracking 2025-05-16 20:49:02 +08:00
UncleCode
8a5e23d374 feat(crawler): add separate timeout for wait_for condition
Adds a new wait_for_timeout parameter to CrawlerRunConfig that allows specifying
a separate timeout for the wait_for condition, independent of the page_timeout.
This provides more granular control over waiting behaviors in the crawler.

Also removes unused colorama dependency and updates LinkedIn crawler example.

BREAKING CHANGE: LinkedIn crawler example now uses different wait_for_images timing
2025-05-16 17:00:45 +08:00
UncleCode
76dd86d1b3 Merge remote-tracking branch 'origin/linkedin-prep' into next 2025-05-08 17:13:59 +08:00
UncleCode
206a9dfabd feat(crawler): add session management and view-source support
Add session_id feature to allow reusing browser pages across multiple crawls.
Add support for view-source: protocol in URL handling.
Fix browser config reference and string formatting issues.
Update examples to demonstrate new session management features.

BREAKING CHANGE: Browser page handling now persists when using session_id
2025-05-08 17:13:35 +08:00
Aravind Karnam
aaf05910eb fix: removed unnecessary imports and installs 2025-05-06 15:53:55 +05:30
Aravind Karnam
a0555d5fa6 merge:from next branch 2025-05-06 15:16:47 +05:30
Aravind Karnam
38ebcbb304 fix: provide support for local llm by adding it to the arguments 2025-05-05 10:34:38 +05:30
UncleCode
9b5ccac76e feat(extraction): add RegexExtractionStrategy for pattern-based extraction
Add new RegexExtractionStrategy for fast, zero-LLM extraction of common data types:
- Built-in patterns for emails, URLs, phones, dates, and more
- Support for custom regex patterns
- LLM-assisted pattern generation utility
- Optimized HTML preprocessing with fit_html field
- Enhanced network response body capture

Breaking changes: None
2025-05-02 21:15:24 +08:00
Aravind Karnam
87d4b0fff4 format bash scripts properly so copy & paste may work without issues 2025-05-02 17:21:09 +05:30
Aravind Karnam
bd5a9ac632 updated readme with arguments for litellm 2025-05-02 17:04:42 +05:30
Aravind Karnam
6650b2f34a fix: replace openAI with litellm to support multiple llm providers 2025-05-02 16:51:15 +05:30
Aravind Karnam
5cc58f9bb3 fix: 1. duplicate verbose flag 2.inconsistency in argument name --profile-name 3. duplicate initialisaiton of env_defaults 2025-05-02 16:40:58 +05:30
Aravind Karnam
baf7f6a6f5 fix: typo in readme 2025-05-02 16:33:11 +05:30
UncleCode
94e9959fe0 feat(docker-api): add job-based polling endpoints for crawl and LLM tasks
Implements new asynchronous endpoints for handling long-running crawl and LLM tasks:
- POST /crawl/job and GET /crawl/job/{task_id} for crawl operations
- POST /llm/job and GET /llm/job/{task_id} for LLM operations
- Added Redis-based task management with configurable TTL
- Moved schema definitions to dedicated schemas.py
- Added example polling client demo_docker_polling.py

This change allows clients to handle long-running operations asynchronously through a polling pattern rather than holding connections open.
2025-05-01 21:24:52 +08:00
Aravind Karnam
7c2fd5202e fix: incorrect params and commands in linkedin app readme 2025-05-01 18:27:03 +05:30
UncleCode
50f0b83fcd feat(linkedin): add prospect-wizard app with scraping and visualization
Add new LinkedIn prospect discovery tool with three main components:
- c4ai_discover.py for company and people scraping
- c4ai_insights.py for org chart and decision maker analysis
- Interactive graph visualization with company/people exploration

Features include:
- Configurable LinkedIn search and scraping
- Org chart generation with decision maker scoring
- Interactive network graph visualization
- Company similarity analysis
- Chat interface for data exploration

Requires: crawl4ai, openai, sentence-transformers, networkx
2025-04-30 19:38:25 +08:00
UncleCode
9499164d3c feat(browser): improve browser profile management and cleanup
Enhance browser profile handling with better process cleanup and documentation:
- Add process cleanup for existing Chromium instances on Windows/Unix
- Fix profile creation by passing complete browser config
- Add comprehensive documentation for browser and CLI components
- Add initial profile creation test
- Bump version to 0.6.3

This change improves reliability when managing browser profiles and provides better documentation for developers.
2025-04-29 23:04:32 +08:00
UncleCode
2140d9aca4 fix(browser): correct headless mode default behavior
Modify BrowserConfig to respect explicit headless parameter setting instead of forcing True. Update version to 0.6.2 and clean up code formatting in examples.

BREAKING CHANGE: BrowserConfig no longer defaults to headless=True when explicitly set to False
2025-04-26 21:09:50 +08:00
UncleCode
ccec40ed17 feat(models): add dedicated tables field to CrawlResult
- Add tables field to CrawlResult model while maintaining backward compatibility
- Update async_webcrawler.py to extract tables from media and pass to tables field
- Update crypto_analysis_example.py to use the new tables field
- Add /config/dump examples to demo_docker_api.py
- Bump version to 0.6.1
2025-04-24 18:36:25 +08:00
UncleCode
ad4dfb21e1 Remoce "rc1" 2025-04-23 21:00:00 +08:00
UncleCode
7784b2468e feat(docs): enhance Ask AI button UX and add v0.6.0 release notes
Improve Ask AI button with better mobile support, animations, and positioning:
- Add button animations and hover effects
- Improve mobile responsiveness
- Add icon to button
- Fix positioning logic for different viewport sizes
- Add keyboard (Escape) support

Add comprehensive v0.6.0 release documentation:
- Create detailed release notes
- Update blog index with latest release
- Document all major features and breaking changes

BREAKING CHANGE: Documentation structure updated with new v0.6.0 section
2025-04-23 20:07:03 +08:00
UncleCode
146f9d415f Update README 2025-04-23 19:50:33 +08:00
UncleCode
37fd80e4b9 feat(docs): add mobile-friendly navigation menu
Implements a responsive hamburger menu for mobile devices with the following changes:
- Add new mobile_menu.js for handling mobile navigation
- Update layout.css with mobile-specific styles and animations
- Enhance README with updated geolocation example
- Register mobile_menu.js in mkdocs.yml

The mobile menu includes:
- Hamburger button animation
- Slide-out sidebar
- Backdrop overlay
- Touch-friendly navigation
- Proper event handling
2025-04-23 19:44:25 +08:00
UncleCode
949a93982e feat(docs): update documentation and disable Ask AI feature
Major documentation updates including:
- Add comprehensive code examples page
- Add video tutorial to homepage
- Update Docker deployment instructions for v0.6.0
- Temporarily disable Ask AI feature
- Add table border styling
- Update site version to v0.6.x

BREAKING CHANGE: Ask AI feature temporarily disabled pending launch
2025-04-23 19:02:39 +08:00
UncleCode
c4f5651199 chore(deps): upgrade to Python 3.12 and prepare for 0.6.0 release
- Update Docker base image to Python 3.12-slim-bookworm
- Bump version from 0.6.0rc1 to 0.6.0
- Update documentation to reflect release version changes
- Fix license specification in pyproject.toml and setup.py
- Clean up code formatting in demo_docker_api.py

BREAKING CHANGE: Base Python version upgraded from 3.10 to 3.12
2025-04-23 16:35:15 +08:00