Commit Graph

823 Commits

Author SHA1 Message Date
Ahmed-Tawfik94
984524ca1c fix(auth): add token authorization header in request preparation to ensure authenticated requests are made 2025-05-21 13:27:17 +08:00
ntohidi
cb8d581e47 fix(docs): update CrawlerRunConfig to use CacheMode for bypassing cache. REF: #1125 2025-05-19 18:03:05 +02:00
Ahmed Tawfik
ce09648af1 Merge pull request #1054 from Sacristaan/feature/readme_example
Fix: README.md urls list
2025-05-19 14:20:21 +08:00
Ahmed-Tawfik94
a97654270b #1086 fix(markdown): update BM25 filter to use language parameter for stemming 2025-05-19 14:11:46 +08:00
Ahmed-Tawfik94
b4fc60a555 #1103 fix(url): enhance URL normalization to handle invalid schemes and trailing slashes 2025-05-19 13:51:16 +08:00
Ahmed-Tawfik94
137ac014fb #1105 :fix(metadata): optimize article metadata extraction using XPath for improved performance 2025-05-19 13:48:02 +08:00
Ahmed-Tawfik94
faa98eefbc #1105 got fixed (metadata now matches with meta property article:* 2025-05-19 11:35:13 +08:00
ntohidi
22725ca87b fix(crawler): initialize captured_console to prevent unbound local error for local HTML files. REF: #1072
Resolved a bug where running the crawler on local HTML files with `capture_console_messages=False`
(default) raised `UnboundLocalError` due to `captured_console` being accessed before assignment.
2025-05-15 11:29:36 +02:00
ntohidi
e0fbd2b0a0 fix(schema): update f parameter description to use lowercase enum values. REF: #1070
Revised the description for the `f` parameter in the `/mcp/md` tool schema to use lowercase enum values
(`raw`, `fit`, `bm25`, `llm`) for consistency with the actual `enum` definition. This change prevents
LLM-based clients (e.g., Gemini via LibreChat) from generating uppercase values like `"FIT"`, which
caused 422 validation errors due to strict case-sensitive matching.
2025-05-15 10:45:23 +02:00
ntohidi
32966bea11 fix(extraction): resolve 'str' object has no attribute 'choices' error in LLMExtractionStrategy. Refs: #979
This patch ensures consistent handling of `response.choices[0].message.content` by avoiding redefinition
of the `response` variable, which caused downstream exceptions during error handling.
2025-05-15 10:09:19 +02:00
Ahmed-Tawfik94
a3b0cab52a #1088 is sloved flag -bc now if for --byPass-cache 2025-05-15 11:25:06 +08:00
UncleCode
897e017361 Set version to 0.6.3 vr0.6.3 v0.6.3 2025-05-12 21:20:10 +08:00
UncleCode
a3e9ef91ad fix(crawler): remove automatic page closure in screenshot methods
Removes automatic page closure in take_screenshot and take_screenshot_naive methods
to prevent premature closure of pages that might still be needed in the calling context.
This allows for more flexible page lifecycle management by the caller.

BREAKING CHANGE: Page objects are no longer automatically closed after taking screenshots.
Callers must explicitly handle page closure when appropriate.
2025-05-12 21:17:57 +08:00
UncleCode
76dd86d1b3 Merge remote-tracking branch 'origin/linkedin-prep' into next 2025-05-08 17:13:59 +08:00
UncleCode
206a9dfabd feat(crawler): add session management and view-source support
Add session_id feature to allow reusing browser pages across multiple crawls.
Add support for view-source: protocol in URL handling.
Fix browser config reference and string formatting issues.
Update examples to demonstrate new session management features.

BREAKING CHANGE: Browser page handling now persists when using session_id
2025-05-08 17:13:35 +08:00
Aravind Karnam
aaf05910eb fix: removed unnecessary imports and installs 2025-05-06 15:53:55 +05:30
Aravind Karnam
a0555d5fa6 merge:from next branch 2025-05-06 15:16:47 +05:30
Aravind Karnam
38ebcbb304 fix: provide support for local llm by adding it to the arguments 2025-05-05 10:34:38 +05:30
UncleCode
9b5ccac76e feat(extraction): add RegexExtractionStrategy for pattern-based extraction
Add new RegexExtractionStrategy for fast, zero-LLM extraction of common data types:
- Built-in patterns for emails, URLs, phones, dates, and more
- Support for custom regex patterns
- LLM-assisted pattern generation utility
- Optimized HTML preprocessing with fit_html field
- Enhanced network response body capture

Breaking changes: None
2025-05-02 21:15:24 +08:00
Aravind Karnam
87d4b0fff4 format bash scripts properly so copy & paste may work without issues 2025-05-02 17:21:09 +05:30
Aravind Karnam
bd5a9ac632 updated readme with arguments for litellm 2025-05-02 17:04:42 +05:30
Aravind Karnam
6650b2f34a fix: replace openAI with litellm to support multiple llm providers 2025-05-02 16:51:15 +05:30
Aravind Karnam
5cc58f9bb3 fix: 1. duplicate verbose flag 2.inconsistency in argument name --profile-name 3. duplicate initialisaiton of env_defaults 2025-05-02 16:40:58 +05:30
Aravind Karnam
baf7f6a6f5 fix: typo in readme 2025-05-02 16:33:11 +05:30
UncleCode
94e9959fe0 feat(docker-api): add job-based polling endpoints for crawl and LLM tasks
Implements new asynchronous endpoints for handling long-running crawl and LLM tasks:
- POST /crawl/job and GET /crawl/job/{task_id} for crawl operations
- POST /llm/job and GET /llm/job/{task_id} for LLM operations
- Added Redis-based task management with configurable TTL
- Moved schema definitions to dedicated schemas.py
- Added example polling client demo_docker_polling.py

This change allows clients to handle long-running operations asynchronously through a polling pattern rather than holding connections open.
2025-05-01 21:24:52 +08:00
Aravind Karnam
7c2fd5202e fix: incorrect params and commands in linkedin app readme 2025-05-01 18:27:03 +05:30
UncleCode
ee01b81f3e Merge branch 'merge-pr971' into next 2025-05-01 18:58:41 +08:00
UncleCode
0e5d672763 Merge branch 'pr-971' into merge-pr971 2025-05-01 18:57:28 +08:00
wakaka6
cd2b490b40 refactor(logger): Apply the Enumeration for color 2025-05-01 17:04:44 +08:00
UncleCode
50f0b83fcd feat(linkedin): add prospect-wizard app with scraping and visualization
Add new LinkedIn prospect discovery tool with three main components:
- c4ai_discover.py for company and people scraping
- c4ai_insights.py for org chart and decision maker analysis
- Interactive graph visualization with company/people exploration

Features include:
- Configurable LinkedIn search and scraping
- Org chart generation with decision maker scoring
- Interactive network graph visualization
- Company similarity analysis
- Chat interface for data exploration

Requires: crawl4ai, openai, sentence-transformers, networkx
2025-04-30 19:38:25 +08:00
UncleCode
9499164d3c feat(browser): improve browser profile management and cleanup
Enhance browser profile handling with better process cleanup and documentation:
- Add process cleanup for existing Chromium instances on Windows/Unix
- Fix profile creation by passing complete browser config
- Add comprehensive documentation for browser and CLI components
- Add initial profile creation test
- Bump version to 0.6.3

This change improves reliability when managing browser profiles and provides better documentation for developers.
2025-04-29 23:04:32 +08:00
Marc Sacristán
53245e4e0e Fix: README.md urls list 2025-04-29 16:26:35 +02:00
UncleCode
2140d9aca4 fix(browser): correct headless mode default behavior
Modify BrowserConfig to respect explicit headless parameter setting instead of forcing True. Update version to 0.6.2 and clean up code formatting in examples.

BREAKING CHANGE: BrowserConfig no longer defaults to headless=True when explicitly set to False
2025-04-26 21:09:50 +08:00
UncleCode
ccec40ed17 feat(models): add dedicated tables field to CrawlResult
- Add tables field to CrawlResult model while maintaining backward compatibility
- Update async_webcrawler.py to extract tables from media and pass to tables field
- Update crypto_analysis_example.py to use the new tables field
- Add /config/dump examples to demo_docker_api.py
- Bump version to 0.6.1
2025-04-24 18:36:25 +08:00
UncleCode
ad4dfb21e1 Remoce "rc1" 2025-04-23 21:00:00 +08:00
UncleCode
7784b2468e feat(docs): enhance Ask AI button UX and add v0.6.0 release notes
Improve Ask AI button with better mobile support, animations, and positioning:
- Add button animations and hover effects
- Improve mobile responsiveness
- Add icon to button
- Fix positioning logic for different viewport sizes
- Add keyboard (Escape) support

Add comprehensive v0.6.0 release documentation:
- Create detailed release notes
- Update blog index with latest release
- Document all major features and breaking changes

BREAKING CHANGE: Documentation structure updated with new v0.6.0 section
2025-04-23 20:07:03 +08:00
UncleCode
146f9d415f Update README vr0.6.0 2025-04-23 19:50:33 +08:00
UncleCode
37fd80e4b9 feat(docs): add mobile-friendly navigation menu
Implements a responsive hamburger menu for mobile devices with the following changes:
- Add new mobile_menu.js for handling mobile navigation
- Update layout.css with mobile-specific styles and animations
- Enhance README with updated geolocation example
- Register mobile_menu.js in mkdocs.yml

The mobile menu includes:
- Hamburger button animation
- Slide-out sidebar
- Backdrop overlay
- Touch-friendly navigation
- Proper event handling
2025-04-23 19:44:25 +08:00
UncleCode
949a93982e feat(docs): update documentation and disable Ask AI feature
Major documentation updates including:
- Add comprehensive code examples page
- Add video tutorial to homepage
- Update Docker deployment instructions for v0.6.0
- Temporarily disable Ask AI feature
- Add table border styling
- Update site version to v0.6.x

BREAKING CHANGE: Ask AI feature temporarily disabled pending launch
2025-04-23 19:02:39 +08:00
UncleCode
c4f5651199 chore(deps): upgrade to Python 3.12 and prepare for 0.6.0 release
- Update Docker base image to Python 3.12-slim-bookworm
- Bump version from 0.6.0rc1 to 0.6.0
- Update documentation to reflect release version changes
- Fix license specification in pyproject.toml and setup.py
- Clean up code formatting in demo_docker_api.py

BREAKING CHANGE: Base Python version upgraded from 3.10 to 3.12
2025-04-23 16:35:15 +08:00
UncleCode
b0aa8bc9f7 Update README vr0.6.0rc1 2025-04-22 23:21:42 +08:00
UncleCode
c98ffe2130 Update CHANGELOG 2025-04-22 22:36:41 +08:00
UncleCode
4812f08a73 feat(docker): update Docker deployment for v0.6.0
Major updates to Docker deployment infrastructure:
- Switch default port to 11235 for all services
- Add MCP (Model Context Protocol) support with WebSocket/SSE endpoints
- Simplify docker-compose.yml with auto-platform detection
- Update documentation with new features and examples
- Consolidate configuration and improve resource management

BREAKING CHANGE: Default port changed from 8020 to 11235. Update your configurations and deployment scripts accordingly.
2025-04-22 22:35:25 +08:00
unclecode
f3ebb38edf Merge PR #899 into next, resolve conflicts in server.py and docs/browser-crawler-config.md 2025-04-22 14:56:47 +08:00
UncleCode
0007aea204 Update changelog 2025-04-21 23:21:49 +08:00
UncleCode
b5c25731e6 feat(browser): add geolocation, locale and timezone support
Add support for controlling browser geolocation, locale and timezone settings:
- New GeolocationConfig class for managing GPS coordinates
- Add locale and timezone_id parameters to CrawlerRunConfig
- Update browser context creation to handle location settings
- Add example script for geolocation usage
- Update documentation with location-based identity features

This enables more precise control over browser identity and location reporting.
2025-04-21 23:20:59 +08:00
UncleCode
5297e362f3 feat(mcp): Implement MCP protocol and enhance server capabilities
This commit introduces several significant enhancements to the Crawl4AI Docker deployment:

  1. Add MCP Protocol Support:
     - Implement WebSocket and SSE transport layers for MCP server communication
     - Create mcp_bridge.py to expose existing API endpoints via MCP protocol
     - Add comprehensive tests for both socket and SSE transport methods

  2. Enhance Docker Server Capabilities:
     - Add PDF generation endpoint with file saving functionality
     - Add screenshot capture endpoint with configurable wait time
     - Implement JavaScript execution endpoint for dynamic page interaction
     - Add intelligent file path handling for saving generated assets

  3. Improve Search and Context Functionality:
     - Implement syntax-aware code function chunking using AST parsing
     - Add BM25-based intelligent document search with relevance scoring
     - Create separate code and documentation context endpoints
     - Enhance response format with structured results and scores

  4. Rename and Fix File Organization:
     - Fix typo in test_docker_config_gen.py filename
     - Update import statements and dependencies
     - Add FileResponse for context endpoints

  This enhancement significantly improves the machine-to-machine communication
  capabilities of Crawl4AI, making it more suitable for integration with LLM agents
  and other automated systems.

  The CHANGELOG update has been applied successfully, highlighting the key features and improvements made in this release. The commit message provides a detailed explanation of all the
  changes, which will be helpful for tracking the project's evolution.
2025-04-21 22:22:02 +08:00
UncleCode
a58c8000aa refactor(server): migrate to pool-based crawler management
Replace crawler_manager.py with simpler crawler_pool.py implementation:
- Add global page semaphore for hard concurrency cap
- Implement browser pool with idle cleanup
- Add playground UI for testing and stress testing
- Update API handlers to use pooled crawlers
- Enhance logging levels and symbols

BREAKING CHANGE: Removes CrawlerManager class in favor of simpler pool-based approach
2025-04-20 20:14:26 +08:00
Aravind Karnam
b27bb367e8 merge next. Resolve conflicts. Fix some import errors and error handling in server.py 2025-04-19 20:27:47 +05:30
Aravind Karnam
d2648eaa39 fix: solved with deepcopy of elements https://github.com/unclecode/crawl4ai/issues/902 2025-04-19 20:08:36 +05:30