crawl4ai

Author	SHA1	Message	Date
coderabbitai[bot]	32fcacafa6	📝 Add docstrings to `codex/find-and-fix-a-bug` Docstrings generation was requested by @unclecode. * https://github.com/unclecode/crawl4ai/pull/1122#issuecomment-2887985865 The following files were modified: * `crawl4ai/utils.py`	2025-05-17 02:37:00 +00:00
UncleCode	45f1652d98	Fix merge_chunks splitter usage and remove incorrect return	2025-05-17 10:31:19 +08:00
UncleCode	897e017361	Set version to 0.6.3 vr0.6.3 v0.6.3	2025-05-12 21:20:10 +08:00
UncleCode	a3e9ef91ad	fix(crawler): remove automatic page closure in screenshot methods Removes automatic page closure in take_screenshot and take_screenshot_naive methods to prevent premature closure of pages that might still be needed in the calling context. This allows for more flexible page lifecycle management by the caller. BREAKING CHANGE: Page objects are no longer automatically closed after taking screenshots. Callers must explicitly handle page closure when appropriate.	2025-05-12 21:17:57 +08:00
UncleCode	76dd86d1b3	Merge remote-tracking branch 'origin/linkedin-prep' into next	2025-05-08 17:13:59 +08:00
UncleCode	206a9dfabd	feat(crawler): add session management and view-source support Add session_id feature to allow reusing browser pages across multiple crawls. Add support for view-source: protocol in URL handling. Fix browser config reference and string formatting issues. Update examples to demonstrate new session management features. BREAKING CHANGE: Browser page handling now persists when using session_id	2025-05-08 17:13:35 +08:00
Aravind Karnam	aaf05910eb	fix: removed unnecessary imports and installs	2025-05-06 15:53:55 +05:30
Aravind Karnam	a0555d5fa6	merge:from next branch	2025-05-06 15:16:47 +05:30
Aravind Karnam	38ebcbb304	fix: provide support for local llm by adding it to the arguments	2025-05-05 10:34:38 +05:30
UncleCode	9b5ccac76e	feat(extraction): add RegexExtractionStrategy for pattern-based extraction Add new RegexExtractionStrategy for fast, zero-LLM extraction of common data types: - Built-in patterns for emails, URLs, phones, dates, and more - Support for custom regex patterns - LLM-assisted pattern generation utility - Optimized HTML preprocessing with fit_html field - Enhanced network response body capture Breaking changes: None	2025-05-02 21:15:24 +08:00
Aravind Karnam	87d4b0fff4	format bash scripts properly so copy & paste may work without issues	2025-05-02 17:21:09 +05:30
Aravind Karnam	bd5a9ac632	updated readme with arguments for litellm	2025-05-02 17:04:42 +05:30
Aravind Karnam	6650b2f34a	fix: replace openAI with litellm to support multiple llm providers	2025-05-02 16:51:15 +05:30
Aravind Karnam	5cc58f9bb3	fix: 1. duplicate verbose flag 2.inconsistency in argument name --profile-name 3. duplicate initialisaiton of env_defaults	2025-05-02 16:40:58 +05:30
Aravind Karnam	baf7f6a6f5	fix: typo in readme	2025-05-02 16:33:11 +05:30
UncleCode	94e9959fe0	feat(docker-api): add job-based polling endpoints for crawl and LLM tasks Implements new asynchronous endpoints for handling long-running crawl and LLM tasks: - POST /crawl/job and GET /crawl/job/{task_id} for crawl operations - POST /llm/job and GET /llm/job/{task_id} for LLM operations - Added Redis-based task management with configurable TTL - Moved schema definitions to dedicated schemas.py - Added example polling client demo_docker_polling.py This change allows clients to handle long-running operations asynchronously through a polling pattern rather than holding connections open.	2025-05-01 21:24:52 +08:00
Aravind Karnam	7c2fd5202e	fix: incorrect params and commands in linkedin app readme	2025-05-01 18:27:03 +05:30
UncleCode	ee01b81f3e	Merge branch 'merge-pr971' into next	2025-05-01 18:58:41 +08:00
UncleCode	0e5d672763	Merge branch 'pr-971' into merge-pr971	2025-05-01 18:57:28 +08:00
wakaka6	cd2b490b40	refactor(logger): Apply the Enumeration for color	2025-05-01 17:04:44 +08:00
UncleCode	50f0b83fcd	feat(linkedin): add prospect-wizard app with scraping and visualization Add new LinkedIn prospect discovery tool with three main components: - c4ai_discover.py for company and people scraping - c4ai_insights.py for org chart and decision maker analysis - Interactive graph visualization with company/people exploration Features include: - Configurable LinkedIn search and scraping - Org chart generation with decision maker scoring - Interactive network graph visualization - Company similarity analysis - Chat interface for data exploration Requires: crawl4ai, openai, sentence-transformers, networkx	2025-04-30 19:38:25 +08:00
UncleCode	9499164d3c	feat(browser): improve browser profile management and cleanup Enhance browser profile handling with better process cleanup and documentation: - Add process cleanup for existing Chromium instances on Windows/Unix - Fix profile creation by passing complete browser config - Add comprehensive documentation for browser and CLI components - Add initial profile creation test - Bump version to 0.6.3 This change improves reliability when managing browser profiles and provides better documentation for developers.	2025-04-29 23:04:32 +08:00
UncleCode	2140d9aca4	fix(browser): correct headless mode default behavior Modify BrowserConfig to respect explicit headless parameter setting instead of forcing True. Update version to 0.6.2 and clean up code formatting in examples. BREAKING CHANGE: BrowserConfig no longer defaults to headless=True when explicitly set to False	2025-04-26 21:09:50 +08:00
UncleCode	ccec40ed17	feat(models): add dedicated tables field to CrawlResult - Add tables field to CrawlResult model while maintaining backward compatibility - Update async_webcrawler.py to extract tables from media and pass to tables field - Update crypto_analysis_example.py to use the new tables field - Add /config/dump examples to demo_docker_api.py - Bump version to 0.6.1	2025-04-24 18:36:25 +08:00
UncleCode	ad4dfb21e1	Remoce "rc1"	2025-04-23 21:00:00 +08:00
UncleCode	7784b2468e	feat(docs): enhance Ask AI button UX and add v0.6.0 release notes Improve Ask AI button with better mobile support, animations, and positioning: - Add button animations and hover effects - Improve mobile responsiveness - Add icon to button - Fix positioning logic for different viewport sizes - Add keyboard (Escape) support Add comprehensive v0.6.0 release documentation: - Create detailed release notes - Update blog index with latest release - Document all major features and breaking changes BREAKING CHANGE: Documentation structure updated with new v0.6.0 section	2025-04-23 20:07:03 +08:00
UncleCode	146f9d415f	Update README vr0.6.0	2025-04-23 19:50:33 +08:00
UncleCode	37fd80e4b9	feat(docs): add mobile-friendly navigation menu Implements a responsive hamburger menu for mobile devices with the following changes: - Add new mobile_menu.js for handling mobile navigation - Update layout.css with mobile-specific styles and animations - Enhance README with updated geolocation example - Register mobile_menu.js in mkdocs.yml The mobile menu includes: - Hamburger button animation - Slide-out sidebar - Backdrop overlay - Touch-friendly navigation - Proper event handling	2025-04-23 19:44:25 +08:00
UncleCode	949a93982e	feat(docs): update documentation and disable Ask AI feature Major documentation updates including: - Add comprehensive code examples page - Add video tutorial to homepage - Update Docker deployment instructions for v0.6.0 - Temporarily disable Ask AI feature - Add table border styling - Update site version to v0.6.x BREAKING CHANGE: Ask AI feature temporarily disabled pending launch	2025-04-23 19:02:39 +08:00
UncleCode	c4f5651199	chore(deps): upgrade to Python 3.12 and prepare for 0.6.0 release - Update Docker base image to Python 3.12-slim-bookworm - Bump version from 0.6.0rc1 to 0.6.0 - Update documentation to reflect release version changes - Fix license specification in pyproject.toml and setup.py - Clean up code formatting in demo_docker_api.py BREAKING CHANGE: Base Python version upgraded from 3.10 to 3.12	2025-04-23 16:35:15 +08:00
UncleCode	b0aa8bc9f7	Update README vr0.6.0rc1	2025-04-22 23:21:42 +08:00
UncleCode	c98ffe2130	Update CHANGELOG	2025-04-22 22:36:41 +08:00
UncleCode	4812f08a73	feat(docker): update Docker deployment for v0.6.0 Major updates to Docker deployment infrastructure: - Switch default port to 11235 for all services - Add MCP (Model Context Protocol) support with WebSocket/SSE endpoints - Simplify docker-compose.yml with auto-platform detection - Update documentation with new features and examples - Consolidate configuration and improve resource management BREAKING CHANGE: Default port changed from 8020 to 11235. Update your configurations and deployment scripts accordingly.	2025-04-22 22:35:25 +08:00
unclecode	f3ebb38edf	Merge PR #899 into next, resolve conflicts in server.py and docs/browser-crawler-config.md	2025-04-22 14:56:47 +08:00
UncleCode	0007aea204	Update changelog	2025-04-21 23:21:49 +08:00
UncleCode	b5c25731e6	feat(browser): add geolocation, locale and timezone support Add support for controlling browser geolocation, locale and timezone settings: - New GeolocationConfig class for managing GPS coordinates - Add locale and timezone_id parameters to CrawlerRunConfig - Update browser context creation to handle location settings - Add example script for geolocation usage - Update documentation with location-based identity features This enables more precise control over browser identity and location reporting.	2025-04-21 23:20:59 +08:00
UncleCode	5297e362f3	feat(mcp): Implement MCP protocol and enhance server capabilities This commit introduces several significant enhancements to the Crawl4AI Docker deployment: 1. Add MCP Protocol Support: - Implement WebSocket and SSE transport layers for MCP server communication - Create mcp_bridge.py to expose existing API endpoints via MCP protocol - Add comprehensive tests for both socket and SSE transport methods 2. Enhance Docker Server Capabilities: - Add PDF generation endpoint with file saving functionality - Add screenshot capture endpoint with configurable wait time - Implement JavaScript execution endpoint for dynamic page interaction - Add intelligent file path handling for saving generated assets 3. Improve Search and Context Functionality: - Implement syntax-aware code function chunking using AST parsing - Add BM25-based intelligent document search with relevance scoring - Create separate code and documentation context endpoints - Enhance response format with structured results and scores 4. Rename and Fix File Organization: - Fix typo in test_docker_config_gen.py filename - Update import statements and dependencies - Add FileResponse for context endpoints This enhancement significantly improves the machine-to-machine communication capabilities of Crawl4AI, making it more suitable for integration with LLM agents and other automated systems. The CHANGELOG update has been applied successfully, highlighting the key features and improvements made in this release. The commit message provides a detailed explanation of all the changes, which will be helpful for tracking the project's evolution.	2025-04-21 22:22:02 +08:00
UncleCode	a58c8000aa	refactor(server): migrate to pool-based crawler management Replace crawler_manager.py with simpler crawler_pool.py implementation: - Add global page semaphore for hard concurrency cap - Implement browser pool with idle cleanup - Add playground UI for testing and stress testing - Update API handlers to use pooled crawlers - Enhance logging levels and symbols BREAKING CHANGE: Removes CrawlerManager class in favor of simpler pool-based approach	2025-04-20 20:14:26 +08:00
Aravind Karnam	b27bb367e8	merge next. Resolve conflicts. Fix some import errors and error handling in server.py	2025-04-19 20:27:47 +05:30
Aravind Karnam	d2648eaa39	fix: solved with deepcopy of elements https://github.com/unclecode/crawl4ai/issues/902	2025-04-19 20:08:36 +05:30
Aravind Karnam	c2902fd200	reverse:last change in order of execution for it introduced a new issue in content generated. https://github.com/unclecode/crawl4ai/issues/902	2025-04-19 19:46:20 +05:30
UncleCode	16b2318242	feat(api): implement crawler pool manager for improved resource handling Adds a new CrawlerManager class to handle browser instance pooling and failover: - Implements auto-scaling based on system resources - Adds primary/backup crawler management - Integrates memory monitoring and throttling - Adds streaming support with memory tracking - Updates API endpoints to use pooled crawlers BREAKING CHANGE: API endpoints now require CrawlerManager initialization	2025-04-18 22:26:24 +08:00
UncleCode	907cba194f	Merge branch 'next-stress' into next	2025-04-17 22:34:43 +08:00
UncleCode	3bf78ff47a	refactor(docker-demo): enhance error handling and output formatting Improve the Docker API demo script with better error handling, more detailed output, and enhanced visualization: - Add detailed error messages and stack traces for debugging - Implement better status code handling and display - Enhance JSON output formatting with monokai theme and word wrap - Add depth information display for deep crawls - Improve proxy usage reporting - Fix port number inconsistency No breaking changes.	2025-04-17 22:32:58 +08:00
UncleCode	921e0c46b6	feat(tests): implement high volume stress testing framework Add comprehensive stress testing solution for SDK using arun_many and dispatcher system: - Create test_stress_sdk.py for running high volume crawl tests - Add run_benchmark.py for orchestrating tests with predefined configs - Implement benchmark_report.py for generating performance reports - Add memory tracking and local test site generation - Support both streaming and batch processing modes - Add detailed documentation in README.md The framework enables testing SDK performance, concurrency handling, and memory behavior under high-volume scenarios.	2025-04-17 22:31:51 +08:00
UncleCode	fd899f66aa	Merge branch 'next-fix-markdown-source' into next	2025-04-17 20:16:15 +08:00
UncleCode	30ec4f571f	feat(docs): add comprehensive Docker API demo script Add a new example script demonstrating Docker API usage with extensive features: - Basic crawling with single/multi URL support - Markdown generation with various filters - Parameter demonstrations (CSS, JS, screenshots, SSL, proxies) - Extraction strategies using CSS and LLM - Deep crawling capabilities with streaming - Integration examples with proxy rotation and SSL certificate fetching Also includes minor formatting improvements in async_webcrawler.py	2025-04-17 20:16:11 +08:00
UncleCode	7db6b468d9	feat(markdown): add content source selection for markdown generation Adds a new content_source parameter to MarkdownGenerationStrategy that allows selecting which HTML content to use for markdown generation: - cleaned_html (default): uses post-processed HTML - raw_html: uses original webpage HTML - fit_html: uses preprocessed HTML for schema extraction Changes include: - Added content_source parameter to MarkdownGenerationStrategy - Updated AsyncWebCrawler to handle HTML source selection - Added examples and tests for the new feature - Updated documentation with new parameter details BREAKING CHANGE: Renamed cleaned_html parameter to input_html in generate_markdown() method signature to better reflect its generalized purpose	2025-04-17 20:13:53 +08:00
Aravind Karnam	eed7f88f29	Merge branch 'next' into 2025-MAR-ALPHA-1	2025-04-17 10:50:02 +05:30
UncleCode	94d486579c	docs(tests): clarify server URL comments in deep crawl tests Improve documentation of test configuration URLs by adding clearer comments explaining when to use each URL configuration - Docker vs development mode. No functional changes, only comment improvements.	2025-04-15 22:32:27 +08:00

1 2 3 4 5 ...

813 Commits