Commit Graph

848 Commits

Author SHA1 Message Date
ntohidi
dc85481180 refactor: Update LLM extraction example with the updated structure 2025-06-12 12:23:03 +02:00
ntohidi
5d9213a0e9 fix: Update JavaScript execution in AsyncPlaywrightCrawlerStrategy to handle script errors and add basic download test case. ref #1215 2025-06-12 12:21:40 +02:00
ntohidi
4679ee023d fix: Enhance URLPatternFilter to enforce path boundary checks for prefix matching. ref #1003 2025-06-10 11:19:18 +02:00
Nasrin
f9b7090084 Merge pull request #1186 from zimmski/fix-typo-provoder
fix, Typo
2025-06-10 10:26:45 +02:00
AHMET YILMAZ
9442597f81 #1127: Improve URL handling and normalization in scraping strategies 2025-06-10 11:57:06 +08:00
AHMET YILMAZ
74b06d4b80 #1167 Add PHP MIME types to ContentTypeFilter for better file handling 2025-06-09 11:49:33 +08:00
ntohidi
5ac19a61d7 feat: Implement max_scroll_steps parameter for full page scanning. ref: #1168 2025-06-05 16:40:34 +02:00
Markus Zimmermann
022cc2d92a fix, Typo 2025-06-05 15:30:38 +02:00
ntohidi
fcc2abe4db (fix): Update document about LLM extraction strategy to use LLMConfig. REF #1146 2025-06-03 12:53:59 +02:00
ntohidi
cc95d3abd4 Fix raw URL parsing logic to correctly handle "raw://" and "raw:" prefixes. REF #1118 2025-06-03 11:19:08 +02:00
Nasrin
5ce3e682f3 Merge pull request #752 from jl-martins/fix-raw-url-parsing
Fix `raw://` URL parsing logic. issue ref #1118
2025-06-03 11:10:29 +02:00
João Martins
58c1e17170 Merge branch 'main' into fix-raw-url-parsing 2025-05-30 13:03:25 +01:00
UncleCode
3b766e1aac Add Google Colab button to LinkedIn Prospect Wizard README
- Added Colab badge linking to the demo notebook
- Added call-to-action encouraging users to try the demo in Colab
- Provides zero-setup cloud environment for testing

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-05-26 14:35:06 +08:00
UncleCode
c3b7b7e918 Add linkedin example ipynb. 2025-05-25 17:55:22 +08:00
UncleCode
7d0b447e1c Update setup script to clarify virtual display setup message 2025-05-25 16:55:18 +08:00
UncleCode
33b0e222ca Add Colab utilities and rename setup function for clarity 2025-05-25 16:50:56 +08:00
UncleCode
1fc45ffac8 Fix temperature typo and enhance LinkedIn extraction with Colab support
- Fixed widespread typo: `temprature` → `temperature` across LLMConfig and related files
- Enhanced CSS/XPath selector guidance for more reliable LinkedIn data extraction
- Added Google Colab display server support for running Crawl4AI in notebook environments
- Improved browser debugging with verbose startup args logging
- Updated LinkedIn schemas and HTML snippets for better parsing accuracy

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-05-25 16:47:12 +08:00
devin-ai-integration[bot]
9c2cc7f73c Fix BM25ContentFilter documentation to use language parameter instead of use_stemming (#1152)
Co-authored-by: Devin AI <158243242+devin-ai-integration[bot]@users.noreply.github.com>
Co-authored-by: UncleCode <unclecode@kidocode.com>
2025-05-25 10:02:13 +08:00
UncleCode
1c5e76d51a Adjust positioning and set only core component as selected item by default 2025-05-24 20:49:44 +08:00
UncleCode
7665a6832f Add LLMContext article and updte JS to not show all components. 2025-05-24 20:46:24 +08:00
UncleCode
a06710ff03 Adding LLMContext generator to website. 2025-05-24 20:37:09 +08:00
unclecode
ad078c3f18 fix(pdf): add timeout to PDF downloads to prevent hanging (#1141)
- Added timeout=(20, 600) to requests.get() to prevent indefinite hanging
- Added download progress logging for better visibility
- Improved error handling with specific timeout exceptions
- Better temp file cleanup tracking

Fixes #1141
2025-05-23 16:05:44 +08:00
unclecode
400a6621ee Add debug folder to gitignore 2025-05-23 10:43:05 +08:00
UncleCode
bf56787874 refactor(browser): remove commented-out code for clarity 2025-05-21 20:32:40 +08:00
UncleCode
08ad7ef257 feat(browser): improve browser session management and profile handling
Enhance browser session management with the following improvements:
- Add state cloning between browser contexts
- Implement smarter page closing logic based on total pages and browser config
- Add storage state persistence during profile creation
- Improve managed browser context handling with storage state support

This change improves browser session reliability and persistence across runs.
2025-05-21 20:23:17 +08:00
UncleCode
1c0ce41328 Fix managed browser page retrieval when no pages (#1137)
This pull request addresses the issue of handling default context pages when none are open.  
- Introduces a conditional check to determine if a page exists in the context.  
- If no pages exist, a new page is created via await context.new_page().
2025-05-20 21:12:32 +08:00
UncleCode
85ac6fa523 Merge branch 'next' of https://github.com/unclecode/crawl4ai into next 2025-05-17 19:04:03 +08:00
UncleCode
becc4624bb feat(favicon): add new favicon images for improved branding 2025-05-17 19:03:51 +08:00
UncleCode
754ba731fa Fix chunk splitting utilities (#1122)
* Fix merge_chunks splitter usage and remove incorrect return

* 📝 Add docstrings to `codex/find-and-fix-a-bug` (#1123)

Docstrings generation was requested by @unclecode.

* https://github.com/unclecode/crawl4ai/pull/1122#issuecomment-2887985865

The following files were modified:

* `crawl4ai/utils.py`

Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>

---------

Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>
2025-05-17 15:06:53 +08:00
UncleCode
ac9981a1f5 feat(favicon): add favicon image and update mkdocs configuration 2025-05-16 21:59:23 +08:00
UncleCode
83ef15fd47 feat(favicon): add favicon.ico for improved branding 2025-05-16 21:55:07 +08:00
UncleCode
a3cb938675 feat(theme): enable dark color mode in mkdocs configuration 2025-05-16 21:44:56 +08:00
UncleCode
9b60988232 feat(feedback): add feedback modal styles and integrate into mkdocs configuration 2025-05-16 21:25:10 +08:00
UncleCode
98e951f611 fix(mkdocs): remove duplicate gtag.js entry in extra_javascript 2025-05-16 20:52:41 +08:00
UncleCode
baca2df8df feat(analytics): add Google Tag Manager script and gtag.js for tracking 2025-05-16 20:49:02 +08:00
UncleCode
8a5e23d374 feat(crawler): add separate timeout for wait_for condition
Adds a new wait_for_timeout parameter to CrawlerRunConfig that allows specifying
a separate timeout for the wait_for condition, independent of the page_timeout.
This provides more granular control over waiting behaviors in the crawler.

Also removes unused colorama dependency and updates LinkedIn crawler example.

BREAKING CHANGE: LinkedIn crawler example now uses different wait_for_images timing
2025-05-16 17:00:45 +08:00
UncleCode
897e017361 Set version to 0.6.3 vr0.6.3 v0.6.3 2025-05-12 21:20:10 +08:00
UncleCode
a3e9ef91ad fix(crawler): remove automatic page closure in screenshot methods
Removes automatic page closure in take_screenshot and take_screenshot_naive methods
to prevent premature closure of pages that might still be needed in the calling context.
This allows for more flexible page lifecycle management by the caller.

BREAKING CHANGE: Page objects are no longer automatically closed after taking screenshots.
Callers must explicitly handle page closure when appropriate.
2025-05-12 21:17:57 +08:00
UncleCode
76dd86d1b3 Merge remote-tracking branch 'origin/linkedin-prep' into next 2025-05-08 17:13:59 +08:00
UncleCode
206a9dfabd feat(crawler): add session management and view-source support
Add session_id feature to allow reusing browser pages across multiple crawls.
Add support for view-source: protocol in URL handling.
Fix browser config reference and string formatting issues.
Update examples to demonstrate new session management features.

BREAKING CHANGE: Browser page handling now persists when using session_id
2025-05-08 17:13:35 +08:00
Aravind Karnam
aaf05910eb fix: removed unnecessary imports and installs 2025-05-06 15:53:55 +05:30
Aravind Karnam
a0555d5fa6 merge:from next branch 2025-05-06 15:16:47 +05:30
Aravind Karnam
38ebcbb304 fix: provide support for local llm by adding it to the arguments 2025-05-05 10:34:38 +05:30
UncleCode
9b5ccac76e feat(extraction): add RegexExtractionStrategy for pattern-based extraction
Add new RegexExtractionStrategy for fast, zero-LLM extraction of common data types:
- Built-in patterns for emails, URLs, phones, dates, and more
- Support for custom regex patterns
- LLM-assisted pattern generation utility
- Optimized HTML preprocessing with fit_html field
- Enhanced network response body capture

Breaking changes: None
2025-05-02 21:15:24 +08:00
Aravind Karnam
87d4b0fff4 format bash scripts properly so copy & paste may work without issues 2025-05-02 17:21:09 +05:30
Aravind Karnam
bd5a9ac632 updated readme with arguments for litellm 2025-05-02 17:04:42 +05:30
Aravind Karnam
6650b2f34a fix: replace openAI with litellm to support multiple llm providers 2025-05-02 16:51:15 +05:30
Aravind Karnam
5cc58f9bb3 fix: 1. duplicate verbose flag 2.inconsistency in argument name --profile-name 3. duplicate initialisaiton of env_defaults 2025-05-02 16:40:58 +05:30
Aravind Karnam
baf7f6a6f5 fix: typo in readme 2025-05-02 16:33:11 +05:30
UncleCode
94e9959fe0 feat(docker-api): add job-based polling endpoints for crawl and LLM tasks
Implements new asynchronous endpoints for handling long-running crawl and LLM tasks:
- POST /crawl/job and GET /crawl/job/{task_id} for crawl operations
- POST /llm/job and GET /llm/job/{task_id} for LLM operations
- Added Redis-based task management with configurable TTL
- Moved schema definitions to dedicated schemas.py
- Added example polling client demo_docker_polling.py

This change allows clients to handle long-running operations asynchronously through a polling pattern rather than holding connections open.
2025-05-01 21:24:52 +08:00