Commit Graph

1109 Commits

Author SHA1 Message Date
Ahmed-Tawfik94
984524ca1c fix(auth): add token authorization header in request preparation to ensure authenticated requests are made 2025-05-21 13:27:17 +08:00
UncleCode
1c0ce41328 Fix managed browser page retrieval when no pages (#1137)
This pull request addresses the issue of handling default context pages when none are open.  
- Introduces a conditional check to determine if a page exists in the context.  
- If no pages exist, a new page is created via await context.new_page().
2025-05-20 21:12:32 +08:00
ntohidi
cb8d581e47 fix(docs): update CrawlerRunConfig to use CacheMode for bypassing cache. REF: #1125 2025-05-19 18:03:05 +02:00
Ahmed-Tawfik94
a55c2b3f88 refactor(logging): update extraction logging to use url_status method 2025-05-19 16:32:22 +08:00
Ahmed Tawfik
ce09648af1 Merge pull request #1054 from Sacristaan/feature/readme_example
Fix: README.md urls list
2025-05-19 14:20:21 +08:00
Ahmed-Tawfik94
a97654270b #1086 fix(markdown): update BM25 filter to use language parameter for stemming 2025-05-19 14:11:46 +08:00
Ahmed-Tawfik94
b4fc60a555 #1103 fix(url): enhance URL normalization to handle invalid schemes and trailing slashes 2025-05-19 13:51:16 +08:00
Ahmed-Tawfik94
137ac014fb #1105 :fix(metadata): optimize article metadata extraction using XPath for improved performance 2025-05-19 13:48:02 +08:00
Ahmed-Tawfik94
faa98eefbc #1105 got fixed (metadata now matches with meta property article:* 2025-05-19 11:35:13 +08:00
UncleCode
85ac6fa523 Merge branch 'next' of https://github.com/unclecode/crawl4ai into next 2025-05-17 19:04:03 +08:00
UncleCode
becc4624bb feat(favicon): add new favicon images for improved branding 2025-05-17 19:03:51 +08:00
UncleCode
754ba731fa Fix chunk splitting utilities (#1122)
* Fix merge_chunks splitter usage and remove incorrect return

* 📝 Add docstrings to `codex/find-and-fix-a-bug` (#1123)

Docstrings generation was requested by @unclecode.

* https://github.com/unclecode/crawl4ai/pull/1122#issuecomment-2887985865

The following files were modified:

* `crawl4ai/utils.py`

Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>

---------

Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>
2025-05-17 15:06:53 +08:00
UncleCode
ac9981a1f5 feat(favicon): add favicon image and update mkdocs configuration 2025-05-16 21:59:23 +08:00
UncleCode
83ef15fd47 feat(favicon): add favicon.ico for improved branding 2025-05-16 21:55:07 +08:00
UncleCode
a3cb938675 feat(theme): enable dark color mode in mkdocs configuration 2025-05-16 21:44:56 +08:00
UncleCode
9b60988232 feat(feedback): add feedback modal styles and integrate into mkdocs configuration 2025-05-16 21:25:10 +08:00
UncleCode
98e951f611 fix(mkdocs): remove duplicate gtag.js entry in extra_javascript 2025-05-16 20:52:41 +08:00
UncleCode
baca2df8df feat(analytics): add Google Tag Manager script and gtag.js for tracking 2025-05-16 20:49:02 +08:00
UncleCode
8a5e23d374 feat(crawler): add separate timeout for wait_for condition
Adds a new wait_for_timeout parameter to CrawlerRunConfig that allows specifying
a separate timeout for the wait_for condition, independent of the page_timeout.
This provides more granular control over waiting behaviors in the crawler.

Also removes unused colorama dependency and updates LinkedIn crawler example.

BREAKING CHANGE: LinkedIn crawler example now uses different wait_for_images timing
2025-05-16 17:00:45 +08:00
ntohidi
22725ca87b fix(crawler): initialize captured_console to prevent unbound local error for local HTML files. REF: #1072
Resolved a bug where running the crawler on local HTML files with `capture_console_messages=False`
(default) raised `UnboundLocalError` due to `captured_console` being accessed before assignment.
2025-05-15 11:29:36 +02:00
ntohidi
e0fbd2b0a0 fix(schema): update f parameter description to use lowercase enum values. REF: #1070
Revised the description for the `f` parameter in the `/mcp/md` tool schema to use lowercase enum values
(`raw`, `fit`, `bm25`, `llm`) for consistency with the actual `enum` definition. This change prevents
LLM-based clients (e.g., Gemini via LibreChat) from generating uppercase values like `"FIT"`, which
caused 422 validation errors due to strict case-sensitive matching.
2025-05-15 10:45:23 +02:00
ntohidi
32966bea11 fix(extraction): resolve 'str' object has no attribute 'choices' error in LLMExtractionStrategy. Refs: #979
This patch ensures consistent handling of `response.choices[0].message.content` by avoiding redefinition
of the `response` variable, which caused downstream exceptions during error handling.
2025-05-15 10:09:19 +02:00
Ahmed-Tawfik94
a3b0cab52a #1088 is sloved flag -bc now if for --byPass-cache 2025-05-15 11:25:06 +08:00
medo94my
137556b3dc fix the EXTRACT to match the styling of the other methods 2025-05-14 16:01:10 +08:00
ntohidi
260e2dc347 fix(browser): create browser config before launching managed browser instance. REF: https://discord.com/channels/1278297938551902308/1278298697540567132/1371683009459392716 2025-05-13 14:03:20 +02:00
ntohidi
25d97d56e4 fix(dependencies): remove duplicated aiofiles from project dependencies. REF #1045 2025-05-13 13:56:12 +02:00
Aravind Karnam
98a56e6e01 Merge next branch 2025-05-13 17:12:11 +05:30
Emmanuel Ferdman
1e1c887a2f fix(docker-api): migrate to modern datetime library API
Signed-off-by: Emmanuel Ferdman <emmanuelferdman@gmail.com>
2025-05-13 00:04:58 -07:00
UncleCode
897e017361 Set version to 0.6.3 vr0.6.3 v0.6.3 2025-05-12 21:20:10 +08:00
UncleCode
a3e9ef91ad fix(crawler): remove automatic page closure in screenshot methods
Removes automatic page closure in take_screenshot and take_screenshot_naive methods
to prevent premature closure of pages that might still be needed in the calling context.
This allows for more flexible page lifecycle management by the caller.

BREAKING CHANGE: Page objects are no longer automatically closed after taking screenshots.
Callers must explicitly handle page closure when appropriate.
2025-05-12 21:17:57 +08:00
UncleCode
76dd86d1b3 Merge remote-tracking branch 'origin/linkedin-prep' into next 2025-05-08 17:13:59 +08:00
UncleCode
206a9dfabd feat(crawler): add session management and view-source support
Add session_id feature to allow reusing browser pages across multiple crawls.
Add support for view-source: protocol in URL handling.
Fix browser config reference and string formatting issues.
Update examples to demonstrate new session management features.

BREAKING CHANGE: Browser page handling now persists when using session_id
2025-05-08 17:13:35 +08:00
ntohidi
1af3d1c2e0 Merge branch '2025-APR-1' of https://github.com/unclecode/crawl4ai into 2025-APR-1 2025-05-08 11:11:32 +02:00
Aravind Karnam
c1041b9bbe fix: exclude_external_images flag simply discards elements ref:https://github.com/unclecode/crawl4ai/issues/345 2025-05-07 18:43:29 +05:30
Aravind Karnam
f6e25e2a6b fix: check_robots_txt to support wildcard rules ref: #699 2025-05-07 17:53:30 +05:30
ntohidi
ee93acbd06 fix(async_playwright_crawler): use config directly instead of self.config for verbosity check 2025-05-07 12:32:38 +02:00
Aravind Karnam
2b17f234f8 docs: update direct passing of content_filter to CrawlerRunConfig and instead pass it via MarkdownGenerator. Ref: #603 2025-05-07 15:20:36 +05:30
ntohidi
eebb8c84f0 fix(requirements): add PyPDF2 dependency for PDF processing 2025-05-07 11:18:44 +02:00
ntohidi
12783fabda fix(dependencies): update pillow version constraint to allow newer releases. ref #709 2025-05-07 11:18:13 +02:00
Aravind Karnam
39e3b792a1 Merge branch 'next' into 2025-APR-1 2025-05-07 10:25:25 +05:30
Aravind Karnam
aaf05910eb fix: removed unnecessary imports and installs 2025-05-06 15:53:55 +05:30
Aravind Karnam
a0555d5fa6 merge:from next branch 2025-05-06 15:16:47 +05:30
Aravind Karnam
38ebcbb304 fix: provide support for local llm by adding it to the arguments 2025-05-05 10:34:38 +05:30
UncleCode
9b5ccac76e feat(extraction): add RegexExtractionStrategy for pattern-based extraction
Add new RegexExtractionStrategy for fast, zero-LLM extraction of common data types:
- Built-in patterns for emails, URLs, phones, dates, and more
- Support for custom regex patterns
- LLM-assisted pattern generation utility
- Optimized HTML preprocessing with fit_html field
- Enhanced network response body capture

Breaking changes: None
2025-05-02 21:15:24 +08:00
Aravind Karnam
87d4b0fff4 format bash scripts properly so copy & paste may work without issues 2025-05-02 17:21:09 +05:30
Aravind Karnam
bd5a9ac632 updated readme with arguments for litellm 2025-05-02 17:04:42 +05:30
Aravind Karnam
6650b2f34a fix: replace openAI with litellm to support multiple llm providers 2025-05-02 16:51:15 +05:30
Aravind Karnam
5cc58f9bb3 fix: 1. duplicate verbose flag 2.inconsistency in argument name --profile-name 3. duplicate initialisaiton of env_defaults 2025-05-02 16:40:58 +05:30
Aravind Karnam
baf7f6a6f5 fix: typo in readme 2025-05-02 16:33:11 +05:30
ntohidi
e0cd3e10de fix(crawler): initialize captured_console variable for local file processing 2025-05-02 10:35:35 +02:00