crawl4ai

Author	SHA1	Message	Date
UncleCode	c3b7b7e918	Add linkedin example ipynb.	2025-05-25 17:55:22 +08:00
UncleCode	7d0b447e1c	Update setup script to clarify virtual display setup message	2025-05-25 16:55:18 +08:00
UncleCode	33b0e222ca	Add Colab utilities and rename setup function for clarity	2025-05-25 16:50:56 +08:00
UncleCode	1fc45ffac8	Fix temperature typo and enhance LinkedIn extraction with Colab support - Fixed widespread typo: `temprature` → `temperature` across LLMConfig and related files - Enhanced CSS/XPath selector guidance for more reliable LinkedIn data extraction - Added Google Colab display server support for running Crawl4AI in notebook environments - Improved browser debugging with verbose startup args logging - Updated LinkedIn schemas and HTML snippets for better parsing accuracy 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>	2025-05-25 16:47:12 +08:00
devin-ai-integration[bot]	9c2cc7f73c	Fix BM25ContentFilter documentation to use language parameter instead of use_stemming (#1152 ) Co-authored-by: Devin AI <158243242+devin-ai-integration[bot]@users.noreply.github.com> Co-authored-by: UncleCode <unclecode@kidocode.com>	2025-05-25 10:02:13 +08:00
UncleCode	1c5e76d51a	Adjust positioning and set only core component as selected item by default	2025-05-24 20:49:44 +08:00
UncleCode	7665a6832f	Add LLMContext article and updte JS to not show all components.	2025-05-24 20:46:24 +08:00
UncleCode	a06710ff03	Adding LLMContext generator to website.	2025-05-24 20:37:09 +08:00
unclecode	ad078c3f18	fix(pdf): add timeout to PDF downloads to prevent hanging (#1141 ) - Added timeout=(20, 600) to requests.get() to prevent indefinite hanging - Added download progress logging for better visibility - Improved error handling with specific timeout exceptions - Better temp file cleanup tracking Fixes #1141	2025-05-23 16:05:44 +08:00
unclecode	400a6621ee	Add debug folder to gitignore	2025-05-23 10:43:05 +08:00
Aravind Karnam	3d46d89759	docs: fix https://github.com/unclecode/crawl4ai/issues/1109	2025-05-22 17:21:42 +05:30
ntohidi	da8f0dbb93	fix(browser_profiler): change logger print to info for consistent logging in interactive manager	2025-05-22 11:25:51 +02:00
ntohidi	33a0c7a17a	fix(logger): add RED color to LogColor enum for enhanced logging options	2025-05-22 11:17:28 +02:00
UncleCode	bf56787874	refactor(browser): remove commented-out code for clarity	2025-05-21 20:32:40 +08:00
UncleCode	08ad7ef257	feat(browser): improve browser session management and profile handling Enhance browser session management with the following improvements: - Add state cloning between browser contexts - Implement smarter page closing logic based on total pages and browser config - Add storage state persistence during profile creation - Improve managed browser context handling with storage state support This change improves browser session reliability and persistence across runs.	2025-05-21 20:23:17 +08:00
Ahmed-Tawfik94	984524ca1c	fix(auth): add token authorization header in request preparation to ensure authenticated requests are made	2025-05-21 13:27:17 +08:00
UncleCode	1c0ce41328	Fix managed browser page retrieval when no pages (#1137 ) This pull request addresses the issue of handling default context pages when none are open. - Introduces a conditional check to determine if a page exists in the context. - If no pages exist, a new page is created via await context.new_page().	2025-05-20 21:12:32 +08:00
ntohidi	cb8d581e47	fix(docs): update CrawlerRunConfig to use CacheMode for bypassing cache. REF: #1125	2025-05-19 18:03:05 +02:00
Ahmed-Tawfik94	a55c2b3f88	refactor(logging): update extraction logging to use url_status method	2025-05-19 16:32:22 +08:00
Ahmed Tawfik	ce09648af1	Merge pull request #1054 from Sacristaan/feature/readme_example Fix: README.md urls list	2025-05-19 14:20:21 +08:00
Ahmed-Tawfik94	a97654270b	#1086 fix(markdown): update BM25 filter to use language parameter for stemming	2025-05-19 14:11:46 +08:00
Ahmed-Tawfik94	b4fc60a555	#1103 fix(url): enhance URL normalization to handle invalid schemes and trailing slashes	2025-05-19 13:51:16 +08:00
Ahmed-Tawfik94	137ac014fb	#1105 :fix(metadata): optimize article metadata extraction using XPath for improved performance	2025-05-19 13:48:02 +08:00
Ahmed-Tawfik94	faa98eefbc	#1105 got fixed (metadata now matches with meta property article:*	2025-05-19 11:35:13 +08:00
UncleCode	85ac6fa523	Merge branch 'next' of https://github.com/unclecode/crawl4ai into next	2025-05-17 19:04:03 +08:00
UncleCode	becc4624bb	feat(favicon): add new favicon images for improved branding	2025-05-17 19:03:51 +08:00
UncleCode	754ba731fa	Fix chunk splitting utilities (#1122 ) * Fix merge_chunks splitter usage and remove incorrect return * 📝 Add docstrings to `codex/find-and-fix-a-bug` (#1123) Docstrings generation was requested by @unclecode. * https://github.com/unclecode/crawl4ai/pull/1122#issuecomment-2887985865 The following files were modified: * `crawl4ai/utils.py` Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com> --------- Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>	2025-05-17 15:06:53 +08:00
UncleCode	ac9981a1f5	feat(favicon): add favicon image and update mkdocs configuration	2025-05-16 21:59:23 +08:00
UncleCode	83ef15fd47	feat(favicon): add favicon.ico for improved branding	2025-05-16 21:55:07 +08:00
UncleCode	a3cb938675	feat(theme): enable dark color mode in mkdocs configuration	2025-05-16 21:44:56 +08:00
UncleCode	9b60988232	feat(feedback): add feedback modal styles and integrate into mkdocs configuration	2025-05-16 21:25:10 +08:00
UncleCode	98e951f611	fix(mkdocs): remove duplicate gtag.js entry in extra_javascript	2025-05-16 20:52:41 +08:00
UncleCode	baca2df8df	feat(analytics): add Google Tag Manager script and gtag.js for tracking	2025-05-16 20:49:02 +08:00
UncleCode	8a5e23d374	feat(crawler): add separate timeout for wait_for condition Adds a new wait_for_timeout parameter to CrawlerRunConfig that allows specifying a separate timeout for the wait_for condition, independent of the page_timeout. This provides more granular control over waiting behaviors in the crawler. Also removes unused colorama dependency and updates LinkedIn crawler example. BREAKING CHANGE: LinkedIn crawler example now uses different wait_for_images timing	2025-05-16 17:00:45 +08:00
ntohidi	22725ca87b	fix(crawler): initialize `captured_console` to prevent unbound local error for local HTML files. REF: #1072 Resolved a bug where running the crawler on local HTML files with `capture_console_messages=False` (default) raised `UnboundLocalError` due to `captured_console` being accessed before assignment.	2025-05-15 11:29:36 +02:00
ntohidi	e0fbd2b0a0	fix(schema): update `f` parameter description to use lowercase enum values. REF: #1070 Revised the description for the `f` parameter in the `/mcp/md` tool schema to use lowercase enum values (`raw`, `fit`, `bm25`, `llm`) for consistency with the actual `enum` definition. This change prevents LLM-based clients (e.g., Gemini via LibreChat) from generating uppercase values like `"FIT"`, which caused 422 validation errors due to strict case-sensitive matching.	2025-05-15 10:45:23 +02:00
ntohidi	32966bea11	fix(extraction): resolve `'str' object has no attribute 'choices'` error in LLMExtractionStrategy. Refs: #979 This patch ensures consistent handling of `response.choices[0].message.content` by avoiding redefinition of the `response` variable, which caused downstream exceptions during error handling.	2025-05-15 10:09:19 +02:00
Ahmed-Tawfik94	a3b0cab52a	#1088 is sloved flag -bc now if for --byPass-cache	2025-05-15 11:25:06 +08:00
medo94my	137556b3dc	fix the EXTRACT to match the styling of the other methods	2025-05-14 16:01:10 +08:00
ntohidi	260e2dc347	fix(browser): create browser config before launching managed browser instance. REF: https://discord.com/channels/1278297938551902308/1278298697540567132/1371683009459392716	2025-05-13 14:03:20 +02:00
ntohidi	25d97d56e4	fix(dependencies): remove duplicated aiofiles from project dependencies. REF #1045	2025-05-13 13:56:12 +02:00
Aravind Karnam	98a56e6e01	Merge next branch	2025-05-13 17:12:11 +05:30
UncleCode	897e017361	Set version to 0.6.3 vr0.6.3 v0.6.3	2025-05-12 21:20:10 +08:00
UncleCode	a3e9ef91ad	fix(crawler): remove automatic page closure in screenshot methods Removes automatic page closure in take_screenshot and take_screenshot_naive methods to prevent premature closure of pages that might still be needed in the calling context. This allows for more flexible page lifecycle management by the caller. BREAKING CHANGE: Page objects are no longer automatically closed after taking screenshots. Callers must explicitly handle page closure when appropriate.	2025-05-12 21:17:57 +08:00
UncleCode	76dd86d1b3	Merge remote-tracking branch 'origin/linkedin-prep' into next	2025-05-08 17:13:59 +08:00
UncleCode	206a9dfabd	feat(crawler): add session management and view-source support Add session_id feature to allow reusing browser pages across multiple crawls. Add support for view-source: protocol in URL handling. Fix browser config reference and string formatting issues. Update examples to demonstrate new session management features. BREAKING CHANGE: Browser page handling now persists when using session_id	2025-05-08 17:13:35 +08:00
ntohidi	1af3d1c2e0	Merge branch '2025-APR-1' of https://github.com/unclecode/crawl4ai into 2025-APR-1	2025-05-08 11:11:32 +02:00
Aravind Karnam	c1041b9bbe	fix: exclude_external_images flag simply discards elements ref:https://github.com/unclecode/crawl4ai/issues/345	2025-05-07 18:43:29 +05:30
Aravind Karnam	f6e25e2a6b	fix: check_robots_txt to support wildcard rules ref: #699	2025-05-07 17:53:30 +05:30
ntohidi	ee93acbd06	fix(async_playwright_crawler): use config directly instead of self.config for verbosity check	2025-05-07 12:32:38 +02:00

1 2 3 4 5 ...

973 Commits