Commit Graph

  • b2f3cb0dfa WIP: logger migriate to rich wakaka6 2025-04-10 23:02:19 +08:00
  • 18e8227dfb feat(crawler): add console message capture functionality UncleCode 2025-04-10 23:26:09 +08:00
  • 7c358a1aee fix(browser): add null check for crawlerRunConfig.url UncleCode 2025-04-10 23:25:07 +08:00
  • 108b2a8bfb Fixed capturing console messages for case the url is the local file. Update docker configuration (work in progress) UncleCode 2025-04-10 23:22:38 +08:00
  • 66ac07b4f3 feat(crawler): add network request and console message capturing unclecode 2025-04-10 16:03:48 +08:00
  • a2061bf31e feat(crawler): add MHTML capture functionality UncleCode 2025-04-09 15:39:04 +08:00
  • 6f7ab9c927 fix: Revert changes to session management in AsyncHttpWebcrawler and solve the underlying issue by removing the session closure in finally block of session context. Aravind Karnam 2025-04-08 18:31:00 +05:30
  • 9038e9acbd Merge branch 'main' into next UncleCode 2025-04-08 17:43:42 +08:00
  • 02e627e0bd fix(crawler): simplify page retrieval logic in AsyncPlaywrightCrawlerStrategy UncleCode 2025-04-08 17:43:36 +08:00
  • 72d8e679ad feat(pipeline): add high-level Crawler utility class for simplified web crawling next-2-batch-crawl UncleCode 2025-04-07 22:50:44 +08:00
  • 67a790b4a6 Add test file for Pipeline batch crawl. UncleCode 2025-04-06 19:38:31 +08:00
  • 5b66208a7e Refactor next branch UncleCode 2025-04-06 18:33:09 +08:00
  • d95b2dc9f2 Some refactoring, movie pipelin submodule folder into the main. UncleCode 2025-04-06 18:28:28 +08:00
  • 591f55edc7 refactor(browser): rename methods and update type hints in BrowserHub for clarity UncleCode 2025-04-06 18:22:05 +08:00
  • e1d9e2489c refactor(docs): update import statement in quickstart.py for improved clarity UncleCode 2025-04-05 23:12:06 +08:00
  • b1693b1c21 Remove old quickstart files UncleCode 2025-04-05 23:10:25 +08:00
  • 49d904ca0a refactor(docs): enhance quickstart_examples.py with improved configuration and file handling UncleCode 2025-04-05 22:57:45 +08:00
  • ca9351252a refactor(docs): update import paths and clean up example code in quickstart_examples.py UncleCode 2025-04-05 22:55:56 +08:00
  • 935d9d39f8 Add quickstart example set UncleCode 2025-04-05 21:37:25 +08:00
  • f8213c32b9 Merge branch 'vr0.5.0.post8' UncleCode 2025-04-05 21:36:17 +08:00
  • 14894b4d70 feat(config): set DefaultMarkdownGenerator as the default markdown generator in CrawlerRunConfig feat(logger): add color mapping for log message formatting options UncleCode 2025-04-03 20:34:19 +08:00
  • 7155778eac chore: move from faust-cchardet to chardet Aravind Karnam 2025-04-03 17:42:51 +05:30
  • 4133e5460d typo-fix: https://github.com/unclecode/crawl4ai/pull/918 Aravind Karnam 2025-04-03 17:42:24 +05:30
  • 73fda8a6ec fix: address the PR review: https://github.com/unclecode/crawl4ai/pull/899#discussion_r2024639193 Aravind Karnam 2025-04-03 13:47:13 +05:30
  • 86df20234b fix(crawler): handle exceptions in get_page call to ensure page retrieval UncleCode 2025-04-02 21:25:24 +08:00
  • 179921a131 fix(crawler): update get_page call to include additional return value UncleCode 2025-04-02 19:01:30 +08:00
  • 9e16a4bb26 Merge next and resolve conflicts Aravind Karnam 2025-04-02 12:18:23 +05:30
  • c5cac2b459 feat(browser): add BrowserHub for centralized browser management and resource sharing UncleCode 2025-04-01 20:35:02 +08:00
  • 555455d710 feat(browser): implement browser pooling and page pre-warming UncleCode 2025-03-31 21:55:07 +08:00
  • 765f856ed4 Merge pull request #808 from dvschuyl/bug/parse-srcset-fix-float-width Aravind 2025-03-31 18:21:09 +05:30
  • 757e3177ed fix: https://github.com/unclecode/crawl4ai/issues/839 Aravind Karnam 2025-03-31 17:10:04 +05:30
  • d8357e80d2 Merge pull request #915 from maggie-edkey/css-selector Aravind 2025-03-31 13:03:35 +05:30
  • ef1f0c4102 fix:https://github.com/unclecode/crawl4ai/issues/701 Aravind Karnam 2025-03-31 12:43:32 +05:30
  • 1119f2f5b5 fix: https://github.com/unclecode/crawl4ai/issues/911 maggie.wang 2025-03-31 14:05:54 +08:00
  • bb02398086 refactor(browser): improve browser strategy architecture and lifecycle management UncleCode 2025-03-30 20:58:39 +08:00
  • 3ff7eec8f3 refactor(browser): consolidate browser strategy implementations UncleCode 2025-03-28 22:47:28 +08:00
  • d8cbeff386 fix: https://github.com/unclecode/crawl4ai/issues/842 Aravind Karnam 2025-03-28 19:31:05 +05:30
  • 64f20ab44a refactor(docker): update Dockerfile and browser strategy to use Chromium UncleCode 2025-03-28 15:59:02 +08:00
  • 57e0423b3a fix:target_element should not affect link extraction. -> https://github.com/unclecode/crawl4ai/issues/902 Aravind Karnam 2025-03-28 12:56:37 +05:30
  • c635f6b9a2 refactor(browser): reorganize browser strategies and improve Docker implementation UncleCode 2025-03-27 21:35:13 +08:00
  • 7be5427283 Merge branch 'next' into 2025-MAR-ALPHA-1 Aravind Karnam 2025-03-27 12:29:32 +05:30
  • 7f93e88379 refactor(tests): remove unused imports in test_docker_browser.py UncleCode 2025-03-26 15:19:29 +08:00
  • 40d4dd36c9 chore(version): bump version to 0.5.0.post8 and update post-installation setup UncleCode 2025-03-25 21:56:49 +08:00
  • d8f38f2298 chore(version): bump version to 0.5.0.post7 UncleCode 2025-03-25 21:47:19 +08:00
  • 5c88d1310d feat(cli): add output file option and integrate LXML web scraping strategy UncleCode 2025-03-25 21:38:24 +08:00
  • 4a20d7f7c2 feat(cli): add quick JSON extraction and global config management UncleCode 2025-03-25 20:30:25 +08:00
  • 585e5e5973 fix: https://github.com/unclecode/crawl4ai/issues/733 Aravind Karnam 2025-03-25 15:17:59 +05:30
  • e3111d0a32 fix: prevent session closing after each request to maintain connection pool. Fixes: https://github.com/unclecode/crawl4ai/issues/867 Aravind Karnam 2025-03-25 13:46:55 +05:30
  • 2f0e217751 Chore: Add brotli as dependancy to fix: https://github.com/unclecode/crawl4ai/issues/867 Aravind Karnam 2025-03-25 13:44:41 +05:30
  • 6405cf0a6f Merge branch 'vr0.5.0.post5' into next UncleCode 2025-03-25 14:51:29 +08:00
  • 6eed4adc65 Merge branch 'vr0.5.0.post5' UncleCode 2025-03-25 12:24:07 +08:00
  • bdd9db579a chore(version): bump version to 0.5.0.post6 vr0.5.0.post5 UncleCode 2025-03-25 12:01:36 +08:00
  • 1107fa1d62 feat(cli): enhance markdown generation with default content filters UncleCode 2025-03-25 11:56:00 +08:00
  • efa73257c5 Merge branch 'next' into 2025-MAR-ALPHA-1 Aravind Karnam 2025-03-24 21:57:29 +05:30
  • 4dfd270161 fix: #855 run-many-deep-crawling UncleCode 2025-03-24 22:54:53 +08:00
  • 8c08521301 feat(browser): add Docker-based browser automation strategy UncleCode 2025-03-24 21:36:58 +08:00
  • 462d5765e2 fix(browser): improve storage state persistence in CDP strategy UncleCode 2025-03-23 21:06:41 +08:00
  • 6eeb2e4076 feat(browser): enhance browser context creation with user data directory support and improved storage state handling UncleCode 2025-03-23 19:07:13 +08:00
  • 0094cac675 refactor(browser): improve parallel crawling and browser management UncleCode 2025-03-23 18:53:24 +08:00
  • 4ab0893ffb feat(browser): implement modular browser management system UncleCode 2025-03-21 22:50:00 +08:00
  • e01d1e73e1 fix: link normalisation in BestFirstStrategy Aravind Karnam 2025-03-21 17:34:13 +05:30
  • 471d110c5e fix: url normalisation ref: https://github.com/unclecode/crawl4ai/issues/841 Aravind Karnam 2025-03-21 16:48:07 +05:30
  • f89113377a fix: Move adding of visited urls to the 'visited' set, when queueing the URLs instead of after dequeuing, this is to prevent duplicate crawls. https://github.com/unclecode/crawl4ai/issues/843 Aravind Karnam 2025-03-21 13:44:57 +05:30
  • 6740e87b4d fix: remove trailing slash when the path is empty. This is causing dupicate crawls Aravind Karnam 2025-03-21 13:41:31 +05:30
  • 8b761f232b fix: improve logged url readability by decoding encoded urls Aravind Karnam 2025-03-21 13:40:23 +05:30
  • e0c2a7c284 chore: remove mistakenly commited deps.txt file Aravind Karnam 2025-03-21 11:06:46 +05:30
  • ac2f9ae533 fix: streamline url status logging via single entrypoint i.e. logger.url_status Aravind Karnam 2025-03-20 18:59:15 +05:30
  • eedda1ae5c fix: Truncate long urls in middle than end since users are confused that same url is being scraped several times. Also remove labels on status and timer to be replaced with symbols to save space and display more URL Aravind Karnam 2025-03-20 18:56:19 +05:30
  • 8cecbec7a7 Merge branch 'next' into 2025-MAR-ALPHA-1 Aravind Karnam 2025-03-20 17:07:53 +05:30
  • 6432ff1257 feat(browser): add builtin browser management system UncleCode 2025-03-20 12:13:59 +08:00
  • 4359b12003 docs + fix: Update example for full page screenshot & PDF export. Fix the bug Error: crawl4ai.async_webcrawler.AsyncWebCrawler.aprocess_html() got multiple values for keyword argument - for screenshot param. https://github.com/unclecode/crawl4ai/issues/822#issuecomment-2732602118 Aravind Karnam 2025-03-18 17:20:24 +05:30
  • 5358ac0fc2 refactor: clean up imports and improve JSON schema generation instructions UncleCode 2025-03-18 18:53:34 +08:00
  • 529a79725e docs: remove hallucinations from docs for CrawlerRunConfig + Add chunking strategy docs in the table Aravind Karnam 2025-03-18 16:14:00 +05:30
  • 9109ecd8fc chore: Raise an exception with clear messaging when body tag is missing in the fetched html. The message should warn users to add appropriate wait_for condition to wait until body tag is loaded into DOM. fixes: https://github.com/unclecode/crawl4ai/issues/804 Aravind Karnam 2025-03-18 15:26:20 +05:30
  • 84883be513 Merge branch 'next' into 2025-MAR-ALPHA-1 Aravind Karnam 2025-03-18 15:12:21 +05:30
  • 79328e4292 Create main.yml (#846) Aravind 2025-03-17 18:17:57 +05:30
  • a24799918c feat(llm): add additional LLM configuration parameters UncleCode 2025-03-14 21:36:23 +08:00
  • a31d7b86be feat(changelog): update CHANGELOG for version 0.5.0.post5 with new features, changes, fixes, and breaking changes UncleCode 2025-03-14 15:26:37 +08:00
  • 7884a98be7 feat(crawler): add experimental parameters support and optimize browser handling UncleCode 2025-03-14 14:39:24 +08:00
  • c190ba816d refactor: Instead of custom validation of question, rely on the built in FastAPI validator, so generated API docs also reflects this expectation correctly Aravind Karnam 2025-03-14 09:40:50 +05:30
  • a3954dd4c6 refactor: Move the checking of protocol and prepending protocol inside api handlers Aravind Karnam 2025-03-14 09:39:10 +05:30
  • 6e3c048328 feat(api): refactor crawl request handling to streamline single and multiple URL processing UncleCode 2025-03-13 22:30:38 +08:00
  • b750542e6d feat(crawler): optimize single URL handling and add performance comparison UncleCode 2025-03-13 22:15:15 +08:00
  • cbb8755972 Merge branch 'next' into 2025-MAR-ALPHA-1 Aravind Karnam 2025-03-13 10:42:22 +05:30
  • dc36997a08 feat(schema): improve HTML preprocessing for schema generation UncleCode 2025-03-12 22:40:46 +08:00
  • 1630fbdafe feat(monitor): add real-time crawler monitoring system with memory management UncleCode 2025-03-12 19:05:24 +08:00
  • 341b7a5f2a 🐛 Truncate width to integer string in parse_srcset dvschuyl 2025-03-11 11:05:14 +01:00
  • 3ea3c0520d Add all 5 deployments solution for testing deploy UncleCode 2025-03-10 18:57:14 +08:00
  • 9547bada3a feat(content): add target_elements parameter for selective content extraction UncleCode 2025-03-10 18:54:51 +08:00
  • 9d69fce834 feat(scraping): add smart table extraction and analysis capabilities UncleCode 2025-03-09 21:31:33 +08:00
  • c6a605ccce feat(filters): add reverse option to URLPatternFilter UncleCode 2025-03-08 18:54:41 +08:00
  • 4aeb7ef9ad refactor(proxy): consolidate proxy configuration handling UncleCode 2025-03-07 23:14:11 +08:00
  • a68cbb232b feat(browser): add standalone CDP browser launch and lxml extraction strategy UncleCode 2025-03-07 20:55:56 +08:00
  • e1b3bfe6fb Merge branch 'vr0.5.0.post4' UncleCode 2025-03-06 22:46:44 +08:00
  • f78c46446b feat(deep-crawling): improve URL normalization and domain filtering UncleCode 2025-03-06 22:45:57 +08:00
  • 1b72880007 chore(version): bump version to 0.5.0.post3 UncleCode 2025-03-06 20:32:32 +08:00
  • 29f7915b79 fix(models): support float timestamps in CrawlStats UncleCode 2025-03-06 20:30:57 +08:00
  • 2327db6fdc refactor(crawler): introduce CrawlResultContainer and simplify interfaces UncleCode 2025-03-05 22:23:08 +08:00
  • fd02dc782d Merge branch 'main' of https://github.com/unclecode/crawl4ai UncleCode 2025-03-05 17:15:48 +08:00
  • 3a234ec950 fix(auth): make JWT authentication optional with fallback UncleCode 2025-03-05 17:14:42 +08:00