Commit Graph

643 Commits

Author SHA1 Message Date
UncleCode
a9b6b65238 chore: update version to 0.3.744 and add publish.sh to .gitignore 2024-11-28 19:26:50 +08:00
UncleCode
a036b7f122 feat: implement create_box_message utility for formatted error messages and enhance error logging in AsyncWebCrawler 2024-11-28 19:24:07 +08:00
UncleCode
0bccf23db3 docs: update quickstart_async.py to enable example function calls for better demonstration 2024-11-28 18:19:42 +08:00
UncleCode
0cbd594512 Merge branch 'next' - Update README, and quickstart examples 2024-11-28 16:43:16 +08:00
UncleCode
efe93a5f57 docs: enhance README with development TODOs and refine mission statement for clarity 2024-11-28 16:41:11 +08:00
UncleCode
3fda66b85b docs: refine README content for clarity and conciseness, improving descriptions and formatting 2024-11-28 16:36:24 +08:00
UncleCode
ddfb6707b4 docs: update README to reflect new branding and improve section headings for clarity 2024-11-28 16:34:08 +08:00
UncleCode
a69f7a9531 fix: correct typo in function documentation for clarity and accuracy 2024-11-28 16:31:41 +08:00
UncleCode
d583aa43ca refactor: update cache handling in quickstart_async example to use CacheMode enum 2024-11-28 15:53:25 +08:00
UncleCode
3abb573142 docs: update README for version 0.3.743 with improved formatting and contributor acknowledgments 2024-11-28 13:07:59 +08:00
UncleCode
d556dada9f docs: update README to keep details open for extraction capabilities, browser integration, input/output flexibility, utility & debugging, security & accessibility, community & documentation, and cutting-edge features 2024-11-28 13:07:33 +08:00
UncleCode
ce7d49484f docs: update README for version 0.3.743 with new features, enhancements, and contributor acknowledgments 2024-11-28 13:06:46 +08:00
UncleCode
e4acd18429 docs: update README for version 0.3.743 with new features, enhancements, and contributor acknowledgments 2024-11-28 13:06:30 +08:00
UncleCode
c2d4784810 fix: resolve merge conflict in DefaultMarkdownGenerator affecting fit_markdown generation 2024-11-28 12:56:31 +08:00
UncleCode
76bea6c577 Merge branch 'main' into 0.3.743 2024-11-28 12:53:30 +08:00
UncleCode
3ff0b0b2c4 feat: update changelog for version 0.3.743 with new features, improvements, and contributor acknowledgments 2024-11-28 12:48:07 +08:00
UncleCode
a1c7dc17ce Merge branch 'next' of https://github.com/unclecode/crawl4ai into next 2024-11-28 12:45:57 +08:00
UncleCode
24723b2f10 Enhance features and documentation
- Updated version to 0.3.743
  - Improved ManagedBrowser configuration with dynamic host/port
  - Implemented fast HTML formatting in web crawler
  - Enhanced markdown generation with a new generator class
  - Improved sanitization and utility functions
  - Added contributor details and pull request acknowledgments
  - Updated documentation for clearer usage scenarios
  - Adjusted tests to reflect class name changes
2024-11-28 12:45:05 +08:00
Hamza Farhan
f998e9e949 Fix: handled the cases where markdown_with_citations, references_markdown, and filtered_html might not be defined. (#293)
Thanks, dear Farhan, for the changes you made in the code. I accepted and merged them into the main branch. Also, I will add your name to our contributor list. Thank you so much.
2024-11-27 19:20:54 +08:00
zhounan
73661f7d1f docs: enhance development installation instructions (#286)
Thanks for your contribution. I'm merging your changes and I'll add your name to our contributor list. Thank you so much.
2024-11-27 15:04:20 +08:00
UncleCode
b5d4db07d1 Merge branch 'main' of https://github.com/unclecode/crawl4ai 2024-11-27 14:55:58 +08:00
UncleCode
c6a022132b docs: update CONTRIBUTORS.md to acknowledge aadityakanjolia4 for fixing 'CustomHTML2Text' bug 2024-11-27 14:55:56 +08:00
Aravind Karnam
2f5e0598bb updated definition of can_process_url to include dept as an argument, as it's needed to skip filters for start_url 2024-11-26 18:26:57 +05:30
Aravind Karnam
ff731e4ea1 fixed the final scraper_quickstart.py example 2024-11-26 17:08:32 +05:30
Aravind Karnam
9530ded83a fixed the final scraper_quickstart.py example 2024-11-26 17:05:54 +05:30
Aravind Karnam
155c756238 <Future pending> issue fix was incorrect. Reverting 2024-11-26 17:04:04 +05:30
Aravind Karnam
a888c91790 Fix "Future attached to a different loop" error by ensuring tasks are created in the correct event loop
- Explicitly retrieve and use the correct event loop when creating tasks to avoid cross-loop issues.
- Ensures proper task scheduling in environments with multiple event loops.
2024-11-26 14:05:02 +05:30
Aravind Karnam
a98d51a62c Remove the can_process_url check from _process_links since it's already being checked in process_url 2024-11-26 11:11:49 +05:30
Aravind Karnam
ee3001b1f7 fix: moved depth as a param to can_process_url and applying filter chain only when depth is not zero. This way
filter chain is skipped but other validations are in place even for start URL
2024-11-26 10:22:14 +05:30
Aravind Karnam
b13fd71040 chore: 1. Expose process_external_links as a param
2. Removed a few unused imports
3. Removed URL normalisation for external links separately as that won't be necessary
2024-11-26 10:07:11 +05:30
unclecode
195c0ccf8a chore: remove deprecated Docker Compose configurations for crawl4ai service 2024-11-24 19:40:27 +08:00
unclecode
b09a86c0c1 chore: remove deprecated Docker Compose configurations for crawl4ai service 2024-11-24 19:40:10 +08:00
unclecode
de43505ae4 feat: update version to 0.3.742 2024-11-24 19:36:30 +08:00
unclecode
d7c5b900b8 feat: add support for arm64 platform in Docker commands and update INSTALL_TYPE variable in docker-compose 2024-11-24 19:35:53 +08:00
unclecode
edad7b6a74 chore: remove Railway deployment configuration and related documentation 2024-11-24 18:48:39 +08:00
UncleCode
829a1f7992 feat: update version to 0.3.741 and enhance content filtering with heuristic strategy. Fixing the issue that when the past HTML to BM25 content filter does not have any HTML elements. 2024-11-23 19:45:41 +08:00
UncleCode
d729aa7d5e refactor: Add group ID to for images extracted from srcset. 2024-11-23 18:00:32 +08:00
Aravind Karnam
2226ef53c8 fix: Exempting the start_url from can_process_url 2024-11-23 14:59:14 +05:30
aravind
3d52b551f2 Merge pull request #8 from aravindkarnam/main
Pulling in 0.3.74
2024-11-23 13:57:36 +05:30
Aravind Karnam
f8e85b1499 Fixed a bug in _process_links, handled condition for when url_scorer is passed as None, renamed the scrapper folder to scraper. 2024-11-23 13:52:34 +05:30
Aravind Karnam
c1797037c0 Fixed a few bugs, import errors and changed to asyncio wait_for instead of timeout to support python versions < 3.11 2024-11-23 12:39:25 +05:30
UncleCode
0d0cef3438 feat: add enhanced markdown generation example with citations and file output 2024-11-22 20:14:58 +08:00
UncleCode
d7a112fefe Merge branch 'main' of https://github.com/unclecode/crawl4ai 2024-11-22 19:56:56 +08:00
UncleCode
a5decaa7cf Merge branch '0.3.74' 2024-11-22 19:55:52 +08:00
UncleCode
8dea3f470f chore: update README to include new features and improvements for version 0.3.74 2024-11-22 18:50:12 +08:00
UncleCode
e02935dc5b chore: update README to reflect new features and improvements in version 0.3.74 2024-11-22 18:49:22 +08:00
UncleCode
24ad2fe2dd feat: enhance Markdown generation to include fit_html attribute 2024-11-22 18:47:17 +08:00
UncleCode
571dda6549 Update Redme 2024-11-22 18:27:43 +08:00
UncleCode
006bee4a5a feat: enhance image processing capabilities
- Enhanced image processing with srcset support and validation checks for better image selection.
2024-11-22 16:00:17 +08:00
UncleCode
dbb751c8f0 In this commit, we introduce the new concept of MakrdownGenerationStrategy, which allows us to expand our future strategies to generate better markdown. Right now, we generate raw markdown as we were doing before. We have a new algorithm for fitting markdown based on BM25, and now we add the ability to refine markdown into a citation form. Our links will be extracted and replaced by a citation reference number, and then we will have reference sections at the very end; we add all the links with the descriptions. This format is more suitable for large language models. In case we don't need to pass links, we can reduce the size of the markdown significantly and also attach the list of references as a separate file to a large language model. This commit contains changes for this direction. 2024-11-21 18:21:43 +08:00