UncleCode
a9b6b65238
chore: update version to 0.3.744 and add publish.sh to .gitignore
2024-11-28 19:26:50 +08:00
UncleCode
a036b7f122
feat: implement create_box_message utility for formatted error messages and enhance error logging in AsyncWebCrawler
2024-11-28 19:24:07 +08:00
UncleCode
0bccf23db3
docs: update quickstart_async.py to enable example function calls for better demonstration
2024-11-28 18:19:42 +08:00
UncleCode
0cbd594512
Merge branch 'next' - Update README, and quickstart examples
2024-11-28 16:43:16 +08:00
UncleCode
efe93a5f57
docs: enhance README with development TODOs and refine mission statement for clarity
2024-11-28 16:41:11 +08:00
UncleCode
3fda66b85b
docs: refine README content for clarity and conciseness, improving descriptions and formatting
2024-11-28 16:36:24 +08:00
UncleCode
ddfb6707b4
docs: update README to reflect new branding and improve section headings for clarity
2024-11-28 16:34:08 +08:00
UncleCode
a69f7a9531
fix: correct typo in function documentation for clarity and accuracy
2024-11-28 16:31:41 +08:00
UncleCode
d583aa43ca
refactor: update cache handling in quickstart_async example to use CacheMode enum
2024-11-28 15:53:25 +08:00
UncleCode
3abb573142
docs: update README for version 0.3.743 with improved formatting and contributor acknowledgments
2024-11-28 13:07:59 +08:00
UncleCode
d556dada9f
docs: update README to keep details open for extraction capabilities, browser integration, input/output flexibility, utility & debugging, security & accessibility, community & documentation, and cutting-edge features
2024-11-28 13:07:33 +08:00
UncleCode
ce7d49484f
docs: update README for version 0.3.743 with new features, enhancements, and contributor acknowledgments
2024-11-28 13:06:46 +08:00
UncleCode
e4acd18429
docs: update README for version 0.3.743 with new features, enhancements, and contributor acknowledgments
2024-11-28 13:06:30 +08:00
UncleCode
c2d4784810
fix: resolve merge conflict in DefaultMarkdownGenerator affecting fit_markdown generation
2024-11-28 12:56:31 +08:00
UncleCode
76bea6c577
Merge branch 'main' into 0.3.743
2024-11-28 12:53:30 +08:00
UncleCode
3ff0b0b2c4
feat: update changelog for version 0.3.743 with new features, improvements, and contributor acknowledgments
2024-11-28 12:48:07 +08:00
UncleCode
a1c7dc17ce
Merge branch 'next' of https://github.com/unclecode/crawl4ai into next
2024-11-28 12:45:57 +08:00
UncleCode
24723b2f10
Enhance features and documentation
...
- Updated version to 0.3.743
- Improved ManagedBrowser configuration with dynamic host/port
- Implemented fast HTML formatting in web crawler
- Enhanced markdown generation with a new generator class
- Improved sanitization and utility functions
- Added contributor details and pull request acknowledgments
- Updated documentation for clearer usage scenarios
- Adjusted tests to reflect class name changes
2024-11-28 12:45:05 +08:00
Hamza Farhan
f998e9e949
Fix: handled the cases where markdown_with_citations, references_markdown, and filtered_html might not be defined. ( #293 )
...
Thanks, dear Farhan, for the changes you made in the code. I accepted and merged them into the main branch. Also, I will add your name to our contributor list. Thank you so much.
2024-11-27 19:20:54 +08:00
zhounan
73661f7d1f
docs: enhance development installation instructions ( #286 )
...
Thanks for your contribution. I'm merging your changes and I'll add your name to our contributor list. Thank you so much.
2024-11-27 15:04:20 +08:00
UncleCode
b5d4db07d1
Merge branch 'main' of https://github.com/unclecode/crawl4ai
2024-11-27 14:55:58 +08:00
UncleCode
c6a022132b
docs: update CONTRIBUTORS.md to acknowledge aadityakanjolia4 for fixing 'CustomHTML2Text' bug
2024-11-27 14:55:56 +08:00
Aravind Karnam
2f5e0598bb
updated definition of can_process_url to include dept as an argument, as it's needed to skip filters for start_url
2024-11-26 18:26:57 +05:30
Aravind Karnam
ff731e4ea1
fixed the final scraper_quickstart.py example
2024-11-26 17:08:32 +05:30
Aravind Karnam
9530ded83a
fixed the final scraper_quickstart.py example
2024-11-26 17:05:54 +05:30
Aravind Karnam
155c756238
<Future pending> issue fix was incorrect. Reverting
2024-11-26 17:04:04 +05:30
Aravind Karnam
a888c91790
Fix "Future attached to a different loop" error by ensuring tasks are created in the correct event loop
...
- Explicitly retrieve and use the correct event loop when creating tasks to avoid cross-loop issues.
- Ensures proper task scheduling in environments with multiple event loops.
2024-11-26 14:05:02 +05:30
Aravind Karnam
a98d51a62c
Remove the can_process_url check from _process_links since it's already being checked in process_url
2024-11-26 11:11:49 +05:30
Aravind Karnam
ee3001b1f7
fix: moved depth as a param to can_process_url and applying filter chain only when depth is not zero. This way
...
filter chain is skipped but other validations are in place even for start URL
2024-11-26 10:22:14 +05:30
Aravind Karnam
b13fd71040
chore: 1. Expose process_external_links as a param
...
2. Removed a few unused imports
3. Removed URL normalisation for external links separately as that won't be necessary
2024-11-26 10:07:11 +05:30
unclecode
195c0ccf8a
chore: remove deprecated Docker Compose configurations for crawl4ai service
2024-11-24 19:40:27 +08:00
unclecode
b09a86c0c1
chore: remove deprecated Docker Compose configurations for crawl4ai service
2024-11-24 19:40:10 +08:00
unclecode
de43505ae4
feat: update version to 0.3.742
2024-11-24 19:36:30 +08:00
unclecode
d7c5b900b8
feat: add support for arm64 platform in Docker commands and update INSTALL_TYPE variable in docker-compose
2024-11-24 19:35:53 +08:00
unclecode
edad7b6a74
chore: remove Railway deployment configuration and related documentation
2024-11-24 18:48:39 +08:00
UncleCode
829a1f7992
feat: update version to 0.3.741 and enhance content filtering with heuristic strategy. Fixing the issue that when the past HTML to BM25 content filter does not have any HTML elements.
2024-11-23 19:45:41 +08:00
UncleCode
d729aa7d5e
refactor: Add group ID to for images extracted from srcset.
2024-11-23 18:00:32 +08:00
Aravind Karnam
2226ef53c8
fix: Exempting the start_url from can_process_url
2024-11-23 14:59:14 +05:30
aravind
3d52b551f2
Merge pull request #8 from aravindkarnam/main
...
Pulling in 0.3.74
2024-11-23 13:57:36 +05:30
Aravind Karnam
f8e85b1499
Fixed a bug in _process_links, handled condition for when url_scorer is passed as None, renamed the scrapper folder to scraper.
2024-11-23 13:52:34 +05:30
Aravind Karnam
c1797037c0
Fixed a few bugs, import errors and changed to asyncio wait_for instead of timeout to support python versions < 3.11
2024-11-23 12:39:25 +05:30
UncleCode
0d0cef3438
feat: add enhanced markdown generation example with citations and file output
2024-11-22 20:14:58 +08:00
UncleCode
d7a112fefe
Merge branch 'main' of https://github.com/unclecode/crawl4ai
2024-11-22 19:56:56 +08:00
UncleCode
a5decaa7cf
Merge branch '0.3.74'
2024-11-22 19:55:52 +08:00
UncleCode
8dea3f470f
chore: update README to include new features and improvements for version 0.3.74
2024-11-22 18:50:12 +08:00
UncleCode
e02935dc5b
chore: update README to reflect new features and improvements in version 0.3.74
2024-11-22 18:49:22 +08:00
UncleCode
24ad2fe2dd
feat: enhance Markdown generation to include fit_html attribute
2024-11-22 18:47:17 +08:00
UncleCode
571dda6549
Update Redme
2024-11-22 18:27:43 +08:00
UncleCode
006bee4a5a
feat: enhance image processing capabilities
...
- Enhanced image processing with srcset support and validation checks for better image selection.
2024-11-22 16:00:17 +08:00
UncleCode
dbb751c8f0
In this commit, we introduce the new concept of MakrdownGenerationStrategy, which allows us to expand our future strategies to generate better markdown. Right now, we generate raw markdown as we were doing before. We have a new algorithm for fitting markdown based on BM25, and now we add the ability to refine markdown into a citation form. Our links will be extracted and replaced by a citation reference number, and then we will have reference sections at the very end; we add all the links with the descriptions. This format is more suitable for large language models. In case we don't need to pass links, we can reduce the size of the markdown significantly and also attach the list of references as a separate file to a large language model. This commit contains changes for this direction.
2024-11-21 18:21:43 +08:00