UncleCode
9139ef3125
feat(docker): update Dockerfile for improved installation process and enhance deployment documentation with Docker Compose setup and API token security
2024-11-16 18:19:44 +08:00
UncleCode
6360d0545a
feat(api): add API token authentication and update Dockerfile description
2024-11-16 18:08:56 +08:00
UncleCode
1961adb530
refactor(docker): remove shared memory size configuration to streamline Dockerfile
2024-11-16 17:35:27 +08:00
UncleCode
79feab89c4
refactor(deploy): remove memory utilization alert configuration from deployment template
2024-11-16 17:28:42 +08:00
UncleCode
5d0b13294c
feat(deploy): change instance size to professional-xs and update memory utilization alert window to 300 seconds
2024-11-16 17:25:07 +08:00
UncleCode
67edc2d641
feat(deploy): update instance size to professional-xs and add memory utilization alert parameters
2024-11-16 17:23:32 +08:00
UncleCode
6b569cceb5
feat(deploy): update branch to 0.3.74 and change instance size to basic-xs
2024-11-16 17:21:45 +08:00
UncleCode
6f2fe5954f
feat(deploy): update instance size to professional-xs and add memory utilization alert
2024-11-16 17:12:41 +08:00
UncleCode
fca1319b7d
feat(docker): add MkDocs installation and build step for documentation
2024-11-16 17:10:30 +08:00
UncleCode
f77f06a3bd
feat(deploy): add deployment configuration and templates for crawl4ai
2024-11-16 16:43:31 +08:00
UncleCode
e62c807295
feat(deploy): add Railway deployment configuration and setup instructions
2024-11-16 16:38:13 +08:00
UncleCode
90df6921b7
feat(crawl_sync): add synchronous crawl endpoint and corresponding test
2024-11-16 15:34:30 +08:00
UncleCode
5098442086
refactor: migrate versioning to __version__.py and remove deprecated _version.py
2024-11-16 15:30:24 +08:00
UncleCode
d0014c6793
New async database manager and migration support
...
- Introduced AsyncDatabaseManager for async DB management.
- Added migration feature to transition to file-based storage.
- Enhanced web crawler with improved caching logic.
- Updated requirements and setup for async processing.
2024-11-16 14:54:41 +08:00
UncleCode
ae7ebc0bd8
chore: update .gitignore and enhance changelog with major feature additions and examples
2024-11-15 20:16:13 +08:00
UncleCode
1f269f9834
test(content_filter): add comprehensive tests for BM25ContentFilter functionality
2024-11-15 18:11:11 +08:00
UncleCode
7f1ae5adcf
Update changelog
2024-11-14 22:51:51 +08:00
UncleCode
3d00fee6c2
- In this commit, the library is updated to process file downloads. Users can now specify a download folder and trigger the download process via JavaScript or other means, with all files being saved. The list of downloaded files will also be added to the crowd result object.
...
- Another thing this commit introduces is the concept of the Relevance Content Filter. This is an improvement over Fit Markdown. This class of strategies aims to extract the main content from a given page - the part that really matters and is useful to be processed. One strategy has been created using the BM25 algorithm, which finds chunks of text from the web page relevant to its title, descriptions, and keywords, or supports a given user query and matches them. The result is then returned to the main engine to be converted to Markdown. Plans include adding approaches using language models as well.
- The cache database was updated to hold information about response headers and downloaded files.
2024-11-14 22:50:59 +08:00
UncleCode
17913f5acf
feat(crawler): support local files and raw HTML input in AsyncWebCrawler
2024-11-13 20:00:29 +08:00
UncleCode
c38ac29edb
perf(crawler): major performance improvements & raw HTML support
...
- Switch to lxml parser (~4x speedup)
- Add raw HTML & local file crawling support
- Fix cache headers & async cleanup
- Add browser process monitoring
- Optimize BeautifulSoup operations
- Pre-compile regex patterns
Breaking: Raw HTML handling requires new URL prefixes
Fixes : #256 , #253
2024-11-13 19:40:40 +08:00
UncleCode
38044d4afe
Merge pull request #255 from maheshpec/feature/configure-cache-directory
...
feat(config): Adding a configurable way of setting the cache directory for constrained environments
2024-11-13 09:43:29 +01:00
UncleCode
61b93ebf36
Update change log
2024-11-13 15:38:30 +08:00
UncleCode
bf91adf3f8
fix: Resolve unexpected BrowserContext closure during crawl in Docker
...
- Removed __del__ method in AsyncPlaywrightCrawlerStrategy to ensure reliable browser lifecycle management by using explicit context managers.
- Added process monitoring in ManagedBrowser to detect and log unexpected terminations of the browser subprocess.
- Updated Docker configuration to expose port 9222 for remote debugging and allocate extra shared memory to prevent browser crashes.
- Improved error handling and resource cleanup for browser instances, particularly in Docker environments.
Resolves Issue #256
2024-11-13 15:37:16 +08:00
Mahesh
00026b5f8b
feat(config): Adding a configurable way of setting the cache directory for constrained environments
2024-11-12 14:52:51 -07:00
UncleCode
8c22396d8b
Merge pull request #234 from devatnull/patch-1
...
Fix typo: scrapper → scraper
2024-11-12 08:37:14 +01:00
UncleCode
b6d6631b12
Enhance Async Crawler with Playwright support
...
- Implemented new async crawler strategy using Playwright.
- Introduced ManagedBrowser for better browser management.
- Added support for persistent browser sessions and improved error handling.
- Updated version from 0.3.73 to 0.3.731.
- Enhanced logic in main.py for conditional mounting of static files.
- Updated requirements to replace playwright_stealth with tf-playwright-stealth.
2024-11-12 12:10:58 +08:00
UncleCode
a098483cbb
Update Roadmap
2024-11-09 20:40:30 +08:00
UncleCode
f9a297e08d
Add Docker example script for testing Crawl4AI functionality
2024-11-08 19:39:05 +08:00
UncleCode
bcdd80911f
Remove some old files.
2024-11-08 19:08:58 +08:00
UncleCode
b120965b6a
Fixed issues with the Manage Browser, including its inability to connect to the user directory and inability to create new pages within the Manage Browser context; all issues are now resolved.
2024-11-07 20:15:03 +08:00
UncleCode
16f918621f
Merge branch 'main' of https://github.com/unclecode/crawl4ai
2024-11-07 19:30:22 +08:00
UncleCode
f7574230a1
Update API server request object. text_docker file and Readme
2024-11-07 19:29:31 +08:00
devatnull
2879344d9c
Update README.md
2024-11-06 17:36:46 +03:00
UncleCode
9f5eef1f38
Refactored the CustomHTML2Text class in content_scrapping_strategy.py to remove the handling logic for header tags (h1-h6), which are now commented out. This cleanup improves code readability and reduces maintenance overhead.
2024-11-06 21:50:09 +08:00
UncleCode
c5aa1bec18
Merge pull request #229 from bizrockman/main
...
Preventing NoneType has no attribute get Errors
2024-11-06 07:31:07 +01:00
UncleCode
b51263664e
feat(api): add CORS support and static file serving, update root redirect
2024-11-05 21:02:47 +08:00
UncleCode
1e7db0d293
docs(README): update release notes for version 0.3.73 with new features and improvements
2024-11-05 20:12:20 +08:00
UncleCode
2a54f3c048
refactor(core): remove main_v0.py file and associated functionality
2024-11-05 20:11:07 +08:00
UncleCode
1c20b815b3
docs(README): update Docker usage instructions and add deployment options
2024-11-05 20:10:24 +08:00
UncleCode
43a2b26f63
Merge branch 'main' of https://github.com/unclecode/crawl4ai
2024-11-05 20:08:20 +08:00
UncleCode
3cf19a1bc2
chore(version): bump version to 0.3.73
2024-11-05 20:05:58 +08:00
UncleCode
67a23c3182
feat(core): Release v0.3.73 with Browser Takeover and Docker Support
...
Major changes:
- Add browser takeover feature using CDP for authentic browsing
- Implement Docker support with full API server documentation
- Enhance Mockdown with tag preservation system
- Improve parallel crawling performance
This release focuses on authenticity and scalability, introducing the ability
to use users' own browsers while providing containerized deployment options.
Breaking changes include modified browser handling and API response structure.
See CHANGELOG.md for detailed migration guide.
2024-11-05 20:04:18 +08:00
bizrockman
796dbaf08c
Rename episode_11_3_Extraction_Strategies:_Cosine.md to episode_11_3_Extraction_Strategies_Cosine.md
...
Name that will work in Windows
2024-11-04 20:19:43 +01:00
bizrockman
3a3c88a2d0
Rename episode_11_2_Extraction_Strategies:_LLM.md to episode_11_2_Extraction_Strategies_LLM.md
...
Name that will work in Windows
2024-11-04 20:19:20 +01:00
bizrockman
870296fa7e
Rename episode_11_1_Extraction_Strategies:_JSON_CSS.md to episode_11_1_Extraction_Strategies_JSON_CSS.md
...
Name that will work in Windows
2024-11-04 20:18:58 +01:00
bizrockman
a28046c233
Rename episode_08_Media_Handling:_Images,_Videos,_and_Audio.md to episode_08_Media_Handling_Images_Videos_and_Audio.md
...
Name that will work in Windows
2024-11-04 20:18:26 +01:00
bizrockman
0bba0e074f
Preventing NoneType has no attribute get Errors
...
Sometimes the list contains Tag elements that do not have attrs set, resulting in this Error.
2024-11-04 20:12:24 +01:00
UncleCode
c4c6227962
Creating the API server component
2024-11-04 20:33:15 +08:00
UncleCode
e6c914d2fa
Refactor version management and remove deprecated gitignore.dev file
2024-11-04 16:51:59 +08:00
UncleCode
be8f4fc59a
Merge branch '0.3.73' of https://github.com/unclecode/crawl4ai into 0.3.73
2024-11-04 14:12:07 +08:00