UncleCode
2a54f3c048
refactor(core): remove main_v0.py file and associated functionality
2024-11-05 20:11:07 +08:00
UncleCode
1c20b815b3
docs(README): update Docker usage instructions and add deployment options
2024-11-05 20:10:24 +08:00
UncleCode
43a2b26f63
Merge branch 'main' of https://github.com/unclecode/crawl4ai
2024-11-05 20:08:20 +08:00
UncleCode
3cf19a1bc2
chore(version): bump version to 0.3.73
2024-11-05 20:05:58 +08:00
UncleCode
67a23c3182
feat(core): Release v0.3.73 with Browser Takeover and Docker Support
...
Major changes:
- Add browser takeover feature using CDP for authentic browsing
- Implement Docker support with full API server documentation
- Enhance Mockdown with tag preservation system
- Improve parallel crawling performance
This release focuses on authenticity and scalability, introducing the ability
to use users' own browsers while providing containerized deployment options.
Breaking changes include modified browser handling and API response structure.
See CHANGELOG.md for detailed migration guide.
2024-11-05 20:04:18 +08:00
bizrockman
796dbaf08c
Rename episode_11_3_Extraction_Strategies:_Cosine.md to episode_11_3_Extraction_Strategies_Cosine.md
...
Name that will work in Windows
2024-11-04 20:19:43 +01:00
bizrockman
3a3c88a2d0
Rename episode_11_2_Extraction_Strategies:_LLM.md to episode_11_2_Extraction_Strategies_LLM.md
...
Name that will work in Windows
2024-11-04 20:19:20 +01:00
bizrockman
870296fa7e
Rename episode_11_1_Extraction_Strategies:_JSON_CSS.md to episode_11_1_Extraction_Strategies_JSON_CSS.md
...
Name that will work in Windows
2024-11-04 20:18:58 +01:00
bizrockman
a28046c233
Rename episode_08_Media_Handling:_Images,_Videos,_and_Audio.md to episode_08_Media_Handling_Images_Videos_and_Audio.md
...
Name that will work in Windows
2024-11-04 20:18:26 +01:00
bizrockman
0bba0e074f
Preventing NoneType has no attribute get Errors
...
Sometimes the list contains Tag elements that do not have attrs set, resulting in this Error.
2024-11-04 20:12:24 +01:00
UncleCode
c4c6227962
Creating the API server component
2024-11-04 20:33:15 +08:00
UncleCode
e6c914d2fa
Refactor version management and remove deprecated gitignore.dev file
2024-11-04 16:51:59 +08:00
UncleCode
be8f4fc59a
Merge branch '0.3.73' of https://github.com/unclecode/crawl4ai into 0.3.73
2024-11-04 14:12:07 +08:00
unclecode
fbdf870fbf
Update CHANGELOG
2024-11-04 14:10:27 +08:00
UncleCode
7b0cca41b4
Update gitignore
2024-11-04 13:48:26 +08:00
UncleCode
33d0e9ec8c
Update dev gitignore
2024-11-04 13:42:37 +08:00
UncleCode
42f1c67ca8
Merge branch '0.3.73' of https://github.com/unclecode/crawl4ai into 0.3.73
2024-11-04 13:39:39 +08:00
UncleCode
e28c49a8fe
Refactor .gitignore.dev file: Add ignore patterns for various files and directories
2024-11-04 13:39:38 +08:00
unclecode
54d5a3a259
Improved database management and error handling, updated README instructions, refined .gitignore, enhanced async web crawling capabilities, and updated dependencies.
2024-11-04 13:22:13 +08:00
UncleCode
de6b43f334
Merge pull request #215 from mjvankampen/build/flexible-requirements
...
build: make requirements more flexible
2024-11-03 08:30:06 +01:00
UncleCode
07f508bd0c
Merge pull request #218 from timoa/main
...
chore(docs): fix documentation links + markdown lint fix
2024-11-03 06:59:30 +01:00
UncleCode
62a86dbe8d
Refactor mission section in README and add mission diagram
2024-10-31 16:38:56 +08:00
UncleCode
492ada0ed4
Add mission diagram to MISSION.md
2024-10-31 15:26:43 +08:00
UncleCode
d8eef02867
Add link to mission statement in README
2024-10-31 15:23:58 +08:00
UncleCode
6c7235d6a7
Add mission.md file
2024-10-31 15:22:00 +08:00
Damien Laureaux
0a09d78fa5
chore(docs): fix documentation links + markdown lint
2024-10-31 05:50:22 +01:00
UncleCode
19c3f3efb2
Refactor tutorial markdown files: Update numbering and formatting
2024-10-30 20:58:07 +08:00
UncleCode
e97e8df6ba
Update README: Fix typo in project name
2024-10-30 20:45:20 +08:00
UncleCode
cb6f5323ae
Update README
2024-10-30 20:44:57 +08:00
UncleCode
47464cedec
Update README
2024-10-30 20:42:27 +08:00
UncleCode
982d203d91
Merge branch '0.3.73'
2024-10-30 20:40:09 +08:00
UncleCode
9307c19f35
Update documents, upload new version of quickstart.
2024-10-30 20:39:35 +08:00
Mark Jan van Kampen
605a82793b
fix dev requirements and lock playwright due to failing tests
2024-10-30 10:41:37 +01:00
Mark Jan van Kampen
df9ee44d42
build: make requirements more flexible
...
According to #102 the requirements specified are minimum version. Currently they are defined as fixed versions in requirements.txt and setup.py leading to projects consuming this package are limited to using exactly these requirements instead of a more flexible range. This PR addresses this.
2024-10-30 10:03:22 +01:00
UncleCode
e9f7d5e73a
Merge branch '0.3.73'
2024-10-30 00:16:49 +08:00
UncleCode
3529c2e732
Update new tutorial documents and added to the docs folder.
2024-10-30 00:16:18 +08:00
UncleCode
d9e0b7abab
Fix README badge
2024-10-28 15:14:16 +08:00
UncleCode
b2800fefc6
Add badges to README
2024-10-28 15:10:12 +08:00
UncleCode
d913e20edc
Update Readme
2024-10-28 15:09:37 +08:00
UncleCode
c2a71a5abe
Update Docs folder, prepare branch for new version 0.3.73
v.3.72
2024-10-27 19:35:13 +08:00
UncleCode
d61615e0b0
Merge branch '0.3.72'
2024-10-27 19:33:05 +08:00
UncleCode
ac9d83c72f
Update gitignore
2024-10-27 19:29:04 +08:00
UncleCode
ff9149b5c9
Merge branch 'main' of https://github.com/unclecode/crawl4ai
2024-10-27 19:28:05 +08:00
UncleCode
4239654722
Update Documentation
2024-10-27 19:24:46 +08:00
UncleCode
38474bd66a
Update version
2024-10-24 20:24:21 +08:00
UncleCode
bcfe83f702
feat: enhance crawler with overlay removal and improved screenshot capabilities
...
• Add smart overlay removal system for handling popups and modals
• Improve screenshot functionality with configurable timing controls
• Implement URL normalization and enhanced link processing
• Add custom base directory support for cache storage
• Refine external content filtering and social media domain handling
This commit significantly improves the crawler's ability to handle modern
websites by automatically removing intrusive overlays and providing better
screenshot capabilities. URL handling is now more robust with proper
normalization and duplicate detection. The cache system is more flexible
with customizable base directory support.
Breaking changes: None
Issue numbers: None
2024-10-24 20:22:47 +08:00
UncleCode
32f57c49d6
Merge pull request #194 from IdrisHanafi/feat/customize-crawl-base-directory
...
Support for custom crawl base directory
2024-10-24 13:09:27 +02:00
UncleCode
60ba131ac8
[v0.3.72] Enhance content extraction and proxy support
...
- Add ContentCleaningStrategy for improved content extraction
- Implement advanced proxy configuration with authentication
- Enhance image source detection and handling
- Add fit_markdown and fit_html for refined content output
- Improve external link and image handling flexibility
2024-10-22 20:19:22 +08:00
Idris Hanafi
a5f627ba1a
feat: customize crawl base directory
2024-10-21 17:58:39 -04:00
UncleCode
04d16e6d2b
Fix Base64 image parsing in WebScrappingStrategy (issue 182)
...
- Add support for extracting Base64 encoded images
- Improve image format detection to include Base64 images
- Enhance compatibility with locally saved HTML files using Base64 image encoding
2024-10-20 19:25:25 +08:00