Commit Graph

48 Commits

Author SHA1 Message Date
UncleCode
86259244e4 Add ".do" to gitignore 2024-12-31 17:30:09 +08:00
UncleCode
0ec593fa90 Update the Tutorial section for new document version 2024-12-31 17:27:31 +08:00
UncleCode
fb33a24891 Commit Message:
- Added examples for Amazon product data extraction methods
  - Updated configuration options and enhance documentation
  - Minor refactoring for improved performance and readability
  - Cleaned up version control settings.
2024-12-29 20:05:18 +08:00
UncleCode
9a4ed6bbd7 Commit Message:
Enhance crawler capabilities and documentation

  - Added SSL certificate extraction in AsyncWebCrawler.
  - Introduced new content filters and chunking strategies for more robust data extraction.
  - Updated documentation management to streamline user experience.
2024-12-26 15:17:07 +08:00
UncleCode
d5ed451299 Enhance crawler capabilities and documentation
- Add llm.txt generator
  - Added SSL certificate extraction in AsyncWebCrawler.
  - Introduced new content filters and chunking strategies for more robust data extraction.
  - Updated documentation.
2024-12-25 21:34:31 +08:00
UncleCode
84b311760f Commit Message:
Enhance Crawl4AI with CLI and documentation updates
  - Implemented Command-Line Interface (CLI) in `crawl4ai/cli.py`
  - Added chunking strategies and their documentation in `llm.txt`
2024-12-21 14:26:56 +08:00
UncleCode
399af801a1 Merge branch 'next' 2024-12-12 20:17:27 +08:00
UncleCode
3d69715dba chore: Update .gitignore to include new files and directories 2024-12-12 19:57:59 +08:00
UncleCode
0982c639ae Enhance AsyncWebCrawler and related configurations
- Introduced new configuration classes: BrowserConfig and CrawlerRunConfig.
  - Refactored AsyncWebCrawler to leverage the new configuration system for cleaner parameter management.
  - Updated AsyncPlaywrightCrawlerStrategy for better flexibility and reduced legacy parameters.
  - Improved error handling with detailed context extraction during exceptions.
  - Enhanced overall maintainability and usability of the web crawler.
2024-12-12 19:35:09 +08:00
UncleCode
2d31915f0a Commit Message:
Enhance Async Crawler with storage state handling
  - Updated Async Crawler to support storage state management.
  - Added error handling for URL validation in Async Web Crawler.
  - Modified README logo and improved .gitignore entries.
  - Fixed issues in multiple files for better code robustness.
2024-12-09 20:04:59 +08:00
UncleCode
a9b6b65238 chore: update version to 0.3.744 and add publish.sh to .gitignore 2024-11-28 19:26:50 +08:00
UncleCode
a5decaa7cf Merge branch '0.3.74' 2024-11-22 19:55:52 +08:00
UncleCode
2bdec1fa5a chore: add manage-collab.sh to .gitignore 2024-11-19 19:33:04 +08:00
UncleCode
b654c49e55 Update .gitignore to exclude additional scripts and files 2024-11-19 19:32:06 +08:00
UncleCode
2f19d38693 Update .gitignore to include .gitboss/ and todo_executor.md 2024-11-19 19:02:41 +08:00
UncleCode
73658c758a chore: update .gitignore to include manage-collab.sh 2024-11-19 16:10:43 +08:00
UncleCode
ae7ebc0bd8 chore: update .gitignore and enhance changelog with major feature additions and examples 2024-11-15 20:16:13 +08:00
UncleCode
bf91adf3f8 fix: Resolve unexpected BrowserContext closure during crawl in Docker
- Removed __del__ method in AsyncPlaywrightCrawlerStrategy to ensure reliable browser lifecycle management by using explicit context managers.
- Added process monitoring in ManagedBrowser to detect and log unexpected terminations of the browser subprocess.
- Updated Docker configuration to expose port 9222 for remote debugging and allocate extra shared memory to prevent browser crashes.
- Improved error handling and resource cleanup for browser instances, particularly in Docker environments.

Resolves Issue #256
2024-11-13 15:37:16 +08:00
unclecode
54d5a3a259 Improved database management and error handling, updated README instructions, refined .gitignore, enhanced async web crawling capabilities, and updated dependencies. 2024-11-04 13:22:13 +08:00
UncleCode
c2a71a5abe Update Docs folder, prepare branch for new version 0.3.73 2024-10-27 19:35:13 +08:00
UncleCode
4e2852d5ff [v0.3.71] Enhance chunking strategies and improve overall performance
- Add OverlappingWindowChunking and improve SlidingWindowChunking
- Update CHUNK_TOKEN_THRESHOLD to 2048 tokens
- Optimize AsyncPlaywrightCrawlerStrategy close method
- Enhance flexibility in CosineStrategy with generic embedding model loading
- Improve JSON-based extraction strategies
- Add knowledge graph generation example
2024-10-19 18:36:59 +08:00
UncleCode
768aa06ceb feat(crawler): Enhance stealth and flexibility, improve error handling
- Implement playwright_stealth for better bot detection avoidance
- Add user simulation and navigator override options
- Improve iframe processing and browser selection
- Enhance error reporting and debugging capabilities
- Optimize image processing and parallel crawling
- Add new example for user simulation feature
- Added support for including links in Markdown content, by definin g a new flag `include_links_on_markdown` in `crawl` method.
2024-10-17 21:37:48 +08:00
unclecode
740802c491 Merge branch '0.3.6' 2024-10-14 22:55:24 +08:00
unclecode
b9ac96c332 Merge branch 'main' of https://github.com/unclecode/crawl4ai 2024-10-14 22:54:23 +08:00
unclecode
d06535388a Update gitignore 2024-10-14 22:53:56 +08:00
unclecode
6aa803d712 Update gitignore 2024-10-14 21:03:40 +08:00
unclecode
320afdea64 feat: Enhance crawler flexibility and LLM extraction capabilities
- Add browser type selection (Chromium, Firefox, WebKit)
- Implement iframe content extraction
- Improve image processing and dimension updates
- Add custom headers support in AsyncPlaywrightCrawlerStrategy
- Enhance delayed content retrieval with new parameter
- Optimize HTML sanitization and Markdown conversion
- Update examples in quickstart_async.py for new features
2024-10-14 21:03:28 +08:00
unclecode
ff3524d9b1 feat(v0.3.6): Add screenshot capture, delayed content, and custom timeouts
- Implement screenshot capture functionality
- Add delayed content retrieval method
- Introduce custom page timeout parameter
- Enhance LLM support with multiple providers
- Improve database schema auto-updates
- Optimize image processing in WebScrappingStrategy
- Update error handling and logging
- Expand examples in quickstart_async.py
2024-10-12 13:42:42 +08:00
unclecode
b99d20b725 Add pypi_build.sh to .gitignore 2024-10-08 18:10:57 +08:00
unclecode
4750810a67 Enhance AsyncWebCrawler with smart waiting and screenshot capabilities
- Implement smart_wait function in AsyncPlaywrightCrawlerStrategy
- Add screenshot support to AsyncCrawlResponse and AsyncWebCrawler
- Improve error handling and timeout management in crawling process
- Fix typo in CrawlResult model (responser_headers -> response_headers)
- Update .gitignore to exclude additional files
- Adjust import path in test_basic_crawling.py
2024-10-02 17:34:56 +08:00
unclecode
8b6e88c85c Update .gitignore to ignore temporary and test directories 2024-09-26 15:09:49 +08:00
unclecode
10cdad039d Update documents and README 2024-09-25 16:52:11 +08:00
unclecode
8463aabedf chore: Remove .test_pads/ directory from .gitignore 2024-07-19 17:09:29 +08:00
unclecode
7f30144ef2 chore: Remove .tests/ directory from .gitignore 2024-07-09 15:10:18 +08:00
unclecode
d11a83c232 ## [0.2.71] 2024-06-26
• Refactored `crawler_strategy.py` to handle exceptions and improve error messages
• Improved `get_content_of_website_optimized` function in `utils.py` for better performance
• Updated `utils.py` with latest changes
• Migrated to `ChromeDriverManager` for resolving Chrome driver download issues
2024-06-26 15:34:15 +08:00
unclecode
2217904876 Update .gitignore 2024-06-22 18:12:12 +08:00
unclecode
19d3d39115 Update Marge the DOCS branch 2024-06-21 18:04:13 +08:00
unclecode
4a50781453 chore: Remove local and .files folders from .gitignore 2024-06-17 15:57:34 +08:00
unclecode
8b8683f22e Add research assistant example using Chainlit 2024-06-04 22:43:09 +08:00
QIN2DIM
5cee084340 fix(main): UnicodeDecodeError
File "T:\_GitHubProjects\Forks\crawl4ai\main.py", line 70, in read_index
    partials[filename[:-5]] = file.read()

UnicodeDecodeError: 'gbk' codec can't decode byte 0xa4 in position 149: illegal multibyte sequence
2024-05-18 23:31:11 +08:00
Unclecode
bf00c26a83 chore: Update Dockerfile to install chromium-chromedriver and spacy library 2024-05-18 09:16:52 +00:00
unclecode
199c66114c chore: Update pip installation command and requirements, add new dependencies 2024-05-16 20:58:36 +08:00
unclecode
f6e59157bf - Test all methods
- Update index.hml
- Update Readme
- Resolve some bugs
2024-05-14 21:27:41 +08:00
unclecode
82706129f5 Update:
- Text Categorization
- Crawler, Extraction, and Chunking strategies
- Clustering for semantic segmentation
2024-05-12 22:37:21 +08:00
unclecode
7039e3c1ee - Issue Resolved: Every <pre> tag's HTML content is replaced with its inner text to address situations like syntax highlighters, where each character might be in a <span>. This avoids issues where the minimum word threshold might ignore them. 2024-05-12 14:08:22 +08:00
unclecode
181250cb93 chore: Add function to clear the database 2024-05-09 19:42:43 +08:00
unclecode
c71adb29ce chore: Update .gitignore and README.md 2024-05-09 19:25:25 +08:00
unclecode
b8e743cd8d Initial Commit 2024-05-09 19:10:25 +08:00