Commit Graph

76 Commits

Author SHA1 Message Date
unclecode
740802c491 Merge branch '0.3.6' 2024-10-14 22:55:24 +08:00
unclecode
b9ac96c332 Merge branch 'main' of https://github.com/unclecode/crawl4ai 2024-10-14 22:54:23 +08:00
unclecode
d06535388a Update gitignore 2024-10-14 22:53:56 +08:00
unclecode
6aa803d712 Update gitignore 2024-10-14 21:03:40 +08:00
unclecode
320afdea64 feat: Enhance crawler flexibility and LLM extraction capabilities
- Add browser type selection (Chromium, Firefox, WebKit)
- Implement iframe content extraction
- Improve image processing and dimension updates
- Add custom headers support in AsyncPlaywrightCrawlerStrategy
- Enhance delayed content retrieval with new parameter
- Optimize HTML sanitization and Markdown conversion
- Update examples in quickstart_async.py for new features
2024-10-14 21:03:28 +08:00
unclecode
ff3524d9b1 feat(v0.3.6): Add screenshot capture, delayed content, and custom timeouts
- Implement screenshot capture functionality
- Add delayed content retrieval method
- Introduce custom page timeout parameter
- Enhance LLM support with multiple providers
- Improve database schema auto-updates
- Optimize image processing in WebScrappingStrategy
- Update error handling and logging
- Expand examples in quickstart_async.py
2024-10-12 13:42:42 +08:00
unclecode
b99d20b725 Add pypi_build.sh to .gitignore 2024-10-08 18:10:57 +08:00
unclecode
4750810a67 Enhance AsyncWebCrawler with smart waiting and screenshot capabilities
- Implement smart_wait function in AsyncPlaywrightCrawlerStrategy
- Add screenshot support to AsyncCrawlResponse and AsyncWebCrawler
- Improve error handling and timeout management in crawling process
- Fix typo in CrawlResult model (responser_headers -> response_headers)
- Update .gitignore to exclude additional files
- Adjust import path in test_basic_crawling.py
2024-10-02 17:34:56 +08:00
unclecode
8b6e88c85c Update .gitignore to ignore temporary and test directories 2024-09-26 15:09:49 +08:00
unclecode
10cdad039d Update documents and README 2024-09-25 16:52:11 +08:00
unclecode
8463aabedf chore: Remove .test_pads/ directory from .gitignore 2024-07-19 17:09:29 +08:00
unclecode
7f30144ef2 chore: Remove .tests/ directory from .gitignore 2024-07-09 15:10:18 +08:00
unclecode
d11a83c232 ## [0.2.71] 2024-06-26
• Refactored `crawler_strategy.py` to handle exceptions and improve error messages
• Improved `get_content_of_website_optimized` function in `utils.py` for better performance
• Updated `utils.py` with latest changes
• Migrated to `ChromeDriverManager` for resolving Chrome driver download issues
2024-06-26 15:34:15 +08:00
unclecode
2217904876 Update .gitignore 2024-06-22 18:12:12 +08:00
unclecode
19d3d39115 Update Marge the DOCS branch 2024-06-21 18:04:13 +08:00
unclecode
4a50781453 chore: Remove local and .files folders from .gitignore 2024-06-17 15:57:34 +08:00
unclecode
8b8683f22e Add research assistant example using Chainlit 2024-06-04 22:43:09 +08:00
QIN2DIM
5cee084340 fix(main): UnicodeDecodeError
File "T:\_GitHubProjects\Forks\crawl4ai\main.py", line 70, in read_index
    partials[filename[:-5]] = file.read()

UnicodeDecodeError: 'gbk' codec can't decode byte 0xa4 in position 149: illegal multibyte sequence
2024-05-18 23:31:11 +08:00
Unclecode
bf00c26a83 chore: Update Dockerfile to install chromium-chromedriver and spacy library 2024-05-18 09:16:52 +00:00
unclecode
199c66114c chore: Update pip installation command and requirements, add new dependencies 2024-05-16 20:58:36 +08:00
unclecode
f6e59157bf - Test all methods
- Update index.hml
- Update Readme
- Resolve some bugs
2024-05-14 21:27:41 +08:00
unclecode
82706129f5 Update:
- Text Categorization
- Crawler, Extraction, and Chunking strategies
- Clustering for semantic segmentation
2024-05-12 22:37:21 +08:00
unclecode
7039e3c1ee - Issue Resolved: Every <pre> tag's HTML content is replaced with its inner text to address situations like syntax highlighters, where each character might be in a <span>. This avoids issues where the minimum word threshold might ignore them. 2024-05-12 14:08:22 +08:00
unclecode
181250cb93 chore: Add function to clear the database 2024-05-09 19:42:43 +08:00
unclecode
c71adb29ce chore: Update .gitignore and README.md 2024-05-09 19:25:25 +08:00
unclecode
b8e743cd8d Initial Commit 2024-05-09 19:10:25 +08:00