Commit Graph

103 Commits

Author SHA1 Message Date
Aravind Karnam
ff731e4ea1 fixed the final scraper_quickstart.py example 2024-11-26 17:08:32 +05:30
Aravind Karnam
9530ded83a fixed the final scraper_quickstart.py example 2024-11-26 17:05:54 +05:30
aravind
3d52b551f2 Merge pull request #8 from aravindkarnam/main
Pulling in 0.3.74
2024-11-23 13:57:36 +05:30
Aravind Karnam
f8e85b1499 Fixed a bug in _process_links, handled condition for when url_scorer is passed as None, renamed the scrapper folder to scraper. 2024-11-23 13:52:34 +05:30
UncleCode
0d0cef3438 feat: add enhanced markdown generation example with citations and file output 2024-11-22 20:14:58 +08:00
UncleCode
b6af94cbbb Merge remote-tracking branch 'origin/main' into 0.3.74 2024-11-18 21:15:04 +08:00
UncleCode
852729ff38 feat(docker): add Docker Compose configurations for local and hub deployment; enhance GPU support checks in Dockerfile
feat(requirements): update requirements.txt to include snowballstemmer
fix(version_manager): correct version parsing to use __version__.__version__
feat(main): introduce chunking strategy and content filter in CrawlRequest model
feat(content_filter): enhance BM25 algorithm with priority tag scoring for improved content relevance
feat(logger): implement new async logger engine replacing print statements throughout library
fix(database): resolve version-related deadlock and circular lock issues in database operations
docs(docker): expand Docker deployment documentation with usage instructions for Docker Compose
2024-11-18 21:00:06 +08:00
UncleCode
df63a40606 feat(docs): update examples and documentation to replace bypass_cache with cache_mode for improved clarity 2024-11-17 19:44:45 +08:00
UncleCode
f9fe6f89fe feat(database): implement version management and migration checks during initialization 2024-11-17 18:09:33 +08:00
UncleCode
2a82455b3d feat(crawl): implement direct crawl functionality and introduce CacheMode for improved caching control 2024-11-17 17:17:34 +08:00
UncleCode
3a66aa8a60 feat(cache): introduce CacheMode and CacheContext for enhanced caching behavior
chore(requirements): add colorama dependency
refactor(config): add SHOW_DEPRECATION_WARNINGS flag and clean up code
fix(docs): update example scripts for clarity and consistency
2024-11-17 15:30:56 +08:00
UncleCode
4b45b28f25 feat(docs): enhance deployment documentation with one-click setup, API security details, and Docker Compose examples 2024-11-16 18:44:47 +08:00
UncleCode
9139ef3125 feat(docker): update Dockerfile for improved installation process and enhance deployment documentation with Docker Compose setup and API token security 2024-11-16 18:19:44 +08:00
UncleCode
6360d0545a feat(api): add API token authentication and update Dockerfile description 2024-11-16 18:08:56 +08:00
UncleCode
90df6921b7 feat(crawl_sync): add synchronous crawl endpoint and corresponding test 2024-11-16 15:34:30 +08:00
aravind
60670b2af6 Merge pull request #7 from aravindkarnam/main
pulling the main branch into scraper-uc
2024-11-15 20:43:54 +05:30
UncleCode
ae7ebc0bd8 chore: update .gitignore and enhance changelog with major feature additions and examples 2024-11-15 20:16:13 +08:00
UncleCode
c38ac29edb perf(crawler): major performance improvements & raw HTML support
- Switch to lxml parser (~4x speedup)
- Add raw HTML & local file crawling support
- Fix cache headers & async cleanup
- Add browser process monitoring
- Optimize BeautifulSoup operations
- Pre-compile regex patterns

Breaking: Raw HTML handling requires new URL prefixes
Fixes: #256, #253
2024-11-13 19:40:40 +08:00
Mahesh
00026b5f8b feat(config): Adding a configurable way of setting the cache directory for constrained environments 2024-11-12 14:52:51 -07:00
UncleCode
f9a297e08d Add Docker example script for testing Crawl4AI functionality 2024-11-08 19:39:05 +08:00
UncleCode
0d357ab7d2 feat(scraper): Enhance URL filtering and scoring systems
Implement comprehensive URL filtering and scoring capabilities:

Filters:
- Add URLPatternFilter with glob/regex support
- Implement ContentTypeFilter with MIME type checking
- Add DomainFilter for domain control
- Create FilterChain with stats tracking

Scorers:
- Complete KeywordRelevanceScorer implementation
- Add PathDepthScorer for URL structure scoring
- Implement ContentTypeScorer for file type priorities
- Add FreshnessScorer for date-based scoring
- Add DomainAuthorityScorer for domain weighting
- Create CompositeScorer for combined strategies

Features:
- Add statistics tracking for both filters and scorers
- Implement logging support throughout
- Add resource cleanup methods
- Create comprehensive documentation
- Include performance optimizations

Tests and docs included.
Note: Review URL normalization overlap with recent crawler changes.
2024-11-08 19:02:28 +08:00
UncleCode
bae4665949 feat(scraper): Enhance URL filtering and scoring systems
Implement comprehensive URL filtering and scoring capabilities:

Filters:
- Add URLPatternFilter with glob/regex support
- Implement ContentTypeFilter with MIME type checking
- Add DomainFilter for domain control
- Create FilterChain with stats tracking

Scorers:
- Complete KeywordRelevanceScorer implementation
- Add PathDepthScorer for URL structure scoring
- Implement ContentTypeScorer for file type priorities
- Add FreshnessScorer for date-based scoring
- Add DomainAuthorityScorer for domain weighting
- Create CompositeScorer for combined strategies

Features:
- Add statistics tracking for both filters and scorers
- Implement logging support throughout
- Add resource cleanup methods
- Create comprehensive documentation
- Include performance optimizations

Tests and docs included.
Note: Review URL normalization overlap with recent crawler changes.

- Quick Start is created and added
2024-11-08 18:45:12 +08:00
UncleCode
d11c004fbb Enhanced BFS Strategy: Improved monitoring, resource management & configuration
- Added CrawlStats for comprehensive crawl monitoring
- Implemented proper resource cleanup with shutdown mechanism
- Enhanced URL processing with better validation and politeness controls
- Added configuration options (max_concurrent, timeout, external_links)
- Improved error handling with retry logic
- Added domain-specific queues for better performance
- Created comprehensive documentation

Note: URL normalization needs review - potential duplicate processing
with core crawler for internal links. Currently commented out pending
further investigation of edge cases.
2024-11-08 15:57:23 +08:00
UncleCode
be472c624c Refactored AsyncWebScraper to include comprehensive error handling and progress tracking capabilities. Introduced a ScrapingProgress data class to monitor processed and failed URLs. Enhanced scraping methods to log errors and track stats throughout the scraping process. 2024-11-06 21:09:47 +08:00
UncleCode
c5aa1bec18 Merge pull request #229 from bizrockman/main
Preventing NoneType has no attribute get Errors
2024-11-06 07:31:07 +01:00
UncleCode
67a23c3182 feat(core): Release v0.3.73 with Browser Takeover and Docker Support
Major changes:
- Add browser takeover feature using CDP for authentic browsing
- Implement Docker support with full API server documentation
- Enhance Mockdown with tag preservation system
- Improve parallel crawling performance

This release focuses on authenticity and scalability, introducing the ability
to use users' own browsers while providing containerized deployment options.
Breaking changes include modified browser handling and API response structure.

See CHANGELOG.md for detailed migration guide.
2024-11-05 20:04:18 +08:00
bizrockman
796dbaf08c Rename episode_11_3_Extraction_Strategies:_Cosine.md to episode_11_3_Extraction_Strategies_Cosine.md
Name that will work in Windows
2024-11-04 20:19:43 +01:00
bizrockman
3a3c88a2d0 Rename episode_11_2_Extraction_Strategies:_LLM.md to episode_11_2_Extraction_Strategies_LLM.md
Name that will work in Windows
2024-11-04 20:19:20 +01:00
bizrockman
870296fa7e Rename episode_11_1_Extraction_Strategies:_JSON_CSS.md to episode_11_1_Extraction_Strategies_JSON_CSS.md
Name that will work in Windows
2024-11-04 20:18:58 +01:00
bizrockman
a28046c233 Rename episode_08_Media_Handling:_Images,_Videos,_and_Audio.md to episode_08_Media_Handling_Images_Videos_and_Audio.md
Name that will work in Windows
2024-11-04 20:18:26 +01:00
UncleCode
62a86dbe8d Refactor mission section in README and add mission diagram 2024-10-31 16:38:56 +08:00
UncleCode
6c7235d6a7 Add mission.md file 2024-10-31 15:22:00 +08:00
UncleCode
19c3f3efb2 Refactor tutorial markdown files: Update numbering and formatting 2024-10-30 20:58:07 +08:00
UncleCode
9307c19f35 Update documents, upload new version of quickstart. 2024-10-30 20:39:35 +08:00
UncleCode
3529c2e732 Update new tutorial documents and added to the docs folder. 2024-10-30 00:16:18 +08:00
UncleCode
c2a71a5abe Update Docs folder, prepare branch for new version 0.3.73 2024-10-27 19:35:13 +08:00
UncleCode
4239654722 Update Documentation 2024-10-27 19:24:46 +08:00
UncleCode
4e2852d5ff [v0.3.71] Enhance chunking strategies and improve overall performance
- Add OverlappingWindowChunking and improve SlidingWindowChunking
- Update CHUNK_TOKEN_THRESHOLD to 2048 tokens
- Optimize AsyncPlaywrightCrawlerStrategy close method
- Enhance flexibility in CosineStrategy with generic embedding model loading
- Improve JSON-based extraction strategies
- Add knowledge graph generation example
2024-10-19 18:36:59 +08:00
UncleCode
b309bc34e1 Fix the model nam ein quick start example 2024-10-18 15:32:25 +08:00
UncleCode
b8147b64e0 chore: Bump version to 0.3.71 and improve error handling
- Update version number to 0.3.71
- Add sleep_on_close option to AsyncPlaywrightCrawlerStrategy
- Enhance context creation with additional options
- Improve error message formatting and visibility
- Update quickstart documentation
2024-10-18 13:31:12 +08:00
UncleCode
768aa06ceb feat(crawler): Enhance stealth and flexibility, improve error handling
- Implement playwright_stealth for better bot detection avoidance
- Add user simulation and navigator override options
- Improve iframe processing and browser selection
- Enhance error reporting and debugging capabilities
- Optimize image processing and parallel crawling
- Add new example for user simulation feature
- Added support for including links in Markdown content, by definin g a new flag `include_links_on_markdown` in `crawl` method.
2024-10-17 21:37:48 +08:00
unclecode
320afdea64 feat: Enhance crawler flexibility and LLM extraction capabilities
- Add browser type selection (Chromium, Firefox, WebKit)
- Implement iframe content extraction
- Improve image processing and dimension updates
- Add custom headers support in AsyncPlaywrightCrawlerStrategy
- Enhance delayed content retrieval with new parameter
- Optimize HTML sanitization and Markdown conversion
- Update examples in quickstart_async.py for new features
2024-10-14 21:03:28 +08:00
unclecode
b9bbd42373 Update Quickstart examples 2024-10-13 14:37:45 +08:00
unclecode
68e9144ce3 feat: Enhance crawling control and LLM extraction flexibility
- Add before_retrieve_html hook and delay_before_return_html option
- Implement flexible page_timeout for smart_wait function
- Support extra_args and custom headers in LLM extraction
- Allow arbitrary kwargs in AsyncWebCrawler initialization
- Improve perform_completion_with_backoff for custom API calls
- Update examples with new features and diverse LLM providers
2024-10-12 14:48:22 +08:00
unclecode
ff3524d9b1 feat(v0.3.6): Add screenshot capture, delayed content, and custom timeouts
- Implement screenshot capture functionality
- Add delayed content retrieval method
- Introduce custom page timeout parameter
- Enhance LLM support with multiple providers
- Improve database schema auto-updates
- Optimize image processing in WebScrappingStrategy
- Update error handling and logging
- Expand examples in quickstart_async.py
2024-10-12 13:42:42 +08:00
unclecode
4750810a67 Enhance AsyncWebCrawler with smart waiting and screenshot capabilities
- Implement smart_wait function in AsyncPlaywrightCrawlerStrategy
- Add screenshot support to AsyncCrawlResponse and AsyncWebCrawler
- Improve error handling and timeout management in crawling process
- Fix typo in CrawlResult model (responser_headers -> response_headers)
- Update .gitignore to exclude additional files
- Adjust import path in test_basic_crawling.py
2024-10-02 17:34:56 +08:00
unclecode
5d4e92db7d Update quickstart_async.py to improve performance and add Firecrawl simulation 2024-09-28 00:11:39 +08:00
unclecode
8b6e88c85c Update .gitignore to ignore temporary and test directories 2024-09-26 15:09:49 +08:00
unclecode
10cdad039d Update documents and README 2024-09-25 16:52:11 +08:00
unclecode
4d48bd31ca Push async version last changes for merge to main branch 2024-09-24 20:52:08 +08:00