crawl4ai

Author	SHA1	Message	Date
Aravind Karnam	7a5f83b76f	fix: Added browser config and crawler run config from 0.4.22	2024-12-18 10:33:09 +05:30
Aravind Karnam	2f5e0598bb	updated definition of can_process_url to include dept as an argument, as it's needed to skip filters for start_url	2024-11-26 18:26:57 +05:30
Aravind Karnam	155c756238	<Future pending> issue fix was incorrect. Reverting	2024-11-26 17:04:04 +05:30
Aravind Karnam	a888c91790	Fix "Future attached to a different loop" error by ensuring tasks are created in the correct event loop - Explicitly retrieve and use the correct event loop when creating tasks to avoid cross-loop issues. - Ensures proper task scheduling in environments with multiple event loops.	2024-11-26 14:05:02 +05:30
Aravind Karnam	a98d51a62c	Remove the can_process_url check from _process_links since it's already being checked in process_url	2024-11-26 11:11:49 +05:30
Aravind Karnam	ee3001b1f7	fix: moved depth as a param to can_process_url and applying filter chain only when depth is not zero. This way filter chain is skipped but other validations are in place even for start URL	2024-11-26 10:22:14 +05:30
Aravind Karnam	b13fd71040	chore: 1. Expose process_external_links as a param 2. Removed a few unused imports 3. Removed URL normalisation for external links separately as that won't be necessary	2024-11-26 10:07:11 +05:30
Aravind Karnam	2226ef53c8	fix: Exempting the start_url from can_process_url	2024-11-23 14:59:14 +05:30
Aravind Karnam	f8e85b1499	Fixed a bug in _process_links, handled condition for when url_scorer is passed as None, renamed the scrapper folder to scraper.	2024-11-23 13:52:34 +05:30
Aravind Karnam	c1797037c0	Fixed a few bugs, import errors and changed to asyncio wait_for instead of timeout to support python versions < 3.11	2024-11-23 12:39:25 +05:30
UncleCode	0d357ab7d2	feat(scraper): Enhance URL filtering and scoring systems Implement comprehensive URL filtering and scoring capabilities: Filters: - Add URLPatternFilter with glob/regex support - Implement ContentTypeFilter with MIME type checking - Add DomainFilter for domain control - Create FilterChain with stats tracking Scorers: - Complete KeywordRelevanceScorer implementation - Add PathDepthScorer for URL structure scoring - Implement ContentTypeScorer for file type priorities - Add FreshnessScorer for date-based scoring - Add DomainAuthorityScorer for domain weighting - Create CompositeScorer for combined strategies Features: - Add statistics tracking for both filters and scorers - Implement logging support throughout - Add resource cleanup methods - Create comprehensive documentation - Include performance optimizations Tests and docs included. Note: Review URL normalization overlap with recent crawler changes.	2024-11-08 19:02:28 +08:00
UncleCode	bae4665949	feat(scraper): Enhance URL filtering and scoring systems Implement comprehensive URL filtering and scoring capabilities: Filters: - Add URLPatternFilter with glob/regex support - Implement ContentTypeFilter with MIME type checking - Add DomainFilter for domain control - Create FilterChain with stats tracking Scorers: - Complete KeywordRelevanceScorer implementation - Add PathDepthScorer for URL structure scoring - Implement ContentTypeScorer for file type priorities - Add FreshnessScorer for date-based scoring - Add DomainAuthorityScorer for domain weighting - Create CompositeScorer for combined strategies Features: - Add statistics tracking for both filters and scorers - Implement logging support throughout - Add resource cleanup methods - Create comprehensive documentation - Include performance optimizations Tests and docs included. Note: Review URL normalization overlap with recent crawler changes. - Quick Start is created and added	2024-11-08 18:45:12 +08:00
UncleCode	d11c004fbb	Enhanced BFS Strategy: Improved monitoring, resource management & configuration - Added CrawlStats for comprehensive crawl monitoring - Implemented proper resource cleanup with shutdown mechanism - Enhanced URL processing with better validation and politeness controls - Added configuration options (max_concurrent, timeout, external_links) - Improved error handling with retry logic - Added domain-specific queues for better performance - Created comprehensive documentation Note: URL normalization needs review - potential duplicate processing with core crawler for internal links. Currently commented out pending further investigation of edge cases.	2024-11-08 15:57:23 +08:00
UncleCode	3d1c9a8434	Revieweing the BFS strategy.	2024-11-07 18:54:53 +08:00
UncleCode	be472c624c	Refactored AsyncWebScraper to include comprehensive error handling and progress tracking capabilities. Introduced a ScrapingProgress data class to monitor processed and failed URLs. Enhanced scraping methods to log errors and track stats throughout the scraping process.	2024-11-06 21:09:47 +08:00
Aravind Karnam	8105fd178e	Removed stubs for remove_from_future_crawls since the visited set is updated soon as the URL was queued, Removed add_to_retry_queue(url) since retry with exponential backoff with help of tenacity is going to take care of it.	2024-10-17 15:42:43 +05:30
Aravind Karnam	ce7fce4b16	1. Moved to asyncio.wait instead of gather so that results can be yeilded just as they are ready, rather than in batches 2. Moved the visted.add(url), to before the task is put in queue rather than after the crawl is completed. This makes sure that duplicate crawls doesn't happen when same URL is found at different depth and that get's queued too because the crawl is not yet completed and visted set is not updated. 3. Named the yield_results attribute to stream instead. Since that seems to be popularly used in all other AI libraries for intermediate results.	2024-10-17 12:25:17 +05:30
Aravind Karnam	de28b59aca	removed unused imports	2024-10-16 22:36:48 +05:30
Aravind Karnam	04d8b47b92	Exposed min_crawl_delay for BFSScraperStrategy	2024-10-16 22:34:54 +05:30
Aravind Karnam	2943feeecf	1. Added a flag to yield each crawl result,as they become ready along with the final scraper result as another option 2. Removed ascrape_many method, as I'm currently not focusing on it in the first cut of scraper 3. Added some error handling for cases where robots.txt cannot be fetched or parsed.	2024-10-16 22:05:29 +05:30
Aravind Karnam	8a7d29ce85	updated some comments and removed content type checking functionality from core as it's implemented as a filter	2024-10-16 15:59:37 +05:30
Aravind Karnam	d743adac68	Fixed some bugs in robots.txt processing	2024-10-03 15:58:57 +05:30
Aravind Karnam	7fe220dbd5	1. Introduced a bool flag to ascrape method to switch between sequential and concurrent processing 2. Introduced a dictionary for depth tracking across various tasks 3. Removed redundancy with crawled_urls variable. Instead created a list with visited set variable in returned object.	2024-10-03 11:17:11 +05:30
Aravind Karnam	7f3e2e47ed	Parallel processing with retry on failure with exponential backoff - Simplified URL validation and normalisation - respecting Robots.txt	2024-09-19 12:34:12 +05:30
Aravind Karnam	44ce12c62c	Created scaffolding for Scraper as per the plan. Implemented the ascrape method in bfs_scraper_strategy	2024-09-09 13:13:34 +05:30

25 Commits