Aravind Karnam
9ef43bc5f0
Refactor: Move adeep_crawl as method of crawler itself. Create attributes in CrawlResult to reconstruct the tree once deep crawling is completed
2025-01-29 15:58:21 +05:30
Aravind Karnam
84ffdaab9a
Refactor: Move adeep_crawl as method of crawler itself. Create attributes in CrawlResult to reconstruct the tree once deep crawling is completed
2025-01-29 13:06:09 +05:30
Aravind Karnam
78223bc847
feat: create ScraperPageResult model to attach score and depth attributes to yielded/returned crawl results
2025-01-28 16:47:30 +05:30
Aravind Karnam
85847ff13f
feat:
...
1. Make active_crawls into a dict instead of set and remove jobs array. Effective lookup and storage of active crawls and crawl control.
2. Put a lock on active_crawls, so similtanious push and pop by coroutines doesn't cause a race condition
3. Move the depth check logic outside the child link for loop, as source_url doesn't change in the loop.
2025-01-28 12:39:45 +05:30
Aravind Karnam
f34b4878cf
fix: code formatting
2025-01-28 10:00:01 +05:30
Aravind Karnam
0ff95c83bc
feat: change input params to scraper, Add asynchronous context manager to AsyncWebScraper, Optimise filter application
2025-01-27 18:13:33 +05:30
UncleCode
e6ef8d91ba
refactor(scraper): optimize URL validation and filter performance
...
- Replace validators library with built-in urlparse for URL validation
- Optimize filter statistics update logic for better performance
- Add performance benchmarking suite for filters
- Add execution time tracking to scraper examples
- Update gitignore with windsurfrules
BREAKING CHANGE: Removed dependency on validators library for URL validation
2025-01-22 19:45:56 +08:00
Aravind Karnam
6e78c56dda
Refactor: Removed all scheduling logic from scraper. From now scraper expects arun_many to handle all scheduling. Scraper will only do traversal, validations, compliance checks, URL filtering and scoring etc. Reformatted some of the scraper files with Black code formatter
2025-01-21 18:44:43 +05:30
Aravind Karnam
67fa06c09b
Refactor: Removed all scheduling logic from scraper. From now scraper expects arun_many to handle all scheduling. Scraper will only do traversal, validations, compliance checks, URL filtering and scoring etc. Reformatted some of the scraper files with Black code formatter
2025-01-21 17:49:51 +05:30
Aravind Karnam
7a5f83b76f
fix: Added browser config and crawler run config from 0.4.22
2024-12-18 10:33:09 +05:30
Aravind Karnam
ff731e4ea1
fixed the final scraper_quickstart.py example
2024-11-26 17:08:32 +05:30
Aravind Karnam
9530ded83a
fixed the final scraper_quickstart.py example
2024-11-26 17:05:54 +05:30
Aravind Karnam
f8e85b1499
Fixed a bug in _process_links, handled condition for when url_scorer is passed as None, renamed the scrapper folder to scraper.
2024-11-23 13:52:34 +05:30