Commit Graph

314 Commits

Author SHA1 Message Date
Aravind Karnam
e5ecf291f3 Implemented filtering for images and grabbing the contextual text from nearest parent 2024-07-21 15:03:17 +05:30
Aravind Karnam
9d0cafcfa6 fixed import error in model_loader.py 2024-07-21 14:55:58 +05:30
unclecode
f5a4e80e2c chore: Fix typo in chunking_strategy.py and crawler_strategy.py
The commit fixes a typo in the `chunking_strategy.py` file where `nl.toknize.TextTilingTokenizer()` was corrected to `nl.tokenize.TextTilingTokenizer()`. Additionally, in the `crawler_strategy.py` file, the commit converts the screenshot image to RGB mode before saving it as a JPEG. This ensures consistent image quality and compression.
2024-07-19 17:40:31 +08:00
unclecode
4d283ab386 ## [v0.2.74] - 2024-07-08
A slew of exciting updates to improve the crawler's stability and robustness! 🎉

- 💻 **UTF encoding fix**: Resolved the Windows \"charmap\" error by adding UTF encoding.
- 🛡️ **Error handling**: Implemented MaxRetryError exception handling in LocalSeleniumCrawlerStrategy.
- 🧹 **Input sanitization**: Improved input sanitization and handled encoding issues in LLMExtractionStrategy.
- 🚮 **Database cleanup**: Removed existing database file and initialized a new one.
2024-07-08 16:33:25 +08:00
unclecode
9926eb9f95 feat: Bump version to v0.2.73 and update documentation
This commit updates the version number to v0.2.73 and makes corresponding changes in the README.md and Dockerfile.

Docker file install the default mode, this resolve many of installation issues.

Additionally, the installation instructions are updated to include support for different modes. Setup.py doesn't have anymore dependancy on Spacy.

The change log is also updated to reflect these changes.

Supporting websites need with-head browser.
2024-07-03 15:19:22 +08:00
unclecode
88d8cd8650 feat: Add page load check for LocalSeleniumCrawlerStrategy
This commit adds a page load check for the LocalSeleniumCrawlerStrategy in the `crawl` method. The `_ensure_page_load` method is introduced to ensure that the page has finished loading before proceeding. This helps to prevent issues with incomplete page sources and improves the reliability of the crawler.
2024-07-01 00:07:32 +08:00
unclecode
61ae2de841 1/Update setup.py to support following modes:
- default (most frequent mode)
- torch
- transformers
- all
2/ Update Docker file
3/ Update documentation as well.
2024-06-30 00:15:29 +08:00
unclecode
5b28eed2c0 Add a temporary solution for when we can't crawl websites in headless mode. 2024-06-29 23:25:50 +08:00
unclecode
4756d0a532 Refactor crawler_strategy.py to handle exceptions and improve error messages 2024-06-26 15:04:33 +08:00
unclecode
7ba2142363 chore: Refactor get_content_of_website_optimized function in utils.py 2024-06-26 14:43:09 +08:00
unclecode
96d1eb0d0d Some updated ins utils.py 2024-06-26 13:03:03 +08:00
unclecode
144cfa0eda Switch to ChromeDriverManager due some issues with download the chrome driver 2024-06-26 13:00:17 +08:00
unclecode
a0dff192ae Update README for speed example 2024-06-24 23:06:12 +08:00
unclecode
1fffeeedd2 Update Readme: Showcase the speed 2024-06-24 23:02:08 +08:00
unclecode
78cfad8b2f chore: Update version to 0.2.7 and improve extraction function speed 2024-06-24 22:39:56 +08:00
unclecode
d6182bedd7 chore:
- Add demo page to the new mkdocs
- Set website home page to mkdocs
2024-06-22 20:36:01 +08:00
unclecode
21b110bfd7 Update LLMExtractionStrategy to disable chunking if specified, Add example of summarization for a web page. 2024-06-19 19:03:35 +08:00
unclecode
350ca1511b chore: Update configuration values, create new example, and update Dockerfile and README 2024-06-19 18:48:20 +08:00
unclecode
539263a8ba chore: Update configuration values for chunk token threshold, overlap rate, and minimum word threshold. Create a new example for LLMExtraction Strategy, update Dockerfile, and README 2024-06-19 18:32:20 +08:00
unclecode
3f0e265baf Merge branch 'format-inline-tags' 2024-06-19 00:48:38 +08:00
unclecode
853b9d59d8 feat: Add hooks for enhanced control over Selenium drivers
- Added six hooks: on_driver_created, before_get_url, after_get_url, before_return_html, on_user_agent_updated.
- Included example usage in quickstart.py.
- Updated README and changelog.
2024-06-18 20:00:51 +08:00
unclecode
77da48050d chore: Add custom headers to LocalSeleniumCrawlerStrategy 2024-06-17 15:50:03 +08:00
unclecode
9a97aacd85 chore: Add hooks for customizing the LocalSeleniumCrawlerStrategy 2024-06-17 15:37:18 +08:00
unclecode
413595542a Enhancement: Replaced inline HTML tags with textual format for better LLM context handling #24 2024-06-17 15:14:34 +08:00
unclecode
d1d83a6ef7 Fix issue #22: Use MD5 hash for caching HTML files to handle long URLs 2024-06-17 14:44:01 +08:00
unclecode
f7e0cee1b0 vital: Right now, only raw html is retrived from datbase, therefore, css selector and other filter will be executed every time. 2024-06-08 18:37:40 +08:00
unclecode
b3a0edaa6d - User agent
- Extract Links
- Extract Metadata
- Update Readme
- Update REST API document
2024-06-08 17:59:42 +08:00
unclecode
9c34b30723 Extract internal and external links. 2024-06-08 16:53:06 +08:00
unclecode
a19379aa58 Add recipe images, update README, and REST api example 2024-06-07 20:43:50 +08:00
unclecode
8e73a482a2 feat: Add screenshot functionality to crawl_urls
The code changes in this commit add the `screenshot` parameter to the `crawl_urls` function in `main.py`. This allows users to specify whether they want to take a screenshot of the page during the crawling process. The default value is `False`.

This commit message follows the established convention of starting with a type (feat for feature) and providing a concise and descriptive summary of the changes made.
2024-06-07 15:23:32 +08:00
unclecode
0533aeb814 v0.2.3:
- Extract all media tags
- Take screenshot of the page
2024-06-07 15:23:13 +08:00
unclecode
51f26d12fe Update for v0.2.2
- Support multiple JS scripts
- Fixed some of bugs
- Resolved a few issue relevant to Colab installation
2024-06-02 15:40:18 +08:00
unclecode
f1b60b2016 chore: Update ONNX model loading process 2024-05-31 18:07:05 +08:00
Unclecode
53d1176d53 chore: Update extraction strategy to support GPU, MPS, and CPU, add batch processing for CPU devices 2024-05-19 16:18:58 +00:00
unclecode
13a3b21d19 - Add ONNX embedding model for CPU devices, Update the similarithy threshold, improve the embedding speed. 2024-05-19 22:30:10 +08:00
unclecode
3846648c12 chore: Update extraction strategy to support GPU, MPS, and CPU, add batch procesing for CPU devices 2024-05-18 15:42:19 +08:00
unclecode
eb6423875f chore: Update Selenium options in crawler_strategy.py and add verbose logging in CosineStrategy 2024-05-18 14:13:06 +08:00
unclecode
b6319c6f6e chore: Add support for GPU, MPS, and CPU 2024-05-17 21:56:13 +08:00
UncleCode
454135856e Update extraction_strategy.py Support GPU, MPS, and CPU 2024-05-17 21:40:48 +08:00
UncleCode
33fddc27ad Update model loader to support GPU, MPS, and CPU 2024-05-17 21:39:22 +08:00
unclecode
36e46be23d chore: Add verbose option to ExtractionStrategy classes
This commit adds a new `verbose` option to the `ExtractionStrategy` classes. The `verbose` option allows for logging of extraction details, such as the number of extracted blocks and the URL being processed. This improves the debugging and monitoring capabilities of the code.
2024-05-17 18:06:10 +08:00
unclecode
f52f526002 chore: Update web_crawler.py to use NoExtractionStrategy as default 2024-05-17 16:03:35 +08:00
unclecode
3f8576f870 chore: Update model_loader.py to use pretrained models without resume_download 2024-05-17 15:26:15 +08:00
unclecode
a317dc5e1d Load CosineStrategy in the function 2024-05-17 15:13:06 +08:00
unclecode
a5f9d07dbf Remove dependency on Spacy model. 2024-05-17 15:08:03 +08:00
UncleCode
5b4a586b2d Update web_crawler.py
Set CosineExtraction as defaul strategy
2024-05-16 22:28:24 +08:00
UncleCode
a856319499 Update web_crawler.py
Set NoExtractionStrategy for FetchPages
2024-05-16 22:06:33 +08:00
UncleCode
5ce1dc1622 Update web_crawler.py
Set all extraction strategies default to NoExtractionStrategy
2024-05-16 21:58:11 +08:00
unclecode
ea16dec587 Improve library loading 2024-05-16 21:19:02 +08:00
unclecode
d19488a821 chore: Update model_loader.py to create necessary folders in the home directory 2024-05-16 21:05:24 +08:00