crawl4ai

Author	SHA1	Message	Date
unclecode	fb6ed5f000	feat: Sanitize input and handle encoding issues in LLMExtractionStrategy This commit modifies the LLMExtractionStrategy class in `extraction_strategy.py` to sanitize input and handle potential encoding issues. The `sanitize_input_encode` function is introduced in `utils.py` to encode and decode the input text as UTF-8 or ASCII, depending on the encoding issues encountered. If an encoding error occurs, the function falls back to ASCII encoding and logs a warning message. This change improves the robustness of the extraction process and ensures that characters are not lost due to encoding issues.	2024-07-05 17:30:58 +08:00
unclecode	597fe8bdb7	chore: Delete existing database file and initialize new database This commit deletes the existing database file and initializes a new database in the `crawl4ai/database.py` file. The `os.remove()` function is used to delete the file if it exists, and then the `init_db()` function is called to initialize the new database. This change is necessary to start with a clean database state.	2024-07-05 17:04:57 +08:00
unclecode	3ff2a0d0e7	Merge branch 'main' of https://github.com/unclecode/crawl4ai v0.2.73	2024-07-03 15:26:47 +08:00
unclecode	3cd1b3719f	Bump version to v0.2.73, update documentation, and resolve installation issues	2024-07-03 15:26:43 +08:00
unclecode	9926eb9f95	feat: Bump version to v0.2.73 and update documentation This commit updates the version number to v0.2.73 and makes corresponding changes in the README.md and Dockerfile. Docker file install the default mode, this resolve many of installation issues. Additionally, the installation instructions are updated to include support for different modes. Setup.py doesn't have anymore dependancy on Spacy. The change log is also updated to reflect these changes. Supporting websites need with-head browser.	2024-07-03 15:19:22 +08:00
UncleCode	3abaa82501	Merge pull request #37 from shivkumar0757/fix-readme-encoding @shivkumar0757 Great work! I value your contribution and have merged your pull request. You will be credited in the upcoming change-log. Thank you for your continuous support in advancing this library, to democratize an open access crawler to everyone.	2024-07-01 07:31:07 +02:00
unclecode	88d8cd8650	feat: Add page load check for LocalSeleniumCrawlerStrategy This commit adds a page load check for the LocalSeleniumCrawlerStrategy in the `crawl` method. The `_ensure_page_load` method is introduced to ensure that the page has finished loading before proceeding. This helps to prevent issues with incomplete page sources and improves the reliability of the crawler.	2024-07-01 00:07:32 +08:00
shiv	a08f21d66c	Fix UnicodeDecodeError by reading README.md with UTF-8 encoding	2024-06-30 20:27:33 +05:30
unclecode	d58286989c	UPDATE DOCUMENTS	2024-06-30 00:34:02 +08:00
unclecode	b58af3349c	chore: Update installation instructions with support for different modes v0.2.72	2024-06-30 00:22:17 +08:00
unclecode	940df4631f	Update ChangeLog	2024-06-30 00:18:40 +08:00
unclecode	685706e0aa	Update version, and change log	2024-06-30 00:17:43 +08:00
unclecode	7b0979e134	Update Redme and Docker file	2024-06-30 00:15:43 +08:00
unclecode	61ae2de841	1/Update setup.py to support following modes: - default (most frequent mode) - torch - transformers - all 2/ Update Docker file 3/ Update documentation as well.	2024-06-30 00:15:29 +08:00
unclecode	5b28eed2c0	Add a temporary solution for when we can't crawl websites in headless mode.	2024-06-29 23:25:50 +08:00
unclecode	f8a11779fe	Update change log	2024-06-26 16:48:36 +08:00
unclecode	d11a83c232	## [0.2.71] 2024-06-26 • Refactored `crawler_strategy.py` to handle exceptions and improve error messages • Improved `get_content_of_website_optimized` function in `utils.py` for better performance • Updated `utils.py` with latest changes • Migrated to `ChromeDriverManager` for resolving Chrome driver download issues v0.2.71	2024-06-26 15:34:15 +08:00
unclecode	3255c7a3fa	Update CHANGELOG.md with recent commits	2024-06-26 15:20:34 +08:00
unclecode	4756d0a532	Refactor crawler_strategy.py to handle exceptions and improve error messages	2024-06-26 15:04:33 +08:00
unclecode	7ba2142363	chore: Refactor get_content_of_website_optimized function in utils.py	2024-06-26 14:43:09 +08:00
unclecode	96d1eb0d0d	Some updated ins utils.py	2024-06-26 13:03:03 +08:00
unclecode	144cfa0eda	Switch to ChromeDriverManager due some issues with download the chrome driver	2024-06-26 13:00:17 +08:00
unclecode	a0dff192ae	Update README for speed example	2024-06-24 23:06:12 +08:00
unclecode	1fffeeedd2	Update Readme: Showcase the speed	2024-06-24 23:02:08 +08:00
unclecode	f51b078042	Update reame example.	2024-06-24 22:54:29 +08:00
unclecode	b6023a51fb	Add star chart	2024-06-24 22:47:46 +08:00
unclecode	78cfad8b2f	chore: Update version to 0.2.7 and improve extraction function speed v0.2.7	2024-06-24 22:39:56 +08:00
unclecode	68b3dff74a	Update CSS	2024-06-23 00:36:03 +08:00
unclecode	bfc4abd6e8	Update documents	2024-06-22 20:57:03 +08:00
unclecode	8c77a760fc	Fixed: - Redirect "/" to mkdocs	2024-06-22 20:54:32 +08:00
unclecode	b9bf8ac9d7	Fix mounting the "/" to mkdocs site folder	2024-06-22 20:41:39 +08:00
unclecode	d6182bedd7	chore: - Add demo page to the new mkdocs - Set website home page to mkdocs	2024-06-22 20:36:01 +08:00
unclecode	2217904876	Update .gitignore	2024-06-22 18:12:12 +08:00
unclecode	2c2362b4d3	issue 19 is resolved - Update Dockerfile to install mkdocs and build documentation v0.2.6	2024-06-22 17:18:00 +08:00
unclecode	612ed3fef2	chore: Update print statement to use markdown format	2024-06-21 19:10:13 +08:00
unclecode	fb2a6d0d04	chore: Update documentation link in README.md	2024-06-21 18:05:18 +08:00
unclecode	19d3d39115	Update Marge the DOCS branch	2024-06-21 18:04:13 +08:00
unclecode	c1413e6916	chore: Update documentation link in README.md	2024-06-21 17:57:47 +08:00
unclecode	e7705e661a	ADD MKDocs	2024-06-21 17:56:54 +08:00
unclecode	21b110bfd7	Update LLMExtractionStrategy to disable chunking if specified, Add example of summarization for a web page.	2024-06-19 19:03:35 +08:00
unclecode	1fcb573909	chore: Update table of contents in README.md	2024-06-19 18:53:22 +08:00
unclecode	0f6c5f5453	chore: Update configuration values, create new example, and update Dockerfile and README	2024-06-19 18:50:58 +08:00
unclecode	350ca1511b	chore: Update configuration values, create new example, and update Dockerfile and README	2024-06-19 18:48:20 +08:00
unclecode	539263a8ba	chore: Update configuration values for chunk token threshold, overlap rate, and minimum word threshold. Create a new example for LLMExtraction Strategy, update Dockerfile, and README	2024-06-19 18:32:20 +08:00
unclecode	3f0e265baf	Merge branch 'format-inline-tags'	2024-06-19 00:48:38 +08:00
unclecode	21e2538e57	Update quickstart.py	2024-06-19 00:37:53 +08:00
unclecode	480902bd66	Update README	2024-06-18 20:02:21 +08:00
unclecode	853b9d59d8	feat: Add hooks for enhanced control over Selenium drivers - Added six hooks: on_driver_created, before_get_url, after_get_url, before_return_html, on_user_agent_updated. - Included example usage in quickstart.py. - Updated README and changelog.	2024-06-18 20:00:51 +08:00
unclecode	6d04284c44	Merge branch 'hooks'	2024-06-18 19:53:50 +08:00
unclecode	4a50781453	chore: Remove local and .files folders from .gitignore	2024-06-17 15:57:34 +08:00

1 2 3 4

186 Commits