refactor: Update image description minimum word threshold in get_content_of_website_optimized

2024-08-02 15:55:32 +08:00
parent 9ee988753d
commit 659c8cd953
8 changed files with 71 additions and 16 deletions
--- a/docs/examples/llm_extraction_openai_pricing.py
+++ b/docs/examples/llm_extraction_openai_pricing.py
@@ -21,7 +21,8 @@ result = crawler.run(
    url=url,
    word_count_threshold=1,
    extraction_strategy= LLMExtractionStrategy(
-        provider= "openai/gpt-4o", api_token = os.getenv('OPENAI_API_KEY'), 
+        # provider= "openai/gpt-4o", api_token = os.getenv('OPENAI_API_KEY'), 
+        provider= "groq/llama-3.1-70b-versatile", api_token = os.getenv('GROQ_API_KEY'), 
        schema=OpenAIModelFee.model_json_schema(),
        extraction_type="schema",
        instruction="From the crawled content, extract all mentioned model names along with their "\
--- a/docs/md/changelog.md
+++ b/docs/md/changelog.md
@@ -1,5 +1,24 @@
 # Changelog

+# Changelog
+
+## [v0.2.76] - 2024-08-02
+
+Major improvements in functionality, performance, and cross-platform compatibility! 🚀
+
+- 🐳 **Docker enhancements**: Significantly improved Dockerfile for easy installation on Linux, Mac, and Windows.
+- 🌐 **Official Docker Hub image**: Launched our first official image on Docker Hub for streamlined deployment.
+- 🔧 **Selenium upgrade**: Removed dependency on ChromeDriver, now using Selenium's built-in capabilities for better compatibility.
+- 🖼️ **Image description**: Implemented ability to generate textual descriptions for extracted images from web pages.
+- ⚡ **Performance boost**: Various improvements to enhance overall speed and performance.
+
+A big shoutout to our amazing community contributors:
+- [@aravindkarnam](https://github.com/aravindkarnam) for developing the textual description extraction feature.
+- [@FractalMind](https://github.com/FractalMind) for creating the first official Docker Hub image and fixing Dockerfile errors.
+- [@ketonkss4](https://github.com/ketonkss4) for identifying Selenium's new capabilities, helping us reduce dependencies.
+
+Your contributions are driving Crawl4AI forward! 🙌
+
 ## [v0.2.75] - 2024-07-19

 Minor improvements for a more maintainable codebase:
--- a/docs/md/index.md
+++ b/docs/md/index.md
@@ -1,4 +1,4 @@
-# Crawl4AI v0.2.75
+# Crawl4AI v0.2.76

 Welcome to the official documentation for Crawl4AI! 🕷️🤖 Crawl4AI is an open-source Python library designed to simplify web crawling and extract useful information from web pages. This documentation will guide you through the features, usage, and customization of Crawl4AI.

--- a/docs/md/installation.md
+++ b/docs/md/installation.md
@@ -8,6 +8,8 @@ There are three ways to use Crawl4AI:

 ## Option 1: Library Installation

+You can try this Colab for a quick start: [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1sJPAmeLj5PMrg2VgOwMJ2ubGIcK0cJeX#scrollTo=g1RrmI4W_rPk)
+
 Crawl4AI offers flexible installation options to suit various use cases. Choose the option that best fits your needs:

 - **Default Installation** (Basic functionality):
--- a/docs/md/introduction.md
+++ b/docs/md/introduction.md
@@ -20,18 +20,6 @@ Crawl4AI is designed to simplify the process of crawling web pages and extractin
 - **🎯 CSS Selector Support**: Extract specific content using CSS selectors.
 - **📝 Instruction/Keyword Refinement**: Pass instructions or keywords to refine the extraction process.

-## Recent Changes (v0.2.5) 🌟
-
- **New Hooks**: Added six important hooks to the crawler:
-  - 🟢 `on_driver_created`: Called when the driver is ready for initializations.
-  - 🔵 `before_get_url`: Called right before Selenium fetches the URL.
-  - 🟣 `after_get_url`: Called after Selenium fetches the URL.
-  - 🟠 `before_return_html`: Called when the data is parsed and ready.
-  - 🟡 `on_user_agent_updated`: Called when the user changes the user agent, causing the driver to reinitialize.
- **New Example**: Added an example in [`quickstart.py`](https://github.com/unclecode/crawl4ai/blob/main/docs/examples/quickstart.py) in the example folder under the docs.
- **Improved Semantic Context**: Maintaining the semantic context of inline tags (e.g., abbreviation, DEL, INS) for improved LLM-friendliness.
- **Dockerfile Update**: Updated Dockerfile to ensure compatibility across multiple platforms.
-
 Check the [Changelog](https://github.com/unclecode/crawl4ai/blob/main/CHANGELOG.md) for more details.

 ## Power and Simplicity of Crawl4AI 🚀