moved score threshold to config.py & replaced the separator for tag.get_text in find_closest_parent_with_useful_text fn from period(.) to space( ) to keep the text more neutral.

Implemented filtering for images and grabbing the contextual text from nearest parent
fixed import error in model_loader.py
2024-07-21 15:18:23 +05:30 · 2024-07-21 15:03:17 +05:30 · 2024-07-21 14:55:58 +05:30 · 2024-07-19 17:42:39 +08:00 · 2024-07-19 17:40:31 +08:00 · 2024-07-19 17:09:29 +08:00
14 changed files with 275 additions and 20 deletions
--- a/.gitignore
+++ b/.gitignore
@@ -165,6 +165,8 @@ Crawl4AI.egg-info/
 Crawl4AI.egg-info/*
 crawler_data.db
 .vscode/
+.tests/
+.test_pads/
 test_pad.py
 test_pad*.py
 .data/
--- a/CHANGELOG.md
+++ b/CHANGELOG.md
@@ -1,5 +1,14 @@
 # Changelog

+## [v0.2.75] - 2024-07-19
+
+Minor improvements for a more maintainable codebase:
+
+- 🔄 Fixed typos in `chunking_strategy.py` and `crawler_strategy.py` to improve code readability
+- 🔄 Removed `.test_pads/` directory from `.gitignore` to keep our repository clean and organized
+
+These changes may seem small, but they contribute to a more stable and sustainable codebase. By fixing typos and updating our `.gitignore` settings, we're ensuring that our code is easier to maintain and scale in the long run.
+
 ## [v0.2.74] - 2024-07-08
 A slew of exciting updates to improve the crawler's stability and robustness! 🎉

--- a/README.md
+++ b/README.md
@@ -1,4 +1,4 @@
-# Crawl4AI v0.2.74 🕷️🤖
+# Crawl4AI v0.2.75 🕷️🤖

 [![GitHub Stars](https://img.shields.io/github/stars/unclecode/crawl4ai?style=social)](https://github.com/unclecode/crawl4ai/stargazers)
 [![GitHub Forks](https://img.shields.io/github/forks/unclecode/crawl4ai?style=social)](https://github.com/unclecode/crawl4ai/network/members)
--- a/crawl4ai/chunking_strategy.py
+++ b/crawl4ai/chunking_strategy.py
@@ -55,7 +55,7 @@ class TopicSegmentationChunking(ChunkingStrategy):
    
    def __init__(self, num_keywords=3, **kwargs):
        import nltk as nl
-        self.tokenizer = nl.toknize.TextTilingTokenizer()
+        self.tokenizer = nl.tokenize.TextTilingTokenizer()
        self.num_keywords = num_keywords

    def chunk(self, text: str) -> list:
--- a/crawl4ai/config.py
+++ b/crawl4ai/config.py
@@ -27,3 +27,13 @@ WORD_TOKEN_RATE = 1.3

 # Threshold for the minimum number of word in a HTML tag to be considered 
 MIN_WORD_THRESHOLD = 1
+
+# Threshold for the Image extraction - Range is 1 to 6
+# Images are scored based on point based system, to filter based on usefulness. Points are assigned
+# to each image based on the following aspects.
+# If either height or width exceeds 150px
+# If image size is greater than 10Kb
+# If alt property is set
+# If image format is in jpg, png or webp
+# If image is in the first half of the total images extracted from the page
+IMAGE_SCORE_THRESHOLD = 2
--- a/crawl4ai/crawler_strategy.py
+++ b/crawl4ai/crawler_strategy.py
@@ -292,15 +292,22 @@ class LocalSeleniumCrawlerStrategy(CrawlerStrategy):
            # Open the screenshot with PIL
            image = Image.open(BytesIO(screenshot))

+            # Convert image to RGB mode
+            rgb_image = image.convert('RGB')
+
            # Convert to JPEG and compress
            buffered = BytesIO()
-            image.save(buffered, format="JPEG", quality=85)
+            rgb_image.save(buffered, format="JPEG", quality=85)
            img_base64 = base64.b64encode(buffered.getvalue()).decode('utf-8')

            if self.verbose:
                print(f"[LOG] 📸 Screenshot taken and converted to base64")

            return img_base64
+        except Exception as e:
+            if self.verbose:
+                print(f"[ERROR] Failed to take screenshot: {str(e)}")
+            return ""

        except Exception as e:
            error_message = sanitize_input_encode(f"Failed to take screenshot: {str(e)}")
--- a/crawl4ai/model_loader.py
+++ b/crawl4ai/model_loader.py
@@ -3,7 +3,7 @@ from pathlib import Path
 import subprocess, os
 import shutil
 import tarfile
-from crawl4ai.config import MODEL_REPO_BRANCH
+from .model_loader import *
 import argparse
 import urllib.request
 __location__ = os.path.realpath(os.path.join(os.getcwd(), os.path.dirname(__file__)))
--- a/crawl4ai/utils.py
+++ b/crawl4ai/utils.py
@@ -11,6 +11,9 @@ from .prompts import PROMPT_EXTRACT_BLOCKS
 from .config import *
 from pathlib import Path
 from typing import Dict, Any
+from urllib.parse import urljoin
+import requests
+from requests.exceptions import InvalidSchema

 class InvalidCSSSelectorError(Exception):
    pass
@@ -447,6 +450,101 @@ def get_content_of_website_optimized(url: str, html: str, word_count_threshold:
    links = {'internal': [], 'external': []}
    media = {'images': [], 'videos': [], 'audios': []}

+    def process_image(img, url, index, total_images):
+            #Check if an image has valid display and inside undesired html elements
+            def is_valid_image(img, parent, parent_classes):
+                style = img.get('style', '')
+                src = img.get('src', '')
+                classes_to_check = ['button', 'icon', 'logo']
+                tags_to_check = ['button', 'input']
+                return all([
+                    'display:none' not in style,
+                    src,
+                    not any(s in var for var in [src, img.get('alt', ''), *parent_classes] for s in classes_to_check),
+                    parent.name not in tags_to_check
+                ])
+
+            #Score an image for it's usefulness
+            def score_image_for_usefulness(img, base_url, index, images_count):
+                # Function to parse image height/width value and units
+                def parse_dimension(dimension):
+                    if dimension:
+                        match = re.match(r"(\d+)(\D*)", dimension)
+                        if match:
+                            number = int(match.group(1))
+                            unit = match.group(2) or 'px'  # Default unit is 'px' if not specified
+                            return number, unit
+                    return None, None
+
+                # Fetch image file metadata to extract size and extension
+                def fetch_image_file_size(img, base_url):
+                    #If src is relative path construct full URL, if not it may be CDN URL
+                    img_url = urljoin(base_url,img.get('src'))
+                    try:
+                        response = requests.head(img_url)
+                        if response.status_code == 200:
+                            return response.headers.get('Content-Length',None)
+                        else:
+                            print(f"Failed to retrieve file size for {img_url}")
+                            return None
+                    except InvalidSchema as e:
+                        return None
+                    finally:
+                        return
+
+                image_height = img.get('height')
+                height_value, height_unit = parse_dimension(image_height)
+                image_width =  img.get('width')
+                width_value, width_unit = parse_dimension(image_width)
+                image_size = int(fetch_image_file_size(img,base_url) or 0)
+                image_format = os.path.splitext(img.get('src',''))[1].lower()
+                score = 0
+                if height_value:
+                    if height_unit == 'px' and height_value > 150:
+                        score += 1
+                    if height_unit in ['%','vh','vmin','vmax'] and height_value >30:
+                        score += 1
+                if width_value:
+                    if width_unit == 'px' and width_value > 150:
+                        score += 1
+                    if width_unit in ['%','vh','vmin','vmax'] and width_value >30:
+                        score += 1
+                if image_size > 10000:
+                    score += 1
+                if img.get('alt') != '':
+                    score+=1
+                if any(image_format==format for format in ['jpg','png','webp']):
+                    score+=1
+                if index/images_count<0.5:
+                    score+=1
+                return score
+
+            # Extract meaningful text for images from closest parent
+            def find_closest_parent_with_useful_text(tag):
+                current_tag = tag
+                while current_tag:
+                    current_tag = current_tag.parent
+                    # Get the text content of the parent tag
+                    if current_tag:
+                        text_content = current_tag.get_text(separator=' ',strip=True)
+                        # Check if the text content has at least word_count_threshold
+                        if len(text_content.split()) >= word_count_threshold:
+                            return text_content
+                return None
+
+            if not is_valid_image(img, img.parent, img.parent.get('class', [])):
+                return None
+            score = score_image_for_usefulness(img, url, index, total_images)
+            if score <= IMAGE_SCORE_THRESHOLD:
+                return None
+            return {
+                'src': img.get('src', ''),
+                'alt': img.get('alt', ''),
+                'desc': find_closest_parent_with_useful_text(img),
+                'score': score,
+                'type': 'image'
+            }
+
    def process_element(element: element.PageElement) -> bool:
        try:
            if isinstance(element, NavigableString):
@@ -471,11 +569,6 @@ def get_content_of_website_optimized(url: str, html: str, word_count_threshold:
                keep_element = True

            elif element.name == 'img':
-                media['images'].append({
-                    'src': element.get('src'),
-                    'alt': element.get('alt'),
-                    'type': 'image'
-                })
                return True  # Always keep image elements

            elif element.name in ['video', 'audio']:
@@ -518,6 +611,14 @@ def get_content_of_website_optimized(url: str, html: str, word_count_threshold:
            print('Error processing element:', str(e))
            return False

+    #process images by filtering and extracting contextual text from the page
+    imgs = body.find_all('img')
+    media['images'] = [
+        result for result in
+        (process_image(img, url, i, len(imgs)) for i, img in enumerate(imgs))
+        if result is not None
+    ]
+
    process_element(body)

    def flatten_nested_elements(node):
--- a/docs/md/changelog.md
+++ b/docs/md/changelog.md
@@ -1,5 +1,15 @@
 # Changelog

+## [v0.2.75] - 2024-07-19
+
+Minor improvements for a more maintainable codebase:
+
+- 🔄 Fixed typos in `chunking_strategy.py` and `crawler_strategy.py` to improve code readability
+- 🔄 Removed `.test_pads/` directory from `.gitignore` to keep our repository clean and organized
+
+These changes may seem small, but they contribute to a more stable and sustainable codebase. By fixing typos and updating our `.gitignore` settings, we're ensuring that our code is easier to maintain and scale in the long run.
+
+
 ## v0.2.74 - 2024-07-08
 A slew of exciting updates to improve the crawler's stability and robustness! 🎉

--- a/docs/md/demo.md
+++ b/docs/md/demo.md
@@ -14,6 +14,7 @@
            <div class="form-group">
                <button class="btn btn-default" type="submit">Submit</button>
            </div>
+
        </fieldset>
    </form>

@@ -93,6 +94,10 @@
        </div>
    </section>

+    <div id="error" class="error-message" style="display: none; margin-top:1em;">
+        <div class="terminal-alert terminal-alert-error"></div>
+    </div>
+
    <script>
        function showTab(tabId) {
            const tabs = document.querySelectorAll('.tab-content');
@@ -162,7 +167,17 @@
                },
                body: JSON.stringify(data)
            })
-            .then(response => response.json())
+            .then(response => {
+                if (!response.ok) {
+                    if (response.status === 429) {
+                        return response.json().then(err => { 
+                            throw Object.assign(new Error('Rate limit exceeded'), { status: 429, details: err });
+                        });
+                    }
+                    throw new Error('Network response was not ok');
+                }
+                return response.json();
+            })
            .then(data => {
                data = data.results[0]; // Only one URL is requested
                document.getElementById('loading').style.display = 'none';
@@ -187,11 +202,29 @@ result = crawler.run(
 print(result)
                `;
                redo(document.getElementById('pythonCode'), pythonCode);
+                document.getElementById('error').style.display = 'none';
            })
            .catch(error => {
                document.getElementById('loading').style.display = 'none';
-                document.getElementById('response').style.display = 'block';
-                document.getElementById('markdownContent').textContent = 'Error: ' + error;
+                document.getElementById('error').style.display = 'block';
+                let errorMessage = 'An unexpected error occurred. Please try again later.';
+                
+                if (error.status === 429) {
+                    const details = error.details;
+                    if (details.retry_after) {
+                        errorMessage = `Rate limit exceeded. Please wait ${parseFloat(details.retry_after).toFixed(1)} seconds before trying again.`;
+                    } else if (details.reset_at) {
+                        const resetTime = new Date(details.reset_at);
+                        const waitTime = Math.ceil((resetTime - new Date()) / 1000);
+                        errorMessage = `Rate limit exceeded. Please try again after ${waitTime} seconds.`;
+                    } else {
+                        errorMessage = `Rate limit exceeded. Please try again later.`;
+                    }
+                } else if (error.message) {
+                    errorMessage = error.message;
+                }
+                
+                document.querySelector('#error .terminal-alert').textContent = errorMessage;
            });
        });
    </script>
--- a/docs/md/index.md
+++ b/docs/md/index.md
@@ -1,4 +1,4 @@
-# Crawl4AI v0.2.74
+# Crawl4AI v0.2.75

 Welcome to the official documentation for Crawl4AI! 🕷️🤖 Crawl4AI is an open-source Python library designed to simplify web crawling and extract useful information from web pages. This documentation will guide you through the features, usage, and customization of Crawl4AI.

--- a/main.py
+++ b/main.py
@@ -22,6 +22,15 @@ from typing import List, Optional
 from crawl4ai.web_crawler import WebCrawler
 from crawl4ai.database import get_total_count, clear_db

+import time
+from slowapi import Limiter, _rate_limit_exceeded_handler
+from slowapi.util import get_remote_address
+from slowapi.errors import RateLimitExceeded
+
+# load .env file
+from dotenv import load_dotenv
+load_dotenv()
+
 # Configuration
 __location__ = os.path.realpath(os.path.join(os.getcwd(), os.path.dirname(__file__)))
 MAX_CONCURRENT_REQUESTS = 10  # Adjust this to change the maximum concurrent requests
@@ -30,6 +39,78 @@ lock = asyncio.Lock()

 app = FastAPI()

+# Initialize rate limiter
+def rate_limit_key_func(request: Request):
+    access_token = request.headers.get("access-token")
+    if access_token == os.environ.get('ACCESS_TOKEN'):
+        return None
+    return get_remote_address(request)
+
+limiter = Limiter(key_func=rate_limit_key_func)
+app.state.limiter = limiter
+
+# Dictionary to store last request times for each client
+last_request_times = {}
+last_rate_limit = {}
+
+
+def get_rate_limit():
+    limit = os.environ.get('ACCESS_PER_MIN', "5")
+    return f"{limit}/minute"
+
+# Custom rate limit exceeded handler
+async def custom_rate_limit_exceeded_handler(request: Request, exc: RateLimitExceeded) -> JSONResponse:
+    if request.client.host not in last_rate_limit or time.time() - last_rate_limit[request.client.host] > 60:
+        last_rate_limit[request.client.host] = time.time()
+    retry_after = 60 - (time.time() - last_rate_limit[request.client.host])
+    reset_at = time.time() + retry_after
+    return JSONResponse(
+        status_code=429,
+        content={
+            "detail": "Rate limit exceeded",
+            "limit": str(exc.limit.limit),
+            "retry_after": retry_after,
+            'reset_at': reset_at,
+            "message": f"You have exceeded the rate limit of {exc.limit.limit}."
+        }
+    )
+    
+app.add_exception_handler(RateLimitExceeded, custom_rate_limit_exceeded_handler)
+
+
+# Middleware for token-based bypass and per-request limit
+class RateLimitMiddleware(BaseHTTPMiddleware):
+    async def dispatch(self, request: Request, call_next):
+        SPAN = int(os.environ.get('ACCESS_TIME_SPAN', 10))
+        access_token = request.headers.get("access-token")
+        if access_token == os.environ.get('ACCESS_TOKEN'):
+            return await call_next(request)
+        
+        path = request.url.path
+        if path in ["/crawl", "/old"]:
+            client_ip = request.client.host
+            current_time = time.time()
+            
+            # Check time since last request
+            if client_ip in last_request_times:
+                time_since_last_request = current_time - last_request_times[client_ip]
+                if time_since_last_request < SPAN:
+                    return JSONResponse(
+                        status_code=429,
+                        content={
+                            "detail": "Too many requests",
+                            "message": "Rate limit exceeded. Please wait 10 seconds between requests.",
+                            "retry_after": max(0, SPAN - time_since_last_request),
+                            "reset_at": current_time + max(0, SPAN - time_since_last_request),
+                        }
+                    )
+            
+            last_request_times[client_ip] = current_time
+
+        return await call_next(request)
+
+app.add_middleware(RateLimitMiddleware)
+
 # CORS configuration
 origins = ["*"]  # Allow all origins
 app.add_middleware(
@@ -73,6 +154,7 @@ def read_root():
    return RedirectResponse(url="/mkdocs")

@app.get("/old", response_class=HTMLResponse)
+@limiter.limit(get_rate_limit())
 async def read_index(request: Request):
    partials_dir = os.path.join(__location__, "pages", "partial")
    partials = {}
@@ -107,6 +189,7 @@ def import_strategy(module_name: str, class_name: str, *args, **kwargs):
        raise HTTPException(status_code=400, detail=f"Class {class_name} not found in {module_name}.")

@app.post("/crawl")
+@limiter.limit(get_rate_limit())
 async def crawl_urls(crawl_request: CrawlRequest, request: Request):
    logging.debug(f"[LOG] Crawl request for URL: {crawl_request.urls}")
    global current_requests
--- a/middlewares.py
+++ b/middlewares.py
--- a/setup.py
+++ b/setup.py
@@ -1,18 +1,18 @@
 from setuptools import setup, find_packages
 import os
 from pathlib import Path
-import subprocess
-from setuptools.command.install import install
+import shutil

 # Create the .crawl4ai folder in the user's home directory if it doesn't exist
 # If the folder already exists, remove the cache folder
-crawl4ai_folder = os.path.join(Path.home(), ".crawl4ai")
-if os.path.exists(f"{crawl4ai_folder}/cache"):
-    subprocess.run(["rm", "-rf", f"{crawl4ai_folder}/cache"])
-os.makedirs(crawl4ai_folder, exist_ok=True)
-os.makedirs(f"{crawl4ai_folder}/cache", exist_ok=True)
+crawl4ai_folder = Path.home() / ".crawl4ai"
+cache_folder = crawl4ai_folder / "cache"

+if cache_folder.exists():
+    shutil.rmtree(cache_folder)

+crawl4ai_folder.mkdir(exist_ok=True)
+cache_folder.mkdir(exist_ok=True)

 # Read the requirements from requirements.txt
 with open("requirements.txt") as f:
Author	SHA1	Message	Date
Aravind Karnam	cf6c835e18	moved score threshold to config.py & replaced the separator for tag.get_text in find_closest_parent_with_useful_text fn from period(.) to space( ) to keep the text more neutral.	2024-07-21 15:18:23 +05:30
Aravind Karnam	e5ecf291f3	Implemented filtering for images and grabbing the contextual text from nearest parent	2024-07-21 15:03:17 +05:30
Aravind Karnam	9d0cafcfa6	fixed import error in model_loader.py	2024-07-21 14:55:58 +05:30
unclecode	7715623430	chore: Fix typos and update .gitignore These changes fix typos in `chunking_strategy.py` and `crawler_strategy.py` to improve code readability. Additionally, the `.test_pads/` directory is removed from the `.gitignore` file to keep the repository clean and organized.	2024-07-19 17:42:39 +08:00
unclecode	f5a4e80e2c	chore: Fix typo in chunking_strategy.py and crawler_strategy.py The commit fixes a typo in the `chunking_strategy.py` file where `nl.toknize.TextTilingTokenizer()` was corrected to `nl.tokenize.TextTilingTokenizer()`. Additionally, in the `crawler_strategy.py` file, the commit converts the screenshot image to RGB mode before saving it as a JPEG. This ensures consistent image quality and compression.	2024-07-19 17:40:31 +08:00
unclecode	8463aabedf	chore: Remove .test_pads/ directory from .gitignore	2024-07-19 17:09:29 +08:00
unclecode	7f30144ef2	chore: Remove .tests/ directory from .gitignore	2024-07-09 15:10:18 +08:00
unclecode	fa5516aad6	chore: Refactor setup.py to use pathlib and shutil for folder creation and removal, to remove cache folder in cross platform manner.	2024-07-09 13:25:00 +08:00
unclecode	ca0336af9e	feat: Add error handling for rate limit exceeded in form submission This commit adds error handling for rate limit exceeded in the form submission process. If the server returns a 429 status code, the client will display an error message indicating the rate limit has been exceeded and provide information on when the user can try again. This improves the user experience by providing clear feedback and guidance when rate limits are reached.	2024-07-08 20:24:00 +08:00
unclecode	65ed1aeade	feat: Add rate limiting functionality with custom handlers	2024-07-08 20:02:12 +08:00
unclecode	4d283ab386	## [v0.2.74] - 2024-07-08 A slew of exciting updates to improve the crawler's stability and robustness! 🎉 - 💻 UTF encoding fix: Resolved the Windows \"charmap\" error by adding UTF encoding. - 🛡️ Error handling: Implemented MaxRetryError exception handling in LocalSeleniumCrawlerStrategy. - 🧹 Input sanitization: Improved input sanitization and handled encoding issues in LLMExtractionStrategy. - 🚮 Database cleanup: Removed existing database file and initialized a new one.	2024-07-08 16:33:25 +08:00