moved score threshold to config.py & replaced the separator for tag.get_text in find_closest_parent_with_useful_text fn from period(.) to space( ) to keep the text more neutral.

Implemented filtering for images and grabbing the contextual text from nearest parent
fixed import error in model_loader.py
2024-07-21 15:18:23 +05:30 · 2024-07-21 15:03:17 +05:30 · 2024-07-21 14:55:58 +05:30 · 2024-07-19 17:42:39 +08:00 · 2024-07-19 17:40:31 +08:00
9 changed files with 147 additions and 10 deletions
--- a/CHANGELOG.md
+++ b/CHANGELOG.md
@@ -1,5 +1,14 @@
 # Changelog
 ## [v0.2.75] - 2024-07-19
 Minor improvements for a more maintainable codebase:
 - 🔄 Fixed typos in `chunking_strategy.py` and `crawler_strategy.py` to improve code readability
 - 🔄 Removed `.test_pads/` directory from `.gitignore` to keep our repository clean and organized
 These changes may seem small, but they contribute to a more stable and sustainable codebase. By fixing typos and updating our `.gitignore` settings, we're ensuring that our code is easier to maintain and scale in the long run.
 ## [v0.2.74] - 2024-07-08
 A slew of exciting updates to improve the crawler's stability and robustness! 🎉
--- a/README.md
+++ b/README.md
@@ -1,4 +1,4 @@
-# Crawl4AI v0.2.74 🕷️🤖
+# Crawl4AI v0.2.75 🕷️🤖
 [![GitHub Stars](https://img.shields.io/github/stars/unclecode/crawl4ai?style=social)](https://github.com/unclecode/crawl4ai/stargazers)
 [![GitHub Forks](https://img.shields.io/github/forks/unclecode/crawl4ai?style=social)](https://github.com/unclecode/crawl4ai/network/members)
--- a/crawl4ai/chunking_strategy.py
+++ b/crawl4ai/chunking_strategy.py
@@ -55,7 +55,7 @@ class TopicSegmentationChunking(ChunkingStrategy):
    def __init__(self, num_keywords=3, **kwargs):
        import nltk as nl
-        self.tokenizer = nl.toknize.TextTilingTokenizer()
+        self.tokenizer = nl.tokenize.TextTilingTokenizer()
        self.num_keywords = num_keywords
    def chunk(self, text: str) -> list:
--- a/crawl4ai/config.py
+++ b/crawl4ai/config.py
@@ -27,3 +27,13 @@ WORD_TOKEN_RATE = 1.3
 # Threshold for the minimum number of word in a HTML tag to be considered 
 MIN_WORD_THRESHOLD = 1
 # Threshold for the Image extraction - Range is 1 to 6
 # Images are scored based on point based system, to filter based on usefulness. Points are assigned
 # to each image based on the following aspects.
 # If either height or width exceeds 150px
 # If image size is greater than 10Kb
 # If alt property is set
 # If image format is in jpg, png or webp
 # If image is in the first half of the total images extracted from the page
 IMAGE_SCORE_THRESHOLD = 2
--- a/crawl4ai/crawler_strategy.py
+++ b/crawl4ai/crawler_strategy.py
@@ -292,15 +292,22 @@ class LocalSeleniumCrawlerStrategy(CrawlerStrategy):
            # Open the screenshot with PIL
            image = Image.open(BytesIO(screenshot))
            # Convert image to RGB mode
            rgb_image = image.convert('RGB')
            # Convert to JPEG and compress
            buffered = BytesIO()
-            image.save(buffered, format="JPEG", quality=85)
+            rgb_image.save(buffered, format="JPEG", quality=85)
            img_base64 = base64.b64encode(buffered.getvalue()).decode('utf-8')
            if self.verbose:
                print(f"[LOG] 📸 Screenshot taken and converted to base64")
            return img_base64
        except Exception as e:
            if self.verbose:
                print(f"[ERROR] Failed to take screenshot: {str(e)}")
            return ""
        except Exception as e:
            error_message = sanitize_input_encode(f"Failed to take screenshot: {str(e)}")
--- a/crawl4ai/model_loader.py
+++ b/crawl4ai/model_loader.py
@@ -3,7 +3,7 @@ from pathlib import Path
 import subprocess, os
 import shutil
 import tarfile
-from crawl4ai.config import MODEL_REPO_BRANCH
+from .model_loader import *
 import argparse
 import urllib.request
 __location__ = os.path.realpath(os.path.join(os.getcwd(), os.path.dirname(__file__)))
--- a/crawl4ai/utils.py
+++ b/crawl4ai/utils.py
@@ -11,6 +11,9 @@ from .prompts import PROMPT_EXTRACT_BLOCKS
 from .config import *
 from pathlib import Path
 from typing import Dict, Any
 from urllib.parse import urljoin
 import requests
 from requests.exceptions import InvalidSchema
 class InvalidCSSSelectorError(Exception):
    pass
@@ -447,6 +450,101 @@ def get_content_of_website_optimized(url: str, html: str, word_count_threshold:
    links = {'internal': [], 'external': []}
    media = {'images': [], 'videos': [], 'audios': []}
    def process_image(img, url, index, total_images):
            #Check if an image has valid display and inside undesired html elements
            def is_valid_image(img, parent, parent_classes):
                style = img.get('style', '')
                src = img.get('src', '')
                classes_to_check = ['button', 'icon', 'logo']
                tags_to_check = ['button', 'input']
                return all([
                    'display:none' not in style,
                    src,
                    not any(s in var for var in [src, img.get('alt', ''), *parent_classes] for s in classes_to_check),
                    parent.name not in tags_to_check
                ])
            #Score an image for it's usefulness
            def score_image_for_usefulness(img, base_url, index, images_count):
                # Function to parse image height/width value and units
                def parse_dimension(dimension):
                    if dimension:
                        match = re.match(r"(\d+)(\D*)", dimension)
                        if match:
                            number = int(match.group(1))
                            unit = match.group(2) or 'px'  # Default unit is 'px' if not specified
                            return number, unit
                    return None, None
                # Fetch image file metadata to extract size and extension
                def fetch_image_file_size(img, base_url):
                    #If src is relative path construct full URL, if not it may be CDN URL
                    img_url = urljoin(base_url,img.get('src'))
                    try:
                        response = requests.head(img_url)
                        if response.status_code == 200:
                            return response.headers.get('Content-Length',None)
                        else:
                            print(f"Failed to retrieve file size for {img_url}")
                            return None
                    except InvalidSchema as e:
                        return None
                    finally:
                        return
                image_height = img.get('height')
                height_value, height_unit = parse_dimension(image_height)
                image_width =  img.get('width')
                width_value, width_unit = parse_dimension(image_width)
                image_size = int(fetch_image_file_size(img,base_url) or 0)
                image_format = os.path.splitext(img.get('src',''))[1].lower()
                score = 0
                if height_value:
                    if height_unit == 'px' and height_value > 150:
                        score += 1
                    if height_unit in ['%','vh','vmin','vmax'] and height_value >30:
                        score += 1
                if width_value:
                    if width_unit == 'px' and width_value > 150:
                        score += 1
                    if width_unit in ['%','vh','vmin','vmax'] and width_value >30:
                        score += 1
                if image_size > 10000:
                    score += 1
                if img.get('alt') != '':
                    score+=1
                if any(image_format==format for format in ['jpg','png','webp']):
                    score+=1
                if index/images_count<0.5:
                    score+=1
                return score
            # Extract meaningful text for images from closest parent
            def find_closest_parent_with_useful_text(tag):
                current_tag = tag
                while current_tag:
                    current_tag = current_tag.parent
                    # Get the text content of the parent tag
                    if current_tag:
                        text_content = current_tag.get_text(separator=' ',strip=True)
                        # Check if the text content has at least word_count_threshold
                        if len(text_content.split()) >= word_count_threshold:
                            return text_content
                return None
            if not is_valid_image(img, img.parent, img.parent.get('class', [])):
                return None
            score = score_image_for_usefulness(img, url, index, total_images)
            if score <= IMAGE_SCORE_THRESHOLD:
                return None
            return {
                'src': img.get('src', ''),
                'alt': img.get('alt', ''),
                'desc': find_closest_parent_with_useful_text(img),
                'score': score,
                'type': 'image'
            }
    def process_element(element: element.PageElement) -> bool:
        try:
            if isinstance(element, NavigableString):
@@ -471,11 +569,6 @@ def get_content_of_website_optimized(url: str, html: str, word_count_threshold:
                keep_element = True
            elif element.name == 'img':
                media['images'].append({
                    'src': element.get('src'),
                    'alt': element.get('alt'),
                    'type': 'image'
                })
                return True  # Always keep image elements
            elif element.name in ['video', 'audio']:
@@ -518,6 +611,14 @@ def get_content_of_website_optimized(url: str, html: str, word_count_threshold:
            print('Error processing element:', str(e))
            return False
    #process images by filtering and extracting contextual text from the page
    imgs = body.find_all('img')
    media['images'] = [
        result for result in
        (process_image(img, url, i, len(imgs)) for i, img in enumerate(imgs))
        if result is not None
    ]
    process_element(body)
    def flatten_nested_elements(node):
--- a/docs/md/changelog.md
+++ b/docs/md/changelog.md
@@ -1,5 +1,15 @@
 # Changelog
 ## [v0.2.75] - 2024-07-19
 Minor improvements for a more maintainable codebase:
 - 🔄 Fixed typos in `chunking_strategy.py` and `crawler_strategy.py` to improve code readability
 - 🔄 Removed `.test_pads/` directory from `.gitignore` to keep our repository clean and organized
 These changes may seem small, but they contribute to a more stable and sustainable codebase. By fixing typos and updating our `.gitignore` settings, we're ensuring that our code is easier to maintain and scale in the long run.
 ## v0.2.74 - 2024-07-08
 A slew of exciting updates to improve the crawler's stability and robustness! 🎉
--- a/docs/md/index.md
+++ b/docs/md/index.md
@@ -1,4 +1,4 @@
-# Crawl4AI v0.2.74
+# Crawl4AI v0.2.75
 Welcome to the official documentation for Crawl4AI! 🕷️🤖 Crawl4AI is an open-source Python library designed to simplify web crawling and extract useful information from web pages. This documentation will guide you through the features, usage, and customization of Crawl4AI.
Author	SHA1	Message	Date
Aravind Karnam	cf6c835e18	moved score threshold to config.py & replaced the separator for tag.get_text in find_closest_parent_with_useful_text fn from period(.) to space( ) to keep the text more neutral.	2024-07-21 15:18:23 +05:30
Aravind Karnam	e5ecf291f3	Implemented filtering for images and grabbing the contextual text from nearest parent	2024-07-21 15:03:17 +05:30
Aravind Karnam	9d0cafcfa6	fixed import error in model_loader.py	2024-07-21 14:55:58 +05:30
unclecode	7715623430	chore: Fix typos and update .gitignore These changes fix typos in `chunking_strategy.py` and `crawler_strategy.py` to improve code readability. Additionally, the `.test_pads/` directory is removed from the `.gitignore` file to keep the repository clean and organized.	2024-07-19 17:42:39 +08:00
unclecode	f5a4e80e2c	chore: Fix typo in chunking_strategy.py and crawler_strategy.py The commit fixes a typo in the `chunking_strategy.py` file where `nl.toknize.TextTilingTokenizer()` was corrected to `nl.tokenize.TextTilingTokenizer()`. Additionally, in the `crawler_strategy.py` file, the commit converts the screenshot image to RGB mode before saving it as a JPEG. This ensures consistent image quality and compression.	2024-07-19 17:40:31 +08:00
`@@ -1,4 +1,4 @@`
	`# Crawl4AI v0.2.74`	`# Crawl4AI v0.2.75`

	`Welcome to the official documentation for Crawl4AI! 🕷️🤖 Crawl4AI is an open-source Python library designed to simplify web crawling and extract useful information from web pages. This documentation will guide you through the features, usage, and customization of Crawl4AI.`	`Welcome to the official documentation for Crawl4AI! 🕷️🤖 Crawl4AI is an open-source Python library designed to simplify web crawling and extract useful information from web pages. This documentation will guide you through the features, usage, and customization of Crawl4AI.`