chore: Update version to 0.2.74 in setup.py

Prepare branch for release 0.2.74
Add UTF encoding to resolve the windows machone "charmap" error.
2024-07-08 16:30:28 +08:00 · 2024-07-08 16:30:14 +08:00 · 2024-07-08 16:18:07 +08:00 · 2024-07-08 15:59:59 +08:00 · 2024-07-06 14:28:01 +08:00 · 2024-07-06 14:08:30 +08:00
11 changed files with 14 additions and 158 deletions
--- a/.gitignore
+++ b/.gitignore
@@ -165,8 +165,6 @@ Crawl4AI.egg-info/
 Crawl4AI.egg-info/*
 crawler_data.db
 .vscode/
 .tests/
 .test_pads/
 test_pad.py
 test_pad*.py
 .data/
--- a/CHANGELOG.md
+++ b/CHANGELOG.md
@@ -1,14 +1,5 @@
 # Changelog
 ## [v0.2.75] - 2024-07-19
 Minor improvements for a more maintainable codebase:
 - 🔄 Fixed typos in `chunking_strategy.py` and `crawler_strategy.py` to improve code readability
 - 🔄 Removed `.test_pads/` directory from `.gitignore` to keep our repository clean and organized
 These changes may seem small, but they contribute to a more stable and sustainable codebase. By fixing typos and updating our `.gitignore` settings, we're ensuring that our code is easier to maintain and scale in the long run.
 ## [v0.2.74] - 2024-07-08
 A slew of exciting updates to improve the crawler's stability and robustness! 🎉
--- a/README.md
+++ b/README.md
@@ -1,4 +1,4 @@
-# Crawl4AI v0.2.75 🕷️🤖
+# Crawl4AI v0.2.74 🕷️🤖
 [![GitHub Stars](https://img.shields.io/github/stars/unclecode/crawl4ai?style=social)](https://github.com/unclecode/crawl4ai/stargazers)
 [![GitHub Forks](https://img.shields.io/github/forks/unclecode/crawl4ai?style=social)](https://github.com/unclecode/crawl4ai/network/members)
--- a/crawl4ai/chunking_strategy.py
+++ b/crawl4ai/chunking_strategy.py
@@ -55,7 +55,7 @@ class TopicSegmentationChunking(ChunkingStrategy):
    def __init__(self, num_keywords=3, **kwargs):
        import nltk as nl
-        self.tokenizer = nl.tokenize.TextTilingTokenizer()
+        self.tokenizer = nl.toknize.TextTilingTokenizer()
        self.num_keywords = num_keywords
    def chunk(self, text: str) -> list:
--- a/crawl4ai/crawler_strategy.py
+++ b/crawl4ai/crawler_strategy.py
@@ -292,22 +292,15 @@ class LocalSeleniumCrawlerStrategy(CrawlerStrategy):
            # Open the screenshot with PIL
            image = Image.open(BytesIO(screenshot))
            # Convert image to RGB mode
            rgb_image = image.convert('RGB')
            # Convert to JPEG and compress
            buffered = BytesIO()
-            rgb_image.save(buffered, format="JPEG", quality=85)
+            image.save(buffered, format="JPEG", quality=85)
            img_base64 = base64.b64encode(buffered.getvalue()).decode('utf-8')
            if self.verbose:
                print(f"[LOG] 📸 Screenshot taken and converted to base64")
            return img_base64
        except Exception as e:
            if self.verbose:
                print(f"[ERROR] Failed to take screenshot: {str(e)}")
            return ""
        except Exception as e:
            error_message = sanitize_input_encode(f"Failed to take screenshot: {str(e)}")
--- a/docs/md/changelog.md
+++ b/docs/md/changelog.md
@@ -1,15 +1,5 @@
 # Changelog
 ## [v0.2.75] - 2024-07-19
 Minor improvements for a more maintainable codebase:
 - 🔄 Fixed typos in `chunking_strategy.py` and `crawler_strategy.py` to improve code readability
 - 🔄 Removed `.test_pads/` directory from `.gitignore` to keep our repository clean and organized
 These changes may seem small, but they contribute to a more stable and sustainable codebase. By fixing typos and updating our `.gitignore` settings, we're ensuring that our code is easier to maintain and scale in the long run.
 ## v0.2.74 - 2024-07-08
 A slew of exciting updates to improve the crawler's stability and robustness! 🎉
--- a/docs/md/demo.md
+++ b/docs/md/demo.md
@@ -14,7 +14,6 @@
            <div class="form-group">
                <button class="btn btn-default" type="submit">Submit</button>
            </div>
        </fieldset>
    </form>
@@ -94,10 +93,6 @@
        </div>
    </section>
    <div id="error" class="error-message" style="display: none; margin-top:1em;">
        <div class="terminal-alert terminal-alert-error"></div>
    </div>
    <script>
        function showTab(tabId) {
            const tabs = document.querySelectorAll('.tab-content');
@@ -167,17 +162,7 @@
                },
                body: JSON.stringify(data)
            })
-            .then(response => {
+            .then(response => response.json())
                if (!response.ok) {
                    if (response.status === 429) {
                        return response.json().then(err => { 
                            throw Object.assign(new Error('Rate limit exceeded'), { status: 429, details: err });
                        });
                    }
                    throw new Error('Network response was not ok');
                }
                return response.json();
            })
            .then(data => {
                data = data.results[0]; // Only one URL is requested
                document.getElementById('loading').style.display = 'none';
@@ -202,29 +187,11 @@ result = crawler.run(
 print(result)
                `;
                redo(document.getElementById('pythonCode'), pythonCode);
                document.getElementById('error').style.display = 'none';
            })
            .catch(error => {
                document.getElementById('loading').style.display = 'none';
-                document.getElementById('error').style.display = 'block';
+                document.getElementById('response').style.display = 'block';
-                let errorMessage = 'An unexpected error occurred. Please try again later.';
+                document.getElementById('markdownContent').textContent = 'Error: ' + error;
                if (error.status === 429) {
                    const details = error.details;
                    if (details.retry_after) {
                        errorMessage = `Rate limit exceeded. Please wait ${parseFloat(details.retry_after).toFixed(1)} seconds before trying again.`;
                    } else if (details.reset_at) {
                        const resetTime = new Date(details.reset_at);
                        const waitTime = Math.ceil((resetTime - new Date()) / 1000);
                        errorMessage = `Rate limit exceeded. Please try again after ${waitTime} seconds.`;
                    } else {
                        errorMessage = `Rate limit exceeded. Please try again later.`;
                    }
                } else if (error.message) {
                    errorMessage = error.message;
                }
                document.querySelector('#error .terminal-alert').textContent = errorMessage;
            });
        });
    </script>
--- a/docs/md/index.md
+++ b/docs/md/index.md
@@ -1,4 +1,4 @@
-# Crawl4AI v0.2.75
+# Crawl4AI v0.2.74
 Welcome to the official documentation for Crawl4AI! 🕷️🤖 Crawl4AI is an open-source Python library designed to simplify web crawling and extract useful information from web pages. This documentation will guide you through the features, usage, and customization of Crawl4AI.
--- a/main.py
+++ b/main.py
@@ -22,15 +22,6 @@ from typing import List, Optional
 from crawl4ai.web_crawler import WebCrawler
 from crawl4ai.database import get_total_count, clear_db
 import time
 from slowapi import Limiter, _rate_limit_exceeded_handler
 from slowapi.util import get_remote_address
 from slowapi.errors import RateLimitExceeded
 # load .env file
 from dotenv import load_dotenv
 load_dotenv()
 # Configuration
 __location__ = os.path.realpath(os.path.join(os.getcwd(), os.path.dirname(__file__)))
 MAX_CONCURRENT_REQUESTS = 10  # Adjust this to change the maximum concurrent requests
@@ -39,78 +30,6 @@ lock = asyncio.Lock()
 app = FastAPI()
 # Initialize rate limiter
 def rate_limit_key_func(request: Request):
    access_token = request.headers.get("access-token")
    if access_token == os.environ.get('ACCESS_TOKEN'):
        return None
    return get_remote_address(request)
 limiter = Limiter(key_func=rate_limit_key_func)
 app.state.limiter = limiter
 # Dictionary to store last request times for each client
 last_request_times = {}
 last_rate_limit = {}
 def get_rate_limit():
    limit = os.environ.get('ACCESS_PER_MIN', "5")
    return f"{limit}/minute"
 # Custom rate limit exceeded handler
 async def custom_rate_limit_exceeded_handler(request: Request, exc: RateLimitExceeded) -> JSONResponse:
    if request.client.host not in last_rate_limit or time.time() - last_rate_limit[request.client.host] > 60:
        last_rate_limit[request.client.host] = time.time()
    retry_after = 60 - (time.time() - last_rate_limit[request.client.host])
    reset_at = time.time() + retry_after
    return JSONResponse(
        status_code=429,
        content={
            "detail": "Rate limit exceeded",
            "limit": str(exc.limit.limit),
            "retry_after": retry_after,
            'reset_at': reset_at,
            "message": f"You have exceeded the rate limit of {exc.limit.limit}."
        }
    )
 app.add_exception_handler(RateLimitExceeded, custom_rate_limit_exceeded_handler)
 # Middleware for token-based bypass and per-request limit
 class RateLimitMiddleware(BaseHTTPMiddleware):
    async def dispatch(self, request: Request, call_next):
        SPAN = int(os.environ.get('ACCESS_TIME_SPAN', 10))
        access_token = request.headers.get("access-token")
        if access_token == os.environ.get('ACCESS_TOKEN'):
            return await call_next(request)
        path = request.url.path
        if path in ["/crawl", "/old"]:
            client_ip = request.client.host
            current_time = time.time()
            # Check time since last request
            if client_ip in last_request_times:
                time_since_last_request = current_time - last_request_times[client_ip]
                if time_since_last_request < SPAN:
                    return JSONResponse(
                        status_code=429,
                        content={
                            "detail": "Too many requests",
                            "message": "Rate limit exceeded. Please wait 10 seconds between requests.",
                            "retry_after": max(0, SPAN - time_since_last_request),
                            "reset_at": current_time + max(0, SPAN - time_since_last_request),
                        }
                    )
            last_request_times[client_ip] = current_time
        return await call_next(request)
 app.add_middleware(RateLimitMiddleware)
 # CORS configuration
 origins = ["*"]  # Allow all origins
 app.add_middleware(
@@ -154,7 +73,6 @@ def read_root():
    return RedirectResponse(url="/mkdocs")
@app.get("/old", response_class=HTMLResponse)
@limiter.limit(get_rate_limit())
 async def read_index(request: Request):
    partials_dir = os.path.join(__location__, "pages", "partial")
    partials = {}
@@ -189,7 +107,6 @@ def import_strategy(module_name: str, class_name: str, *args, **kwargs):
        raise HTTPException(status_code=400, detail=f"Class {class_name} not found in {module_name}.")
@app.post("/crawl")
@limiter.limit(get_rate_limit())
 async def crawl_urls(crawl_request: CrawlRequest, request: Request):
    logging.debug(f"[LOG] Crawl request for URL: {crawl_request.urls}")
    global current_requests
--- a/middlewares.py
+++ b/middlewares.py
--- a/setup.py
+++ b/setup.py
@@ -1,18 +1,18 @@
 from setuptools import setup, find_packages
 import os
 from pathlib import Path
-import shutil
+import subprocess
 from setuptools.command.install import install
 # Create the .crawl4ai folder in the user's home directory if it doesn't exist
 # If the folder already exists, remove the cache folder
-crawl4ai_folder = Path.home() / ".crawl4ai"
+crawl4ai_folder = os.path.join(Path.home(), ".crawl4ai")
-cache_folder = crawl4ai_folder / "cache"
+if os.path.exists(f"{crawl4ai_folder}/cache"):
    subprocess.run(["rm", "-rf", f"{crawl4ai_folder}/cache"])
 os.makedirs(crawl4ai_folder, exist_ok=True)
 os.makedirs(f"{crawl4ai_folder}/cache", exist_ok=True)
 if cache_folder.exists():
    shutil.rmtree(cache_folder)
 crawl4ai_folder.mkdir(exist_ok=True)
 cache_folder.mkdir(exist_ok=True)
 # Read the requirements from requirements.txt
 with open("requirements.txt") as f:
Author	SHA1	Message	Date
unclecode	2101540819	chore: Update version to 0.2.74 in setup.py	2024-07-08 16:30:28 +08:00
unclecode	9d98393606	Prepare branch for release 0.2.74	2024-07-08 16:30:14 +08:00
unclecode	6f99368744	Add UTF encoding to resolve the windows machone "charmap" error.	2024-07-08 16:18:07 +08:00
unclecode	ea2f83ac10	feat: Add delay after fetching URL in crawler hooks This commit adds a delay of 5 seconds after fetching the URL in the `after_get_url` hook of the crawler hooks. The delay is implemented using the `time.sleep()` function. This change ensures that the entire page is fetched before proceeding with further actions.	2024-07-08 15:59:59 +08:00
unclecode	7f41ff4a74	The `after_get_url` hook is executed after getting the URL, allowing for further customization.	2024-07-06 14:28:01 +08:00
unclecode	236bdb4035	feat: Add MaxRetryError exception handling in LocalSeleniumCrawlerStrategy	2024-07-06 14:08:30 +08:00
unclecode	1368248254	feat: Sanitize input and handle encoding issues in LLMExtractionStrategy	2024-07-05 17:59:26 +08:00
unclecode	b0ec54b9e9	feat: Sanitize input and handle encoding issues in LLMExtractionStrategy	2024-07-05 17:37:25 +08:00
unclecode	fb6ed5f000	feat: Sanitize input and handle encoding issues in LLMExtractionStrategy This commit modifies the LLMExtractionStrategy class in `extraction_strategy.py` to sanitize input and handle potential encoding issues. The `sanitize_input_encode` function is introduced in `utils.py` to encode and decode the input text as UTF-8 or ASCII, depending on the encoding issues encountered. If an encoding error occurs, the function falls back to ASCII encoding and logs a warning message. This change improves the robustness of the extraction process and ensures that characters are not lost due to encoding issues.	2024-07-05 17:30:58 +08:00
unclecode	597fe8bdb7	chore: Delete existing database file and initialize new database This commit deletes the existing database file and initializes a new database in the `crawl4ai/database.py` file. The `os.remove()` function is used to delete the file if it exists, and then the `init_db()` function is called to initialize the new database. This change is necessary to start with a clean database state.	2024-07-05 17:04:57 +08:00
`@@ -1,4 +1,4 @@`
	`# Crawl4AI v0.2.75`	`# Crawl4AI v0.2.74`

	`Welcome to the official documentation for Crawl4AI! 🕷️🤖 Crawl4AI is an open-source Python library designed to simplify web crawling and extract useful information from web pages. This documentation will guide you through the features, usage, and customization of Crawl4AI.`	`Welcome to the official documentation for Crawl4AI! 🕷️🤖 Crawl4AI is an open-source Python library designed to simplify web crawling and extract useful information from web pages. This documentation will guide you through the features, usage, and customization of Crawl4AI.`