Compare commits

..

11 Commits

Author SHA1 Message Date
UncleCode
c2d4784810 fix: resolve merge conflict in DefaultMarkdownGenerator affecting fit_markdown generation 2024-11-28 12:56:31 +08:00
UncleCode
76bea6c577 Merge branch 'main' into 0.3.743 2024-11-28 12:53:30 +08:00
UncleCode
3ff0b0b2c4 feat: update changelog for version 0.3.743 with new features, improvements, and contributor acknowledgments 2024-11-28 12:48:07 +08:00
UncleCode
a1c7dc17ce Merge branch 'next' of https://github.com/unclecode/crawl4ai into next 2024-11-28 12:45:57 +08:00
UncleCode
24723b2f10 Enhance features and documentation
- Updated version to 0.3.743
  - Improved ManagedBrowser configuration with dynamic host/port
  - Implemented fast HTML formatting in web crawler
  - Enhanced markdown generation with a new generator class
  - Improved sanitization and utility functions
  - Added contributor details and pull request acknowledgments
  - Updated documentation for clearer usage scenarios
  - Adjusted tests to reflect class name changes
2024-11-28 12:45:05 +08:00
Hamza Farhan
f998e9e949 Fix: handled the cases where markdown_with_citations, references_markdown, and filtered_html might not be defined. (#293)
Thanks, dear Farhan, for the changes you made in the code. I accepted and merged them into the main branch. Also, I will add your name to our contributor list. Thank you so much.
2024-11-27 19:20:54 +08:00
zhounan
73661f7d1f docs: enhance development installation instructions (#286)
Thanks for your contribution. I'm merging your changes and I'll add your name to our contributor list. Thank you so much.
2024-11-27 15:04:20 +08:00
UncleCode
b5d4db07d1 Merge branch 'main' of https://github.com/unclecode/crawl4ai 2024-11-27 14:55:58 +08:00
UncleCode
c6a022132b docs: update CONTRIBUTORS.md to acknowledge aadityakanjolia4 for fixing 'CustomHTML2Text' bug 2024-11-27 14:55:56 +08:00
unclecode
195c0ccf8a chore: remove deprecated Docker Compose configurations for crawl4ai service 2024-11-24 19:40:27 +08:00
unclecode
b09a86c0c1 chore: remove deprecated Docker Compose configurations for crawl4ai service 2024-11-24 19:40:10 +08:00
13 changed files with 181 additions and 104 deletions

View File

@@ -1,5 +1,53 @@
# Changelog
## [0.3.743] November 27, 2024
Enhance features and documentation
- Updated version to 0.3.743
- Improved ManagedBrowser configuration with dynamic host/port
- Implemented fast HTML formatting in web crawler
- Enhanced markdown generation with a new generator class
- Improved sanitization and utility functions
- Added contributor details and pull request acknowledgments
- Updated documentation for clearer usage scenarios
- Adjusted tests to reflect class name changes
### CONTRIBUTORS.md
Added new contributors and pull request details.
Updated community contributions and acknowledged pull requests.
### crawl4ai/__version__.py
Version update.
Bumped version to 0.3.743.
### crawl4ai/async_crawler_strategy.py
Improved ManagedBrowser configuration.
Enhanced browser initialization with configurable host and debugging port; improved hook execution.
### crawl4ai/async_webcrawler.py
Optimized HTML processing.
Implemented 'fast_format_html' for optimized HTML formatting; applied it when 'prettiify' is enabled.
### crawl4ai/content_scraping_strategy.py
Enhanced markdown generation strategy.
Updated to use DefaultMarkdownGenerator and improved markdown generation with filters option.
### crawl4ai/markdown_generation_strategy.py
Refactored markdown generation class.
Renamed DefaultMarkdownGenerationStrategy to DefaultMarkdownGenerator; added content filter handling.
### crawl4ai/utils.py
Enhanced utility functions.
Improved input sanitization and enhanced HTML formatting method.
### docs/md_v2/advanced/hooks-auth.md
Improved documentation for hooks.
Updated code examples to include cookies in crawler strategy initialization.
### tests/async/test_markdown_genertor.py
Refactored tests to match class renaming.
Updated tests to use renamed DefaultMarkdownGenerator class.
## [0.3.74] November 17, 2024
This changelog details the updates and changes introduced in Crawl4AI version 0.3.74. It's designed to inform developers about new features, modifications to existing components, removals, and other important information.

View File

@@ -10,11 +10,19 @@ We would like to thank the following people for their contributions to Crawl4AI:
## Community Contributors
- [aadityakanjolia4](https://github.com/aadityakanjolia4) - Fix for `CustomHTML2Text` is not defined.
- [FractalMind](https://github.com/FractalMind) - Created the first official Docker Hub image and fixed Dockerfile errors
- [ketonkss4](https://github.com/ketonkss4) - Identified Selenium's new capabilities, helping reduce dependencies
- [jonymusky](https://github.com/jonymusky) - Javascript execution documentation, and wait_for
- [datehoer](https://github.com/datehoer) - Add browser prxy support
## Pull Requests
- [nelzomal](https://github.com/nelzomal) - Enhance development installation instructions [#286](https://github.com/unclecode/crawl4ai/pull/286)
- [HamzaFarhan](https://github.com/HamzaFarhan) - Handled the cases where markdown_with_citations, references_markdown, and filtered_html might not be defined [#293](https://github.com/unclecode/crawl4ai/pull/293)
- [NanmiCoder](https://github.com/NanmiCoder) - fix: crawler strategy exception handling and fixes [#271](https://github.com/unclecode/crawl4ai/pull/271)
## Other Contributors
- [Gokhan](https://github.com/gkhngyk)

View File

@@ -110,7 +110,15 @@ For contributors who plan to modify the source code:
```bash
git clone https://github.com/unclecode/crawl4ai.git
cd crawl4ai
pip install -e .
pip install -e . # Basic installation in editable mode
```
Install optional features:
```bash
pip install -e ".[torch]" # With PyTorch features
pip install -e ".[transformer]" # With Transformer features
pip install -e ".[cosine]" # With cosine similarity features
pip install -e ".[sync]" # With synchronous crawling (Selenium)
pip install -e ".[all]" # Install all optional features
```
## One-Click Deployment 🚀

View File

@@ -1,2 +1,2 @@
# crawl4ai/_version.py
__version__ = "0.3.742"
__version__ = "0.3.743"

View File

@@ -35,13 +35,14 @@ stealth_config = StealthConfig(
class ManagedBrowser:
def __init__(self, browser_type: str = "chromium", user_data_dir: Optional[str] = None, headless: bool = False, logger = None):
def __init__(self, browser_type: str = "chromium", user_data_dir: Optional[str] = None, headless: bool = False, logger = None, host: str = "localhost", debugging_port: int = 9222):
self.browser_type = browser_type
self.user_data_dir = user_data_dir
self.headless = headless
self.browser_process = None
self.temp_dir = None
self.debugging_port = 9222
self.debugging_port = debugging_port
self.host = host
self.logger = logger
self.shutting_down = False
@@ -70,7 +71,7 @@ class ManagedBrowser:
# Monitor browser process output for errors
asyncio.create_task(self._monitor_browser_process())
await asyncio.sleep(2) # Give browser time to start
return f"http://localhost:{self.debugging_port}"
return f"http://{self.host}:{self.debugging_port}"
except Exception as e:
await self.cleanup()
raise Exception(f"Failed to start browser: {e}")
@@ -416,13 +417,13 @@ class AsyncPlaywrightCrawlerStrategy(AsyncCrawlerStrategy):
else:
raise ValueError(f"Invalid hook type: {hook_type}")
async def execute_hook(self, hook_type: str, *args):
async def execute_hook(self, hook_type: str, *args, **kwargs):
hook = self.hooks.get(hook_type)
if hook:
if asyncio.iscoroutinefunction(hook):
return await hook(*args)
return await hook(*args, **kwargs)
else:
return hook(*args)
return hook(*args, **kwargs)
return args[0] if args else None
def update_user_agent(self, user_agent: str):
@@ -642,6 +643,7 @@ class AsyncPlaywrightCrawlerStrategy(AsyncCrawlerStrategy):
session_id = kwargs.get("session_id")
# Handle page creation differently for managed browser
context = None
if self.use_managed_browser:
if session_id:
# Reuse existing session if available
@@ -760,7 +762,7 @@ class AsyncPlaywrightCrawlerStrategy(AsyncCrawlerStrategy):
return response
if not kwargs.get("js_only", False):
await self.execute_hook('before_goto', page)
await self.execute_hook('before_goto', page, context = context)
response = await page.goto(
@@ -773,7 +775,7 @@ class AsyncPlaywrightCrawlerStrategy(AsyncCrawlerStrategy):
# response = await page.goto("about:blank")
# await page.evaluate(f"window.location.href = '{url}'")
await self.execute_hook('after_goto', page)
await self.execute_hook('after_goto', page, context = context)
# Get status code and headers
status_code = response.status
@@ -838,7 +840,7 @@ class AsyncPlaywrightCrawlerStrategy(AsyncCrawlerStrategy):
# await page.wait_for_timeout(100)
# Check for on execution event
await self.execute_hook('on_execution_started', page)
await self.execute_hook('on_execution_started', page, context = context)
if kwargs.get("simulate_user", False) or kwargs.get("magic", False):
# Simulate user interactions
@@ -924,7 +926,7 @@ class AsyncPlaywrightCrawlerStrategy(AsyncCrawlerStrategy):
if kwargs.get("process_iframes", False):
page = await self.process_iframes(page)
await self.execute_hook('before_retrieve_html', page)
await self.execute_hook('before_retrieve_html', page, context = context)
# Check if delay_before_return_html is set then wait for that time
delay_before_return_html = kwargs.get("delay_before_return_html")
if delay_before_return_html:
@@ -935,7 +937,7 @@ class AsyncPlaywrightCrawlerStrategy(AsyncCrawlerStrategy):
await self.remove_overlay_elements(page)
html = await page.content()
await self.execute_hook('before_return_html', page, html)
await self.execute_hook('before_return_html', page, html, context = context)
# Check if kwargs has screenshot=True then take screenshot
screenshot_data = None

View File

@@ -25,7 +25,8 @@ from .config import (
from .utils import (
sanitize_input_encode,
InvalidCSSSelectorError,
format_html
format_html,
fast_format_html
)
from urllib.parse import urlparse
import random
@@ -534,16 +535,17 @@ class AsyncWebCrawler:
"timing": time.perf_counter() - t1
}
)
screenshot = None if not screenshot else screenshot
if kwargs.get("prettiify", False):
cleaned_html = fast_format_html(cleaned_html)
return CrawlResult(
url=url,
html=html,
cleaned_html=format_html(cleaned_html),
cleaned_html=cleaned_html,
markdown_v2=markdown_v2,
markdown=markdown,
fit_markdown=fit_markdown,

View File

@@ -10,7 +10,7 @@ from urllib.parse import urljoin
from requests.exceptions import InvalidSchema
# from .content_cleaning_strategy import ContentCleaningStrategy
from .content_filter_strategy import RelevantContentFilter, BM25ContentFilter#, HeuristicContentFilter
from .markdown_generation_strategy import MarkdownGenerationStrategy, DefaultMarkdownGenerationStrategy
from .markdown_generation_strategy import MarkdownGenerationStrategy, DefaultMarkdownGenerator
from .models import MarkdownGenerationResult
from .utils import (
sanitize_input_encode,
@@ -105,21 +105,28 @@ class WebScrapingStrategy(ContentScrapingStrategy):
Returns:
Dict containing markdown content in various formats
"""
markdown_generator: Optional[MarkdownGenerationStrategy] = kwargs.get('markdown_generator', DefaultMarkdownGenerationStrategy())
markdown_generator: Optional[MarkdownGenerationStrategy] = kwargs.get('markdown_generator', DefaultMarkdownGenerator())
if markdown_generator:
try:
if kwargs.get('fit_markdown', False) and not markdown_generator.content_filter:
markdown_generator.content_filter = BM25ContentFilter(
user_query=kwargs.get('fit_markdown_user_query', None),
bm25_threshold=kwargs.get('fit_markdown_bm25_threshold', 1.0)
)
markdown_result: MarkdownGenerationResult = markdown_generator.generate_markdown(
cleaned_html=cleaned_html,
base_url=url,
html2text_options=kwargs.get('html2text', {}),
content_filter=kwargs.get('content_filter', None)
html2text_options=kwargs.get('html2text', {})
)
help_message = """"""
return {
'markdown': markdown_result.raw_markdown,
'fit_markdown': markdown_result.fit_markdown or "Set flag 'fit_markdown' to True to get cleaned HTML content.",
'fit_html': markdown_result.fit_html or "Set flag 'fit_markdown' to True to get cleaned HTML content.",
'fit_markdown': markdown_result.fit_markdown,
'fit_html': markdown_result.fit_html,
'markdown_v2': markdown_result
}
except Exception as e:

View File

@@ -11,6 +11,8 @@ LINK_PATTERN = re.compile(r'!?\[([^\]]+)\]\(([^)]+?)(?:\s+"([^"]*)")?\)')
class MarkdownGenerationStrategy(ABC):
"""Abstract base class for markdown generation strategies."""
def __init__(self, content_filter: Optional[RelevantContentFilter] = None):
self.content_filter = content_filter
@abstractmethod
def generate_markdown(self,
@@ -23,8 +25,10 @@ class MarkdownGenerationStrategy(ABC):
"""Generate markdown from cleaned HTML."""
pass
class DefaultMarkdownGenerationStrategy(MarkdownGenerationStrategy):
class DefaultMarkdownGenerator(MarkdownGenerationStrategy):
"""Default implementation of markdown generation strategy."""
def __init__(self, content_filter: Optional[RelevantContentFilter] = None):
super().__init__(content_filter)
def convert_links_to_citations(self, markdown: str, base_url: str = "") -> Tuple[str, str]:
link_map = {}
@@ -84,14 +88,18 @@ class DefaultMarkdownGenerationStrategy(MarkdownGenerationStrategy):
raw_markdown = raw_markdown.replace(' ```', '```')
# Convert links to citations
markdown_with_citations: str = ""
references_markdown: str = ""
if citations:
markdown_with_citations, references_markdown = self.convert_links_to_citations(
raw_markdown, base_url
)
# Generate fit markdown if content filter is provided
fit_markdown: Optional[str] = None
if content_filter:
fit_markdown: Optional[str] = ""
filtered_html: Optional[str] = ""
if content_filter or self.content_filter:
content_filter = content_filter or self.content_filter
filtered_html = content_filter.filter_content(cleaned_html)
filtered_html = '\n'.join('<div>{}</div>'.format(s) for s in filtered_html)
fit_markdown = h.handle(filtered_html)
@@ -101,7 +109,7 @@ class DefaultMarkdownGenerationStrategy(MarkdownGenerationStrategy):
markdown_with_citations=markdown_with_citations,
references_markdown=references_markdown,
fit_markdown=fit_markdown,
fit_html=filtered_html
fit_html=filtered_html,
)
def fast_urljoin(base: str, url: str) -> str:

View File

@@ -233,12 +233,17 @@ def sanitize_html(html):
def sanitize_input_encode(text: str) -> str:
"""Sanitize input to handle potential encoding issues."""
try:
# Attempt to encode and decode as UTF-8 to handle potential encoding issues
return text.encode('utf-8', errors='ignore').decode('utf-8')
except UnicodeEncodeError as e:
print(f"Warning: Encoding issue detected. Some characters may be lost. Error: {e}")
# Fall back to ASCII if UTF-8 fails
return text.encode('ascii', errors='ignore').decode('ascii')
try:
if not text:
return ''
# Attempt to encode and decode as UTF-8 to handle potential encoding issues
return text.encode('utf-8', errors='ignore').decode('utf-8')
except UnicodeEncodeError as e:
print(f"Warning: Encoding issue detected. Some characters may be lost. Error: {e}")
# Fall back to ASCII if UTF-8 fails
return text.encode('ascii', errors='ignore').decode('ascii')
except Exception as e:
raise ValueError(f"Error sanitizing input: {str(e)}") from e
def escape_json_string(s):
"""
@@ -1079,9 +1084,54 @@ def wrap_text(draw, text, font, max_width):
return '\n'.join(lines)
def format_html(html_string):
soup = BeautifulSoup(html_string, 'html.parser')
soup = BeautifulSoup(html_string, 'lxml.parser')
return soup.prettify()
def fast_format_html(html_string):
"""
A fast HTML formatter that uses string operations instead of parsing.
Args:
html_string (str): The HTML string to format
Returns:
str: The formatted HTML string
"""
# Initialize variables
indent = 0
indent_str = " " # Two spaces for indentation
formatted = []
in_content = False
# Split by < and > to separate tags and content
parts = html_string.replace('>', '>\n').replace('<', '\n<').split('\n')
for part in parts:
if not part.strip():
continue
# Handle closing tags
if part.startswith('</'):
indent -= 1
formatted.append(indent_str * indent + part)
# Handle self-closing tags
elif part.startswith('<') and part.endswith('/>'):
formatted.append(indent_str * indent + part)
# Handle opening tags
elif part.startswith('<'):
formatted.append(indent_str * indent + part)
indent += 1
# Handle content between tags
else:
content = part.strip()
if content:
formatted.append(indent_str * indent + content)
return '\n'.join(formatted)
def normalize_url(href, base_url):
"""Normalize URLs to ensure consistent format"""
from urllib.parse import urljoin, urlparse

View File

@@ -1,27 +0,0 @@
services:
crawl4ai:
image: unclecode/crawl4ai:basic # Pull image from Docker Hub
ports:
- "11235:11235" # FastAPI server
- "8000:8000" # Alternative port
- "9222:9222" # Browser debugging
- "8080:8080" # Additional port
environment:
- CRAWL4AI_API_TOKEN=${CRAWL4AI_API_TOKEN:-} # Optional API token
- OPENAI_API_KEY=${OPENAI_API_KEY:-} # Optional OpenAI API key
- CLAUDE_API_KEY=${CLAUDE_API_KEY:-} # Optional Claude API key
volumes:
- /dev/shm:/dev/shm # Shared memory for browser operations
deploy:
resources:
limits:
memory: 4G
reservations:
memory: 1G
restart: unless-stopped
healthcheck:
test: ["CMD", "curl", "-f", "http://localhost:11235/health"]
interval: 30s
timeout: 10s
retries: 3
start_period: 40s

View File

@@ -1,33 +0,0 @@
services:
crawl4ai:
build:
context: .
dockerfile: Dockerfile
args:
PYTHON_VERSION: 3.10
INSTALL_TYPE: all
ENABLE_GPU: false
ports:
- "11235:11235" # FastAPI server
- "8000:8000" # Alternative port
- "9222:9222" # Browser debugging
- "8080:8080" # Additional port
environment:
- CRAWL4AI_API_TOKEN=${CRAWL4AI_API_TOKEN:-} # Optional API token
- OPENAI_API_KEY=${OPENAI_API_KEY:-} # Optional OpenAI API key
- CLAUDE_API_KEY=${CLAUDE_API_KEY:-} # Optional Claude API key
volumes:
- /dev/shm:/dev/shm # Shared memory for browser operations
deploy:
resources:
limits:
memory: 4G
reservations:
memory: 1G
restart: unless-stopped
healthcheck:
test: ["CMD", "curl", "-f", "http://localhost:11235/health"]
interval: 30s
timeout: 10s
retries: 3
start_period: 40s

View File

@@ -18,7 +18,7 @@ Let's see how we can customize the AsyncWebCrawler using hooks! In this example,
import asyncio
from crawl4ai import AsyncWebCrawler
from crawl4ai.async_crawler_strategy import AsyncPlaywrightCrawlerStrategy
from playwright.async_api import Page, Browser
from playwright.async_api import Page, Browser, BrowserContext
async def on_browser_created(browser: Browser):
print("[HOOK] on_browser_created")
@@ -71,7 +71,11 @@ from crawl4ai.async_crawler_strategy import AsyncPlaywrightCrawlerStrategy
async def main():
print("\n🔗 Using Crawler Hooks: Let's see how we can customize the AsyncWebCrawler using hooks!")
crawler_strategy = AsyncPlaywrightCrawlerStrategy(verbose=True)
initial_cookies = [
{"name": "sessionId", "value": "abc123", "domain": ".example.com"},
{"name": "userId", "value": "12345", "domain": ".example.com"}
]
crawler_strategy = AsyncPlaywrightCrawlerStrategy(verbose=True, cookies=initial_cookies)
crawler_strategy.set_hook('on_browser_created', on_browser_created)
crawler_strategy.set_hook('before_goto', before_goto)
crawler_strategy.set_hook('after_goto', after_goto)

View File

@@ -11,7 +11,7 @@ import asyncio
import os
import time
from typing import Dict, Any
from crawl4ai.markdown_generation_strategy import DefaultMarkdownGenerationStrategy
from crawl4ai.markdown_generation_strategy import DefaultMarkdownGenerator
# Get current directory
__location__ = os.path.realpath(os.path.join(os.getcwd(), os.path.dirname(__file__)))
@@ -41,7 +41,7 @@ def test_basic_markdown_conversion():
with open(__location__ + "/data/wikipedia.html", "r") as f:
cleaned_html = f.read()
generator = DefaultMarkdownGenerationStrategy()
generator = DefaultMarkdownGenerator()
start_time = time.perf_counter()
result = generator.generate_markdown(
@@ -70,7 +70,7 @@ def test_relative_links():
Also an [image](/images/test.png) and another [page](/wiki/Banana).
"""
generator = DefaultMarkdownGenerationStrategy()
generator = DefaultMarkdownGenerator()
result = generator.generate_markdown(
cleaned_html=markdown,
base_url="https://en.wikipedia.org"
@@ -86,7 +86,7 @@ def test_duplicate_links():
Here's a [link](/test) and another [link](/test) and a [different link](/other).
"""
generator = DefaultMarkdownGenerationStrategy()
generator = DefaultMarkdownGenerator()
result = generator.generate_markdown(
cleaned_html=markdown,
base_url="https://example.com"
@@ -102,7 +102,7 @@ def test_link_descriptions():
Here's a [link with title](/test "Test Title") and a [link with description](/other) to test.
"""
generator = DefaultMarkdownGenerationStrategy()
generator = DefaultMarkdownGenerator()
result = generator.generate_markdown(
cleaned_html=markdown,
base_url="https://example.com"
@@ -120,7 +120,7 @@ def test_performance_large_document():
iterations = 5
times = []
generator = DefaultMarkdownGenerationStrategy()
generator = DefaultMarkdownGenerator()
for i in range(iterations):
start_time = time.perf_counter()
@@ -144,7 +144,7 @@ def test_image_links():
And a regular [link](/page).
"""
generator = DefaultMarkdownGenerationStrategy()
generator = DefaultMarkdownGenerator()
result = generator.generate_markdown(
cleaned_html=markdown,
base_url="https://example.com"