Compare commits

..

20 Commits

Author SHA1 Message Date
UncleCode
0d357ab7d2 feat(scraper): Enhance URL filtering and scoring systems
Implement comprehensive URL filtering and scoring capabilities:

Filters:
- Add URLPatternFilter with glob/regex support
- Implement ContentTypeFilter with MIME type checking
- Add DomainFilter for domain control
- Create FilterChain with stats tracking

Scorers:
- Complete KeywordRelevanceScorer implementation
- Add PathDepthScorer for URL structure scoring
- Implement ContentTypeScorer for file type priorities
- Add FreshnessScorer for date-based scoring
- Add DomainAuthorityScorer for domain weighting
- Create CompositeScorer for combined strategies

Features:
- Add statistics tracking for both filters and scorers
- Implement logging support throughout
- Add resource cleanup methods
- Create comprehensive documentation
- Include performance optimizations

Tests and docs included.
Note: Review URL normalization overlap with recent crawler changes.
2024-11-08 19:02:28 +08:00
UncleCode
bae4665949 feat(scraper): Enhance URL filtering and scoring systems
Implement comprehensive URL filtering and scoring capabilities:

Filters:
- Add URLPatternFilter with glob/regex support
- Implement ContentTypeFilter with MIME type checking
- Add DomainFilter for domain control
- Create FilterChain with stats tracking

Scorers:
- Complete KeywordRelevanceScorer implementation
- Add PathDepthScorer for URL structure scoring
- Implement ContentTypeScorer for file type priorities
- Add FreshnessScorer for date-based scoring
- Add DomainAuthorityScorer for domain weighting
- Create CompositeScorer for combined strategies

Features:
- Add statistics tracking for both filters and scorers
- Implement logging support throughout
- Add resource cleanup methods
- Create comprehensive documentation
- Include performance optimizations

Tests and docs included.
Note: Review URL normalization overlap with recent crawler changes.

- Quick Start is created and added
2024-11-08 18:45:12 +08:00
UncleCode
d11c004fbb Enhanced BFS Strategy: Improved monitoring, resource management & configuration
- Added CrawlStats for comprehensive crawl monitoring
- Implemented proper resource cleanup with shutdown mechanism
- Enhanced URL processing with better validation and politeness controls
- Added configuration options (max_concurrent, timeout, external_links)
- Improved error handling with retry logic
- Added domain-specific queues for better performance
- Created comprehensive documentation

Note: URL normalization needs review - potential duplicate processing
with core crawler for internal links. Currently commented out pending
further investigation of edge cases.
2024-11-08 15:57:23 +08:00
UncleCode
3d1c9a8434 Revieweing the BFS strategy. 2024-11-07 18:54:53 +08:00
UncleCode
be472c624c Refactored AsyncWebScraper to include comprehensive error handling and progress tracking capabilities. Introduced a ScrapingProgress data class to monitor processed and failed URLs. Enhanced scraping methods to log errors and track stats throughout the scraping process. 2024-11-06 21:09:47 +08:00
UncleCode
06b21dcc50 Update .gitignore to include new directories for issues and documentation 2024-11-06 18:44:03 +08:00
UncleCode
0f0f60527d Merge pull request #172 from aravindkarnam/scraper
Scraper
2024-11-06 07:00:44 +01:00
Aravind Karnam
8105fd178e Removed stubs for remove_from_future_crawls since the visited set is updated soon as the URL was queued, Removed add_to_retry_queue(url) since retry with exponential backoff with help of tenacity is going to take care of it. 2024-10-17 15:42:43 +05:30
Aravind Karnam
ce7fce4b16 1. Moved to asyncio.wait instead of gather so that results can be yeilded just as they are ready, rather than in batches
2. Moved the visted.add(url), to before the task is put in queue rather than after the crawl is completed. This makes sure that  duplicate crawls doesn't happen when same URL is found at different depth and that get's queued too because the crawl is not yet completed and visted set is not updated.
3. Named the yield_results attribute to stream instead. Since that seems to be popularly used in all other AI libraries for intermediate results.
2024-10-17 12:25:17 +05:30
Aravind Karnam
de28b59aca removed unused imports 2024-10-16 22:36:48 +05:30
Aravind Karnam
04d8b47b92 Exposed min_crawl_delay for BFSScraperStrategy 2024-10-16 22:34:54 +05:30
Aravind Karnam
2943feeecf 1. Added a flag to yield each crawl result,as they become ready along with the final scraper result as another option
2. Removed ascrape_many method, as I'm currently not focusing on it in the first cut of scraper
3. Added some error handling for cases where robots.txt cannot be fetched or parsed.
2024-10-16 22:05:29 +05:30
Aravind Karnam
8a7d29ce85 updated some comments and removed content type checking functionality from core as it's implemented as a filter 2024-10-16 15:59:37 +05:30
aravind
159bd875bd Merge pull request #5 from aravindkarnam/main
Merging 0.3.6
2024-10-16 10:41:22 +05:30
Aravind Karnam
d743adac68 Fixed some bugs in robots.txt processing 2024-10-03 15:58:57 +05:30
Aravind Karnam
7fe220dbd5 1. Introduced a bool flag to ascrape method to switch between sequential and concurrent processing
2. Introduced a dictionary for depth tracking across various tasks
3. Removed redundancy with crawled_urls variable. Instead created a list with visited set variable in returned object.
2024-10-03 11:17:11 +05:30
aravind
65e013d9d1 Merge pull request #3 from aravindkarnam/main
Merging latest changes from main branch
2024-10-03 09:52:12 +05:30
Aravind Karnam
7f3e2e47ed Parallel processing with retry on failure with exponential backoff - Simplified URL validation and normalisation - respecting Robots.txt 2024-09-19 12:34:12 +05:30
aravind
78f26ac263 Merge pull request #2 from aravindkarnam/staging
Staging
2024-09-18 18:16:23 +05:30
Aravind Karnam
44ce12c62c Created scaffolding for Scraper as per the plan. Implemented the ascrape method in bfs_scraper_strategy 2024-09-09 13:13:34 +05:30
275 changed files with 79556 additions and 42087 deletions

12
.gitattributes vendored
View File

@@ -1,12 +0,0 @@
# Documentation
*.html linguist-documentation
docs/* linguist-documentation
docs/examples/* linguist-documentation
docs/md_v2/* linguist-documentation
# Explicitly mark Python as the main language
*.py linguist-detectable=true
*.py linguist-language=Python
# Exclude HTML from language statistics
*.html linguist-detectable=false

26
.gitignore vendored
View File

@@ -199,35 +199,13 @@ test_env/
**/.DS_Store
todo.md
todo_executor.md
git_changes.py
git_changes.md
pypi_build.sh
git_issues.py
git_issues.md
.next/
.tests/
# .issues/
.docs/
.issues/
.gitboss/
todo_executor.md
protect-all-except-feature.sh
manage-collab.sh
publish.sh
combine.sh
combined_output.txt
.local
.scripts
tree.md
tree.md
.scripts
.local
.do
/plans
.codeiumignore
todo/
# windsurf rules
.windsurfrules
.docs/
.issues/

File diff suppressed because it is too large Load Diff

View File

@@ -10,21 +10,11 @@ We would like to thank the following people for their contributions to Crawl4AI:
## Community Contributors
- [aadityakanjolia4](https://github.com/aadityakanjolia4) - Fix for `CustomHTML2Text` is not defined.
- [FractalMind](https://github.com/FractalMind) - Created the first official Docker Hub image and fixed Dockerfile errors
- [ketonkss4](https://github.com/ketonkss4) - Identified Selenium's new capabilities, helping reduce dependencies
- [jonymusky](https://github.com/jonymusky) - Javascript execution documentation, and wait_for
- [datehoer](https://github.com/datehoer) - Add browser prxy support
## Pull Requests
- [dvschuyl](https://github.com/dvschuyl) - AsyncPlaywrightCrawlerStrategy page-evaluate context destroyed by navigation [#304](https://github.com/unclecode/crawl4ai/pull/304)
- [nelzomal](https://github.com/nelzomal) - Enhance development installation instructions [#286](https://github.com/unclecode/crawl4ai/pull/286)
- [HamzaFarhan](https://github.com/HamzaFarhan) - Handled the cases where markdown_with_citations, references_markdown, and filtered_html might not be defined [#293](https://github.com/unclecode/crawl4ai/pull/293)
- [NanmiCoder](https://github.com/NanmiCoder) - fix: crawler strategy exception handling and fixes [#271](https://github.com/unclecode/crawl4ai/pull/271)
- [paulokuong](https://github.com/paulokuong) - fix: RAWL4_AI_BASE_DIRECTORY should be Path object instead of string [#298](https://github.com/unclecode/crawl4ai/pull/298)
## Other Contributors
- [Gokhan](https://github.com/gkhngyk)

View File

@@ -1,136 +0,0 @@
# syntax=docker/dockerfile:1.4
ARG TARGETPLATFORM
ARG BUILDPLATFORM
# Other build arguments
ARG PYTHON_VERSION=3.10
# Base stage with system dependencies
FROM python:${PYTHON_VERSION}-slim as base
# Declare ARG variables again within the build stage
ARG INSTALL_TYPE=all
ARG ENABLE_GPU=false
# Platform-specific labels
LABEL maintainer="unclecode"
LABEL description="🔥🕷️ Crawl4AI: Open-source LLM Friendly Web Crawler & scraper"
LABEL version="1.0"
# Environment setup
ENV PYTHONUNBUFFERED=1 \
PYTHONDONTWRITEBYTECODE=1 \
PIP_NO_CACHE_DIR=1 \
PIP_DISABLE_PIP_VERSION_CHECK=1 \
PIP_DEFAULT_TIMEOUT=100 \
DEBIAN_FRONTEND=noninteractive
# Install system dependencies
RUN apt-get update && apt-get install -y --no-install-recommends \
build-essential \
curl \
wget \
gnupg \
git \
cmake \
pkg-config \
python3-dev \
libjpeg-dev \
libpng-dev \
&& rm -rf /var/lib/apt/lists/*
# Playwright system dependencies for Linux
RUN apt-get update && apt-get install -y --no-install-recommends \
libglib2.0-0 \
libnss3 \
libnspr4 \
libatk1.0-0 \
libatk-bridge2.0-0 \
libcups2 \
libdrm2 \
libdbus-1-3 \
libxcb1 \
libxkbcommon0 \
libx11-6 \
libxcomposite1 \
libxdamage1 \
libxext6 \
libxfixes3 \
libxrandr2 \
libgbm1 \
libpango-1.0-0 \
libcairo2 \
libasound2 \
libatspi2.0-0 \
&& rm -rf /var/lib/apt/lists/*
# GPU support if enabled and architecture is supported
RUN if [ "$ENABLE_GPU" = "true" ] && [ "$TARGETPLATFORM" = "linux/amd64" ] ; then \
apt-get update && apt-get install -y --no-install-recommends \
nvidia-cuda-toolkit \
&& rm -rf /var/lib/apt/lists/* ; \
else \
echo "Skipping NVIDIA CUDA Toolkit installation (unsupported platform or GPU disabled)"; \
fi
# Create and set working directory
WORKDIR /app
# Copy the entire project
COPY . .
# Install base requirements
RUN pip install --no-cache-dir -r requirements.txt
# Install required library for FastAPI
RUN pip install fastapi uvicorn psutil
# Install ML dependencies first for better layer caching
RUN if [ "$INSTALL_TYPE" = "all" ] ; then \
pip install --no-cache-dir \
torch \
torchvision \
torchaudio \
scikit-learn \
nltk \
transformers \
tokenizers && \
python -m nltk.downloader punkt stopwords ; \
fi
# Install the package
RUN if [ "$INSTALL_TYPE" = "all" ] ; then \
pip install ".[all]" && \
python -m crawl4ai.model_loader ; \
elif [ "$INSTALL_TYPE" = "torch" ] ; then \
pip install ".[torch]" ; \
elif [ "$INSTALL_TYPE" = "transformer" ] ; then \
pip install ".[transformer]" && \
python -m crawl4ai.model_loader ; \
else \
pip install "." ; \
fi
# Install MkDocs and required plugins
RUN pip install --no-cache-dir \
mkdocs \
mkdocs-material \
mkdocs-terminal \
pymdown-extensions
# Build MkDocs documentation
RUN mkdocs build
# Install Playwright and browsers
RUN if [ "$TARGETPLATFORM" = "linux/amd64" ]; then \
playwright install chromium; \
elif [ "$TARGETPLATFORM" = "linux/arm64" ]; then \
playwright install chromium; \
fi
# Expose port
EXPOSE 8000 11235 9222 8080
# Start the FastAPI server
CMD ["uvicorn", "main:app", "--host", "0.0.0.0", "--port", "11235"]

View File

@@ -1,2 +1 @@
include requirements.txt
recursive-include crawl4ai/js_snippet *.js
include requirements.txt

View File

@@ -1,46 +0,0 @@
# Mission
![Mission Diagram](./docs/assets/pitch-dark.svg)
### 1. The Data Capitalization Opportunity
We live in an unprecedented era of digital wealth creation. Every day, individuals and enterprises generate massive amounts of valuable digital footprints across various platforms, social media channels, messenger apps, and cloud services. While people can interact with their data within these platforms, there's an immense untapped opportunity to transform this data into true capital assets. Just as physical property became a foundational element of wealth creation, personal and enterprise data has the potential to become a new form of capital on balance sheets.
For individuals, this represents an opportunity to transform their digital activities into valuable assets. For enterprises, their internal communications, team discussions, and collaborative documents contain rich insights that could be structured and valued as intellectual capital. This wealth of information represents an unprecedented opportunity for value creation in the digital age.
### 2. The Potential of Authentic Data
While synthetic data has played a crucial role in AI development, there's an enormous untapped potential in the authentic data generated by individuals and organizations. Every message, document, and interaction contains unique insights and patterns that could enhance AI development. The challenge isn't a lack of data - it's that most authentic human-generated data remains inaccessible for productive use.
By enabling willing participation in data sharing, we can unlock this vast reservoir of authentic human knowledge. This represents an opportunity to enhance AI development with diverse, real-world data that reflects the full spectrum of human experience and knowledge.
## Our Pathway to Data Democracy
### 1. Open-Source Foundation
Our first step is creating an open-source data extraction engine that empowers developers and innovators to build tools for data structuring and organization. This foundation ensures transparency, security, and community-driven development. By making these tools openly available, we enable the technical infrastructure needed for true data ownership and capitalization.
### 2. Data Capitalization Platform
Building on this open-source foundation, we're developing a platform that helps individuals and enterprises transform their digital footprints into structured, valuable assets. This platform will provide the tools and frameworks needed to organize, understand, and value personal and organizational data as true capital assets.
### 3. Creating a Data Marketplace
The final piece is establishing a marketplace where individuals and organizations can willingly share their data assets. This creates opportunities for:
- Individuals to earn equity, revenue, or other forms of value from their data
- Enterprises to access diverse, high-quality data for AI development
- Researchers to work with authentic human-generated data
- Startups to build innovative solutions using real-world data
## Economic Vision: A Shared Data Economy
We envision a future where data becomes a fundamental asset class in a thriving shared economy. This transformation will democratize AI development by enabling willing participation in data sharing, ensuring that the benefits of AI advancement flow back to data creators. Just as property rights revolutionized economic systems, establishing data as a capital asset will create new opportunities for wealth creation and economic participation.
This shared data economy will:
- Enable individuals to capitalize on their digital footprints
- Create new revenue streams for data creators
- Provide AI developers with access to diverse, authentic data
- Foster innovation through broader access to real-world data
- Ensure more equitable distribution of AI's economic benefits
Our vision is to facilitate this transformation from the ground up - starting with open-source tools, progressing to data capitalization platforms, and ultimately creating a thriving marketplace where data becomes a true asset class in a shared economy. This approach ensures that the future of AI is built on a foundation of authentic human knowledge, with benefits flowing back to the individuals and organizations who create and share their valuable data.

779
README.md
View File

@@ -1,78 +1,157 @@
# 🚀🤖 Crawl4AI: Open-source LLM Friendly Web Crawler & Scraper.
<div align="center">
<a href="https://trendshift.io/repositories/11716" target="_blank"><img src="https://trendshift.io/api/badge/repositories/11716" alt="unclecode%2Fcrawl4ai | Trendshift" style="width: 250px; height: 55px;" width="250" height="55"/></a>
# Crawl4AI (Async Version) 🕷️🤖
[![GitHub Stars](https://img.shields.io/github/stars/unclecode/crawl4ai?style=social)](https://github.com/unclecode/crawl4ai/stargazers)
[![GitHub Forks](https://img.shields.io/github/forks/unclecode/crawl4ai?style=social)](https://github.com/unclecode/crawl4ai/network/members)
[![PyPI version](https://badge.fury.io/py/crawl4ai.svg)](https://badge.fury.io/py/crawl4ai)
[![Python Version](https://img.shields.io/pypi/pyversions/crawl4ai)](https://pypi.org/project/crawl4ai/)
[![Downloads](https://static.pepy.tech/badge/crawl4ai/month)](https://pepy.tech/project/crawl4ai)
<!-- [![Documentation Status](https://readthedocs.org/projects/crawl4ai/badge/?version=latest)](https://crawl4ai.readthedocs.io/) -->
[![GitHub Issues](https://img.shields.io/github/issues/unclecode/crawl4ai)](https://github.com/unclecode/crawl4ai/issues)
[![GitHub Pull Requests](https://img.shields.io/github/issues-pr/unclecode/crawl4ai)](https://github.com/unclecode/crawl4ai/pulls)
[![License](https://img.shields.io/github/license/unclecode/crawl4ai)](https://github.com/unclecode/crawl4ai/blob/main/LICENSE)
[![Code style: black](https://img.shields.io/badge/code%20style-black-000000.svg)](https://github.com/psf/black)
[![Security: bandit](https://img.shields.io/badge/security-bandit-yellow.svg)](https://github.com/PyCQA/bandit)
</div>
Crawl4AI simplifies asynchronous web crawling and data extraction, making it accessible for large language models (LLMs) and AI applications. 🆓🌐
Crawl4AI is the #1 trending GitHub repository, actively maintained by a vibrant community. It delivers blazing-fast, AI-ready web crawling tailored for LLMs, AI agents, and data pipelines. Open source, flexible, and built for real-time performance, Crawl4AI empowers developers with unmatched speed, precision, and deployment ease.
> Looking for the synchronous version? Check out [README.sync.md](./README.sync.md). You can also access the previous version in the branch [V0.2.76](https://github.com/unclecode/crawl4ai/blob/v0.2.76).
[✨ Check out latest update v0.4.3b1x](#-recent-updates)
## New update 0.3.6
- 🌐 Multi-browser support (Chromium, Firefox, WebKit)
- 🖼️ Improved image processing with lazy-loading detection
- 🔧 Custom page timeout parameter for better control over crawling behavior
- 🕰️ Enhanced handling of delayed content loading
- 🔑 Custom headers support for LLM interactions
- 🖼️ iframe content extraction for comprehensive page analysis
- ⏱️ Flexible timeout and delayed content retrieval options
🎉 **Version 0.4.3b1 is out!** This release brings exciting new features like a Memory Dispatcher System, Streaming Support, LLM-Powered Markdown Generation, Schema Generation, and Robots.txt Compliance! [Read the release notes →](https://docs.crawl4ai.com/blog)
## Try it Now!
<details>
<summary>🤓 <strong>My Personal Story</strong></summary>
✨ Play around with this [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1REChY6fXQf-EaVYLv0eHEWvzlYxGm0pd?usp=sharing)
My journey with computers started in childhood when my dad, a computer scientist, introduced me to an Amstrad computer. Those early days sparked a fascination with technology, leading me to pursue computer science and specialize in NLP during my postgraduate studies. It was during this time that I first delved into web crawling, building tools to help researchers organize papers and extract information from publications a challenging yet rewarding experience that honed my skills in data extraction.
✨ Visit our [Documentation Website](https://crawl4ai.com/mkdocs/)
Fast forward to 2023, I was working on a tool for a project and needed a crawler to convert a webpage into markdown. While exploring solutions, I found one that claimed to be open-source but required creating an account and generating an API token. Worse, it turned out to be a SaaS model charging $16, and its quality didnt meet my standards. Frustrated, I realized this was a deeper problem. That frustration turned into turbo anger mode, and I decided to build my own solution. In just a few days, I created Crawl4AI. To my surprise, it went viral, earning thousands of GitHub stars and resonating with a global community.
## Features ✨
I made Crawl4AI open-source for two reasons. First, its my way of giving back to the open-source community that has supported me throughout my career. Second, I believe data should be accessible to everyone, not locked behind paywalls or monopolized by a few. Open access to data lays the foundation for the democratization of AI—a vision where individuals can train their own models and take ownership of their information. This library is the first step in a larger journey to create the best open-source data extraction and generation tool the world has ever seen, built collaboratively by a passionate community.
- 🆓 Completely free and open-source
- 🚀 Blazing fast performance, outperforming many paid services
- 🤖 LLM-friendly output formats (JSON, cleaned HTML, markdown)
- 🌍 Supports crawling multiple URLs simultaneously
- 🎨 Extracts and returns all media tags (Images, Audio, and Video)
- 🔗 Extracts all external and internal links
- 📚 Extracts metadata from the page
- 🔄 Custom hooks for authentication, headers, and page modifications before crawling
- 🕵️ User-agent customization
- 🖼️ Takes screenshots of the page
- 📜 Executes multiple custom JavaScripts before crawling
- 📊 Generates structured output without LLM using JsonCssExtractionStrategy
- 📚 Various chunking strategies: topic-based, regex, sentence, and more
- 🧠 Advanced extraction strategies: cosine clustering, LLM, and more
- 🎯 CSS selector support for precise data extraction
- 📝 Passes instructions/keywords to refine extraction
- 🔒 Proxy support for enhanced privacy and access
- 🔄 Session management for complex multi-page crawling scenarios
- 🌐 Asynchronous architecture for improved performance and scalability
Thank you to everyone who has supported this project, used it, and shared feedback. Your encouragement motivates me to dream even bigger. Join us, file issues, submit PRs, or spread the word. Together, we can build a tool that truly empowers people to access their own data and reshape the future of AI.
</details>
## Installation 🛠️
## 🧐 Why Crawl4AI?
Crawl4AI offers flexible installation options to suit various use cases. You can install it as a Python package or use Docker.
1. **Built for LLMs**: Creates smart, concise Markdown optimized for RAG and fine-tuning applications.
2. **Lightning Fast**: Delivers results 6x faster with real-time, cost-efficient performance.
3. **Flexible Browser Control**: Offers session management, proxies, and custom hooks for seamless data access.
4. **Heuristic Intelligence**: Uses advanced algorithms for efficient extraction, reducing reliance on costly models.
5. **Open Source & Deployable**: Fully open-source with no API keys—ready for Docker and cloud integration.
6. **Thriving Community**: Actively maintained by a vibrant community and the #1 trending GitHub repository.
### Using pip 🐍
## 🚀 Quick Start
Choose the installation option that best fits your needs:
#### Basic Installation
For basic web crawling and scraping tasks:
1. Install Crawl4AI:
```bash
# Install the package
pip install -U crawl4ai
# Run post-installation setup
crawl4ai-setup
# Verify your installation
crawl4ai-doctor
pip install crawl4ai
```
If you encounter any browser-related issues, you can install them manually:
By default, this will install the asynchronous version of Crawl4AI, using Playwright for web crawling.
👉 Note: When you install Crawl4AI, the setup script should automatically install and set up Playwright. However, if you encounter any Playwright-related errors, you can manually install it using one of these methods:
1. Through the command line:
```bash
playwright install
```
2. If the above doesn't work, try this more specific command:
```bash
python -m playwright install chromium
```
This second method has proven to be more reliable in some cases.
#### Installation with Synchronous Version
If you need the synchronous version using Selenium:
```bash
python -m playwright install --with-deps chromium
pip install crawl4ai[sync]
```
2. Run a simple web crawl:
#### Development Installation
For contributors who plan to modify the source code:
```bash
git clone https://github.com/unclecode/crawl4ai.git
cd crawl4ai
pip install -e .
```
### Using Docker 🐳
We're in the process of creating Docker images and pushing them to Docker Hub. This will provide an easy way to run Crawl4AI in a containerized environment. Stay tuned for updates!
For more detailed installation instructions and options, please refer to our [Installation Guide](https://crawl4ai.com/mkdocs/installation).
## Quick Start 🚀
```python
import asyncio
from crawl4ai import *
from crawl4ai import AsyncWebCrawler
async def main():
async with AsyncWebCrawler() as crawler:
async with AsyncWebCrawler(verbose=True) as crawler:
result = await crawler.arun(url="https://www.nbcnews.com/business")
print(result.markdown)
if __name__ == "__main__":
asyncio.run(main())
```
## Advanced Usage 🔬
### Executing JavaScript and Using CSS Selectors
```python
import asyncio
from crawl4ai import AsyncWebCrawler
async def main():
async with AsyncWebCrawler(verbose=True) as crawler:
js_code = ["const loadMoreButton = Array.from(document.querySelectorAll('button')).find(button => button.textContent.includes('Load More')); loadMoreButton && loadMoreButton.click();"]
result = await crawler.arun(
url="https://www.nbcnews.com/business",
js_code=js_code,
css_selector=".wide-tease-item__description",
bypass_cache=True
)
print(result.extracted_content)
if __name__ == "__main__":
asyncio.run(main())
```
### Using a Proxy
```python
import asyncio
from crawl4ai import AsyncWebCrawler
async def main():
async with AsyncWebCrawler(verbose=True, proxy="http://127.0.0.1:7890") as crawler:
result = await crawler.arun(
url="https://www.nbcnews.com/business",
bypass_cache=True
)
print(result.markdown)
@@ -80,330 +159,86 @@ if __name__ == "__main__":
asyncio.run(main())
```
## ✨ Features
### Extracting Structured Data without LLM
<details>
<summary>📝 <strong>Markdown Generation</strong></summary>
- 🧹 **Clean Markdown**: Generates clean, structured Markdown with accurate formatting.
- 🎯 **Fit Markdown**: Heuristic-based filtering to remove noise and irrelevant parts for AI-friendly processing.
- 🔗 **Citations and References**: Converts page links into a numbered reference list with clean citations.
- 🛠️ **Custom Strategies**: Users can create their own Markdown generation strategies tailored to specific needs.
- 📚 **BM25 Algorithm**: Employs BM25-based filtering for extracting core information and removing irrelevant content.
</details>
<details>
<summary>📊 <strong>Structured Data Extraction</strong></summary>
- 🤖 **LLM-Driven Extraction**: Supports all LLMs (open-source and proprietary) for structured data extraction.
- 🧱 **Chunking Strategies**: Implements chunking (topic-based, regex, sentence-level) for targeted content processing.
- 🌌 **Cosine Similarity**: Find relevant content chunks based on user queries for semantic extraction.
- 🔎 **CSS-Based Extraction**: Fast schema-based data extraction using XPath and CSS selectors.
- 🔧 **Schema Definition**: Define custom schemas for extracting structured JSON from repetitive patterns.
</details>
<details>
<summary>🌐 <strong>Browser Integration</strong></summary>
- 🖥️ **Managed Browser**: Use user-owned browsers with full control, avoiding bot detection.
- 🔄 **Remote Browser Control**: Connect to Chrome Developer Tools Protocol for remote, large-scale data extraction.
- 🔒 **Session Management**: Preserve browser states and reuse them for multi-step crawling.
- 🧩 **Proxy Support**: Seamlessly connect to proxies with authentication for secure access.
- ⚙️ **Full Browser Control**: Modify headers, cookies, user agents, and more for tailored crawling setups.
- 🌍 **Multi-Browser Support**: Compatible with Chromium, Firefox, and WebKit.
- 📐 **Dynamic Viewport Adjustment**: Automatically adjusts the browser viewport to match page content, ensuring complete rendering and capturing of all elements.
</details>
<details>
<summary>🔎 <strong>Crawling & Scraping</strong></summary>
- 🖼️ **Media Support**: Extract images, audio, videos, and responsive image formats like `srcset` and `picture`.
- 🚀 **Dynamic Crawling**: Execute JS and wait for async or sync for dynamic content extraction.
- 📸 **Screenshots**: Capture page screenshots during crawling for debugging or analysis.
- 📂 **Raw Data Crawling**: Directly process raw HTML (`raw:`) or local files (`file://`).
- 🔗 **Comprehensive Link Extraction**: Extracts internal, external links, and embedded iframe content.
- 🛠️ **Customizable Hooks**: Define hooks at every step to customize crawling behavior.
- 💾 **Caching**: Cache data for improved speed and to avoid redundant fetches.
- 📄 **Metadata Extraction**: Retrieve structured metadata from web pages.
- 📡 **IFrame Content Extraction**: Seamless extraction from embedded iframe content.
- 🕵️ **Lazy Load Handling**: Waits for images to fully load, ensuring no content is missed due to lazy loading.
- 🔄 **Full-Page Scanning**: Simulates scrolling to load and capture all dynamic content, perfect for infinite scroll pages.
</details>
<details>
<summary>🚀 <strong>Deployment</strong></summary>
- 🐳 **Dockerized Setup**: Optimized Docker image with API server for easy deployment.
- 🔄 **API Gateway**: One-click deployment with secure token authentication for API-based workflows.
- 🌐 **Scalable Architecture**: Designed for mass-scale production and optimized server performance.
- ⚙️ **DigitalOcean Deployment**: Ready-to-deploy configurations for DigitalOcean and similar platforms.
</details>
<details>
<summary>🎯 <strong>Additional Features</strong></summary>
- 🕶️ **Stealth Mode**: Avoid bot detection by mimicking real users.
- 🏷️ **Tag-Based Content Extraction**: Refine crawling based on custom tags, headers, or metadata.
- 🔗 **Link Analysis**: Extract and analyze all links for detailed data exploration.
- 🛡️ **Error Handling**: Robust error management for seamless execution.
- 🔐 **CORS & Static Serving**: Supports filesystem-based caching and cross-origin requests.
- 📖 **Clear Documentation**: Simplified and updated guides for onboarding and advanced usage.
- 🙌 **Community Recognition**: Acknowledges contributors and pull requests for transparency.
</details>
## Try it Now!
✨ Play around with this [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1SgRPrByQLzjRfwoRNq1wSGE9nYY_EE8C?usp=sharing)
✨ Visit our [Documentation Website](https://docs.crawl4ai.com/)
## Installation 🛠️
Crawl4AI offers flexible installation options to suit various use cases. You can install it as a Python package or use Docker.
<details>
<summary>🐍 <strong>Using pip</strong></summary>
Choose the installation option that best fits your needs:
### Basic Installation
For basic web crawling and scraping tasks:
```bash
pip install crawl4ai
crawl4ai-setup # Setup the browser
```
By default, this will install the asynchronous version of Crawl4AI, using Playwright for web crawling.
👉 **Note**: When you install Crawl4AI, the `crawl4ai-setup` should automatically install and set up Playwright. However, if you encounter any Playwright-related errors, you can manually install it using one of these methods:
1. Through the command line:
```bash
playwright install
```
2. If the above doesn't work, try this more specific command:
```bash
python -m playwright install chromium
```
This second method has proven to be more reliable in some cases.
---
### Installation with Synchronous Version
The sync version is deprecated and will be removed in future versions. If you need the synchronous version using Selenium:
```bash
pip install crawl4ai[sync]
```
---
### Development Installation
For contributors who plan to modify the source code:
```bash
git clone https://github.com/unclecode/crawl4ai.git
cd crawl4ai
pip install -e . # Basic installation in editable mode
```
Install optional features:
```bash
pip install -e ".[torch]" # With PyTorch features
pip install -e ".[transformer]" # With Transformer features
pip install -e ".[cosine]" # With cosine similarity features
pip install -e ".[sync]" # With synchronous crawling (Selenium)
pip install -e ".[all]" # Install all optional features
```
</details>
<details>
<summary>🐳 <strong>Docker Deployment</strong></summary>
> 🚀 **Major Changes Coming!** We're developing a completely new Docker implementation that will make deployment even more efficient and seamless. The current Docker setup is being deprecated in favor of this new solution.
### Current Docker Support
The existing Docker implementation is being deprecated and will be replaced soon. If you still need to use Docker with the current version:
- 📚 [Deprecated Docker Setup](./docs/deprecated/docker-deployment.md) - Instructions for the current Docker implementation
- ⚠️ Note: This setup will be replaced in the next major release
### What's Coming Next?
Our new Docker implementation will bring:
- Improved performance and resource efficiency
- Streamlined deployment process
- Better integration with Crawl4AI features
- Enhanced scalability options
Stay connected with our [GitHub repository](https://github.com/unclecode/crawl4ai) for updates!
</details>
---
### Quick Test
Run a quick test (works for both Docker options):
```python
import requests
# Submit a crawl job
response = requests.post(
"http://localhost:11235/crawl",
json={"urls": "https://example.com", "priority": 10}
)
task_id = response.json()["task_id"]
# Continue polling until the task is complete (status="completed")
result = requests.get(f"http://localhost:11235/task/{task_id}")
```
For more examples, see our [Docker Examples](https://github.com/unclecode/crawl4ai/blob/main/docs/examples/docker_example.py). For advanced configuration, environment variables, and usage examples, see our [Docker Deployment Guide](https://docs.crawl4ai.com/basic/docker-deployment/).
</details>
## 🔬 Advanced Usage Examples 🔬
You can check the project structure in the directory [https://github.com/unclecode/crawl4ai/docs/examples](docs/examples). Over there, you can find a variety of examples; here, some popular examples are shared.
<details>
<summary>📝 <strong>Heuristic Markdown Generation with Clean and Fit Markdown</strong></summary>
The `JsonCssExtractionStrategy` allows for precise extraction of structured data from web pages using CSS selectors.
```python
import asyncio
from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig, CacheMode
from crawl4ai.content_filter_strategy import PruningContentFilter, BM25ContentFilter
from crawl4ai.markdown_generation_strategy import DefaultMarkdownGenerator
async def main():
browser_config = BrowserConfig(
headless=True,
verbose=True,
)
run_config = CrawlerRunConfig(
cache_mode=CacheMode.ENABLED,
markdown_generator=DefaultMarkdownGenerator(
content_filter=PruningContentFilter(threshold=0.48, threshold_type="fixed", min_word_threshold=0)
),
# markdown_generator=DefaultMarkdownGenerator(
# content_filter=BM25ContentFilter(user_query="WHEN_WE_FOCUS_BASED_ON_A_USER_QUERY", bm25_threshold=1.0)
# ),
)
async with AsyncWebCrawler(config=browser_config) as crawler:
result = await crawler.arun(
url="https://docs.micronaut.io/4.7.6/guide/",
config=run_config
)
print(len(result.markdown))
print(len(result.fit_markdown))
print(len(result.markdown_v2.fit_markdown))
if __name__ == "__main__":
asyncio.run(main())
```
</details>
<details>
<summary>🖥️ <strong>Executing JavaScript & Extract Structured Data without LLMs</strong></summary>
```python
import asyncio
from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig, CacheMode
from crawl4ai.extraction_strategy import JsonCssExtractionStrategy
import json
from crawl4ai import AsyncWebCrawler
from crawl4ai.extraction_strategy import JsonCssExtractionStrategy
async def main():
async def extract_news_teasers():
schema = {
"name": "KidoCode Courses",
"baseSelector": "section.charge-methodology .w-tab-content > div",
"fields": [
{
"name": "section_title",
"selector": "h3.heading-50",
"type": "text",
},
{
"name": "section_description",
"selector": ".charge-content",
"type": "text",
},
{
"name": "course_name",
"selector": ".text-block-93",
"type": "text",
},
{
"name": "course_description",
"selector": ".course-content-text",
"type": "text",
},
{
"name": "course_icon",
"selector": ".image-92",
"type": "attribute",
"attribute": "src"
}
"name": "News Teaser Extractor",
"baseSelector": ".wide-tease-item__wrapper",
"fields": [
{
"name": "category",
"selector": ".unibrow span[data-testid='unibrow-text']",
"type": "text",
},
{
"name": "headline",
"selector": ".wide-tease-item__headline",
"type": "text",
},
{
"name": "summary",
"selector": ".wide-tease-item__description",
"type": "text",
},
{
"name": "time",
"selector": "[data-testid='wide-tease-date']",
"type": "text",
},
{
"name": "image",
"type": "nested",
"selector": "picture.teasePicture img",
"fields": [
{"name": "src", "type": "attribute", "attribute": "src"},
{"name": "alt", "type": "attribute", "attribute": "alt"},
],
},
{
"name": "link",
"selector": "a[href]",
"type": "attribute",
"attribute": "href",
},
],
}
}
extraction_strategy = JsonCssExtractionStrategy(schema, verbose=True)
browser_config = BrowserConfig(
headless=False,
verbose=True
)
run_config = CrawlerRunConfig(
extraction_strategy=extraction_strategy,
js_code=["""(async () => {const tabs = document.querySelectorAll("section.charge-methodology .tabs-menu-3 > div");for(let tab of tabs) {tab.scrollIntoView();tab.click();await new Promise(r => setTimeout(r, 500));}})();"""],
cache_mode=CacheMode.BYPASS
)
async with AsyncWebCrawler(config=browser_config) as crawler:
async with AsyncWebCrawler(verbose=True) as crawler:
result = await crawler.arun(
url="https://www.kidocode.com/degrees/technology",
config=run_config
url="https://www.nbcnews.com/business",
extraction_strategy=extraction_strategy,
bypass_cache=True,
)
companies = json.loads(result.extracted_content)
print(f"Successfully extracted {len(companies)} companies")
print(json.dumps(companies[0], indent=2))
assert result.success, "Failed to crawl the page"
news_teasers = json.loads(result.extracted_content)
print(f"Successfully extracted {len(news_teasers)} news teasers")
print(json.dumps(news_teasers[0], indent=2))
if __name__ == "__main__":
asyncio.run(main())
asyncio.run(extract_news_teasers())
```
</details>
For more advanced usage examples, check out our [Examples](https://crawl4ai.com/mkdocs/full_details/advanced_jsoncss_extraction.md) section in the documentation.
<details>
<summary>📚 <strong>Extracting Structured Data with LLMs</strong></summary>
### Extracting Structured Data with OpenAI
```python
import os
import asyncio
from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig, CacheMode
from crawl4ai import AsyncWebCrawler
from crawl4ai.extraction_strategy import LLMExtractionStrategy
from pydantic import BaseModel, Field
@@ -413,26 +248,19 @@ class OpenAIModelFee(BaseModel):
output_fee: str = Field(..., description="Fee for output token for the OpenAI model.")
async def main():
browser_config = BrowserConfig(verbose=True)
run_config = CrawlerRunConfig(
word_count_threshold=1,
extraction_strategy=LLMExtractionStrategy(
# Here you can use any provider that Litellm library supports, for instance: ollama/qwen2
# provider="ollama/qwen2", api_token="no-token",
provider="openai/gpt-4o", api_token=os.getenv('OPENAI_API_KEY'),
schema=OpenAIModelFee.schema(),
extraction_type="schema",
instruction="""From the crawled content, extract all mentioned model names along with their fees for input and output tokens.
Do not miss any models in the entire content. One extracted model JSON format should look like this:
{"model_name": "GPT-4", "input_fee": "US$10.00 / 1M tokens", "output_fee": "US$30.00 / 1M tokens"}."""
),
cache_mode=CacheMode.BYPASS,
)
async with AsyncWebCrawler(config=browser_config) as crawler:
async with AsyncWebCrawler(verbose=True) as crawler:
result = await crawler.arun(
url='https://openai.com/api/pricing/',
config=run_config
word_count_threshold=1,
extraction_strategy=LLMExtractionStrategy(
provider="openai/gpt-4o", api_token=os.getenv('OPENAI_API_KEY'),
schema=OpenAIModelFee.schema(),
extraction_type="schema",
instruction="""From the crawled content, extract all mentioned model names along with their fees for input and output tokens.
Do not miss any models in the entire content. One extracted model JSON format should look like this:
{"model_name": "GPT-4", "input_fee": "US$10.00 / 1M tokens", "output_fee": "US$30.00 / 1M tokens"}."""
),
bypass_cache=True,
)
print(result.extracted_content)
@@ -440,143 +268,117 @@ if __name__ == "__main__":
asyncio.run(main())
```
</details>
### Session Management and Dynamic Content Crawling
<details>
<summary>🤖 <strong>Using You own Browswer with Custome User Profile</strong></summary>
Crawl4AI excels at handling complex scenarios, such as crawling multiple pages with dynamic content loaded via JavaScript. Here's an example of crawling GitHub commits across multiple pages:
```python
import os, sys
from pathlib import Path
import asyncio, time
from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig, CacheMode
import asyncio
import re
from bs4 import BeautifulSoup
from crawl4ai import AsyncWebCrawler
async def test_news_crawl():
# Create a persistent user data directory
user_data_dir = os.path.join(Path.home(), ".crawl4ai", "browser_profile")
os.makedirs(user_data_dir, exist_ok=True)
async def crawl_typescript_commits():
first_commit = ""
async def on_execution_started(page):
nonlocal first_commit
try:
while True:
await page.wait_for_selector('li.Box-sc-g0xbh4-0 h4')
commit = await page.query_selector('li.Box-sc-g0xbh4-0 h4')
commit = await commit.evaluate('(element) => element.textContent')
commit = re.sub(r'\s+', '', commit)
if commit and commit != first_commit:
first_commit = commit
break
await asyncio.sleep(0.5)
except Exception as e:
print(f"Warning: New content didn't appear after JavaScript execution: {e}")
browser_config = BrowserConfig(
verbose=True,
headless=True,
user_data_dir=user_data_dir,
use_persistent_context=True,
)
run_config = CrawlerRunConfig(
cache_mode=CacheMode.BYPASS
)
async with AsyncWebCrawler(config=browser_config) as crawler:
url = "ADDRESS_OF_A_CHALLENGING_WEBSITE"
result = await crawler.arun(
url,
config=run_config,
magic=True,
)
print(f"Successfully crawled {url}")
print(f"Content length: {len(result.markdown)}")
async with AsyncWebCrawler(verbose=True) as crawler:
crawler.crawler_strategy.set_hook('on_execution_started', on_execution_started)
url = "https://github.com/microsoft/TypeScript/commits/main"
session_id = "typescript_commits_session"
all_commits = []
js_next_page = """
const button = document.querySelector('a[data-testid="pagination-next-button"]');
if (button) button.click();
"""
for page in range(3): # Crawl 3 pages
result = await crawler.arun(
url=url,
session_id=session_id,
css_selector="li.Box-sc-g0xbh4-0",
js=js_next_page if page > 0 else None,
bypass_cache=True,
js_only=page > 0
)
assert result.success, f"Failed to crawl page {page + 1}"
soup = BeautifulSoup(result.cleaned_html, 'html.parser')
commits = soup.select("li")
all_commits.extend(commits)
print(f"Page {page + 1}: Found {len(commits)} commits")
await crawler.crawler_strategy.kill_session(session_id)
print(f"Successfully crawled {len(all_commits)} commits across 3 pages")
if __name__ == "__main__":
asyncio.run(crawl_typescript_commits())
```
</details>
This example demonstrates Crawl4AI's ability to handle complex scenarios where content is loaded asynchronously. It crawls multiple pages of GitHub commits, executing JavaScript to load new content and using custom hooks to ensure data is loaded before proceeding.
## ✨ Recent Updates
- **🚀 New Dispatcher System**: Scale to thousands of URLs with intelligent **memory monitoring**, **concurrency control**, and optional **rate limiting**. (See `MemoryAdaptiveDispatcher`, `SemaphoreDispatcher`, `RateLimiter`, `CrawlerMonitor`)
- **⚡ Streaming Mode**: Process results **as they arrive** instead of waiting for an entire batch to complete. (Set `stream=True` in `CrawlerRunConfig`)
- **🤖 Enhanced LLM Integration**:
- **Automatic schema generation**: Create extraction rules from HTML using OpenAI or Ollama, no manual CSS/XPath needed.
- **LLM-powered Markdown filtering**: Refine your markdown output with a new `LLMContentFilter` that understands content relevance.
- **Ollama Support**: Use open-source or self-hosted models for private or cost-effective extraction.
- **🏎️ Faster Scraping Option**: New `LXMLWebScrapingStrategy` offers **10-20x speedup** for large, complex pages (experimental).
- **🤖 robots.txt Compliance**: Respect website rules with `check_robots_txt=True` and efficient local caching.
- **➡️ URL Redirection Tracking**: The `final_url` field now captures the final destination after any redirects.
- **🪞 Improved Mirroring**: The `LXMLWebScrapingStrategy` now has much greater fidelity, allowing for almost pixel-perfect mirroring of websites.
- **📈 Enhanced Monitoring**: Track memory, CPU, and individual crawler status with `CrawlerMonitor`.
- **📝 Improved Documentation**: More examples, clearer explanations, and updated tutorials.
Read the full details in our [0.4.248 Release Notes](https://github.com/unclecode/crawl4ai/blob/main/CHANGELOG.md).
Here's a clear markdown explanation for your users about version numbering:
For more advanced usage examples, check out our [Examples](https://crawl4ai.com/mkdocs/full_details/session_based_crawling.md) section in the documentation.
## Version Numbering in Crawl4AI
## Speed Comparison 🚀
Crawl4AI follows standard Python version numbering conventions (PEP 440) to help users understand the stability and features of each release.
Crawl4AI is designed with speed as a primary focus. Our goal is to provide the fastest possible response with high-quality data extraction, minimizing abstractions between the data and the user.
### Version Numbers Explained
We've conducted a speed comparison between Crawl4AI and Firecrawl, a paid service. The results demonstrate Crawl4AI's superior performance:
Our version numbers follow this pattern: `MAJOR.MINOR.PATCH` (e.g., 0.4.3)
```
Firecrawl:
Time taken: 7.02 seconds
Content length: 42074 characters
Images found: 49
#### Pre-release Versions
We use different suffixes to indicate development stages:
Crawl4AI (simple crawl):
Time taken: 1.60 seconds
Content length: 18238 characters
Images found: 49
- `dev` (0.4.3dev1): Development versions, unstable
- `a` (0.4.3a1): Alpha releases, experimental features
- `b` (0.4.3b1): Beta releases, feature complete but needs testing
- `rc` (0.4.3rc1): Release candidates, potential final version
Crawl4AI (with JavaScript execution):
Time taken: 4.64 seconds
Content length: 40869 characters
Images found: 89
```
#### Installation
- Regular installation (stable version):
```bash
pip install -U crawl4ai
```
As you can see, Crawl4AI outperforms Firecrawl significantly:
- Simple crawl: Crawl4AI is over 4 times faster than Firecrawl.
- With JavaScript execution: Even when executing JavaScript to load more content (doubling the number of images found), Crawl4AI is still faster than Firecrawl's simple crawl.
- Install pre-release versions:
```bash
pip install crawl4ai --pre
```
You can find the full comparison code in our repository at `docs/examples/crawl4ai_vs_firecrawl.py`.
- Install specific version:
```bash
pip install crawl4ai==0.4.3b1
```
## Documentation 📚
#### Why Pre-releases?
We use pre-releases to:
- Test new features in real-world scenarios
- Gather feedback before final releases
- Ensure stability for production users
- Allow early adopters to try new features
For detailed documentation, including installation instructions, advanced features, and API reference, visit our [Documentation Website](https://crawl4ai.com/mkdocs/).
For production environments, we recommend using the stable version. For testing new features, you can opt-in to pre-releases using the `--pre` flag.
## 📖 Documentation & Roadmap
> 🚨 **Documentation Update Alert**: We're undertaking a major documentation overhaul next week to reflect recent updates and improvements. Stay tuned for a more comprehensive and up-to-date guide!
For current documentation, including installation instructions, advanced features, and API reference, visit our [Documentation Website](https://docs.crawl4ai.com/).
To check our development plans and upcoming features, visit our [Roadmap](https://github.com/unclecode/crawl4ai/blob/main/ROADMAP.md).
<details>
<summary>📈 <strong>Development TODOs</strong></summary>
- [x] 0. Graph Crawler: Smart website traversal using graph search algorithms for comprehensive nested page extraction
- [ ] 1. Question-Based Crawler: Natural language driven web discovery and content extraction
- [ ] 2. Knowledge-Optimal Crawler: Smart crawling that maximizes knowledge while minimizing data extraction
- [ ] 3. Agentic Crawler: Autonomous system for complex multi-step crawling operations
- [ ] 4. Automated Schema Generator: Convert natural language to extraction schemas
- [ ] 5. Domain-Specific Scrapers: Pre-configured extractors for common platforms (academic, e-commerce)
- [ ] 6. Web Embedding Index: Semantic search infrastructure for crawled content
- [ ] 7. Interactive Playground: Web UI for testing, comparing strategies with AI assistance
- [ ] 8. Performance Monitor: Real-time insights into crawler operations
- [ ] 9. Cloud Integration: One-click deployment solutions across cloud providers
- [ ] 10. Sponsorship Program: Structured support system with tiered benefits
- [ ] 11. Educational Content: "How to Crawl" video series and interactive tutorials
</details>
## 🤝 Contributing
## Contributing 🤝
We welcome contributions from the open-source community. Check out our [contribution guidelines](https://github.com/unclecode/crawl4ai/blob/main/CONTRIBUTING.md) for more information.
## 📄 License
## License 📄
Crawl4AI is released under the [Apache 2.0 License](https://github.com/unclecode/crawl4ai/blob/main/LICENSE).
## 📧 Contact
## Contact 📧
For questions, suggestions, or feedback, feel free to reach out:
@@ -586,31 +388,6 @@ For questions, suggestions, or feedback, feel free to reach out:
Happy Crawling! 🕸️🚀
## 🗾 Mission
Our mission is to unlock the value of personal and enterprise data by transforming digital footprints into structured, tradeable assets. Crawl4AI empowers individuals and organizations with open-source tools to extract and structure data, fostering a shared data economy.
We envision a future where AI is powered by real human knowledge, ensuring data creators directly benefit from their contributions. By democratizing data and enabling ethical sharing, we are laying the foundation for authentic AI advancement.
<details>
<summary>🔑 <strong>Key Opportunities</strong></summary>
- **Data Capitalization**: Transform digital footprints into measurable, valuable assets.
- **Authentic AI Data**: Provide AI systems with real human insights.
- **Shared Economy**: Create a fair data marketplace that benefits data creators.
</details>
<details>
<summary>🚀 <strong>Development Pathway</strong></summary>
1. **Open-Source Tools**: Community-driven platforms for transparent data extraction.
2. **Digital Asset Structuring**: Tools to organize and value digital knowledge.
3. **Ethical Data Marketplace**: A secure, fair platform for exchanging structured data.
For more details, see our [full mission statement](./MISSION.md).
</details>
## Star History
[![Star History Chart](https://api.star-history.com/svg?repos=unclecode/crawl4ai&type=Date)](https://star-history.com/#unclecode/crawl4ai&Date)
[![Star History Chart](https://api.star-history.com/svg?repos=unclecode/crawl4ai&type=Date)](https://star-history.com/#unclecode/crawl4ai&Date)

244
README.sync.md Normal file
View File

@@ -0,0 +1,244 @@
# Crawl4AI v0.2.77 🕷️🤖
[![GitHub Stars](https://img.shields.io/github/stars/unclecode/crawl4ai?style=social)](https://github.com/unclecode/crawl4ai/stargazers)
[![GitHub Forks](https://img.shields.io/github/forks/unclecode/crawl4ai?style=social)](https://github.com/unclecode/crawl4ai/network/members)
[![GitHub Issues](https://img.shields.io/github/issues/unclecode/crawl4ai)](https://github.com/unclecode/crawl4ai/issues)
[![GitHub Pull Requests](https://img.shields.io/github/issues-pr/unclecode/crawl4ai)](https://github.com/unclecode/crawl4ai/pulls)
[![License](https://img.shields.io/github/license/unclecode/crawl4ai)](https://github.com/unclecode/crawl4ai/blob/main/LICENSE)
Crawl4AI simplifies web crawling and data extraction, making it accessible for large language models (LLMs) and AI applications. 🆓🌐
#### [v0.2.77] - 2024-08-02
Major improvements in functionality, performance, and cross-platform compatibility! 🚀
- 🐳 **Docker enhancements**:
- Significantly improved Dockerfile for easy installation on Linux, Mac, and Windows.
- 🌐 **Official Docker Hub image**:
- Launched our first official image on Docker Hub for streamlined deployment (unclecode/crawl4ai).
- 🔧 **Selenium upgrade**:
- Removed dependency on ChromeDriver, now using Selenium's built-in capabilities for better compatibility.
- 🖼️ **Image description**:
- Implemented ability to generate textual descriptions for extracted images from web pages.
-**Performance boost**:
- Various improvements to enhance overall speed and performance.
## Try it Now!
✨ Play around with this [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1sJPAmeLj5PMrg2VgOwMJ2ubGIcK0cJeX?usp=sharing)
✨ visit our [Documentation Website](https://crawl4ai.com/mkdocs/)
✨ Check [Demo](https://crawl4ai.com/mkdocs/demo)
## Features ✨
- 🆓 Completely free and open-source
- 🤖 LLM-friendly output formats (JSON, cleaned HTML, markdown)
- 🌍 Supports crawling multiple URLs simultaneously
- 🎨 Extracts and returns all media tags (Images, Audio, and Video)
- 🔗 Extracts all external and internal links
- 📚 Extracts metadata from the page
- 🔄 Custom hooks for authentication, headers, and page modifications before crawling
- 🕵️ User-agent customization
- 🖼️ Takes screenshots of the page
- 📜 Executes multiple custom JavaScripts before crawling
- 📚 Various chunking strategies: topic-based, regex, sentence, and more
- 🧠 Advanced extraction strategies: cosine clustering, LLM, and more
- 🎯 CSS selector support
- 📝 Passes instructions/keywords to refine extraction
# Crawl4AI
## 🌟 Shoutout to Contributors of v0.2.77!
A big thank you to the amazing contributors who've made this release possible:
- [@aravindkarnam](https://github.com/aravindkarnam) for the new image description feature
- [@FractalMind](https://github.com/FractalMind) for our official Docker Hub image
- [@ketonkss4](https://github.com/ketonkss4) for helping streamline our Selenium setup
Your contributions are driving Crawl4AI forward! 🚀
## Cool Examples 🚀
### Quick Start
```python
from crawl4ai import WebCrawler
# Create an instance of WebCrawler
crawler = WebCrawler()
# Warm up the crawler (load necessary models)
crawler.warmup()
# Run the crawler on a URL
result = crawler.run(url="https://www.nbcnews.com/business")
# Print the extracted content
print(result.markdown)
```
## How to install 🛠
### Using pip 🐍
```bash
virtualenv venv
source venv/bin/activate
pip install "crawl4ai @ git+https://github.com/unclecode/crawl4ai.git"
```
### Using Docker 🐳
```bash
# For Mac users (M1/M2)
# docker build --platform linux/amd64 -t crawl4ai .
docker build -t crawl4ai .
docker run -d -p 8000:80 crawl4ai
```
### Using Docker Hub 🐳
```bash
docker pull unclecode/crawl4ai:latest
docker run -d -p 8000:80 unclecode/crawl4ai:latest
```
## Speed-First Design 🚀
Perhaps the most important design principle for this library is speed. We need to ensure it can handle many links and resources in parallel as quickly as possible. By combining this speed with fast LLMs like Groq, the results will be truly amazing.
```python
import time
from crawl4ai.web_crawler import WebCrawler
crawler = WebCrawler()
crawler.warmup()
start = time.time()
url = r"https://www.nbcnews.com/business"
result = crawler.run( url, word_count_threshold=10, bypass_cache=True)
end = time.time()
print(f"Time taken: {end - start}")
```
Let's take a look the calculated time for the above code snippet:
```bash
[LOG] 🚀 Crawling done, success: True, time taken: 1.3623387813568115 seconds
[LOG] 🚀 Content extracted, success: True, time taken: 0.05715131759643555 seconds
[LOG] 🚀 Extraction, time taken: 0.05750393867492676 seconds.
Time taken: 1.439958095550537
```
Fetching the content from the page took 1.3623 seconds, and extracting the content took 0.0575 seconds. 🚀
### Extract Structured Data from Web Pages 📊
Crawl all OpenAI models and their fees from the official page.
```python
import os
from crawl4ai import WebCrawler
from crawl4ai.extraction_strategy import LLMExtractionStrategy
from pydantic import BaseModel, Field
class OpenAIModelFee(BaseModel):
model_name: str = Field(..., description="Name of the OpenAI model.")
input_fee: str = Field(..., description="Fee for input token for the OpenAI model.")
output_fee: str = Field(..., description="Fee for output token ßfor the OpenAI model.")
url = 'https://openai.com/api/pricing/'
crawler = WebCrawler()
crawler.warmup()
result = crawler.run(
url=url,
word_count_threshold=1,
extraction_strategy= LLMExtractionStrategy(
provider= "openai/gpt-4o", api_token = os.getenv('OPENAI_API_KEY'),
schema=OpenAIModelFee.schema(),
extraction_type="schema",
instruction="""From the crawled content, extract all mentioned model names along with their fees for input and output tokens.
Do not miss any models in the entire content. One extracted model JSON format should look like this:
{"model_name": "GPT-4", "input_fee": "US$10.00 / 1M tokens", "output_fee": "US$30.00 / 1M tokens"}."""
),
bypass_cache=True,
)
print(result.extracted_content)
```
### Execute JS, Filter Data with CSS Selector, and Clustering
```python
from crawl4ai import WebCrawler
from crawl4ai.chunking_strategy import CosineStrategy
js_code = ["const loadMoreButton = Array.from(document.querySelectorAll('button')).find(button => button.textContent.includes('Load More')); loadMoreButton && loadMoreButton.click();"]
crawler = WebCrawler()
crawler.warmup()
result = crawler.run(
url="https://www.nbcnews.com/business",
js=js_code,
css_selector="p",
extraction_strategy=CosineStrategy(semantic_filter="technology")
)
print(result.extracted_content)
```
### Extract Structured Data from Web Pages With Proxy and BaseUrl
```python
from crawl4ai import WebCrawler
from crawl4ai.extraction_strategy import LLMExtractionStrategy
def create_crawler():
crawler = WebCrawler(verbose=True, proxy="http://127.0.0.1:7890")
crawler.warmup()
return crawler
crawler = create_crawler()
crawler.warmup()
result = crawler.run(
url="https://www.nbcnews.com/business",
extraction_strategy=LLMExtractionStrategy(
provider="openai/gpt-4o",
api_token="sk-",
base_url="https://api.openai.com/v1"
)
)
print(result.markdown)
```
## Documentation 📚
For detailed documentation, including installation instructions, advanced features, and API reference, visit our [Documentation Website](https://crawl4ai.com/mkdocs/).
## Contributing 🤝
We welcome contributions from the open-source community. Check out our [contribution guidelines](https://github.com/unclecode/crawl4ai/blob/main/CONTRIBUTING.md) for more information.
## License 📄
Crawl4AI is released under the [Apache 2.0 License](https://github.com/unclecode/crawl4ai/blob/main/LICENSE).
## Contact 📧
For questions, suggestions, or feedback, feel free to reach out:
- GitHub: [unclecode](https://github.com/unclecode)
- Twitter: [@unclecode](https://twitter.com/unclecode)
- Website: [crawl4ai.com](https://crawl4ai.com)
Happy Crawling! 🕸️🚀
## Star History
[![Star History Chart](https://api.star-history.com/svg?repos=unclecode/crawl4ai&type=Date)](https://star-history.com/#unclecode/crawl4ai&Date)

View File

@@ -1,503 +0,0 @@
# Crawl4AI Strategic Roadmap
```mermaid
%%{init: {'themeVariables': { 'fontSize': '14px'}}}%%
graph TD
subgraph A1[Advanced Crawling Systems 🔧]
A["`
• Graph Crawler ✓
• Question-Based Crawler
• Knowledge-Optimal Crawler
• Agentic Crawler
`"]
end
subgraph A2[Specialized Features 🛠️]
B["`
• Automated Schema Generator
• Domain-Specific Scrapers
`"]
end
subgraph A3[Development Tools 🔨]
C["`
• Interactive Playground
• Performance Monitor
• Cloud Integration
`"]
end
subgraph A4[Community & Growth 🌱]
D["`
• Sponsorship Program
• Educational Content
`"]
end
classDef default fill:#f9f9f9,stroke:#333,stroke-width:2px
classDef section fill:#f0f0f0,stroke:#333,stroke-width:4px,rx:10
class A1,A2,A3,A4 section
%% Layout hints
A1 --> A2[" "]
A3 --> A4[" "]
linkStyle 0,1 stroke:none
```
Crawl4AI is evolving to provide more intelligent, efficient, and versatile web crawling capabilities. This roadmap outlines the key developments and features planned for the project, organized into strategic sections that build upon our current foundation.
## 1. Advanced Crawling Systems 🔧
This section introduces three powerful crawling systems that extend Crawl4AI's capabilities from basic web crawling to intelligent, purpose-driven data extraction.
### 1.1 Question-Based Crawler
The Question-Based Crawler enhances our core engine by enabling automatic discovery and extraction of relevant web content based on natural language questions.
Key Features:
- SerpiAPI integration for intelligent web search
- Relevancy scoring for search results
- Automatic URL discovery and prioritization
- Cross-source validation
```python
from crawl4ai import AsyncWebCrawler
from crawl4ai.discovery import QuestionBasedDiscovery
async with AsyncWebCrawler() as crawler:
discovery = QuestionBasedDiscovery(crawler)
results = await discovery.arun(
question="What are the system requirements for major cloud providers' GPU instances?",
max_urls=5,
relevance_threshold=0.7
)
for result in results:
print(f"Source: {result.url} (Relevance: {result.relevance_score})")
print(f"Content: {result.markdown}\n")
```
### 1.2 Knowledge-Optimal Crawler
An intelligent crawling system that solves the optimization problem of minimizing data extraction while maximizing knowledge acquisition for specific objectives.
Key Features:
- Smart content prioritization
- Minimal data extraction for maximum knowledge
- Probabilistic relevance assessment
- Objective-driven crawling paths
```python
from crawl4ai import AsyncWebCrawler
from crawl4ai.optimization import KnowledgeOptimizer
async with AsyncWebCrawler() as crawler:
optimizer = KnowledgeOptimizer(
objective="Understand GPU instance pricing and limitations across cloud providers",
required_knowledge=[
"pricing structure",
"GPU specifications",
"usage limits",
"availability zones"
],
confidence_threshold=0.85
)
result = await crawler.arun(
urls=[
"https://aws.amazon.com/ec2/pricing/",
"https://cloud.google.com/gpu",
"https://azure.microsoft.com/pricing/"
],
optimizer=optimizer,
optimization_mode="minimal_extraction"
)
print(f"Knowledge Coverage: {result.knowledge_coverage}")
print(f"Data Efficiency: {result.efficiency_ratio}")
print(f"Extracted Content: {result.optimal_content}")
```
### 1.3 Agentic Crawler
An autonomous system capable of understanding complex goals and automatically planning and executing multi-step crawling operations.
Key Features:
- Autonomous goal interpretation
- Dynamic step planning
- Interactive navigation capabilities
- Visual recognition and interaction
- Automatic error recovery
```python
from crawl4ai import AsyncWebCrawler
from crawl4ai.agents import CrawlerAgent
async with AsyncWebCrawler() as crawler:
agent = CrawlerAgent(crawler)
# Automatic planning and execution
result = await agent.arun(
goal="Find research papers about quantum computing published in 2023 with more than 50 citations",
auto_retry=True
)
print("Generated Plan:", result.executed_steps)
print("Extracted Data:", result.data)
# Using custom steps with automatic execution
result = await agent.arun(
goal="Extract conference deadlines from ML conferences",
custom_plan=[
"Navigate to conference page",
"Find important dates section",
"Extract submission deadlines",
"Verify dates are for 2024"
]
)
# Monitoring execution
print("Step Completion:", result.step_status)
print("Execution Time:", result.execution_time)
print("Success Rate:", result.success_rate)
```
# Section 2: Specialized Features 🛠️
This section introduces specialized tools and features that enhance Crawl4AI's capabilities for specific use cases and data extraction needs.
### 2.1 Automated Schema Generator
A system that automatically generates JsonCssExtractionStrategy schemas from natural language descriptions, making structured data extraction accessible to all users.
Key Features:
- Natural language schema generation
- Automatic pattern detection
- Predefined schema templates
- Chrome extension for visual schema building
```python
from crawl4ai import AsyncWebCrawler
from crawl4ai.schema import SchemaGenerator
# Generate schema from natural language description
generator = SchemaGenerator()
schema = await generator.generate(
url="https://news-website.com",
description="For each news article on the page, I need the headline, publication date, and main image"
)
# Use generated schema with crawler
async with AsyncWebCrawler() as crawler:
result = await crawler.arun(
url="https://news-website.com",
extraction_strategy=schema
)
# Example of generated schema:
"""
{
"name": "News Article Extractor",
"baseSelector": "article.news-item",
"fields": [
{
"name": "headline",
"selector": "h2.article-title",
"type": "text"
},
{
"name": "date",
"selector": "span.publish-date",
"type": "text"
},
{
"name": "image",
"selector": "img.article-image",
"type": "attribute",
"attribute": "src"
}
]
}
"""
```
### 2.2 Domain Specific Scrapers
Specialized extraction strategies optimized for common website types and platforms, providing consistent and reliable data extraction without additional configuration.
Key Features:
- Pre-configured extractors for popular platforms
- Academic site specialization (arXiv, NCBI)
- E-commerce standardization
- Documentation site handling
```python
from crawl4ai import AsyncWebCrawler
from crawl4ai.extractors import AcademicExtractor, EcommerceExtractor
async with AsyncWebCrawler() as crawler:
# Academic paper extraction
papers = await crawler.arun(
url="https://arxiv.org/list/cs.AI/recent",
extractor="academic", # Built-in extractor type
site_type="arxiv", # Specific site optimization
extract_fields=[
"title",
"authors",
"abstract",
"citations"
]
)
# E-commerce product data
products = await crawler.arun(
url="https://store.example.com/products",
extractor="ecommerce",
extract_fields=[
"name",
"price",
"availability",
"reviews"
]
)
```
### 2.3 Web Embedding Index
Creates and maintains a semantic search infrastructure for crawled content, enabling efficient retrieval and querying of web content through vector embeddings.
Key Features:
- Automatic embedding generation
- Intelligent content chunking
- Efficient vector storage and indexing
- Semantic search capabilities
```python
from crawl4ai import AsyncWebCrawler
from crawl4ai.indexing import WebIndex
# Initialize and build index
index = WebIndex(model="efficient-mini")
async with AsyncWebCrawler() as crawler:
# Crawl and index content
await index.build(
urls=["https://docs.example.com"],
crawler=crawler,
options={
"chunk_method": "semantic",
"update_policy": "incremental",
"embedding_batch_size": 100
}
)
# Search through indexed content
results = await index.search(
query="How to implement OAuth authentication?",
filters={
"content_type": "technical",
"recency": "6months"
},
top_k=5
)
# Get similar content
similar = await index.find_similar(
url="https://docs.example.com/auth/oauth",
threshold=0.85
)
```
Each of these specialized features builds upon Crawl4AI's core functionality while providing targeted solutions for specific use cases. They can be used independently or combined for more complex data extraction and processing needs.
# Section 3: Development Tools 🔧
This section covers tools designed to enhance the development experience, monitoring, and deployment of Crawl4AI applications.
### 3.1 Crawl4AI Playground 🎮
The Crawl4AI Playground is an interactive web-based development environment that simplifies web scraping experimentation, development, and deployment. With its intuitive interface and AI-powered assistance, users can quickly prototype, test, and deploy web scraping solutions.
#### Key Features 🌟
##### Visual Strategy Builder
- Interactive point-and-click interface for building extraction strategies
- Real-time preview of selected elements
- Side-by-side comparison of different extraction approaches
- Visual validation of CSS selectors and XPath queries
##### AI Assistant Integration
- Strategy recommendations based on target website analysis
- Parameter optimization suggestions
- Best practices guidance for specific use cases
- Automated error detection and resolution
- Performance optimization tips
##### Real-Time Testing & Validation
- Live preview of extraction results
- Side-by-side comparison of multiple strategies
- Performance metrics visualization
- Automatic validation of extracted data
- Error detection and debugging tools
##### Project Management
- Save and organize multiple scraping projects
- Version control for configurations
- Export/import project settings
- Share configurations with team members
- Project templates for common use cases
##### Deployment Pipeline
- One-click deployment to various environments
- Docker container generation
- Cloud deployment templates (AWS, GCP, Azure)
- Scaling configuration management
- Monitoring setup automation
### 3.2 Performance Monitoring System
A comprehensive monitoring solution providing real-time insights into crawler operations, resource usage, and system health through both CLI and GUI interfaces.
Key Features:
- Real-time resource tracking
- Active crawl monitoring
- Performance statistics
- Customizable alerting system
```python
from crawl4ai import AsyncWebCrawler
from crawl4ai.monitor import CrawlMonitor
# Initialize monitoring
monitor = CrawlMonitor()
# Start monitoring with CLI interface
await monitor.start(
mode="cli", # or "gui"
refresh_rate="1s",
metrics={
"resources": ["cpu", "memory", "network"],
"crawls": ["active", "queued", "completed"],
"performance": ["success_rate", "response_times"]
}
)
# Example CLI output:
"""
Crawl4AI Monitor (Live) - Press Q to exit
────────────────────────────────────────
System Usage:
├─ CPU: ███████░░░ 70%
└─ Memory: ████░░░░░ 2.1GB/8GB
Active Crawls:
ID URL Status Progress
001 docs.example.com 🟢 Active 75%
002 api.service.com 🟡 Queue -
Metrics (Last 5min):
├─ Success Rate: 98%
├─ Avg Response: 0.6s
└─ Pages/sec: 8.5
"""
```
### 3.3 Cloud Integration
Streamlined deployment tools for setting up Crawl4AI in various cloud environments, with support for scaling and monitoring.
Key Features:
- One-click deployment solutions
- Auto-scaling configuration
- Load balancing setup
- Cloud-specific optimizations
- Monitoring integration
```python
from crawl4ai import AsyncWebCrawler
from crawl4ai.deploy import CloudDeployer
# Initialize deployer
deployer = CloudDeployer()
# Deploy crawler service
deployment = await deployer.deploy(
service_name="crawler-cluster",
platform="aws", # or "gcp", "azure"
config={
"instance_type": "compute-optimized",
"auto_scaling": {
"min_instances": 2,
"max_instances": 10,
"scale_based_on": "cpu_usage"
},
"region": "us-east-1",
"monitoring": True
}
)
# Get deployment status and endpoints
print(f"Service Status: {deployment.status}")
print(f"API Endpoint: {deployment.endpoint}")
print(f"Monitor URL: {deployment.monitor_url}")
```
These development tools work together to provide a comprehensive environment for developing, testing, monitoring, and deploying Crawl4AI applications. The Playground helps users experiment and generate optimal configurations, the Performance Monitor ensures smooth operation, and the Cloud Integration tools simplify deployment and scaling.
# Section 4: Community & Growth 🌱
This section outlines initiatives designed to build and support the Crawl4AI community, provide educational resources, and ensure sustainable project growth.
### 4.1 Sponsorship Program
A structured program to support ongoing development and maintenance of Crawl4AI while providing valuable benefits to sponsors.
Key Features:
- Multiple sponsorship tiers
- Sponsor recognition system
- Priority support for sponsors
- Early access to new features
- Custom feature development opportunities
Program Structure (not yet finalized):
```
Sponsorship Tiers:
🥉 Bronze Supporter
- GitHub Sponsor badge
- Priority issue response
- Community Discord role
🥈 Silver Supporter
- All Bronze benefits
- Technical support channel
- Vote on roadmap priorities
- Early access to beta features
🥇 Gold Supporter
- All Silver benefits
- Custom feature requests
- Direct developer access
- Private support sessions
💎 Diamond Partner
- All Gold benefits
- Custom development
- On-demand consulting
- Integration support
```
### 4.2 "How to Crawl" Video Series
A comprehensive educational resource teaching users how to effectively use Crawl4AI for various web scraping and data extraction scenarios.
Key Features:
- Step-by-step tutorials
- Real-world use cases
- Best practices
- Integration guides
- Advanced feature deep-dives
These community initiatives are designed to:
- Provide comprehensive learning resources
- Foster a supportive user community
- Ensure sustainable project development
- Share knowledge and best practices
- Create opportunities for collaboration
The combination of structured support through sponsorship, educational content through video series, and interactive learning through the playground creates a robust ecosystem for both new and experienced users of Crawl4AI.

View File

@@ -1,88 +1,30 @@
# __init__.py
from .async_webcrawler import AsyncWebCrawler, CacheMode
from .async_configs import BrowserConfig, CrawlerRunConfig
from .content_scraping_strategy import (
ContentScrapingStrategy,
WebScrapingStrategy,
LXMLWebScrapingStrategy,
)
from .extraction_strategy import (
ExtractionStrategy,
LLMExtractionStrategy,
CosineStrategy,
JsonCssExtractionStrategy,
JsonXPathExtractionStrategy
)
from .chunking_strategy import ChunkingStrategy, RegexChunking
from .markdown_generation_strategy import DefaultMarkdownGenerator
from .content_filter_strategy import PruningContentFilter, BM25ContentFilter, LLMContentFilter
from .models import CrawlResult, MarkdownGenerationResult
from .async_dispatcher import (
MemoryAdaptiveDispatcher,
SemaphoreDispatcher,
RateLimiter,
CrawlerMonitor,
DisplayMode,
BaseDispatcher
)
from .async_webcrawler import AsyncWebCrawler
from .models import CrawlResult
__version__ = "0.3.6"
__all__ = [
"AsyncWebCrawler",
"CrawlResult",
"CacheMode",
"ContentScrapingStrategy",
"WebScrapingStrategy",
"LXMLWebScrapingStrategy",
"BrowserConfig",
"CrawlerRunConfig",
"ExtractionStrategy",
"LLMExtractionStrategy",
"CosineStrategy",
"JsonCssExtractionStrategy",
"JsonXPathExtractionStrategy",
"ChunkingStrategy",
"RegexChunking",
"DefaultMarkdownGenerator",
"PruningContentFilter",
"BM25ContentFilter",
"LLMContentFilter",
"BaseDispatcher",
"MemoryAdaptiveDispatcher",
"SemaphoreDispatcher",
"RateLimiter",
"CrawlerMonitor",
"DisplayMode",
"MarkdownGenerationResult",
]
def is_sync_version_installed():
try:
import selenium
return True
except ImportError:
return False
if is_sync_version_installed():
try:
from .web_crawler import WebCrawler
__all__.append("WebCrawler")
except ImportError:
print(
"Warning: Failed to import WebCrawler even though selenium is installed. This might be due to other missing dependencies."
)
import warnings
print("Warning: Failed to import WebCrawler even though selenium is installed. This might be due to other missing dependencies.")
else:
WebCrawler = None
# import warnings
# print("Warning: Synchronous WebCrawler is not available. Install crawl4ai[sync] for synchronous support. However, please note that the synchronous version will be deprecated soon.")
import warnings
from pydantic import warnings as pydantic_warnings
# Disable all Pydantic warnings
warnings.filterwarnings("ignore", module="pydantic")
# pydantic_warnings.filter_warnings()
import warnings
print("Warning: Synchronous WebCrawler is not available. Install crawl4ai[sync] for synchronous support. However, please note that the synchronous version will be deprecated soon.")

View File

@@ -1,2 +0,0 @@
# crawl4ai/_version.py
__version__ = "0.4.3b1"

View File

@@ -1,715 +0,0 @@
from .config import (
MIN_WORD_THRESHOLD,
IMAGE_DESCRIPTION_MIN_WORD_THRESHOLD,
SCREENSHOT_HEIGHT_TRESHOLD,
PAGE_TIMEOUT,
IMAGE_SCORE_THRESHOLD,
SOCIAL_MEDIA_DOMAINS,
)
from .user_agent_generator import UserAgentGenerator
from .extraction_strategy import ExtractionStrategy
from .chunking_strategy import ChunkingStrategy, RegexChunking
from .markdown_generation_strategy import MarkdownGenerationStrategy
from .content_scraping_strategy import ContentScrapingStrategy, WebScrapingStrategy
from typing import Optional, Union, List
class BrowserConfig:
"""
Configuration class for setting up a browser instance and its context in AsyncPlaywrightCrawlerStrategy.
This class centralizes all parameters that affect browser and context creation. Instead of passing
scattered keyword arguments, users can instantiate and modify this configuration object. The crawler
code will then reference these settings to initialize the browser in a consistent, documented manner.
Attributes:
browser_type (str): The type of browser to launch. Supported values: "chromium", "firefox", "webkit".
Default: "chromium".
headless (bool): Whether to run the browser in headless mode (no visible GUI).
Default: True.
use_managed_browser (bool): Launch the browser using a managed approach (e.g., via CDP), allowing
advanced manipulation. Default: False.
debugging_port (int): Port for the browser debugging protocol. Default: 9222.
use_persistent_context (bool): Use a persistent browser context (like a persistent profile).
Automatically sets use_managed_browser=True. Default: False.
user_data_dir (str or None): Path to a user data directory for persistent sessions. If None, a
temporary directory may be used. Default: None.
chrome_channel (str): The Chrome channel to launch (e.g., "chrome", "msedge"). Only applies if browser_type
is "chromium". Default: "chromium".
channel (str): The channel to launch (e.g., "chromium", "chrome", "msedge"). Only applies if browser_type
is "chromium". Default: "chromium".
proxy (Optional[str]): Proxy server URL (e.g., "http://username:password@proxy:port"). If None, no proxy is used.
Default: None.
proxy_config (dict or None): Detailed proxy configuration, e.g. {"server": "...", "username": "..."}.
If None, no additional proxy config. Default: None.
viewport_width (int): Default viewport width for pages. Default: 1080.
viewport_height (int): Default viewport height for pages. Default: 600.
verbose (bool): Enable verbose logging.
Default: True.
accept_downloads (bool): Whether to allow file downloads. If True, requires a downloads_path.
Default: False.
downloads_path (str or None): Directory to store downloaded files. If None and accept_downloads is True,
a default path will be created. Default: None.
storage_state (str or dict or None): Path or object describing storage state (cookies, localStorage).
Default: None.
ignore_https_errors (bool): Ignore HTTPS certificate errors. Default: True.
java_script_enabled (bool): Enable JavaScript execution in pages. Default: True.
cookies (list): List of cookies to add to the browser context. Each cookie is a dict with fields like
{"name": "...", "value": "...", "url": "..."}.
Default: [].
headers (dict): Extra HTTP headers to apply to all requests in this context.
Default: {}.
user_agent (str): Custom User-Agent string to use. Default: "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) "
"AppleWebKit/537.36 (KHTML, like Gecko) Chrome/116.0.0.0 Safari/537.36".
user_agent_mode (str or None): Mode for generating the user agent (e.g., "random"). If None, use the provided
user_agent as-is. Default: None.
user_agent_generator_config (dict or None): Configuration for user agent generation if user_agent_mode is set.
Default: None.
text_mode (bool): If True, disables images and other rich content for potentially faster load times.
Default: False.
light_mode (bool): Disables certain background features for performance gains. Default: False.
extra_args (list): Additional command-line arguments passed to the browser.
Default: [].
"""
def __init__(
self,
browser_type: str = "chromium",
headless: bool = True,
use_managed_browser: bool = False,
use_persistent_context: bool = False,
user_data_dir: str = None,
chrome_channel: str = "chromium",
channel: str = "chromium",
proxy: Optional[str] = None,
proxy_config: dict = None,
viewport_width: int = 1080,
viewport_height: int = 600,
accept_downloads: bool = False,
downloads_path: str = None,
storage_state=None,
ignore_https_errors: bool = True,
java_script_enabled: bool = True,
sleep_on_close: bool = False,
verbose: bool = True,
cookies: list = None,
headers: dict = None,
user_agent: str = (
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:109.0) AppleWebKit/537.36 "
"(KHTML, like Gecko) Chrome/116.0.5845.187 Safari/604.1 Edg/117.0.2045.47"
),
user_agent_mode: str = None,
user_agent_generator_config: dict = None,
text_mode: bool = False,
light_mode: bool = False,
extra_args: list = None,
debugging_port: int = 9222,
):
self.browser_type = browser_type
self.headless = headless
self.use_managed_browser = use_managed_browser
self.use_persistent_context = use_persistent_context
self.user_data_dir = user_data_dir
self.chrome_channel = chrome_channel or self.browser_type or "chromium"
self.channel = channel or self.browser_type or "chromium"
if self.browser_type in ["firefox", "webkit"]:
self.channel = ""
self.chrome_channel = ""
self.proxy = proxy
self.proxy_config = proxy_config
self.viewport_width = viewport_width
self.viewport_height = viewport_height
self.accept_downloads = accept_downloads
self.downloads_path = downloads_path
self.storage_state = storage_state
self.ignore_https_errors = ignore_https_errors
self.java_script_enabled = java_script_enabled
self.cookies = cookies if cookies is not None else []
self.headers = headers if headers is not None else {}
self.user_agent = user_agent
self.user_agent_mode = user_agent_mode
self.user_agent_generator_config = user_agent_generator_config
self.text_mode = text_mode
self.light_mode = light_mode
self.extra_args = extra_args if extra_args is not None else []
self.sleep_on_close = sleep_on_close
self.verbose = verbose
self.debugging_port = debugging_port
user_agenr_generator = UserAgentGenerator()
if self.user_agent_mode != "random" and self.user_agent_generator_config:
self.user_agent = user_agenr_generator.generate(
**(self.user_agent_generator_config or {})
)
elif self.user_agent_mode == "random":
self.user_agent = user_agenr_generator.generate()
else:
pass
self.browser_hint = user_agenr_generator.generate_client_hints(self.user_agent)
self.headers.setdefault("sec-ch-ua", self.browser_hint)
# If persistent context is requested, ensure managed browser is enabled
if self.use_persistent_context:
self.use_managed_browser = True
@staticmethod
def from_kwargs(kwargs: dict) -> "BrowserConfig":
return BrowserConfig(
browser_type=kwargs.get("browser_type", "chromium"),
headless=kwargs.get("headless", True),
use_managed_browser=kwargs.get("use_managed_browser", False),
use_persistent_context=kwargs.get("use_persistent_context", False),
user_data_dir=kwargs.get("user_data_dir"),
chrome_channel=kwargs.get("chrome_channel", "chromium"),
channel=kwargs.get("channel", "chromium"),
proxy=kwargs.get("proxy"),
proxy_config=kwargs.get("proxy_config"),
viewport_width=kwargs.get("viewport_width", 1080),
viewport_height=kwargs.get("viewport_height", 600),
accept_downloads=kwargs.get("accept_downloads", False),
downloads_path=kwargs.get("downloads_path"),
storage_state=kwargs.get("storage_state"),
ignore_https_errors=kwargs.get("ignore_https_errors", True),
java_script_enabled=kwargs.get("java_script_enabled", True),
cookies=kwargs.get("cookies", []),
headers=kwargs.get("headers", {}),
user_agent=kwargs.get(
"user_agent",
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) "
"AppleWebKit/537.36 (KHTML, like Gecko) Chrome/116.0.0.0 Safari/537.36",
),
user_agent_mode=kwargs.get("user_agent_mode"),
user_agent_generator_config=kwargs.get("user_agent_generator_config"),
text_mode=kwargs.get("text_mode", False),
light_mode=kwargs.get("light_mode", False),
extra_args=kwargs.get("extra_args", []),
)
def to_dict(self):
return {
"browser_type": self.browser_type,
"headless": self.headless,
"use_managed_browser": self.use_managed_browser,
"use_persistent_context": self.use_persistent_context,
"user_data_dir": self.user_data_dir,
"chrome_channel": self.chrome_channel,
"channel": self.channel,
"proxy": self.proxy,
"proxy_config": self.proxy_config,
"viewport_width": self.viewport_width,
"viewport_height": self.viewport_height,
"accept_downloads": self.accept_downloads,
"downloads_path": self.downloads_path,
"storage_state": self.storage_state,
"ignore_https_errors": self.ignore_https_errors,
"java_script_enabled": self.java_script_enabled,
"cookies": self.cookies,
"headers": self.headers,
"user_agent": self.user_agent,
"user_agent_mode": self.user_agent_mode,
"user_agent_generator_config": self.user_agent_generator_config,
"text_mode": self.text_mode,
"light_mode": self.light_mode,
"extra_args": self.extra_args,
"sleep_on_close": self.sleep_on_close,
"verbose": self.verbose,
"debugging_port": self.debugging_port,
}
def clone(self, **kwargs):
"""Create a copy of this configuration with updated values.
Args:
**kwargs: Key-value pairs of configuration options to update
Returns:
BrowserConfig: A new instance with the specified updates
"""
config_dict = self.to_dict()
config_dict.update(kwargs)
return BrowserConfig.from_kwargs(config_dict)
class CrawlerRunConfig:
"""
Configuration class for controlling how the crawler runs each crawl operation.
This includes parameters for content extraction, page manipulation, waiting conditions,
caching, and other runtime behaviors.
This centralizes parameters that were previously scattered as kwargs to `arun()` and related methods.
By using this class, you have a single place to understand and adjust the crawling options.
Attributes:
# Content Processing Parameters
word_count_threshold (int): Minimum word count threshold before processing content.
Default: MIN_WORD_THRESHOLD (typically 200).
extraction_strategy (ExtractionStrategy or None): Strategy to extract structured data from crawled pages.
Default: None (NoExtractionStrategy is used if None).
chunking_strategy (ChunkingStrategy): Strategy to chunk content before extraction.
Default: RegexChunking().
markdown_generator (MarkdownGenerationStrategy): Strategy for generating markdown.
Default: None.
content_filter (RelevantContentFilter or None): Optional filter to prune irrelevant content.
Default: None.
only_text (bool): If True, attempt to extract text-only content where applicable.
Default: False.
css_selector (str or None): CSS selector to extract a specific portion of the page.
Default: None.
excluded_tags (list of str or None): List of HTML tags to exclude from processing.
Default: None.
excluded_selector (str or None): CSS selector to exclude from processing.
Default: None.
keep_data_attributes (bool): If True, retain `data-*` attributes while removing unwanted attributes.
Default: False.
remove_forms (bool): If True, remove all `<form>` elements from the HTML.
Default: False.
prettiify (bool): If True, apply `fast_format_html` to produce prettified HTML output.
Default: False.
parser_type (str): Type of parser to use for HTML parsing.
Default: "lxml".
scraping_strategy (ContentScrapingStrategy): Scraping strategy to use.
Default: WebScrapingStrategy.
proxy_config (dict or None): Detailed proxy configuration, e.g. {"server": "...", "username": "..."}.
If None, no additional proxy config. Default: None.
# Caching Parameters
cache_mode (CacheMode or None): Defines how caching is handled.
If None, defaults to CacheMode.ENABLED internally.
Default: None.
session_id (str or None): Optional session ID to persist the browser context and the created
page instance. If the ID already exists, the crawler does not
create a new page and uses the current page to preserve the state.
bypass_cache (bool): Legacy parameter, if True acts like CacheMode.BYPASS.
Default: False.
disable_cache (bool): Legacy parameter, if True acts like CacheMode.DISABLED.
Default: False.
no_cache_read (bool): Legacy parameter, if True acts like CacheMode.WRITE_ONLY.
Default: False.
no_cache_write (bool): Legacy parameter, if True acts like CacheMode.READ_ONLY.
Default: False.
shared_data (dict or None): Shared data to be passed between hooks.
Default: None.
# Page Navigation and Timing Parameters
wait_until (str): The condition to wait for when navigating, e.g. "domcontentloaded".
Default: "domcontentloaded".
page_timeout (int): Timeout in ms for page operations like navigation.
Default: 60000 (60 seconds).
wait_for (str or None): A CSS selector or JS condition to wait for before extracting content.
Default: None.
wait_for_images (bool): If True, wait for images to load before extracting content.
Default: False.
delay_before_return_html (float): Delay in seconds before retrieving final HTML.
Default: 0.1.
mean_delay (float): Mean base delay between requests when calling arun_many.
Default: 0.1.
max_range (float): Max random additional delay range for requests in arun_many.
Default: 0.3.
semaphore_count (int): Number of concurrent operations allowed.
Default: 5.
# Page Interaction Parameters
js_code (str or list of str or None): JavaScript code/snippets to run on the page.
Default: None.
js_only (bool): If True, indicates subsequent calls are JS-driven updates, not full page loads.
Default: False.
ignore_body_visibility (bool): If True, ignore whether the body is visible before proceeding.
Default: True.
scan_full_page (bool): If True, scroll through the entire page to load all content.
Default: False.
scroll_delay (float): Delay in seconds between scroll steps if scan_full_page is True.
Default: 0.2.
process_iframes (bool): If True, attempts to process and inline iframe content.
Default: False.
remove_overlay_elements (bool): If True, remove overlays/popups before extracting HTML.
Default: False.
simulate_user (bool): If True, simulate user interactions (mouse moves, clicks) for anti-bot measures.
Default: False.
override_navigator (bool): If True, overrides navigator properties for more human-like behavior.
Default: False.
magic (bool): If True, attempts automatic handling of overlays/popups.
Default: False.
adjust_viewport_to_content (bool): If True, adjust viewport according to the page content dimensions.
Default: False.
# Media Handling Parameters
screenshot (bool): Whether to take a screenshot after crawling.
Default: False.
screenshot_wait_for (float or None): Additional wait time before taking a screenshot.
Default: None.
screenshot_height_threshold (int): Threshold for page height to decide screenshot strategy.
Default: SCREENSHOT_HEIGHT_TRESHOLD (from config, e.g. 20000).
pdf (bool): Whether to generate a PDF of the page.
Default: False.
image_description_min_word_threshold (int): Minimum words for image description extraction.
Default: IMAGE_DESCRIPTION_MIN_WORD_THRESHOLD (e.g., 50).
image_score_threshold (int): Minimum score threshold for processing an image.
Default: IMAGE_SCORE_THRESHOLD (e.g., 3).
exclude_external_images (bool): If True, exclude all external images from processing.
Default: False.
# Link and Domain Handling Parameters
exclude_social_media_domains (list of str): List of domains to exclude for social media links.
Default: SOCIAL_MEDIA_DOMAINS (from config).
exclude_external_links (bool): If True, exclude all external links from the results.
Default: False.
exclude_social_media_links (bool): If True, exclude links pointing to social media domains.
Default: False.
exclude_domains (list of str): List of specific domains to exclude from results.
Default: [].
# Debugging and Logging Parameters
verbose (bool): Enable verbose logging.
Default: True.
log_console (bool): If True, log console messages from the page.
Default: False.
# Streaming Parameters
stream (bool): If True, enables streaming of crawled URLs as they are processed when used with arun_many.
Default: False.
# Optional Parameters
stream (bool): If True, stream the page content as it is being loaded.
url: str = None # This is not a compulsory parameter
check_robots_txt (bool): Whether to check robots.txt rules before crawling. Default: False
"""
def __init__(
self,
# Content Processing Parameters
word_count_threshold: int = MIN_WORD_THRESHOLD,
extraction_strategy: ExtractionStrategy = None,
chunking_strategy: ChunkingStrategy = RegexChunking(),
markdown_generator: MarkdownGenerationStrategy = None,
content_filter=None,
only_text: bool = False,
css_selector: str = None,
excluded_tags: list = None,
excluded_selector: str = None,
keep_data_attributes: bool = False,
remove_forms: bool = False,
prettiify: bool = False,
parser_type: str = "lxml",
scraping_strategy: ContentScrapingStrategy = None,
proxy_config: dict = None,
# SSL Parameters
fetch_ssl_certificate: bool = False,
# Caching Parameters
cache_mode=None,
session_id: str = None,
bypass_cache: bool = False,
disable_cache: bool = False,
no_cache_read: bool = False,
no_cache_write: bool = False,
shared_data: dict = None,
# Page Navigation and Timing Parameters
wait_until: str = "domcontentloaded",
page_timeout: int = PAGE_TIMEOUT,
wait_for: str = None,
wait_for_images: bool = False,
delay_before_return_html: float = 0.1,
mean_delay: float = 0.1,
max_range: float = 0.3,
semaphore_count: int = 5,
# Page Interaction Parameters
js_code: Union[str, List[str]] = None,
js_only: bool = False,
ignore_body_visibility: bool = True,
scan_full_page: bool = False,
scroll_delay: float = 0.2,
process_iframes: bool = False,
remove_overlay_elements: bool = False,
simulate_user: bool = False,
override_navigator: bool = False,
magic: bool = False,
adjust_viewport_to_content: bool = False,
# Media Handling Parameters
screenshot: bool = False,
screenshot_wait_for: float = None,
screenshot_height_threshold: int = SCREENSHOT_HEIGHT_TRESHOLD,
pdf: bool = False,
image_description_min_word_threshold: int = IMAGE_DESCRIPTION_MIN_WORD_THRESHOLD,
image_score_threshold: int = IMAGE_SCORE_THRESHOLD,
exclude_external_images: bool = False,
# Link and Domain Handling Parameters
exclude_social_media_domains: list = None,
exclude_external_links: bool = False,
exclude_social_media_links: bool = False,
exclude_domains: list = None,
# Debugging and Logging Parameters
verbose: bool = True,
log_console: bool = False,
# Streaming Parameters
stream: bool = False,
url: str = None,
check_robots_txt: bool = False,
):
self.url = url
# Content Processing Parameters
self.word_count_threshold = word_count_threshold
self.extraction_strategy = extraction_strategy
self.chunking_strategy = chunking_strategy
self.markdown_generator = markdown_generator
self.content_filter = content_filter
self.only_text = only_text
self.css_selector = css_selector
self.excluded_tags = excluded_tags or []
self.excluded_selector = excluded_selector or ""
self.keep_data_attributes = keep_data_attributes
self.remove_forms = remove_forms
self.prettiify = prettiify
self.parser_type = parser_type
self.scraping_strategy = scraping_strategy or WebScrapingStrategy()
self.proxy_config = proxy_config
# SSL Parameters
self.fetch_ssl_certificate = fetch_ssl_certificate
# Caching Parameters
self.cache_mode = cache_mode
self.session_id = session_id
self.bypass_cache = bypass_cache
self.disable_cache = disable_cache
self.no_cache_read = no_cache_read
self.no_cache_write = no_cache_write
self.shared_data = shared_data
# Page Navigation and Timing Parameters
self.wait_until = wait_until
self.page_timeout = page_timeout
self.wait_for = wait_for
self.wait_for_images = wait_for_images
self.delay_before_return_html = delay_before_return_html
self.mean_delay = mean_delay
self.max_range = max_range
self.semaphore_count = semaphore_count
# Page Interaction Parameters
self.js_code = js_code
self.js_only = js_only
self.ignore_body_visibility = ignore_body_visibility
self.scan_full_page = scan_full_page
self.scroll_delay = scroll_delay
self.process_iframes = process_iframes
self.remove_overlay_elements = remove_overlay_elements
self.simulate_user = simulate_user
self.override_navigator = override_navigator
self.magic = magic
self.adjust_viewport_to_content = adjust_viewport_to_content
# Media Handling Parameters
self.screenshot = screenshot
self.screenshot_wait_for = screenshot_wait_for
self.screenshot_height_threshold = screenshot_height_threshold
self.pdf = pdf
self.image_description_min_word_threshold = image_description_min_word_threshold
self.image_score_threshold = image_score_threshold
self.exclude_external_images = exclude_external_images
# Link and Domain Handling Parameters
self.exclude_social_media_domains = (
exclude_social_media_domains or SOCIAL_MEDIA_DOMAINS
)
self.exclude_external_links = exclude_external_links
self.exclude_social_media_links = exclude_social_media_links
self.exclude_domains = exclude_domains or []
# Debugging and Logging Parameters
self.verbose = verbose
self.log_console = log_console
# Streaming Parameters
self.stream = stream
# Robots.txt Handling Parameters
self.check_robots_txt = check_robots_txt
# Validate type of extraction strategy and chunking strategy if they are provided
if self.extraction_strategy is not None and not isinstance(
self.extraction_strategy, ExtractionStrategy
):
raise ValueError(
"extraction_strategy must be an instance of ExtractionStrategy"
)
if self.chunking_strategy is not None and not isinstance(
self.chunking_strategy, ChunkingStrategy
):
raise ValueError(
"chunking_strategy must be an instance of ChunkingStrategy"
)
# Set default chunking strategy if None
if self.chunking_strategy is None:
self.chunking_strategy = RegexChunking()
@staticmethod
def from_kwargs(kwargs: dict) -> "CrawlerRunConfig":
return CrawlerRunConfig(
# Content Processing Parameters
word_count_threshold=kwargs.get("word_count_threshold", 200),
extraction_strategy=kwargs.get("extraction_strategy"),
chunking_strategy=kwargs.get("chunking_strategy", RegexChunking()),
markdown_generator=kwargs.get("markdown_generator"),
content_filter=kwargs.get("content_filter"),
only_text=kwargs.get("only_text", False),
css_selector=kwargs.get("css_selector"),
excluded_tags=kwargs.get("excluded_tags", []),
excluded_selector=kwargs.get("excluded_selector", ""),
keep_data_attributes=kwargs.get("keep_data_attributes", False),
remove_forms=kwargs.get("remove_forms", False),
prettiify=kwargs.get("prettiify", False),
parser_type=kwargs.get("parser_type", "lxml"),
scraping_strategy=kwargs.get("scraping_strategy"),
proxy_config=kwargs.get("proxy_config"),
# SSL Parameters
fetch_ssl_certificate=kwargs.get("fetch_ssl_certificate", False),
# Caching Parameters
cache_mode=kwargs.get("cache_mode"),
session_id=kwargs.get("session_id"),
bypass_cache=kwargs.get("bypass_cache", False),
disable_cache=kwargs.get("disable_cache", False),
no_cache_read=kwargs.get("no_cache_read", False),
no_cache_write=kwargs.get("no_cache_write", False),
shared_data=kwargs.get("shared_data", None),
# Page Navigation and Timing Parameters
wait_until=kwargs.get("wait_until", "domcontentloaded"),
page_timeout=kwargs.get("page_timeout", 60000),
wait_for=kwargs.get("wait_for"),
wait_for_images=kwargs.get("wait_for_images", False),
delay_before_return_html=kwargs.get("delay_before_return_html", 0.1),
mean_delay=kwargs.get("mean_delay", 0.1),
max_range=kwargs.get("max_range", 0.3),
semaphore_count=kwargs.get("semaphore_count", 5),
# Page Interaction Parameters
js_code=kwargs.get("js_code"),
js_only=kwargs.get("js_only", False),
ignore_body_visibility=kwargs.get("ignore_body_visibility", True),
scan_full_page=kwargs.get("scan_full_page", False),
scroll_delay=kwargs.get("scroll_delay", 0.2),
process_iframes=kwargs.get("process_iframes", False),
remove_overlay_elements=kwargs.get("remove_overlay_elements", False),
simulate_user=kwargs.get("simulate_user", False),
override_navigator=kwargs.get("override_navigator", False),
magic=kwargs.get("magic", False),
adjust_viewport_to_content=kwargs.get("adjust_viewport_to_content", False),
# Media Handling Parameters
screenshot=kwargs.get("screenshot", False),
screenshot_wait_for=kwargs.get("screenshot_wait_for"),
screenshot_height_threshold=kwargs.get(
"screenshot_height_threshold", SCREENSHOT_HEIGHT_TRESHOLD
),
pdf=kwargs.get("pdf", False),
image_description_min_word_threshold=kwargs.get(
"image_description_min_word_threshold",
IMAGE_DESCRIPTION_MIN_WORD_THRESHOLD,
),
image_score_threshold=kwargs.get(
"image_score_threshold", IMAGE_SCORE_THRESHOLD
),
exclude_external_images=kwargs.get("exclude_external_images", False),
# Link and Domain Handling Parameters
exclude_social_media_domains=kwargs.get(
"exclude_social_media_domains", SOCIAL_MEDIA_DOMAINS
),
exclude_external_links=kwargs.get("exclude_external_links", False),
exclude_social_media_links=kwargs.get("exclude_social_media_links", False),
exclude_domains=kwargs.get("exclude_domains", []),
# Debugging and Logging Parameters
verbose=kwargs.get("verbose", True),
log_console=kwargs.get("log_console", False),
# Streaming Parameters
stream=kwargs.get("stream", False),
url=kwargs.get("url"),
check_robots_txt=kwargs.get("check_robots_txt", False),
)
# Create a funciton returns dict of the object
def to_dict(self):
return {
"word_count_threshold": self.word_count_threshold,
"extraction_strategy": self.extraction_strategy,
"chunking_strategy": self.chunking_strategy,
"markdown_generator": self.markdown_generator,
"content_filter": self.content_filter,
"only_text": self.only_text,
"css_selector": self.css_selector,
"excluded_tags": self.excluded_tags,
"excluded_selector": self.excluded_selector,
"keep_data_attributes": self.keep_data_attributes,
"remove_forms": self.remove_forms,
"prettiify": self.prettiify,
"parser_type": self.parser_type,
"scraping_strategy": self.scraping_strategy,
"proxy_config": self.proxy_config,
"fetch_ssl_certificate": self.fetch_ssl_certificate,
"cache_mode": self.cache_mode,
"session_id": self.session_id,
"bypass_cache": self.bypass_cache,
"disable_cache": self.disable_cache,
"no_cache_read": self.no_cache_read,
"no_cache_write": self.no_cache_write,
"shared_data": self.shared_data,
"wait_until": self.wait_until,
"page_timeout": self.page_timeout,
"wait_for": self.wait_for,
"wait_for_images": self.wait_for_images,
"delay_before_return_html": self.delay_before_return_html,
"mean_delay": self.mean_delay,
"max_range": self.max_range,
"semaphore_count": self.semaphore_count,
"js_code": self.js_code,
"js_only": self.js_only,
"ignore_body_visibility": self.ignore_body_visibility,
"scan_full_page": self.scan_full_page,
"scroll_delay": self.scroll_delay,
"process_iframes": self.process_iframes,
"remove_overlay_elements": self.remove_overlay_elements,
"simulate_user": self.simulate_user,
"override_navigator": self.override_navigator,
"magic": self.magic,
"adjust_viewport_to_content": self.adjust_viewport_to_content,
"screenshot": self.screenshot,
"screenshot_wait_for": self.screenshot_wait_for,
"screenshot_height_threshold": self.screenshot_height_threshold,
"pdf": self.pdf,
"image_description_min_word_threshold": self.image_description_min_word_threshold,
"image_score_threshold": self.image_score_threshold,
"exclude_external_images": self.exclude_external_images,
"exclude_social_media_domains": self.exclude_social_media_domains,
"exclude_external_links": self.exclude_external_links,
"exclude_social_media_links": self.exclude_social_media_links,
"exclude_domains": self.exclude_domains,
"verbose": self.verbose,
"log_console": self.log_console,
"stream": self.stream,
"url": self.url,
"check_robots_txt": self.check_robots_txt,
}
def clone(self, **kwargs):
"""Create a copy of this configuration with updated values.
Args:
**kwargs: Key-value pairs of configuration options to update
Returns:
CrawlerRunConfig: A new instance with the specified updates
Example:
```python
# Create a new config with streaming enabled
stream_config = config.clone(stream=True)
# Create a new config with multiple updates
new_config = config.clone(
stream=True,
cache_mode=CacheMode.BYPASS,
verbose=True
)
```
"""
config_dict = self.to_dict()
config_dict.update(kwargs)
return CrawlerRunConfig.from_kwargs(config_dict)

File diff suppressed because it is too large Load Diff

View File

@@ -2,232 +2,19 @@ import os
from pathlib import Path
import aiosqlite
import asyncio
from typing import Optional, Dict
from contextlib import asynccontextmanager
import logging
import json # Added for serialization/deserialization
from .utils import ensure_content_dirs, generate_content_hash
from .models import CrawlResult, MarkdownGenerationResult
import aiofiles
from .version_manager import VersionManager
from .async_logger import AsyncLogger
from .utils import get_error_context, create_box_message
from typing import Optional, Tuple
# Set up logging
# logging.basicConfig(level=logging.INFO)
# logger = logging.getLogger(__name__)
# logger.setLevel(logging.INFO)
base_directory = DB_PATH = os.path.join(
os.getenv("CRAWL4_AI_BASE_DIRECTORY", Path.home()), ".crawl4ai"
)
DB_PATH = os.path.join(Path.home(), ".crawl4ai")
os.makedirs(DB_PATH, exist_ok=True)
DB_PATH = os.path.join(base_directory, "crawl4ai.db")
DB_PATH = os.path.join(DB_PATH, "crawl4ai.db")
class AsyncDatabaseManager:
def __init__(self, pool_size: int = 10, max_retries: int = 3):
def __init__(self):
self.db_path = DB_PATH
self.content_paths = ensure_content_dirs(os.path.dirname(DB_PATH))
self.pool_size = pool_size
self.max_retries = max_retries
self.connection_pool: Dict[int, aiosqlite.Connection] = {}
self.pool_lock = asyncio.Lock()
self.init_lock = asyncio.Lock()
self.connection_semaphore = asyncio.Semaphore(pool_size)
self._initialized = False
self.version_manager = VersionManager()
self.logger = AsyncLogger(
log_file=os.path.join(base_directory, ".crawl4ai", "crawler_db.log"),
verbose=False,
tag_width=10,
)
async def initialize(self):
"""Initialize the database and connection pool"""
try:
self.logger.info("Initializing database", tag="INIT")
# Ensure the database file exists
os.makedirs(os.path.dirname(self.db_path), exist_ok=True)
# Check if version update is needed
needs_update = self.version_manager.needs_update()
# Always ensure base table exists
await self.ainit_db()
# Verify the table exists
async with aiosqlite.connect(self.db_path, timeout=30.0) as db:
async with db.execute(
"SELECT name FROM sqlite_master WHERE type='table' AND name='crawled_data'"
) as cursor:
result = await cursor.fetchone()
if not result:
raise Exception("crawled_data table was not created")
# If version changed or fresh install, run updates
if needs_update:
self.logger.info("New version detected, running updates", tag="INIT")
await self.update_db_schema()
from .migrations import (
run_migration,
) # Import here to avoid circular imports
await run_migration()
self.version_manager.update_version() # Update stored version after successful migration
self.logger.success(
"Version update completed successfully", tag="COMPLETE"
)
else:
self.logger.success(
"Database initialization completed successfully", tag="COMPLETE"
)
except Exception as e:
self.logger.error(
message="Database initialization error: {error}",
tag="ERROR",
params={"error": str(e)},
)
self.logger.info(
message="Database will be initialized on first use", tag="INIT"
)
raise
async def cleanup(self):
"""Cleanup connections when shutting down"""
async with self.pool_lock:
for conn in self.connection_pool.values():
await conn.close()
self.connection_pool.clear()
@asynccontextmanager
async def get_connection(self):
"""Connection pool manager with enhanced error handling"""
if not self._initialized:
async with self.init_lock:
if not self._initialized:
try:
await self.initialize()
self._initialized = True
except Exception as e:
import sys
error_context = get_error_context(sys.exc_info())
self.logger.error(
message="Database initialization failed:\n{error}\n\nContext:\n{context}\n\nTraceback:\n{traceback}",
tag="ERROR",
force_verbose=True,
params={
"error": str(e),
"context": error_context["code_context"],
"traceback": error_context["full_traceback"],
},
)
raise
await self.connection_semaphore.acquire()
task_id = id(asyncio.current_task())
try:
async with self.pool_lock:
if task_id not in self.connection_pool:
try:
conn = await aiosqlite.connect(self.db_path, timeout=30.0)
await conn.execute("PRAGMA journal_mode = WAL")
await conn.execute("PRAGMA busy_timeout = 5000")
# Verify database structure
async with conn.execute(
"PRAGMA table_info(crawled_data)"
) as cursor:
columns = await cursor.fetchall()
column_names = [col[1] for col in columns]
expected_columns = {
"url",
"html",
"cleaned_html",
"markdown",
"extracted_content",
"success",
"media",
"links",
"metadata",
"screenshot",
"response_headers",
"downloaded_files",
}
missing_columns = expected_columns - set(column_names)
if missing_columns:
raise ValueError(
f"Database missing columns: {missing_columns}"
)
self.connection_pool[task_id] = conn
except Exception as e:
import sys
error_context = get_error_context(sys.exc_info())
error_message = (
f"Unexpected error in db get_connection at line {error_context['line_no']} "
f"in {error_context['function']} ({error_context['filename']}):\n"
f"Error: {str(e)}\n\n"
f"Code context:\n{error_context['code_context']}"
)
self.logger.error(
message=create_box_message(error_message, type="error"),
)
raise
yield self.connection_pool[task_id]
except Exception as e:
import sys
error_context = get_error_context(sys.exc_info())
error_message = (
f"Unexpected error in db get_connection at line {error_context['line_no']} "
f"in {error_context['function']} ({error_context['filename']}):\n"
f"Error: {str(e)}\n\n"
f"Code context:\n{error_context['code_context']}"
)
self.logger.error(
message=create_box_message(error_message, type="error"),
)
raise
finally:
async with self.pool_lock:
if task_id in self.connection_pool:
await self.connection_pool[task_id].close()
del self.connection_pool[task_id]
self.connection_semaphore.release()
async def execute_with_retry(self, operation, *args):
"""Execute database operations with retry logic"""
for attempt in range(self.max_retries):
try:
async with self.get_connection() as db:
result = await operation(db, *args)
await db.commit()
return result
except Exception as e:
if attempt == self.max_retries - 1:
self.logger.error(
message="Operation failed after {retries} attempts: {error}",
tag="ERROR",
force_verbose=True,
params={"retries": self.max_retries, "error": str(e)},
)
raise
await asyncio.sleep(1 * (attempt + 1)) # Exponential backoff
async def ainit_db(self):
"""Initialize database schema"""
async with aiosqlite.connect(self.db_path, timeout=30.0) as db:
await db.execute(
"""
async with aiosqlite.connect(self.db_path) as db:
await db.execute('''
CREATE TABLE IF NOT EXISTS crawled_data (
url TEXT PRIMARY KEY,
html TEXT,
@@ -238,321 +25,90 @@ class AsyncDatabaseManager:
media TEXT DEFAULT "{}",
links TEXT DEFAULT "{}",
metadata TEXT DEFAULT "{}",
screenshot TEXT DEFAULT "",
response_headers TEXT DEFAULT "{}",
downloaded_files TEXT DEFAULT "{}" -- New column added
screenshot TEXT DEFAULT ""
)
"""
)
''')
await db.commit()
await self.update_db_schema()
async def update_db_schema(self):
"""Update database schema if needed"""
async with aiosqlite.connect(self.db_path, timeout=30.0) as db:
async with aiosqlite.connect(self.db_path) as db:
# Check if the 'media' column exists
cursor = await db.execute("PRAGMA table_info(crawled_data)")
columns = await cursor.fetchall()
column_names = [column[1] for column in columns]
# List of new columns to add
new_columns = [
"media",
"links",
"metadata",
"screenshot",
"response_headers",
"downloaded_files",
]
for column in new_columns:
if 'media' not in column_names:
await self.aalter_db_add_column('media')
# Check for other missing columns and add them if necessary
for column in ['links', 'metadata', 'screenshot']:
if column not in column_names:
await self.aalter_db_add_column(column, db)
await db.commit()
async def aalter_db_add_column(self, new_column: str, db):
"""Add new column to the database"""
if new_column == "response_headers":
await db.execute(
f'ALTER TABLE crawled_data ADD COLUMN {new_column} TEXT DEFAULT "{{}}"'
)
else:
await db.execute(
f'ALTER TABLE crawled_data ADD COLUMN {new_column} TEXT DEFAULT ""'
)
self.logger.info(
message="Added column '{column}' to the database",
tag="INIT",
params={"column": new_column},
)
async def aget_cached_url(self, url: str) -> Optional[CrawlResult]:
"""Retrieve cached URL data as CrawlResult"""
async def _get(db):
async with db.execute(
"SELECT * FROM crawled_data WHERE url = ?", (url,)
) as cursor:
row = await cursor.fetchone()
if not row:
return None
# Get column names
columns = [description[0] for description in cursor.description]
# Create dict from row data
row_dict = dict(zip(columns, row))
# Load content from files using stored hashes
content_fields = {
"html": row_dict["html"],
"cleaned_html": row_dict["cleaned_html"],
"markdown": row_dict["markdown"],
"extracted_content": row_dict["extracted_content"],
"screenshot": row_dict["screenshot"],
"screenshots": row_dict["screenshot"],
}
for field, hash_value in content_fields.items():
if hash_value:
content = await self._load_content(
hash_value,
field.split("_")[0], # Get content type from field name
)
row_dict[field] = content or ""
else:
row_dict[field] = ""
# Parse JSON fields
json_fields = [
"media",
"links",
"metadata",
"response_headers",
"markdown",
]
for field in json_fields:
try:
row_dict[field] = (
json.loads(row_dict[field]) if row_dict[field] else {}
)
except json.JSONDecodeError:
# Very UGLY, never mention it to me please
if field == "markdown" and isinstance(row_dict[field], str):
row_dict[field] = row_dict[field]
else:
row_dict[field] = {}
if isinstance(row_dict["markdown"], Dict):
row_dict["markdown_v2"] = row_dict["markdown"]
if row_dict["markdown"].get("raw_markdown"):
row_dict["markdown"] = row_dict["markdown"]["raw_markdown"]
# Parse downloaded_files
try:
row_dict["downloaded_files"] = (
json.loads(row_dict["downloaded_files"])
if row_dict["downloaded_files"]
else []
)
except json.JSONDecodeError:
row_dict["downloaded_files"] = []
# Remove any fields not in CrawlResult model
valid_fields = CrawlResult.__annotations__.keys()
filtered_dict = {k: v for k, v in row_dict.items() if k in valid_fields}
return CrawlResult(**filtered_dict)
await self.aalter_db_add_column(column)
async def aalter_db_add_column(self, new_column: str):
try:
return await self.execute_with_retry(_get)
async with aiosqlite.connect(self.db_path) as db:
await db.execute(f'ALTER TABLE crawled_data ADD COLUMN {new_column} TEXT DEFAULT ""')
await db.commit()
print(f"Added column '{new_column}' to the database.")
except Exception as e:
self.logger.error(
message="Error retrieving cached URL: {error}",
tag="ERROR",
force_verbose=True,
params={"error": str(e)},
)
print(f"Error altering database to add {new_column} column: {e}")
async def aget_cached_url(self, url: str) -> Optional[Tuple[str, str, str, str, str, str, str, bool, str]]:
try:
async with aiosqlite.connect(self.db_path) as db:
async with db.execute('SELECT url, html, cleaned_html, markdown, extracted_content, success, media, links, metadata, screenshot FROM crawled_data WHERE url = ?', (url,)) as cursor:
return await cursor.fetchone()
except Exception as e:
print(f"Error retrieving cached URL: {e}")
return None
async def acache_url(self, result: CrawlResult):
"""Cache CrawlResult data"""
# Store content files and get hashes
content_map = {
"html": (result.html, "html"),
"cleaned_html": (result.cleaned_html or "", "cleaned"),
"markdown": None,
"extracted_content": (result.extracted_content or "", "extracted"),
"screenshot": (result.screenshot or "", "screenshots"),
}
async def acache_url(self, url: str, html: str, cleaned_html: str, markdown: str, extracted_content: str, success: bool, media: str = "{}", links: str = "{}", metadata: str = "{}", screenshot: str = ""):
try:
if isinstance(result.markdown, MarkdownGenerationResult):
content_map["markdown"] = (
result.markdown.model_dump_json(),
"markdown",
)
elif hasattr(result, "markdown_v2"):
content_map["markdown"] = (
result.markdown_v2.model_dump_json(),
"markdown",
)
elif isinstance(result.markdown, str):
markdown_result = MarkdownGenerationResult(raw_markdown=result.markdown)
content_map["markdown"] = (
markdown_result.model_dump_json(),
"markdown",
)
else:
content_map["markdown"] = (
MarkdownGenerationResult().model_dump_json(),
"markdown",
)
async with aiosqlite.connect(self.db_path) as db:
await db.execute('''
INSERT INTO crawled_data (url, html, cleaned_html, markdown, extracted_content, success, media, links, metadata, screenshot)
VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?)
ON CONFLICT(url) DO UPDATE SET
html = excluded.html,
cleaned_html = excluded.cleaned_html,
markdown = excluded.markdown,
extracted_content = excluded.extracted_content,
success = excluded.success,
media = excluded.media,
links = excluded.links,
metadata = excluded.metadata,
screenshot = excluded.screenshot
''', (url, html, cleaned_html, markdown, extracted_content, success, media, links, metadata, screenshot))
await db.commit()
except Exception as e:
self.logger.warning(
message=f"Error processing markdown content: {str(e)}", tag="WARNING"
)
# Fallback to empty markdown result
content_map["markdown"] = (
MarkdownGenerationResult().model_dump_json(),
"markdown",
)
content_hashes = {}
for field, (content, content_type) in content_map.items():
content_hashes[field] = await self._store_content(content, content_type)
async def _cache(db):
await db.execute(
"""
INSERT INTO crawled_data (
url, html, cleaned_html, markdown,
extracted_content, success, media, links, metadata,
screenshot, response_headers, downloaded_files
)
VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?)
ON CONFLICT(url) DO UPDATE SET
html = excluded.html,
cleaned_html = excluded.cleaned_html,
markdown = excluded.markdown,
extracted_content = excluded.extracted_content,
success = excluded.success,
media = excluded.media,
links = excluded.links,
metadata = excluded.metadata,
screenshot = excluded.screenshot,
response_headers = excluded.response_headers,
downloaded_files = excluded.downloaded_files
""",
(
result.url,
content_hashes["html"],
content_hashes["cleaned_html"],
content_hashes["markdown"],
content_hashes["extracted_content"],
result.success,
json.dumps(result.media),
json.dumps(result.links),
json.dumps(result.metadata or {}),
content_hashes["screenshot"],
json.dumps(result.response_headers or {}),
json.dumps(result.downloaded_files or []),
),
)
try:
await self.execute_with_retry(_cache)
except Exception as e:
self.logger.error(
message="Error caching URL: {error}",
tag="ERROR",
force_verbose=True,
params={"error": str(e)},
)
print(f"Error caching URL: {e}")
async def aget_total_count(self) -> int:
"""Get total number of cached URLs"""
async def _count(db):
async with db.execute("SELECT COUNT(*) FROM crawled_data") as cursor:
result = await cursor.fetchone()
return result[0] if result else 0
try:
return await self.execute_with_retry(_count)
async with aiosqlite.connect(self.db_path) as db:
async with db.execute('SELECT COUNT(*) FROM crawled_data') as cursor:
result = await cursor.fetchone()
return result[0] if result else 0
except Exception as e:
self.logger.error(
message="Error getting total count: {error}",
tag="ERROR",
force_verbose=True,
params={"error": str(e)},
)
print(f"Error getting total count: {e}")
return 0
async def aclear_db(self):
"""Clear all data from the database"""
async def _clear(db):
await db.execute("DELETE FROM crawled_data")
try:
await self.execute_with_retry(_clear)
async with aiosqlite.connect(self.db_path) as db:
await db.execute('DELETE FROM crawled_data')
await db.commit()
except Exception as e:
self.logger.error(
message="Error clearing database: {error}",
tag="ERROR",
force_verbose=True,
params={"error": str(e)},
)
print(f"Error clearing database: {e}")
async def aflush_db(self):
"""Drop the entire table"""
async def _flush(db):
await db.execute("DROP TABLE IF EXISTS crawled_data")
try:
await self.execute_with_retry(_flush)
async with aiosqlite.connect(self.db_path) as db:
await db.execute('DROP TABLE IF EXISTS crawled_data')
await db.commit()
except Exception as e:
self.logger.error(
message="Error flushing database: {error}",
tag="ERROR",
force_verbose=True,
params={"error": str(e)},
)
print(f"Error flushing database: {e}")
async def _store_content(self, content: str, content_type: str) -> str:
"""Store content in filesystem and return hash"""
if not content:
return ""
content_hash = generate_content_hash(content)
file_path = os.path.join(self.content_paths[content_type], content_hash)
# Only write if file doesn't exist
if not os.path.exists(file_path):
async with aiofiles.open(file_path, "w", encoding="utf-8") as f:
await f.write(content)
return content_hash
async def _load_content(
self, content_hash: str, content_type: str
) -> Optional[str]:
"""Load content from filesystem by hash"""
if not content_hash:
return None
file_path = os.path.join(self.content_paths[content_type], content_hash)
try:
async with aiofiles.open(file_path, "r", encoding="utf-8") as f:
return await f.read()
except:
self.logger.error(
message="Failed to load content: {file_path}",
tag="ERROR",
force_verbose=True,
params={"file_path": file_path},
)
return None
# Create a singleton instance
async_db_manager = AsyncDatabaseManager()
async_db_manager = AsyncDatabaseManager()

View File

@@ -1,647 +0,0 @@
from typing import Dict, Optional, List, Tuple
from .async_configs import CrawlerRunConfig
from .models import (
CrawlResult,
CrawlerTaskResult,
CrawlStatus,
DisplayMode,
CrawlStats,
DomainState,
)
from rich.live import Live
from rich.table import Table
from rich.console import Console
from rich import box
from datetime import datetime, timedelta
from collections.abc import AsyncGenerator
import time
import psutil
import asyncio
import uuid
from urllib.parse import urlparse
import random
from abc import ABC, abstractmethod
class RateLimiter:
def __init__(
self,
base_delay: Tuple[float, float] = (1.0, 3.0),
max_delay: float = 60.0,
max_retries: int = 3,
rate_limit_codes: List[int] = None,
):
self.base_delay = base_delay
self.max_delay = max_delay
self.max_retries = max_retries
self.rate_limit_codes = rate_limit_codes or [429, 503]
self.domains: Dict[str, DomainState] = {}
def get_domain(self, url: str) -> str:
return urlparse(url).netloc
async def wait_if_needed(self, url: str) -> None:
domain = self.get_domain(url)
state = self.domains.get(domain)
if not state:
self.domains[domain] = DomainState()
state = self.domains[domain]
now = time.time()
if state.last_request_time:
wait_time = max(0, state.current_delay - (now - state.last_request_time))
if wait_time > 0:
await asyncio.sleep(wait_time)
# Random delay within base range if no current delay
if state.current_delay == 0:
state.current_delay = random.uniform(*self.base_delay)
state.last_request_time = time.time()
def update_delay(self, url: str, status_code: int) -> bool:
domain = self.get_domain(url)
state = self.domains[domain]
if status_code in self.rate_limit_codes:
state.fail_count += 1
if state.fail_count > self.max_retries:
return False
# Exponential backoff with random jitter
state.current_delay = min(
state.current_delay * 2 * random.uniform(0.75, 1.25), self.max_delay
)
else:
# Gradually reduce delay on success
state.current_delay = max(
random.uniform(*self.base_delay), state.current_delay * 0.75
)
state.fail_count = 0
return True
class CrawlerMonitor:
def __init__(
self,
max_visible_rows: int = 15,
display_mode: DisplayMode = DisplayMode.DETAILED,
):
self.console = Console()
self.max_visible_rows = max_visible_rows
self.display_mode = display_mode
self.stats: Dict[str, CrawlStats] = {}
self.process = psutil.Process()
self.start_time = datetime.now()
self.live = Live(self._create_table(), refresh_per_second=2)
def start(self):
self.live.start()
def stop(self):
self.live.stop()
def add_task(self, task_id: str, url: str):
self.stats[task_id] = CrawlStats(
task_id=task_id, url=url, status=CrawlStatus.QUEUED
)
self.live.update(self._create_table())
def update_task(self, task_id: str, **kwargs):
if task_id in self.stats:
for key, value in kwargs.items():
setattr(self.stats[task_id], key, value)
self.live.update(self._create_table())
def _create_aggregated_table(self) -> Table:
"""Creates a compact table showing only aggregated statistics"""
table = Table(
box=box.ROUNDED,
title="Crawler Status Overview",
title_style="bold magenta",
header_style="bold blue",
show_lines=True,
)
# Calculate statistics
total_tasks = len(self.stats)
queued = sum(
1 for stat in self.stats.values() if stat.status == CrawlStatus.QUEUED
)
in_progress = sum(
1 for stat in self.stats.values() if stat.status == CrawlStatus.IN_PROGRESS
)
completed = sum(
1 for stat in self.stats.values() if stat.status == CrawlStatus.COMPLETED
)
failed = sum(
1 for stat in self.stats.values() if stat.status == CrawlStatus.FAILED
)
# Memory statistics
current_memory = self.process.memory_info().rss / (1024 * 1024)
total_task_memory = sum(stat.memory_usage for stat in self.stats.values())
peak_memory = max(
(stat.peak_memory for stat in self.stats.values()), default=0.0
)
# Duration
duration = datetime.now() - self.start_time
# Create status row
table.add_column("Status", style="bold cyan")
table.add_column("Count", justify="right")
table.add_column("Percentage", justify="right")
table.add_row("Total Tasks", str(total_tasks), "100%")
table.add_row(
"[yellow]In Queue[/yellow]",
str(queued),
f"{(queued/total_tasks*100):.1f}%" if total_tasks > 0 else "0%",
)
table.add_row(
"[blue]In Progress[/blue]",
str(in_progress),
f"{(in_progress/total_tasks*100):.1f}%" if total_tasks > 0 else "0%",
)
table.add_row(
"[green]Completed[/green]",
str(completed),
f"{(completed/total_tasks*100):.1f}%" if total_tasks > 0 else "0%",
)
table.add_row(
"[red]Failed[/red]",
str(failed),
f"{(failed/total_tasks*100):.1f}%" if total_tasks > 0 else "0%",
)
# Add memory information
table.add_section()
table.add_row(
"[magenta]Current Memory[/magenta]", f"{current_memory:.1f} MB", ""
)
table.add_row(
"[magenta]Total Task Memory[/magenta]", f"{total_task_memory:.1f} MB", ""
)
table.add_row(
"[magenta]Peak Task Memory[/magenta]", f"{peak_memory:.1f} MB", ""
)
table.add_row(
"[yellow]Runtime[/yellow]",
str(timedelta(seconds=int(duration.total_seconds()))),
"",
)
return table
def _create_detailed_table(self) -> Table:
table = Table(
box=box.ROUNDED,
title="Crawler Performance Monitor",
title_style="bold magenta",
header_style="bold blue",
)
# Add columns
table.add_column("Task ID", style="cyan", no_wrap=True)
table.add_column("URL", style="cyan", no_wrap=True)
table.add_column("Status", style="bold")
table.add_column("Memory (MB)", justify="right")
table.add_column("Peak (MB)", justify="right")
table.add_column("Duration", justify="right")
table.add_column("Info", style="italic")
# Add summary row
total_memory = sum(stat.memory_usage for stat in self.stats.values())
active_count = sum(
1 for stat in self.stats.values() if stat.status == CrawlStatus.IN_PROGRESS
)
completed_count = sum(
1 for stat in self.stats.values() if stat.status == CrawlStatus.COMPLETED
)
failed_count = sum(
1 for stat in self.stats.values() if stat.status == CrawlStatus.FAILED
)
table.add_row(
"[bold yellow]SUMMARY",
f"Total: {len(self.stats)}",
f"Active: {active_count}",
f"{total_memory:.1f}",
f"{self.process.memory_info().rss / (1024 * 1024):.1f}",
str(
timedelta(
seconds=int((datetime.now() - self.start_time).total_seconds())
)
),
f"{completed_count}{failed_count}",
style="bold",
)
table.add_section()
# Add rows for each task
visible_stats = sorted(
self.stats.values(),
key=lambda x: (
x.status != CrawlStatus.IN_PROGRESS,
x.status != CrawlStatus.QUEUED,
x.end_time or datetime.max,
),
)[: self.max_visible_rows]
for stat in visible_stats:
status_style = {
CrawlStatus.QUEUED: "white",
CrawlStatus.IN_PROGRESS: "yellow",
CrawlStatus.COMPLETED: "green",
CrawlStatus.FAILED: "red",
}[stat.status]
table.add_row(
stat.task_id[:8], # Show first 8 chars of task ID
stat.url[:40] + "..." if len(stat.url) > 40 else stat.url,
f"[{status_style}]{stat.status.value}[/{status_style}]",
f"{stat.memory_usage:.1f}",
f"{stat.peak_memory:.1f}",
stat.duration,
stat.error_message[:40] if stat.error_message else "",
)
return table
def _create_table(self) -> Table:
"""Creates the appropriate table based on display mode"""
if self.display_mode == DisplayMode.AGGREGATED:
return self._create_aggregated_table()
return self._create_detailed_table()
class BaseDispatcher(ABC):
def __init__(
self,
rate_limiter: Optional[RateLimiter] = None,
monitor: Optional[CrawlerMonitor] = None,
):
self.crawler = None
self._domain_last_hit: Dict[str, float] = {}
self.concurrent_sessions = 0
self.rate_limiter = rate_limiter
self.monitor = monitor
@abstractmethod
async def crawl_url(
self,
url: str,
config: CrawlerRunConfig,
task_id: str,
monitor: Optional[CrawlerMonitor] = None,
) -> CrawlerTaskResult:
pass
@abstractmethod
async def run_urls(
self,
urls: List[str],
crawler: "AsyncWebCrawler", # noqa: F821
config: CrawlerRunConfig,
monitor: Optional[CrawlerMonitor] = None,
) -> List[CrawlerTaskResult]:
pass
class MemoryAdaptiveDispatcher(BaseDispatcher):
def __init__(
self,
memory_threshold_percent: float = 90.0,
check_interval: float = 1.0,
max_session_permit: int = 20,
memory_wait_timeout: float = 300.0, # 5 minutes default timeout
rate_limiter: Optional[RateLimiter] = None,
monitor: Optional[CrawlerMonitor] = None,
):
super().__init__(rate_limiter, monitor)
self.memory_threshold_percent = memory_threshold_percent
self.check_interval = check_interval
self.max_session_permit = max_session_permit
self.memory_wait_timeout = memory_wait_timeout
self.result_queue = asyncio.Queue() # Queue for storing results
async def crawl_url(
self,
url: str,
config: CrawlerRunConfig,
task_id: str,
) -> CrawlerTaskResult:
start_time = datetime.now()
error_message = ""
memory_usage = peak_memory = 0.0
try:
if self.monitor:
self.monitor.update_task(
task_id, status=CrawlStatus.IN_PROGRESS, start_time=start_time
)
self.concurrent_sessions += 1
if self.rate_limiter:
await self.rate_limiter.wait_if_needed(url)
process = psutil.Process()
start_memory = process.memory_info().rss / (1024 * 1024)
result = await self.crawler.arun(url, config=config, session_id=task_id)
end_memory = process.memory_info().rss / (1024 * 1024)
memory_usage = peak_memory = end_memory - start_memory
if self.rate_limiter and result.status_code:
if not self.rate_limiter.update_delay(url, result.status_code):
error_message = f"Rate limit retry count exceeded for domain {urlparse(url).netloc}"
if self.monitor:
self.monitor.update_task(task_id, status=CrawlStatus.FAILED)
result = CrawlerTaskResult(
task_id=task_id,
url=url,
result=result,
memory_usage=memory_usage,
peak_memory=peak_memory,
start_time=start_time,
end_time=datetime.now(),
error_message=error_message,
)
await self.result_queue.put(result)
return result
if not result.success:
error_message = result.error_message
if self.monitor:
self.monitor.update_task(task_id, status=CrawlStatus.FAILED)
elif self.monitor:
self.monitor.update_task(task_id, status=CrawlStatus.COMPLETED)
except Exception as e:
error_message = str(e)
if self.monitor:
self.monitor.update_task(task_id, status=CrawlStatus.FAILED)
result = CrawlResult(
url=url, html="", metadata={}, success=False, error_message=str(e)
)
finally:
end_time = datetime.now()
if self.monitor:
self.monitor.update_task(
task_id,
end_time=end_time,
memory_usage=memory_usage,
peak_memory=peak_memory,
error_message=error_message,
)
self.concurrent_sessions -= 1
return CrawlerTaskResult(
task_id=task_id,
url=url,
result=result,
memory_usage=memory_usage,
peak_memory=peak_memory,
start_time=start_time,
end_time=end_time,
error_message=error_message,
)
async def run_urls(
self,
urls: List[str],
crawler: "AsyncWebCrawler", # noqa: F821
config: CrawlerRunConfig,
) -> List[CrawlerTaskResult]:
self.crawler = crawler
if self.monitor:
self.monitor.start()
try:
pending_tasks = []
active_tasks = []
task_queue = []
for url in urls:
task_id = str(uuid.uuid4())
if self.monitor:
self.monitor.add_task(task_id, url)
task_queue.append((url, task_id))
while task_queue or active_tasks:
wait_start_time = time.time()
while len(active_tasks) < self.max_session_permit and task_queue:
if psutil.virtual_memory().percent >= self.memory_threshold_percent:
# Check if we've exceeded the timeout
if time.time() - wait_start_time > self.memory_wait_timeout:
raise MemoryError(
f"Memory usage above threshold ({self.memory_threshold_percent}%) for more than {self.memory_wait_timeout} seconds"
)
await asyncio.sleep(self.check_interval)
continue
url, task_id = task_queue.pop(0)
task = asyncio.create_task(self.crawl_url(url, config, task_id))
active_tasks.append(task)
if not active_tasks:
await asyncio.sleep(self.check_interval)
continue
done, pending = await asyncio.wait(
active_tasks, return_when=asyncio.FIRST_COMPLETED
)
pending_tasks.extend(done)
active_tasks = list(pending)
return await asyncio.gather(*pending_tasks)
finally:
if self.monitor:
self.monitor.stop()
async def run_urls_stream(
self,
urls: List[str],
crawler: "AsyncWebCrawler",
config: CrawlerRunConfig,
) -> AsyncGenerator[CrawlerTaskResult, None]:
self.crawler = crawler
if self.monitor:
self.monitor.start()
try:
active_tasks = []
task_queue = []
completed_count = 0
total_urls = len(urls)
# Initialize task queue
for url in urls:
task_id = str(uuid.uuid4())
if self.monitor:
self.monitor.add_task(task_id, url)
task_queue.append((url, task_id))
while completed_count < total_urls:
# Start new tasks if memory permits
while len(active_tasks) < self.max_session_permit and task_queue:
if psutil.virtual_memory().percent >= self.memory_threshold_percent:
await asyncio.sleep(self.check_interval)
continue
url, task_id = task_queue.pop(0)
task = asyncio.create_task(self.crawl_url(url, config, task_id))
active_tasks.append(task)
if not active_tasks and not task_queue:
break
# Wait for any task to complete and yield results
if active_tasks:
done, pending = await asyncio.wait(
active_tasks,
timeout=0.1,
return_when=asyncio.FIRST_COMPLETED
)
for completed_task in done:
result = await completed_task
completed_count += 1
yield result
active_tasks = list(pending)
else:
await asyncio.sleep(self.check_interval)
finally:
if self.monitor:
self.monitor.stop()
class SemaphoreDispatcher(BaseDispatcher):
def __init__(
self,
semaphore_count: int = 5,
max_session_permit: int = 20,
rate_limiter: Optional[RateLimiter] = None,
monitor: Optional[CrawlerMonitor] = None,
):
super().__init__(rate_limiter, monitor)
self.semaphore_count = semaphore_count
self.max_session_permit = max_session_permit
async def crawl_url(
self,
url: str,
config: CrawlerRunConfig,
task_id: str,
semaphore: asyncio.Semaphore = None,
) -> CrawlerTaskResult:
start_time = datetime.now()
error_message = ""
memory_usage = peak_memory = 0.0
try:
if self.monitor:
self.monitor.update_task(
task_id, status=CrawlStatus.IN_PROGRESS, start_time=start_time
)
if self.rate_limiter:
await self.rate_limiter.wait_if_needed(url)
async with semaphore:
process = psutil.Process()
start_memory = process.memory_info().rss / (1024 * 1024)
result = await self.crawler.arun(url, config=config, session_id=task_id)
end_memory = process.memory_info().rss / (1024 * 1024)
memory_usage = peak_memory = end_memory - start_memory
if self.rate_limiter and result.status_code:
if not self.rate_limiter.update_delay(url, result.status_code):
error_message = f"Rate limit retry count exceeded for domain {urlparse(url).netloc}"
if self.monitor:
self.monitor.update_task(task_id, status=CrawlStatus.FAILED)
return CrawlerTaskResult(
task_id=task_id,
url=url,
result=result,
memory_usage=memory_usage,
peak_memory=peak_memory,
start_time=start_time,
end_time=datetime.now(),
error_message=error_message,
)
if not result.success:
error_message = result.error_message
if self.monitor:
self.monitor.update_task(task_id, status=CrawlStatus.FAILED)
elif self.monitor:
self.monitor.update_task(task_id, status=CrawlStatus.COMPLETED)
except Exception as e:
error_message = str(e)
if self.monitor:
self.monitor.update_task(task_id, status=CrawlStatus.FAILED)
result = CrawlResult(
url=url, html="", metadata={}, success=False, error_message=str(e)
)
finally:
end_time = datetime.now()
if self.monitor:
self.monitor.update_task(
task_id,
end_time=end_time,
memory_usage=memory_usage,
peak_memory=peak_memory,
error_message=error_message,
)
return CrawlerTaskResult(
task_id=task_id,
url=url,
result=result,
memory_usage=memory_usage,
peak_memory=peak_memory,
start_time=start_time,
end_time=end_time,
error_message=error_message,
)
async def run_urls(
self,
crawler: "AsyncWebCrawler", # noqa: F821
urls: List[str],
config: CrawlerRunConfig,
) -> List[CrawlerTaskResult]:
self.crawler = crawler
if self.monitor:
self.monitor.start()
try:
semaphore = asyncio.Semaphore(self.semaphore_count)
tasks = []
for url in urls:
task_id = str(uuid.uuid4())
if self.monitor:
self.monitor.add_task(task_id, url)
task = asyncio.create_task(
self.crawl_url(url, config, task_id, semaphore)
)
tasks.append(task)
return await asyncio.gather(*tasks, return_exceptions=True)
finally:
if self.monitor:
self.monitor.stop()

View File

@@ -1,588 +0,0 @@
from typing import Dict, Optional, List, Tuple
from .async_configs import CrawlerRunConfig
from .models import (
CrawlResult,
CrawlerTaskResult,
CrawlStatus,
DisplayMode,
CrawlStats,
DomainState,
)
from rich.live import Live
from rich.table import Table
from rich.console import Console
from rich import box
from datetime import datetime, timedelta
import time
import psutil
import asyncio
import uuid
from urllib.parse import urlparse
import random
from abc import ABC, abstractmethod
class RateLimiter:
def __init__(
self,
base_delay: Tuple[float, float] = (1.0, 3.0),
max_delay: float = 60.0,
max_retries: int = 3,
rate_limit_codes: List[int] = None,
):
self.base_delay = base_delay
self.max_delay = max_delay
self.max_retries = max_retries
self.rate_limit_codes = rate_limit_codes or [429, 503]
self.domains: Dict[str, DomainState] = {}
def get_domain(self, url: str) -> str:
return urlparse(url).netloc
async def wait_if_needed(self, url: str) -> None:
domain = self.get_domain(url)
state = self.domains.get(domain)
if not state:
self.domains[domain] = DomainState()
state = self.domains[domain]
now = time.time()
if state.last_request_time:
wait_time = max(0, state.current_delay - (now - state.last_request_time))
if wait_time > 0:
await asyncio.sleep(wait_time)
# Random delay within base range if no current delay
if state.current_delay == 0:
state.current_delay = random.uniform(*self.base_delay)
state.last_request_time = time.time()
def update_delay(self, url: str, status_code: int) -> bool:
domain = self.get_domain(url)
state = self.domains[domain]
if status_code in self.rate_limit_codes:
state.fail_count += 1
if state.fail_count > self.max_retries:
return False
# Exponential backoff with random jitter
state.current_delay = min(
state.current_delay * 2 * random.uniform(0.75, 1.25), self.max_delay
)
else:
# Gradually reduce delay on success
state.current_delay = max(
random.uniform(*self.base_delay), state.current_delay * 0.75
)
state.fail_count = 0
return True
class CrawlerMonitor:
def __init__(
self,
max_visible_rows: int = 15,
display_mode: DisplayMode = DisplayMode.DETAILED,
):
self.console = Console()
self.max_visible_rows = max_visible_rows
self.display_mode = display_mode
self.stats: Dict[str, CrawlStats] = {}
self.process = psutil.Process()
self.start_time = datetime.now()
self.live = Live(self._create_table(), refresh_per_second=2)
def start(self):
self.live.start()
def stop(self):
self.live.stop()
def add_task(self, task_id: str, url: str):
self.stats[task_id] = CrawlStats(
task_id=task_id, url=url, status=CrawlStatus.QUEUED
)
self.live.update(self._create_table())
def update_task(self, task_id: str, **kwargs):
if task_id in self.stats:
for key, value in kwargs.items():
setattr(self.stats[task_id], key, value)
self.live.update(self._create_table())
def _create_aggregated_table(self) -> Table:
"""Creates a compact table showing only aggregated statistics"""
table = Table(
box=box.ROUNDED,
title="Crawler Status Overview",
title_style="bold magenta",
header_style="bold blue",
show_lines=True,
)
# Calculate statistics
total_tasks = len(self.stats)
queued = sum(
1 for stat in self.stats.values() if stat.status == CrawlStatus.QUEUED
)
in_progress = sum(
1 for stat in self.stats.values() if stat.status == CrawlStatus.IN_PROGRESS
)
completed = sum(
1 for stat in self.stats.values() if stat.status == CrawlStatus.COMPLETED
)
failed = sum(
1 for stat in self.stats.values() if stat.status == CrawlStatus.FAILED
)
# Memory statistics
current_memory = self.process.memory_info().rss / (1024 * 1024)
total_task_memory = sum(stat.memory_usage for stat in self.stats.values())
peak_memory = max(
(stat.peak_memory for stat in self.stats.values()), default=0.0
)
# Duration
duration = datetime.now() - self.start_time
# Create status row
table.add_column("Status", style="bold cyan")
table.add_column("Count", justify="right")
table.add_column("Percentage", justify="right")
table.add_row("Total Tasks", str(total_tasks), "100%")
table.add_row(
"[yellow]In Queue[/yellow]",
str(queued),
f"{(queued/total_tasks*100):.1f}%" if total_tasks > 0 else "0%",
)
table.add_row(
"[blue]In Progress[/blue]",
str(in_progress),
f"{(in_progress/total_tasks*100):.1f}%" if total_tasks > 0 else "0%",
)
table.add_row(
"[green]Completed[/green]",
str(completed),
f"{(completed/total_tasks*100):.1f}%" if total_tasks > 0 else "0%",
)
table.add_row(
"[red]Failed[/red]",
str(failed),
f"{(failed/total_tasks*100):.1f}%" if total_tasks > 0 else "0%",
)
# Add memory information
table.add_section()
table.add_row(
"[magenta]Current Memory[/magenta]", f"{current_memory:.1f} MB", ""
)
table.add_row(
"[magenta]Total Task Memory[/magenta]", f"{total_task_memory:.1f} MB", ""
)
table.add_row(
"[magenta]Peak Task Memory[/magenta]", f"{peak_memory:.1f} MB", ""
)
table.add_row(
"[yellow]Runtime[/yellow]",
str(timedelta(seconds=int(duration.total_seconds()))),
"",
)
return table
def _create_detailed_table(self) -> Table:
table = Table(
box=box.ROUNDED,
title="Crawler Performance Monitor",
title_style="bold magenta",
header_style="bold blue",
)
# Add columns
table.add_column("Task ID", style="cyan", no_wrap=True)
table.add_column("URL", style="cyan", no_wrap=True)
table.add_column("Status", style="bold")
table.add_column("Memory (MB)", justify="right")
table.add_column("Peak (MB)", justify="right")
table.add_column("Duration", justify="right")
table.add_column("Info", style="italic")
# Add summary row
total_memory = sum(stat.memory_usage for stat in self.stats.values())
active_count = sum(
1 for stat in self.stats.values() if stat.status == CrawlStatus.IN_PROGRESS
)
completed_count = sum(
1 for stat in self.stats.values() if stat.status == CrawlStatus.COMPLETED
)
failed_count = sum(
1 for stat in self.stats.values() if stat.status == CrawlStatus.FAILED
)
table.add_row(
"[bold yellow]SUMMARY",
f"Total: {len(self.stats)}",
f"Active: {active_count}",
f"{total_memory:.1f}",
f"{self.process.memory_info().rss / (1024 * 1024):.1f}",
str(
timedelta(
seconds=int((datetime.now() - self.start_time).total_seconds())
)
),
f"{completed_count}{failed_count}",
style="bold",
)
table.add_section()
# Add rows for each task
visible_stats = sorted(
self.stats.values(),
key=lambda x: (
x.status != CrawlStatus.IN_PROGRESS,
x.status != CrawlStatus.QUEUED,
x.end_time or datetime.max,
),
)[: self.max_visible_rows]
for stat in visible_stats:
status_style = {
CrawlStatus.QUEUED: "white",
CrawlStatus.IN_PROGRESS: "yellow",
CrawlStatus.COMPLETED: "green",
CrawlStatus.FAILED: "red",
}[stat.status]
table.add_row(
stat.task_id[:8], # Show first 8 chars of task ID
stat.url[:40] + "..." if len(stat.url) > 40 else stat.url,
f"[{status_style}]{stat.status.value}[/{status_style}]",
f"{stat.memory_usage:.1f}",
f"{stat.peak_memory:.1f}",
stat.duration,
stat.error_message[:40] if stat.error_message else "",
)
return table
def _create_table(self) -> Table:
"""Creates the appropriate table based on display mode"""
if self.display_mode == DisplayMode.AGGREGATED:
return self._create_aggregated_table()
return self._create_detailed_table()
class BaseDispatcher(ABC):
def __init__(
self,
rate_limiter: Optional[RateLimiter] = None,
monitor: Optional[CrawlerMonitor] = None,
):
self.crawler = None
self._domain_last_hit: Dict[str, float] = {}
self.concurrent_sessions = 0
self.rate_limiter = rate_limiter
self.monitor = monitor
@abstractmethod
async def crawl_url(
self,
url: str,
config: CrawlerRunConfig,
task_id: str,
monitor: Optional[CrawlerMonitor] = None,
) -> CrawlerTaskResult:
pass
@abstractmethod
async def run_urls(
self,
urls: List[str],
crawler: "AsyncWebCrawler", # noqa: F821
config: CrawlerRunConfig,
monitor: Optional[CrawlerMonitor] = None,
) -> List[CrawlerTaskResult]:
pass
class MemoryAdaptiveDispatcher(BaseDispatcher):
def __init__(
self,
memory_threshold_percent: float = 90.0,
check_interval: float = 1.0,
max_session_permit: int = 20,
memory_wait_timeout: float = 300.0, # 5 minutes default timeout
rate_limiter: Optional[RateLimiter] = None,
monitor: Optional[CrawlerMonitor] = None,
):
super().__init__(rate_limiter, monitor)
self.memory_threshold_percent = memory_threshold_percent
self.check_interval = check_interval
self.max_session_permit = max_session_permit
self.memory_wait_timeout = memory_wait_timeout
async def crawl_url(
self,
url: str,
config: CrawlerRunConfig,
task_id: str,
) -> CrawlerTaskResult:
start_time = datetime.now()
error_message = ""
memory_usage = peak_memory = 0.0
try:
if self.monitor:
self.monitor.update_task(
task_id, status=CrawlStatus.IN_PROGRESS, start_time=start_time
)
self.concurrent_sessions += 1
if self.rate_limiter:
await self.rate_limiter.wait_if_needed(url)
process = psutil.Process()
start_memory = process.memory_info().rss / (1024 * 1024)
result = await self.crawler.arun(url, config=config, session_id=task_id)
end_memory = process.memory_info().rss / (1024 * 1024)
memory_usage = peak_memory = end_memory - start_memory
if self.rate_limiter and result.status_code:
if not self.rate_limiter.update_delay(url, result.status_code):
error_message = f"Rate limit retry count exceeded for domain {urlparse(url).netloc}"
if self.monitor:
self.monitor.update_task(task_id, status=CrawlStatus.FAILED)
return CrawlerTaskResult(
task_id=task_id,
url=url,
result=result,
memory_usage=memory_usage,
peak_memory=peak_memory,
start_time=start_time,
end_time=datetime.now(),
error_message=error_message,
)
if not result.success:
error_message = result.error_message
if self.monitor:
self.monitor.update_task(task_id, status=CrawlStatus.FAILED)
elif self.monitor:
self.monitor.update_task(task_id, status=CrawlStatus.COMPLETED)
except Exception as e:
error_message = str(e)
if self.monitor:
self.monitor.update_task(task_id, status=CrawlStatus.FAILED)
result = CrawlResult(
url=url, html="", metadata={}, success=False, error_message=str(e)
)
finally:
end_time = datetime.now()
if self.monitor:
self.monitor.update_task(
task_id,
end_time=end_time,
memory_usage=memory_usage,
peak_memory=peak_memory,
error_message=error_message,
)
self.concurrent_sessions -= 1
return CrawlerTaskResult(
task_id=task_id,
url=url,
result=result,
memory_usage=memory_usage,
peak_memory=peak_memory,
start_time=start_time,
end_time=end_time,
error_message=error_message,
)
async def run_urls(
self,
urls: List[str],
crawler: "AsyncWebCrawler", # noqa: F821
config: CrawlerRunConfig,
) -> List[CrawlerTaskResult]:
self.crawler = crawler
if self.monitor:
self.monitor.start()
try:
pending_tasks = []
active_tasks = []
task_queue = []
for url in urls:
task_id = str(uuid.uuid4())
if self.monitor:
self.monitor.add_task(task_id, url)
task_queue.append((url, task_id))
while task_queue or active_tasks:
wait_start_time = time.time()
while len(active_tasks) < self.max_session_permit and task_queue:
if psutil.virtual_memory().percent >= self.memory_threshold_percent:
# Check if we've exceeded the timeout
if time.time() - wait_start_time > self.memory_wait_timeout:
raise MemoryError(
f"Memory usage above threshold ({self.memory_threshold_percent}%) for more than {self.memory_wait_timeout} seconds"
)
await asyncio.sleep(self.check_interval)
continue
url, task_id = task_queue.pop(0)
task = asyncio.create_task(self.crawl_url(url, config, task_id))
active_tasks.append(task)
if not active_tasks:
await asyncio.sleep(self.check_interval)
continue
done, pending = await asyncio.wait(
active_tasks, return_when=asyncio.FIRST_COMPLETED
)
pending_tasks.extend(done)
active_tasks = list(pending)
return await asyncio.gather(*pending_tasks)
finally:
if self.monitor:
self.monitor.stop()
class SemaphoreDispatcher(BaseDispatcher):
def __init__(
self,
semaphore_count: int = 5,
max_session_permit: int = 20,
rate_limiter: Optional[RateLimiter] = None,
monitor: Optional[CrawlerMonitor] = None,
):
super().__init__(rate_limiter, monitor)
self.semaphore_count = semaphore_count
self.max_session_permit = max_session_permit
async def crawl_url(
self,
url: str,
config: CrawlerRunConfig,
task_id: str,
semaphore: asyncio.Semaphore = None,
) -> CrawlerTaskResult:
start_time = datetime.now()
error_message = ""
memory_usage = peak_memory = 0.0
try:
if self.monitor:
self.monitor.update_task(
task_id, status=CrawlStatus.IN_PROGRESS, start_time=start_time
)
if self.rate_limiter:
await self.rate_limiter.wait_if_needed(url)
async with semaphore:
process = psutil.Process()
start_memory = process.memory_info().rss / (1024 * 1024)
result = await self.crawler.arun(url, config=config, session_id=task_id)
end_memory = process.memory_info().rss / (1024 * 1024)
memory_usage = peak_memory = end_memory - start_memory
if self.rate_limiter and result.status_code:
if not self.rate_limiter.update_delay(url, result.status_code):
error_message = f"Rate limit retry count exceeded for domain {urlparse(url).netloc}"
if self.monitor:
self.monitor.update_task(task_id, status=CrawlStatus.FAILED)
return CrawlerTaskResult(
task_id=task_id,
url=url,
result=result,
memory_usage=memory_usage,
peak_memory=peak_memory,
start_time=start_time,
end_time=datetime.now(),
error_message=error_message,
)
if not result.success:
error_message = result.error_message
if self.monitor:
self.monitor.update_task(task_id, status=CrawlStatus.FAILED)
elif self.monitor:
self.monitor.update_task(task_id, status=CrawlStatus.COMPLETED)
except Exception as e:
error_message = str(e)
if self.monitor:
self.monitor.update_task(task_id, status=CrawlStatus.FAILED)
result = CrawlResult(
url=url, html="", metadata={}, success=False, error_message=str(e)
)
finally:
end_time = datetime.now()
if self.monitor:
self.monitor.update_task(
task_id,
end_time=end_time,
memory_usage=memory_usage,
peak_memory=peak_memory,
error_message=error_message,
)
return CrawlerTaskResult(
task_id=task_id,
url=url,
result=result,
memory_usage=memory_usage,
peak_memory=peak_memory,
start_time=start_time,
end_time=end_time,
error_message=error_message,
)
async def run_urls(
self,
crawler: "AsyncWebCrawler", # noqa: F821
urls: List[str],
config: CrawlerRunConfig,
) -> List[CrawlerTaskResult]:
self.crawler = crawler
if self.monitor:
self.monitor.start()
try:
semaphore = asyncio.Semaphore(self.semaphore_count)
tasks = []
for url in urls:
task_id = str(uuid.uuid4())
if self.monitor:
self.monitor.add_task(task_id, url)
task = asyncio.create_task(
self.crawl_url(url, config, task_id, semaphore)
)
tasks.append(task)
return await asyncio.gather(*tasks, return_exceptions=True)
finally:
if self.monitor:
self.monitor.stop()

View File

@@ -1,227 +0,0 @@
from enum import Enum
from typing import Optional, Dict, Any
from colorama import Fore, Style, init
import os
from datetime import datetime
class LogLevel(Enum):
DEBUG = 1
INFO = 2
SUCCESS = 3
WARNING = 4
ERROR = 5
class AsyncLogger:
"""
Asynchronous logger with support for colored console output and file logging.
Supports templated messages with colored components.
"""
DEFAULT_ICONS = {
"INIT": "",
"READY": "",
"FETCH": "",
"SCRAPE": "",
"EXTRACT": "",
"COMPLETE": "",
"ERROR": "×",
"DEBUG": "",
"INFO": "",
"WARNING": "",
}
DEFAULT_COLORS = {
LogLevel.DEBUG: Fore.LIGHTBLACK_EX,
LogLevel.INFO: Fore.CYAN,
LogLevel.SUCCESS: Fore.GREEN,
LogLevel.WARNING: Fore.YELLOW,
LogLevel.ERROR: Fore.RED,
}
def __init__(
self,
log_file: Optional[str] = None,
log_level: LogLevel = LogLevel.DEBUG,
tag_width: int = 10,
icons: Optional[Dict[str, str]] = None,
colors: Optional[Dict[LogLevel, str]] = None,
verbose: bool = True,
):
"""
Initialize the logger.
Args:
log_file: Optional file path for logging
log_level: Minimum log level to display
tag_width: Width for tag formatting
icons: Custom icons for different tags
colors: Custom colors for different log levels
verbose: Whether to output to console
"""
init() # Initialize colorama
self.log_file = log_file
self.log_level = log_level
self.tag_width = tag_width
self.icons = icons or self.DEFAULT_ICONS
self.colors = colors or self.DEFAULT_COLORS
self.verbose = verbose
# Create log file directory if needed
if log_file:
os.makedirs(os.path.dirname(os.path.abspath(log_file)), exist_ok=True)
def _format_tag(self, tag: str) -> str:
"""Format a tag with consistent width."""
return f"[{tag}]".ljust(self.tag_width, ".")
def _get_icon(self, tag: str) -> str:
"""Get the icon for a tag, defaulting to info icon if not found."""
return self.icons.get(tag, self.icons["INFO"])
def _write_to_file(self, message: str):
"""Write a message to the log file if configured."""
if self.log_file:
timestamp = datetime.now().strftime("%Y-%m-%d %H:%M:%S.%f")[:-3]
with open(self.log_file, "a", encoding="utf-8") as f:
# Strip ANSI color codes for file output
clean_message = message.replace(Fore.RESET, "").replace(
Style.RESET_ALL, ""
)
for color in vars(Fore).values():
if isinstance(color, str):
clean_message = clean_message.replace(color, "")
f.write(f"[{timestamp}] {clean_message}\n")
def _log(
self,
level: LogLevel,
message: str,
tag: str,
params: Optional[Dict[str, Any]] = None,
colors: Optional[Dict[str, str]] = None,
base_color: Optional[str] = None,
**kwargs,
):
"""
Core logging method that handles message formatting and output.
Args:
level: Log level for this message
message: Message template string
tag: Tag for the message
params: Parameters to format into the message
colors: Color overrides for specific parameters
base_color: Base color for the entire message
"""
if level.value < self.log_level.value:
return
# Format the message with parameters if provided
if params:
try:
# First format the message with raw parameters
formatted_message = message.format(**params)
# Then apply colors if specified
if colors:
for key, color in colors.items():
# Find the formatted value in the message and wrap it with color
if key in params:
value_str = str(params[key])
formatted_message = formatted_message.replace(
value_str, f"{color}{value_str}{Style.RESET_ALL}"
)
except KeyError as e:
formatted_message = (
f"LOGGING ERROR: Missing parameter {e} in message template"
)
level = LogLevel.ERROR
else:
formatted_message = message
# Construct the full log line
color = base_color or self.colors[level]
log_line = f"{color}{self._format_tag(tag)} {self._get_icon(tag)} {formatted_message}{Style.RESET_ALL}"
# Output to console if verbose
if self.verbose or kwargs.get("force_verbose", False):
print(log_line)
# Write to file if configured
self._write_to_file(log_line)
def debug(self, message: str, tag: str = "DEBUG", **kwargs):
"""Log a debug message."""
self._log(LogLevel.DEBUG, message, tag, **kwargs)
def info(self, message: str, tag: str = "INFO", **kwargs):
"""Log an info message."""
self._log(LogLevel.INFO, message, tag, **kwargs)
def success(self, message: str, tag: str = "SUCCESS", **kwargs):
"""Log a success message."""
self._log(LogLevel.SUCCESS, message, tag, **kwargs)
def warning(self, message: str, tag: str = "WARNING", **kwargs):
"""Log a warning message."""
self._log(LogLevel.WARNING, message, tag, **kwargs)
def error(self, message: str, tag: str = "ERROR", **kwargs):
"""Log an error message."""
self._log(LogLevel.ERROR, message, tag, **kwargs)
def url_status(
self,
url: str,
success: bool,
timing: float,
tag: str = "FETCH",
url_length: int = 50,
):
"""
Convenience method for logging URL fetch status.
Args:
url: The URL being processed
success: Whether the operation was successful
timing: Time taken for the operation
tag: Tag for the message
url_length: Maximum length for URL in log
"""
self._log(
level=LogLevel.SUCCESS if success else LogLevel.ERROR,
message="{url:.{url_length}}... | Status: {status} | Time: {timing:.2f}s",
tag=tag,
params={
"url": url,
"url_length": url_length,
"status": success,
"timing": timing,
},
colors={
"status": Fore.GREEN if success else Fore.RED,
"timing": Fore.YELLOW,
},
)
def error_status(
self, url: str, error: str, tag: str = "ERROR", url_length: int = 50
):
"""
Convenience method for logging error status.
Args:
url: The URL being processed
error: Error message
tag: Tag for the message
url_length: Maximum length for URL in log
"""
self._log(
level=LogLevel.ERROR,
message="{url:.{url_length}}... | Error: {error}",
tag=tag,
params={"url": url, "url_length": url_length, "error": error},
)

File diff suppressed because it is too large Load Diff

View File

@@ -1,117 +0,0 @@
from enum import Enum
class CacheMode(Enum):
"""
Defines the caching behavior for web crawling operations.
Modes:
- ENABLED: Normal caching behavior (read and write)
- DISABLED: No caching at all
- READ_ONLY: Only read from cache, don't write
- WRITE_ONLY: Only write to cache, don't read
- BYPASS: Bypass cache for this operation
"""
ENABLED = "enabled"
DISABLED = "disabled"
READ_ONLY = "read_only"
WRITE_ONLY = "write_only"
BYPASS = "bypass"
class CacheContext:
"""
Encapsulates cache-related decisions and URL handling.
This class centralizes all cache-related logic and URL type checking,
making the caching behavior more predictable and maintainable.
Attributes:
url (str): The URL being processed.
cache_mode (CacheMode): The cache mode for the current operation.
always_bypass (bool): If True, bypasses caching for this operation.
is_cacheable (bool): True if the URL is cacheable, False otherwise.
is_web_url (bool): True if the URL is a web URL, False otherwise.
is_local_file (bool): True if the URL is a local file, False otherwise.
is_raw_html (bool): True if the URL is raw HTML, False otherwise.
_url_display (str): The display name for the URL (web, local file, or raw HTML).
"""
def __init__(self, url: str, cache_mode: CacheMode, always_bypass: bool = False):
"""
Initializes the CacheContext with the provided URL and cache mode.
Args:
url (str): The URL being processed.
cache_mode (CacheMode): The cache mode for the current operation.
always_bypass (bool): If True, bypasses caching for this operation.
"""
self.url = url
self.cache_mode = cache_mode
self.always_bypass = always_bypass
self.is_cacheable = url.startswith(("http://", "https://", "file://"))
self.is_web_url = url.startswith(("http://", "https://"))
self.is_local_file = url.startswith("file://")
self.is_raw_html = url.startswith("raw:")
self._url_display = url if not self.is_raw_html else "Raw HTML"
def should_read(self) -> bool:
"""
Determines if cache should be read based on context.
How it works:
1. If always_bypass is True or is_cacheable is False, return False.
2. If cache_mode is ENABLED or READ_ONLY, return True.
Returns:
bool: True if cache should be read, False otherwise.
"""
if self.always_bypass or not self.is_cacheable:
return False
return self.cache_mode in [CacheMode.ENABLED, CacheMode.READ_ONLY]
def should_write(self) -> bool:
"""
Determines if cache should be written based on context.
How it works:
1. If always_bypass is True or is_cacheable is False, return False.
2. If cache_mode is ENABLED or WRITE_ONLY, return True.
Returns:
bool: True if cache should be written, False otherwise.
"""
if self.always_bypass or not self.is_cacheable:
return False
return self.cache_mode in [CacheMode.ENABLED, CacheMode.WRITE_ONLY]
@property
def display_url(self) -> str:
"""Returns the URL in display format."""
return self._url_display
def _legacy_to_cache_mode(
disable_cache: bool = False,
bypass_cache: bool = False,
no_cache_read: bool = False,
no_cache_write: bool = False,
) -> CacheMode:
"""
Converts legacy cache parameters to the new CacheMode enum.
This is an internal function to help transition from the old boolean flags
to the new CacheMode system.
"""
if disable_cache:
return CacheMode.DISABLED
if bypass_cache:
return CacheMode.BYPASS
if no_cache_read and no_cache_write:
return CacheMode.DISABLED
if no_cache_read:
return CacheMode.WRITE_ONLY
if no_cache_write:
return CacheMode.READ_ONLY
return CacheMode.ENABLED

View File

@@ -3,53 +3,23 @@ import re
from collections import Counter
import string
from .model_loader import load_nltk_punkt
from .utils import *
# Define the abstract base class for chunking strategies
class ChunkingStrategy(ABC):
"""
Abstract base class for chunking strategies.
"""
@abstractmethod
def chunk(self, text: str) -> list:
"""
Abstract method to chunk the given text.
Args:
text (str): The text to chunk.
Returns:
list: A list of chunks.
"""
pass
# Create an identity chunking strategy f(x) = [x]
class IdentityChunking(ChunkingStrategy):
"""
Chunking strategy that returns the input text as a single chunk.
"""
def chunk(self, text: str) -> list:
return [text]
# Regex-based chunking
class RegexChunking(ChunkingStrategy):
"""
Chunking strategy that splits text based on regular expression patterns.
"""
def __init__(self, patterns=None, **kwargs):
"""
Initialize the RegexChunking object.
Args:
patterns (list): A list of regular expression patterns to split text.
"""
if patterns is None:
patterns = [r"\n\n"] # Default split pattern
patterns = [r'\n\n'] # Default split pattern
self.patterns = patterns
def chunk(self, text: str) -> list:
@@ -60,19 +30,12 @@ class RegexChunking(ChunkingStrategy):
new_paragraphs.extend(re.split(pattern, paragraph))
paragraphs = new_paragraphs
return paragraphs
# NLP-based sentence chunking
# NLP-based sentence chunking
class NlpSentenceChunking(ChunkingStrategy):
"""
Chunking strategy that splits text into sentences using NLTK's sentence tokenizer.
"""
def __init__(self, **kwargs):
"""
Initialize the NlpSentenceChunking object.
"""
load_nltk_punkt()
pass
def chunk(self, text: str) -> list:
# Improved regex for sentence splitting
@@ -80,34 +43,18 @@ class NlpSentenceChunking(ChunkingStrategy):
# r'(?<!\w\.\w.)(?<![A-Z][a-z]\.)(?<![A-Z][A-Z]\.)(?<![A-Za-z]\.)(?<=\.|\?|\!|\n)\s'
# )
# sentences = sentence_endings.split(text)
# sens = [sent.strip() for sent in sentences if sent]
# sens = [sent.strip() for sent in sentences if sent]
from nltk.tokenize import sent_tokenize
sentences = sent_tokenize(text)
sens = [sent.strip() for sent in sentences]
sens = [sent.strip() for sent in sentences]
return list(set(sens))
# Topic-based segmentation using TextTiling
class TopicSegmentationChunking(ChunkingStrategy):
"""
Chunking strategy that segments text into topics using NLTK's TextTilingTokenizer.
How it works:
1. Segment the text into topics using TextTilingTokenizer
2. Extract keywords for each topic segment
"""
def __init__(self, num_keywords=3, **kwargs):
"""
Initialize the TopicSegmentationChunking object.
Args:
num_keywords (int): The number of keywords to extract for each topic segment.
"""
import nltk as nl
self.tokenizer = nl.tokenize.TextTilingTokenizer()
self.num_keywords = num_keywords
@@ -119,14 +66,8 @@ class TopicSegmentationChunking(ChunkingStrategy):
def extract_keywords(self, text: str) -> list:
# Tokenize and remove stopwords and punctuation
import nltk as nl
tokens = nl.toknize.word_tokenize(text)
tokens = [
token.lower()
for token in tokens
if token not in nl.corpus.stopwords.words("english")
and token not in string.punctuation
]
tokens = [token.lower() for token in tokens if token not in nl.corpus.stopwords.words('english') and token not in string.punctuation]
# Calculate frequency distribution
freq_dist = Counter(tokens)
@@ -137,120 +78,29 @@ class TopicSegmentationChunking(ChunkingStrategy):
# Segment the text into topics
segments = self.chunk(text)
# Extract keywords for each topic segment
segments_with_topics = [
(segment, self.extract_keywords(segment)) for segment in segments
]
segments_with_topics = [(segment, self.extract_keywords(segment)) for segment in segments]
return segments_with_topics
# Fixed-length word chunks
class FixedLengthWordChunking(ChunkingStrategy):
"""
Chunking strategy that splits text into fixed-length word chunks.
How it works:
1. Split the text into words
2. Create chunks of fixed length
3. Return the list of chunks
"""
def __init__(self, chunk_size=100, **kwargs):
"""
Initialize the fixed-length word chunking strategy with the given chunk size.
Args:
chunk_size (int): The size of each chunk in words.
"""
self.chunk_size = chunk_size
def chunk(self, text: str) -> list:
words = text.split()
return [
" ".join(words[i : i + self.chunk_size])
for i in range(0, len(words), self.chunk_size)
]
return [' '.join(words[i:i + self.chunk_size]) for i in range(0, len(words), self.chunk_size)]
# Sliding window chunking
class SlidingWindowChunking(ChunkingStrategy):
"""
Chunking strategy that splits text into overlapping word chunks.
How it works:
1. Split the text into words
2. Create chunks of fixed length
3. Return the list of chunks
"""
def __init__(self, window_size=100, step=50, **kwargs):
"""
Initialize the sliding window chunking strategy with the given window size and
step size.
Args:
window_size (int): The size of the sliding window in words.
step (int): The step size for sliding the window in words.
"""
self.window_size = window_size
self.step = step
def chunk(self, text: str) -> list:
words = text.split()
chunks = []
if len(words) <= self.window_size:
return [text]
for i in range(0, len(words) - self.window_size + 1, self.step):
chunk = " ".join(words[i : i + self.window_size])
chunks.append(chunk)
# Handle the last chunk if it doesn't align perfectly
if i + self.window_size < len(words):
chunks.append(" ".join(words[-self.window_size :]))
for i in range(0, len(words), self.step):
chunks.append(' '.join(words[i:i + self.window_size]))
return chunks
class OverlappingWindowChunking(ChunkingStrategy):
"""
Chunking strategy that splits text into overlapping word chunks.
How it works:
1. Split the text into words using whitespace
2. Create chunks of fixed length equal to the window size
3. Slide the window by the overlap size
4. Return the list of chunks
"""
def __init__(self, window_size=1000, overlap=100, **kwargs):
"""
Initialize the overlapping window chunking strategy with the given window size and
overlap size.
Args:
window_size (int): The size of the window in words.
overlap (int): The size of the overlap between consecutive chunks in words.
"""
self.window_size = window_size
self.overlap = overlap
def chunk(self, text: str) -> list:
words = text.split()
chunks = []
if len(words) <= self.window_size:
return [text]
start = 0
while start < len(words):
end = start + self.window_size
chunk = " ".join(words[start:end])
chunks.append(chunk)
if end >= len(words):
break
start = end - self.overlap
return chunks

View File

@@ -1,123 +0,0 @@
import click
import sys
import asyncio
from typing import List
from .docs_manager import DocsManager
from .async_logger import AsyncLogger
logger = AsyncLogger(verbose=True)
docs_manager = DocsManager(logger)
def print_table(headers: List[str], rows: List[List[str]], padding: int = 2):
"""Print formatted table with headers and rows"""
widths = [max(len(str(cell)) for cell in col) for col in zip(headers, *rows)]
border = "+" + "+".join("-" * (w + 2 * padding) for w in widths) + "+"
def format_row(row):
return (
"|"
+ "|".join(
f"{' ' * padding}{str(cell):<{w}}{' ' * padding}"
for cell, w in zip(row, widths)
)
+ "|"
)
click.echo(border)
click.echo(format_row(headers))
click.echo(border)
for row in rows:
click.echo(format_row(row))
click.echo(border)
@click.group()
def cli():
"""Crawl4AI Command Line Interface"""
pass
@cli.group()
def docs():
"""Documentation operations"""
pass
@docs.command()
@click.argument("sections", nargs=-1)
@click.option(
"--mode", type=click.Choice(["extended", "condensed"]), default="extended"
)
def combine(sections: tuple, mode: str):
"""Combine documentation sections"""
try:
asyncio.run(docs_manager.ensure_docs_exist())
click.echo(docs_manager.generate(sections, mode))
except Exception as e:
logger.error(str(e), tag="ERROR")
sys.exit(1)
@docs.command()
@click.argument("query")
@click.option("--top-k", "-k", default=5)
@click.option("--build-index", is_flag=True, help="Build index if missing")
def search(query: str, top_k: int, build_index: bool):
"""Search documentation"""
try:
result = docs_manager.search(query, top_k)
if result == "No search index available. Call build_search_index() first.":
if build_index or click.confirm("No search index found. Build it now?"):
asyncio.run(docs_manager.llm_text.generate_index_files())
result = docs_manager.search(query, top_k)
click.echo(result)
except Exception as e:
click.echo(f"Error: {str(e)}", err=True)
sys.exit(1)
@docs.command()
def update():
"""Update docs from GitHub"""
try:
asyncio.run(docs_manager.fetch_docs())
click.echo("Documentation updated successfully")
except Exception as e:
click.echo(f"Error: {str(e)}", err=True)
sys.exit(1)
@docs.command()
@click.option("--force-facts", is_flag=True, help="Force regenerate fact files")
@click.option("--clear-cache", is_flag=True, help="Clear BM25 cache")
def index(force_facts: bool, clear_cache: bool):
"""Build or rebuild search indexes"""
try:
asyncio.run(docs_manager.ensure_docs_exist())
asyncio.run(
docs_manager.llm_text.generate_index_files(
force_generate_facts=force_facts, clear_bm25_cache=clear_cache
)
)
click.echo("Search indexes built successfully")
except Exception as e:
click.echo(f"Error: {str(e)}", err=True)
sys.exit(1)
# Add docs list command
@docs.command()
def list():
"""List available documentation sections"""
try:
sections = docs_manager.list()
print_table(["Sections"], [[section] for section in sections])
except Exception as e:
click.echo(f"Error: {str(e)}", err=True)
sys.exit(1)
if __name__ == "__main__":
cli()

View File

@@ -4,68 +4,31 @@ from dotenv import load_dotenv
load_dotenv() # Load environment variables from .env file
# Default provider, ONLY used when the extraction strategy is LLMExtractionStrategy
DEFAULT_PROVIDER = "openai/gpt-4o-mini"
DEFAULT_PROVIDER = "openai/gpt-4-turbo"
MODEL_REPO_BRANCH = "new-release-0.0.2"
# Provider-model dictionary, ONLY used when the extraction strategy is LLMExtractionStrategy
PROVIDER_MODELS = {
"ollama/llama3": "no-token-needed", # Any model from Ollama no need for API token
"ollama/llama3": "no-token-needed", # Any model from Ollama no need for API token
"groq/llama3-70b-8192": os.getenv("GROQ_API_KEY"),
"groq/llama3-8b-8192": os.getenv("GROQ_API_KEY"),
"openai/gpt-4o-mini": os.getenv("OPENAI_API_KEY"),
"openai/gpt-3.5-turbo": os.getenv("OPENAI_API_KEY"),
"openai/gpt-4-turbo": os.getenv("OPENAI_API_KEY"),
"openai/gpt-4o": os.getenv("OPENAI_API_KEY"),
"openai/o1-mini": os.getenv("OPENAI_API_KEY"),
"openai/o1-preview": os.getenv("OPENAI_API_KEY"),
"anthropic/claude-3-haiku-20240307": os.getenv("ANTHROPIC_API_KEY"),
"anthropic/claude-3-opus-20240229": os.getenv("ANTHROPIC_API_KEY"),
"anthropic/claude-3-sonnet-20240229": os.getenv("ANTHROPIC_API_KEY"),
"anthropic/claude-3-5-sonnet-20240620": os.getenv("ANTHROPIC_API_KEY"),
}
# Chunk token threshold
CHUNK_TOKEN_THRESHOLD = 2**11 # 2048 tokens
CHUNK_TOKEN_THRESHOLD = 500
OVERLAP_RATE = 0.1
WORD_TOKEN_RATE = 1.3
# Threshold for the minimum number of word in a HTML tag to be considered
# Threshold for the minimum number of word in a HTML tag to be considered
MIN_WORD_THRESHOLD = 1
IMAGE_DESCRIPTION_MIN_WORD_THRESHOLD = 1
IMPORTANT_ATTRS = ["src", "href", "alt", "title", "width", "height"]
ONLY_TEXT_ELIGIBLE_TAGS = [
"b",
"i",
"u",
"span",
"del",
"ins",
"sub",
"sup",
"strong",
"em",
"code",
"kbd",
"var",
"s",
"q",
"abbr",
"cite",
"dfn",
"time",
"small",
"mark",
]
SOCIAL_MEDIA_DOMAINS = [
"facebook.com",
"twitter.com",
"x.com",
"linkedin.com",
"instagram.com",
"pinterest.com",
"tiktok.com",
"snapchat.com",
"reddit.com",
]
# Threshold for the Image extraction - Range is 1 to 6
# Images are scored based on point based system, to filter based on usefulness. Points are assigned
# to each image based on the following aspects.
@@ -75,12 +38,3 @@ SOCIAL_MEDIA_DOMAINS = [
# If image format is in jpg, png or webp
# If image is in the first half of the total images extracted from the page
IMAGE_SCORE_THRESHOLD = 2
MAX_METRICS_HISTORY = 1000
NEED_MIGRATION = True
URL_LOG_SHORTEN_LENGTH = 30
SHOW_DEPRECATION_WARNINGS = True
SCREENSHOT_HEIGHT_TRESHOLD = 10000
PAGE_TIMEOUT = 60000
DOWNLOAD_PAGE_TIMEOUT = 60000

View File

@@ -1,999 +0,0 @@
import re
import time
from bs4 import BeautifulSoup, Tag
from typing import List, Tuple, Dict, Optional
from rank_bm25 import BM25Okapi
from collections import deque
from bs4 import NavigableString, Comment
from .utils import clean_tokens, perform_completion_with_backoff, escape_json_string, sanitize_html, get_home_folder, extract_xml_data
from abc import ABC, abstractmethod
import math
from snowballstemmer import stemmer
from .config import DEFAULT_PROVIDER, OVERLAP_RATE, WORD_TOKEN_RATE
from .models import TokenUsage
from .prompts import PROMPT_FILTER_CONTENT
import os
import json
import hashlib
from pathlib import Path
from concurrent.futures import ThreadPoolExecutor, as_completed
from .async_logger import AsyncLogger, LogLevel
from colorama import Fore, Style, init
class RelevantContentFilter(ABC):
"""Abstract base class for content filtering strategies"""
def __init__(self, user_query: str = None):
self.user_query = user_query
self.included_tags = {
# Primary structure
"article",
"main",
"section",
"div",
# List structures
"ul",
"ol",
"li",
"dl",
"dt",
"dd",
# Text content
"p",
"span",
"blockquote",
"pre",
"code",
# Headers
"h1",
"h2",
"h3",
"h4",
"h5",
"h6",
# Tables
"table",
"thead",
"tbody",
"tr",
"td",
"th",
# Other semantic elements
"figure",
"figcaption",
"details",
"summary",
# Text formatting
"em",
"strong",
"b",
"i",
"mark",
"small",
# Rich content
"time",
"address",
"cite",
"q",
}
self.excluded_tags = {
"nav",
"footer",
"header",
"aside",
"script",
"style",
"form",
"iframe",
"noscript",
}
self.header_tags = {"h1", "h2", "h3", "h4", "h5", "h6"}
self.negative_patterns = re.compile(
r"nav|footer|header|sidebar|ads|comment|promo|advert|social|share", re.I
)
self.min_word_count = 2
@abstractmethod
def filter_content(self, html: str) -> List[str]:
"""Abstract method to be implemented by specific filtering strategies"""
pass
def extract_page_query(self, soup: BeautifulSoup, body: Tag) -> str:
"""Common method to extract page metadata with fallbacks"""
if self.user_query:
return self.user_query
query_parts = []
# Title
try:
title = soup.title.string
if title:
query_parts.append(title)
except Exception:
pass
if soup.find("h1"):
query_parts.append(soup.find("h1").get_text())
# Meta tags
temp = ""
for meta_name in ["keywords", "description"]:
meta = soup.find("meta", attrs={"name": meta_name})
if meta and meta.get("content"):
query_parts.append(meta["content"])
temp += meta["content"]
# If still empty, grab first significant paragraph
if not temp:
# Find the first tag P thatits text contains more than 50 characters
for p in body.find_all("p"):
if len(p.get_text()) > 150:
query_parts.append(p.get_text()[:150])
break
return " ".join(filter(None, query_parts))
def extract_text_chunks(
self, body: Tag, min_word_threshold: int = None
) -> List[Tuple[str, str]]:
"""
Extracts text chunks from a BeautifulSoup body element while preserving order.
Returns list of tuples (text, tag_name) for classification.
Args:
body: BeautifulSoup Tag object representing the body element
Returns:
List of (text, tag_name) tuples
"""
# Tags to ignore - inline elements that shouldn't break text flow
INLINE_TAGS = {
"a",
"abbr",
"acronym",
"b",
"bdo",
"big",
"br",
"button",
"cite",
"code",
"dfn",
"em",
"i",
"img",
"input",
"kbd",
"label",
"map",
"object",
"q",
"samp",
"script",
"select",
"small",
"span",
"strong",
"sub",
"sup",
"textarea",
"time",
"tt",
"var",
}
# Tags that typically contain meaningful headers
HEADER_TAGS = {"h1", "h2", "h3", "h4", "h5", "h6", "header"}
chunks = []
current_text = []
chunk_index = 0
def should_break_chunk(tag: Tag) -> bool:
"""Determine if a tag should cause a break in the current text chunk"""
return tag.name not in INLINE_TAGS and not (
tag.name == "p" and len(current_text) == 0
)
# Use deque for efficient push/pop operations
stack = deque([(body, False)])
while stack:
element, visited = stack.pop()
if visited:
# End of block element - flush accumulated text
if current_text and should_break_chunk(element):
text = " ".join("".join(current_text).split())
if text:
tag_type = (
"header" if element.name in HEADER_TAGS else "content"
)
chunks.append((chunk_index, text, tag_type, element))
chunk_index += 1
current_text = []
continue
if isinstance(element, NavigableString):
if str(element).strip():
current_text.append(str(element).strip())
continue
# Pre-allocate children to avoid multiple list operations
children = list(element.children)
if not children:
continue
# Mark block for revisit after processing children
stack.append((element, True))
# Add children in reverse order for correct processing
for child in reversed(children):
if isinstance(child, (Tag, NavigableString)):
stack.append((child, False))
# Handle any remaining text
if current_text:
text = " ".join("".join(current_text).split())
if text:
chunks.append((chunk_index, text, "content", body))
if min_word_threshold:
chunks = [
chunk for chunk in chunks if len(chunk[1].split()) >= min_word_threshold
]
return chunks
def _deprecated_extract_text_chunks(
self, soup: BeautifulSoup
) -> List[Tuple[int, str, Tag]]:
"""Common method for extracting text chunks"""
_text_cache = {}
def fast_text(element: Tag) -> str:
elem_id = id(element)
if elem_id in _text_cache:
return _text_cache[elem_id]
texts = []
for content in element.contents:
if isinstance(content, str):
text = content.strip()
if text:
texts.append(text)
result = " ".join(texts)
_text_cache[elem_id] = result
return result
candidates = []
index = 0
def dfs(element):
nonlocal index
if isinstance(element, Tag):
if element.name in self.included_tags:
if not self.is_excluded(element):
text = fast_text(element)
word_count = len(text.split())
# Headers pass through with adjusted minimum
if element.name in self.header_tags:
if word_count >= 3: # Minimal sanity check for headers
candidates.append((index, text, element))
index += 1
# Regular content uses standard minimum
elif word_count >= self.min_word_count:
candidates.append((index, text, element))
index += 1
for child in element.children:
dfs(child)
dfs(soup.body if soup.body else soup)
return candidates
def is_excluded(self, tag: Tag) -> bool:
"""Common method for exclusion logic"""
if tag.name in self.excluded_tags:
return True
class_id = " ".join(
filter(None, [" ".join(tag.get("class", [])), tag.get("id", "")])
)
return bool(self.negative_patterns.search(class_id))
def clean_element(self, tag: Tag) -> str:
"""Common method for cleaning HTML elements with minimal overhead"""
if not tag or not isinstance(tag, Tag):
return ""
unwanted_tags = {"script", "style", "aside", "form", "iframe", "noscript"}
unwanted_attrs = {
"style",
"onclick",
"onmouseover",
"align",
"bgcolor",
"class",
"id",
}
# Use string builder pattern for better performance
builder = []
def render_tag(elem):
if not isinstance(elem, Tag):
if isinstance(elem, str):
builder.append(elem.strip())
return
if elem.name in unwanted_tags:
return
# Start tag
builder.append(f"<{elem.name}")
# Add cleaned attributes
attrs = {k: v for k, v in elem.attrs.items() if k not in unwanted_attrs}
for key, value in attrs.items():
builder.append(f' {key}="{value}"')
builder.append(">")
# Process children
for child in elem.children:
render_tag(child)
# Close tag
builder.append(f"</{elem.name}>")
try:
render_tag(tag)
return "".join(builder)
except Exception:
return str(tag) # Fallback to original if anything fails
class BM25ContentFilter(RelevantContentFilter):
"""
Content filtering using BM25 algorithm with priority tag handling.
How it works:
1. Extracts page metadata with fallbacks.
2. Extracts text chunks from the body element.
3. Tokenizes the corpus and query.
4. Applies BM25 algorithm to calculate scores for each chunk.
5. Filters out chunks below the threshold.
6. Sorts chunks by score in descending order.
7. Returns the top N chunks.
Attributes:
user_query (str): User query for filtering (optional).
bm25_threshold (float): BM25 threshold for filtering (default: 1.0).
language (str): Language for stemming (default: 'english').
Methods:
filter_content(self, html: str, min_word_threshold: int = None)
"""
def __init__(
self,
user_query: str = None,
bm25_threshold: float = 1.0,
language: str = "english",
):
"""
Initializes the BM25ContentFilter class, if not provided, falls back to page metadata.
Note:
If no query is given and no page metadata is available, then it tries to pick up the first significant paragraph.
Args:
user_query (str): User query for filtering (optional).
bm25_threshold (float): BM25 threshold for filtering (default: 1.0).
language (str): Language for stemming (default: 'english').
"""
super().__init__(user_query=user_query)
self.bm25_threshold = bm25_threshold
self.priority_tags = {
"h1": 5.0,
"h2": 4.0,
"h3": 3.0,
"title": 4.0,
"strong": 2.0,
"b": 1.5,
"em": 1.5,
"blockquote": 2.0,
"code": 2.0,
"pre": 1.5,
"th": 1.5, # Table headers
}
self.stemmer = stemmer(language)
def filter_content(self, html: str, min_word_threshold: int = None) -> List[str]:
"""
Implements content filtering using BM25 algorithm with priority tag handling.
Note:
This method implements the filtering logic for the BM25ContentFilter class.
It takes HTML content as input and returns a list of filtered text chunks.
Args:
html (str): HTML content to be filtered.
min_word_threshold (int): Minimum word threshold for filtering (optional).
Returns:
List[str]: List of filtered text chunks.
"""
if not html or not isinstance(html, str):
return []
soup = BeautifulSoup(html, "lxml")
# Check if body is present
if not soup.body:
# Wrap in body tag if missing
soup = BeautifulSoup(f"<body>{html}</body>", "lxml")
body = soup.find("body")
query = self.extract_page_query(soup, body)
if not query:
return []
# return [self.clean_element(soup)]
candidates = self.extract_text_chunks(body, min_word_threshold)
if not candidates:
return []
# Tokenize corpus
# tokenized_corpus = [chunk.lower().split() for _, chunk, _, _ in candidates]
# tokenized_query = query.lower().split()
# tokenized_corpus = [[ps.stem(word) for word in chunk.lower().split()]
# for _, chunk, _, _ in candidates]
# tokenized_query = [ps.stem(word) for word in query.lower().split()]
tokenized_corpus = [
[self.stemmer.stemWord(word) for word in chunk.lower().split()]
for _, chunk, _, _ in candidates
]
tokenized_query = [
self.stemmer.stemWord(word) for word in query.lower().split()
]
# tokenized_corpus = [[self.stemmer.stemWord(word) for word in tokenize_text(chunk.lower())]
# for _, chunk, _, _ in candidates]
# tokenized_query = [self.stemmer.stemWord(word) for word in tokenize_text(query.lower())]
# Clean from stop words and noise
tokenized_corpus = [clean_tokens(tokens) for tokens in tokenized_corpus]
tokenized_query = clean_tokens(tokenized_query)
bm25 = BM25Okapi(tokenized_corpus)
scores = bm25.get_scores(tokenized_query)
# Adjust scores with tag weights
adjusted_candidates = []
for score, (index, chunk, tag_type, tag) in zip(scores, candidates):
tag_weight = self.priority_tags.get(tag.name, 1.0)
adjusted_score = score * tag_weight
adjusted_candidates.append((adjusted_score, index, chunk, tag))
# Filter candidates by threshold
selected_candidates = [
(index, chunk, tag)
for adjusted_score, index, chunk, tag in adjusted_candidates
if adjusted_score >= self.bm25_threshold
]
if not selected_candidates:
return []
# Sort selected candidates by original document order
selected_candidates.sort(key=lambda x: x[0])
return [self.clean_element(tag) for _, _, tag in selected_candidates]
class PruningContentFilter(RelevantContentFilter):
"""
Content filtering using pruning algorithm with dynamic threshold.
How it works:
1. Extracts page metadata with fallbacks.
2. Extracts text chunks from the body element.
3. Applies pruning algorithm to calculate scores for each chunk.
4. Filters out chunks below the threshold.
5. Sorts chunks by score in descending order.
6. Returns the top N chunks.
Attributes:
user_query (str): User query for filtering (optional), if not provided, falls back to page metadata.
min_word_threshold (int): Minimum word threshold for filtering (optional).
threshold_type (str): Threshold type for dynamic threshold (default: 'fixed').
threshold (float): Fixed threshold value (default: 0.48).
Methods:
filter_content(self, html: str, min_word_threshold: int = None):
"""
def __init__(
self,
user_query: str = None,
min_word_threshold: int = None,
threshold_type: str = "fixed",
threshold: float = 0.48,
):
"""
Initializes the PruningContentFilter class, if not provided, falls back to page metadata.
Note:
If no query is given and no page metadata is available, then it tries to pick up the first significant paragraph.
Args:
user_query (str): User query for filtering (optional).
min_word_threshold (int): Minimum word threshold for filtering (optional).
threshold_type (str): Threshold type for dynamic threshold (default: 'fixed').
threshold (float): Fixed threshold value (default: 0.48).
"""
super().__init__(None)
self.min_word_threshold = min_word_threshold
self.threshold_type = threshold_type
self.threshold = threshold
# Add tag importance for dynamic threshold
self.tag_importance = {
"article": 1.5,
"main": 1.4,
"section": 1.3,
"p": 1.2,
"h1": 1.4,
"h2": 1.3,
"h3": 1.2,
"div": 0.7,
"span": 0.6,
}
# Metric configuration
self.metric_config = {
"text_density": True,
"link_density": True,
"tag_weight": True,
"class_id_weight": True,
"text_length": True,
}
self.metric_weights = {
"text_density": 0.4,
"link_density": 0.2,
"tag_weight": 0.2,
"class_id_weight": 0.1,
"text_length": 0.1,
}
self.tag_weights = {
"div": 0.5,
"p": 1.0,
"article": 1.5,
"section": 1.0,
"span": 0.3,
"li": 0.5,
"ul": 0.5,
"ol": 0.5,
"h1": 1.2,
"h2": 1.1,
"h3": 1.0,
"h4": 0.9,
"h5": 0.8,
"h6": 0.7,
}
def filter_content(self, html: str, min_word_threshold: int = None) -> List[str]:
"""
Implements content filtering using pruning algorithm with dynamic threshold.
Note:
This method implements the filtering logic for the PruningContentFilter class.
It takes HTML content as input and returns a list of filtered text chunks.
Args:
html (str): HTML content to be filtered.
min_word_threshold (int): Minimum word threshold for filtering (optional).
Returns:
List[str]: List of filtered text chunks.
"""
if not html or not isinstance(html, str):
return []
soup = BeautifulSoup(html, "lxml")
if not soup.body:
soup = BeautifulSoup(f"<body>{html}</body>", "lxml")
# Remove comments and unwanted tags
self._remove_comments(soup)
self._remove_unwanted_tags(soup)
# Prune tree starting from body
body = soup.find("body")
self._prune_tree(body)
# Extract remaining content as list of HTML strings
content_blocks = []
for element in body.children:
if isinstance(element, str) or not hasattr(element, "name"):
continue
if len(element.get_text(strip=True)) > 0:
content_blocks.append(str(element))
return content_blocks
def _remove_comments(self, soup):
"""Removes HTML comments"""
for element in soup(text=lambda text: isinstance(text, Comment)):
element.extract()
def _remove_unwanted_tags(self, soup):
"""Removes unwanted tags"""
for tag in self.excluded_tags:
for element in soup.find_all(tag):
element.decompose()
def _prune_tree(self, node):
"""
Prunes the tree starting from the given node.
Args:
node (Tag): The node from which the pruning starts.
"""
if not node or not hasattr(node, "name") or node.name is None:
return
text_len = len(node.get_text(strip=True))
tag_len = len(node.encode_contents().decode("utf-8"))
link_text_len = sum(
len(s.strip())
for s in (a.string for a in node.find_all("a", recursive=False))
if s
)
metrics = {
"node": node,
"tag_name": node.name,
"text_len": text_len,
"tag_len": tag_len,
"link_text_len": link_text_len,
}
score = self._compute_composite_score(metrics, text_len, tag_len, link_text_len)
if self.threshold_type == "fixed":
should_remove = score < self.threshold
else: # dynamic
tag_importance = self.tag_importance.get(node.name, 0.7)
text_ratio = text_len / tag_len if tag_len > 0 else 0
link_ratio = link_text_len / text_len if text_len > 0 else 1
threshold = self.threshold # base threshold
if tag_importance > 1:
threshold *= 0.8
if text_ratio > 0.4:
threshold *= 0.9
if link_ratio > 0.6:
threshold *= 1.2
should_remove = score < threshold
if should_remove:
node.decompose()
else:
children = [child for child in node.children if hasattr(child, "name")]
for child in children:
self._prune_tree(child)
def _compute_composite_score(self, metrics, text_len, tag_len, link_text_len):
"""Computes the composite score"""
if self.min_word_threshold:
# Get raw text from metrics node - avoid extra processing
text = metrics["node"].get_text(strip=True)
word_count = text.count(" ") + 1
if word_count < self.min_word_threshold:
return -1.0 # Guaranteed removal
score = 0.0
total_weight = 0.0
if self.metric_config["text_density"]:
density = text_len / tag_len if tag_len > 0 else 0
score += self.metric_weights["text_density"] * density
total_weight += self.metric_weights["text_density"]
if self.metric_config["link_density"]:
density = 1 - (link_text_len / text_len if text_len > 0 else 0)
score += self.metric_weights["link_density"] * density
total_weight += self.metric_weights["link_density"]
if self.metric_config["tag_weight"]:
tag_score = self.tag_weights.get(metrics["tag_name"], 0.5)
score += self.metric_weights["tag_weight"] * tag_score
total_weight += self.metric_weights["tag_weight"]
if self.metric_config["class_id_weight"]:
class_score = self._compute_class_id_weight(metrics["node"])
score += self.metric_weights["class_id_weight"] * max(0, class_score)
total_weight += self.metric_weights["class_id_weight"]
if self.metric_config["text_length"]:
score += self.metric_weights["text_length"] * math.log(text_len + 1)
total_weight += self.metric_weights["text_length"]
return score / total_weight if total_weight > 0 else 0
def _compute_class_id_weight(self, node):
"""Computes the class ID weight"""
class_id_score = 0
if "class" in node.attrs:
classes = " ".join(node["class"])
if self.negative_patterns.match(classes):
class_id_score -= 0.5
if "id" in node.attrs:
element_id = node["id"]
if self.negative_patterns.match(element_id):
class_id_score -= 0.5
return class_id_score
class LLMContentFilter(RelevantContentFilter):
"""Content filtering using LLMs to generate relevant markdown."""
def __init__(
self,
provider: str = DEFAULT_PROVIDER,
api_token: Optional[str] = None,
instruction: str = None,
chunk_token_threshold: int = int(1e9),
overlap_rate: float = OVERLAP_RATE,
word_token_rate: float = WORD_TOKEN_RATE,
base_url: Optional[str] = None,
api_base: Optional[str] = None,
extra_args: Dict = None,
verbose: bool = False,
logger: Optional[AsyncLogger] = None,
):
super().__init__(None)
self.provider = provider
self.api_token = (
api_token
or PROVIDER_MODELS.get(provider, "no-token")
or os.getenv("OPENAI_API_KEY")
)
self.instruction = instruction
self.chunk_token_threshold = chunk_token_threshold
self.overlap_rate = overlap_rate
self.word_token_rate = word_token_rate
self.base_url = base_url
self.api_base = api_base or base_url
self.extra_args = extra_args or {}
self.verbose = verbose
# Setup logger with custom styling for LLM operations
if logger:
self.logger = logger
elif verbose:
self.logger = AsyncLogger(
verbose=True,
icons={
**AsyncLogger.DEFAULT_ICONS,
"LLM": "", # Star for LLM operations
"CHUNK": "", # Diamond for chunks
"CACHE": "", # Lightning for cache operations
},
colors={
**AsyncLogger.DEFAULT_COLORS,
LogLevel.INFO: Fore.MAGENTA + Style.DIM, # Dimmed purple for LLM ops
}
)
else:
self.logger = None
self.usages = []
self.total_usage = TokenUsage()
def _get_cache_key(self, html: str, instruction: str) -> str:
"""Generate a unique cache key based on HTML and instruction"""
content = f"{html}{instruction}"
return hashlib.md5(content.encode()).hexdigest()
def _merge_chunks(self, text: str) -> List[str]:
"""Split text into chunks with overlap"""
# Calculate tokens and sections
total_tokens = len(text.split()) * self.word_token_rate
num_sections = max(1, math.floor(total_tokens / self.chunk_token_threshold))
adjusted_chunk_threshold = total_tokens / num_sections
# Split into words
words = text.split()
chunks = []
current_chunk = []
current_token_count = 0
for word in words:
word_tokens = len(word) * self.word_token_rate
if current_token_count + word_tokens <= adjusted_chunk_threshold:
current_chunk.append(word)
current_token_count += word_tokens
else:
# Add overlap if not the last chunk
if chunks and self.overlap_rate > 0:
overlap_size = int(len(current_chunk) * self.overlap_rate)
current_chunk.extend(current_chunk[-overlap_size:])
chunks.append(" ".join(current_chunk))
current_chunk = [word]
current_token_count = word_tokens
if current_chunk:
chunks.append(" ".join(current_chunk))
return chunks
def filter_content(self, html: str, ignore_cache: bool = False) -> List[str]:
if not html or not isinstance(html, str):
return []
if self.logger:
self.logger.info(
"Starting LLM content filtering process",
tag="LLM",
params={"provider": self.provider},
colors={"provider": Fore.CYAN}
)
# Cache handling
cache_dir = Path(get_home_folder()) / "llm_cache" / "content_filter"
cache_dir.mkdir(parents=True, exist_ok=True)
cache_key = self._get_cache_key(html, self.instruction or "")
cache_file = cache_dir / f"{cache_key}.json"
if not ignore_cache and cache_file.exists():
if self.logger:
self.logger.info("Found cached result", tag="CACHE")
try:
with cache_file.open('r') as f:
cached_data = json.load(f)
usage = TokenUsage(**cached_data['usage'])
self.usages.append(usage)
self.total_usage.completion_tokens += usage.completion_tokens
self.total_usage.prompt_tokens += usage.prompt_tokens
self.total_usage.total_tokens += usage.total_tokens
return cached_data['blocks']
except Exception as e:
if self.logger:
self.logger.error(f"Cache read error: {str(e)}", tag="CACHE")
# Split into chunks
html_chunks = self._merge_chunks(html)
if self.logger:
self.logger.info(
"Split content into {chunk_count} chunks",
tag="CHUNK",
params={"chunk_count": len(html_chunks)},
colors={"chunk_count": Fore.YELLOW}
)
extracted_content = []
start_time = time.time()
# Process chunks in parallel
with ThreadPoolExecutor(max_workers=4) as executor:
futures = []
for i, chunk in enumerate(html_chunks):
if self.logger:
self.logger.debug(
"Processing chunk {chunk_num}/{total_chunks}",
tag="CHUNK",
params={
"chunk_num": i + 1,
"total_chunks": len(html_chunks)
}
)
prompt_variables = {
"HTML": escape_json_string(sanitize_html(chunk)),
"REQUEST": self.instruction or "Convert this HTML into clean, relevant markdown, removing any noise or irrelevant content."
}
prompt = PROMPT_FILTER_CONTENT
for var, value in prompt_variables.items():
prompt = prompt.replace("{" + var + "}", value)
future = executor.submit(
perform_completion_with_backoff,
self.provider,
prompt,
self.api_token,
base_url=self.api_base,
extra_args=self.extra_args
)
futures.append((i, future))
# Collect results in order
ordered_results = []
for i, future in sorted(futures):
try:
response = future.result()
# Track usage
usage = TokenUsage(
completion_tokens=response.usage.completion_tokens,
prompt_tokens=response.usage.prompt_tokens,
total_tokens=response.usage.total_tokens,
completion_tokens_details=response.usage.completion_tokens_details.__dict__
if response.usage.completion_tokens_details else {},
prompt_tokens_details=response.usage.prompt_tokens_details.__dict__
if response.usage.prompt_tokens_details else {},
)
self.usages.append(usage)
self.total_usage.completion_tokens += usage.completion_tokens
self.total_usage.prompt_tokens += usage.prompt_tokens
self.total_usage.total_tokens += usage.total_tokens
blocks = extract_xml_data(["content"], response.choices[0].message.content)["content"]
if blocks:
ordered_results.append(blocks)
if self.logger:
self.logger.success(
"Successfully processed chunk {chunk_num}",
tag="CHUNK",
params={"chunk_num": i + 1}
)
except Exception as e:
if self.logger:
self.logger.error(
"Error processing chunk {chunk_num}: {error}",
tag="CHUNK",
params={
"chunk_num": i + 1,
"error": str(e)
}
)
end_time = time.time()
if self.logger:
self.logger.success(
"Completed processing in {time:.2f}s",
tag="LLM",
params={"time": end_time - start_time},
colors={"time": Fore.YELLOW}
)
result = ordered_results if ordered_results else []
# Cache the final result
cache_data = {
'blocks': result,
'usage': self.total_usage.__dict__
}
with cache_file.open('w') as f:
json.dump(cache_data, f)
if self.logger:
self.logger.info("Cached results for future use", tag="CACHE")
return result
def show_usage(self) -> None:
"""Print usage statistics"""
print("\n=== Token Usage Summary ===")
print(f"{'Type':<15} {'Count':>12}")
print("-" * 30)
print(f"{'Completion':<15} {self.total_usage.completion_tokens:>12,}")
print(f"{'Prompt':<15} {self.total_usage.prompt_tokens:>12,}")
print(f"{'Total':<15} {self.total_usage.total_tokens:>12,}")
if self.usages:
print("\n=== Usage History ===")
print(f"{'Request #':<10} {'Completion':>12} {'Prompt':>12} {'Total':>12}")
print("-" * 48)
for i, usage in enumerate(self.usages, 1):
print(
f"{i:<10} {usage.completion_tokens:>12,} "
f"{usage.prompt_tokens:>12,} {usage.total_tokens:>12,}"
)

File diff suppressed because it is too large Load Diff

View File

@@ -0,0 +1,301 @@
from abc import ABC, abstractmethod
from typing import Dict, Any
from bs4 import BeautifulSoup
from concurrent.futures import ThreadPoolExecutor
import asyncio, requests, re, os
from .config import *
from bs4 import element, NavigableString, Comment
from urllib.parse import urljoin
from requests.exceptions import InvalidSchema
from .utils import (
sanitize_input_encode,
sanitize_html,
extract_metadata,
InvalidCSSSelectorError,
CustomHTML2Text
)
class ContentScrappingStrategy(ABC):
@abstractmethod
def scrap(self, url: str, html: str, **kwargs) -> Dict[str, Any]:
pass
@abstractmethod
async def ascrap(self, url: str, html: str, **kwargs) -> Dict[str, Any]:
pass
class WebScrappingStrategy(ContentScrappingStrategy):
def scrap(self, url: str, html: str, **kwargs) -> Dict[str, Any]:
return self._get_content_of_website_optimized(url, html, is_async=False, **kwargs)
async def ascrap(self, url: str, html: str, **kwargs) -> Dict[str, Any]:
return await asyncio.to_thread(self._get_content_of_website_optimized, url, html, **kwargs)
def _get_content_of_website_optimized(self, url: str, html: str, word_count_threshold: int = MIN_WORD_THRESHOLD, css_selector: str = None, **kwargs) -> Dict[str, Any]:
if not html:
return None
soup = BeautifulSoup(html, 'html.parser')
body = soup.body
image_description_min_word_threshold = kwargs.get('image_description_min_word_threshold', IMAGE_DESCRIPTION_MIN_WORD_THRESHOLD)
for tag in kwargs.get('excluded_tags', []) or []:
for el in body.select(tag):
el.decompose()
if css_selector:
selected_elements = body.select(css_selector)
if not selected_elements:
return {
'markdown': '',
'cleaned_html': '',
'success': True,
'media': {'images': [], 'videos': [], 'audios': []},
'links': {'internal': [], 'external': []},
'metadata': {},
'message': f"No elements found for CSS selector: {css_selector}"
}
# raise InvalidCSSSelectorError(f"Invalid CSS selector, No elements found for CSS selector: {css_selector}")
body = soup.new_tag('div')
for el in selected_elements:
body.append(el)
links = {'internal': [], 'external': []}
media = {'images': [], 'videos': [], 'audios': []}
# Extract meaningful text for media files from closest parent
def find_closest_parent_with_useful_text(tag):
current_tag = tag
while current_tag:
current_tag = current_tag.parent
# Get the text content of the parent tag
if current_tag:
text_content = current_tag.get_text(separator=' ',strip=True)
# Check if the text content has at least word_count_threshold
if len(text_content.split()) >= image_description_min_word_threshold:
return text_content
return None
def process_image(img, url, index, total_images):
#Check if an image has valid display and inside undesired html elements
def is_valid_image(img, parent, parent_classes):
style = img.get('style', '')
src = img.get('src', '')
classes_to_check = ['button', 'icon', 'logo']
tags_to_check = ['button', 'input']
return all([
'display:none' not in style,
src,
not any(s in var for var in [src, img.get('alt', ''), *parent_classes] for s in classes_to_check),
parent.name not in tags_to_check
])
#Score an image for it's usefulness
def score_image_for_usefulness(img, base_url, index, images_count):
# Function to parse image height/width value and units
def parse_dimension(dimension):
if dimension:
match = re.match(r"(\d+)(\D*)", dimension)
if match:
number = int(match.group(1))
unit = match.group(2) or 'px' # Default unit is 'px' if not specified
return number, unit
return None, None
# Fetch image file metadata to extract size and extension
def fetch_image_file_size(img, base_url):
#If src is relative path construct full URL, if not it may be CDN URL
img_url = urljoin(base_url,img.get('src'))
try:
response = requests.head(img_url)
if response.status_code == 200:
return response.headers.get('Content-Length',None)
else:
print(f"Failed to retrieve file size for {img_url}")
return None
except InvalidSchema as e:
return None
finally:
return
image_height = img.get('height')
height_value, height_unit = parse_dimension(image_height)
image_width = img.get('width')
width_value, width_unit = parse_dimension(image_width)
image_size = 0 #int(fetch_image_file_size(img,base_url) or 0)
image_format = os.path.splitext(img.get('src',''))[1].lower()
# Remove . from format
image_format = image_format.strip('.').split('?')[0]
score = 0
if height_value:
if height_unit == 'px' and height_value > 150:
score += 1
if height_unit in ['%','vh','vmin','vmax'] and height_value >30:
score += 1
if width_value:
if width_unit == 'px' and width_value > 150:
score += 1
if width_unit in ['%','vh','vmin','vmax'] and width_value >30:
score += 1
if image_size > 10000:
score += 1
if img.get('alt') != '':
score+=1
if any(image_format==format for format in ['jpg','png','webp']):
score+=1
if index/images_count<0.5:
score+=1
return score
if not is_valid_image(img, img.parent, img.parent.get('class', [])):
return None
score = score_image_for_usefulness(img, url, index, total_images)
if score <= IMAGE_SCORE_THRESHOLD:
return None
return {
'src': img.get('src', ''),
'data-src': img.get('data-src', ''),
'alt': img.get('alt', ''),
'desc': find_closest_parent_with_useful_text(img),
'score': score,
'type': 'image'
}
def process_element(element: element.PageElement) -> bool:
try:
if isinstance(element, NavigableString):
if isinstance(element, Comment):
element.extract()
return False
# if element.name == 'img':
# process_image(element, url, 0, 1)
# return True
if element.name in ['script', 'style', 'link', 'meta', 'noscript']:
element.decompose()
return False
keep_element = False
if element.name == 'a' and element.get('href'):
href = element['href']
url_base = url.split('/')[2]
link_data = {'href': href, 'text': element.get_text()}
if href.startswith('http') and url_base not in href:
links['external'].append(link_data)
else:
links['internal'].append(link_data)
keep_element = True
elif element.name == 'img':
return True # Always keep image elements
elif element.name in ['video', 'audio']:
media[f"{element.name}s"].append({
'src': element.get('src'),
'alt': element.get('alt'),
'type': element.name,
'description': find_closest_parent_with_useful_text(element)
})
source_tags = element.find_all('source')
for source_tag in source_tags:
media[f"{element.name}s"].append({
'src': source_tag.get('src'),
'alt': element.get('alt'),
'type': element.name,
'description': find_closest_parent_with_useful_text(element)
})
return True # Always keep video and audio elements
if element.name != 'pre':
if element.name in ['b', 'i', 'u', 'span', 'del', 'ins', 'sub', 'sup', 'strong', 'em', 'code', 'kbd', 'var', 's', 'q', 'abbr', 'cite', 'dfn', 'time', 'small', 'mark']:
if kwargs.get('only_text', False):
element.replace_with(element.get_text())
else:
element.unwrap()
elif element.name != 'img':
element.attrs = {}
# Process children
for child in list(element.children):
if isinstance(child, NavigableString) and not isinstance(child, Comment):
if len(child.strip()) > 0:
keep_element = True
else:
if process_element(child):
keep_element = True
# Check word count
if not keep_element:
word_count = len(element.get_text(strip=True).split())
keep_element = word_count >= word_count_threshold
if not keep_element:
element.decompose()
return keep_element
except Exception as e:
print('Error processing element:', str(e))
return False
#process images by filtering and extracting contextual text from the page
# imgs = body.find_all('img')
# media['images'] = [
# result for result in
# (process_image(img, url, i, len(imgs)) for i, img in enumerate(imgs))
# if result is not None
# ]
process_element(body)
# # Process images using ThreadPoolExecutor
imgs = body.find_all('img')
with ThreadPoolExecutor() as executor:
image_results = list(executor.map(process_image, imgs, [url]*len(imgs), range(len(imgs)), [len(imgs)]*len(imgs)))
media['images'] = [result for result in image_results if result is not None]
def flatten_nested_elements(node):
if isinstance(node, NavigableString):
return node
if len(node.contents) == 1 and isinstance(node.contents[0], element.Tag) and node.contents[0].name == node.name:
return flatten_nested_elements(node.contents[0])
node.contents = [flatten_nested_elements(child) for child in node.contents]
return node
body = flatten_nested_elements(body)
base64_pattern = re.compile(r'data:image/[^;]+;base64,([^"]+)')
for img in imgs:
src = img.get('src', '')
if base64_pattern.match(src):
# Replace base64 data with empty string
img['src'] = base64_pattern.sub('', src)
cleaned_html = str(body).replace('\n\n', '\n').replace(' ', ' ')
h = CustomHTML2Text()
h.ignore_links = True
h.body_width = 0
try:
markdown = h.handle(cleaned_html)
except Exception as e:
markdown = h.handle(sanitize_html(cleaned_html))
markdown = markdown.replace(' ```', '```')
try:
meta = extract_metadata(html, soup)
except Exception as e:
print('Error extracting metadata:', str(e))
meta = {}
cleaned_html = sanitize_html(cleaned_html)
return {
'markdown': markdown,
'cleaned_html': cleaned_html,
'success': True,
'media': media,
'links': links,
'metadata': meta
}

View File

@@ -15,53 +15,54 @@ import logging, time
import base64
from PIL import Image, ImageDraw, ImageFont
from io import BytesIO
from typing import Callable
from typing import List, Callable
import requests
import os
from pathlib import Path
from .utils import *
logger = logging.getLogger("selenium.webdriver.remote.remote_connection")
logger = logging.getLogger('selenium.webdriver.remote.remote_connection')
logger.setLevel(logging.WARNING)
logger_driver = logging.getLogger("selenium.webdriver.common.service")
logger_driver = logging.getLogger('selenium.webdriver.common.service')
logger_driver.setLevel(logging.WARNING)
urllib3_logger = logging.getLogger("urllib3.connectionpool")
urllib3_logger = logging.getLogger('urllib3.connectionpool')
urllib3_logger.setLevel(logging.WARNING)
# Disable http.client logging
http_client_logger = logging.getLogger("http.client")
http_client_logger = logging.getLogger('http.client')
http_client_logger.setLevel(logging.WARNING)
# Disable driver_finder and service logging
driver_finder_logger = logging.getLogger("selenium.webdriver.common.driver_finder")
driver_finder_logger = logging.getLogger('selenium.webdriver.common.driver_finder')
driver_finder_logger.setLevel(logging.WARNING)
class CrawlerStrategy(ABC):
@abstractmethod
def crawl(self, url: str, **kwargs) -> str:
pass
@abstractmethod
def take_screenshot(self, save_path: str):
pass
@abstractmethod
def update_user_agent(self, user_agent: str):
pass
@abstractmethod
def set_hook(self, hook_type: str, hook: Callable):
pass
class CloudCrawlerStrategy(CrawlerStrategy):
def __init__(self, use_cached_html=False):
def __init__(self, use_cached_html = False):
super().__init__()
self.use_cached_html = use_cached_html
def crawl(self, url: str) -> str:
data = {
"urls": [url],
@@ -75,7 +76,6 @@ class CloudCrawlerStrategy(CrawlerStrategy):
html = response["results"][0]["html"]
return sanitize_input_encode(html)
class LocalSeleniumCrawlerStrategy(CrawlerStrategy):
def __init__(self, use_cached_html=False, js_code=None, **kwargs):
super().__init__()
@@ -87,25 +87,20 @@ class LocalSeleniumCrawlerStrategy(CrawlerStrategy):
if kwargs.get("user_agent"):
self.options.add_argument("--user-agent=" + kwargs.get("user_agent"))
else:
user_agent = kwargs.get(
"user_agent",
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36",
)
user_agent = kwargs.get("user_agent", "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36")
self.options.add_argument(f"--user-agent={user_agent}")
self.options.add_argument(
"user-agent=Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36"
)
self.options.add_argument("user-agent=Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36")
self.options.headless = kwargs.get("headless", True)
if self.options.headless:
self.options.add_argument("--headless")
self.options.add_argument("--disable-gpu")
self.options.add_argument("--disable-gpu")
self.options.add_argument("--window-size=1920,1080")
self.options.add_argument("--no-sandbox")
self.options.add_argument("--disable-dev-shm-usage")
self.options.add_argument("--disable-blink-features=AutomationControlled")
self.options.add_argument("--disable-blink-features=AutomationControlled")
# self.options.add_argument("--disable-dev-shm-usage")
self.options.add_argument("--disable-gpu")
# self.options.add_argument("--disable-extensions")
@@ -125,45 +120,48 @@ class LocalSeleniumCrawlerStrategy(CrawlerStrategy):
self.use_cached_html = use_cached_html
self.js_code = js_code
self.verbose = kwargs.get("verbose", False)
# Hooks
self.hooks = {
"on_driver_created": None,
"on_user_agent_updated": None,
"before_get_url": None,
"after_get_url": None,
"before_return_html": None,
'on_driver_created': None,
'on_user_agent_updated': None,
'before_get_url': None,
'after_get_url': None,
'before_return_html': None
}
# chromedriver_autoinstaller.install()
# import chromedriver_autoinstaller
# crawl4ai_folder = os.path.join(os.getenv("CRAWL4_AI_BASE_DIRECTORY", Path.home()), ".crawl4ai")
# crawl4ai_folder = os.path.join(Path.home(), ".crawl4ai")
# driver = webdriver.Chrome(service=ChromeService(ChromeDriverManager().install()), options=self.options)
# chromedriver_path = chromedriver_autoinstaller.install()
# chromedriver_path = chromedriver_autoinstaller.utils.download_chromedriver()
# self.service = Service(chromedriver_autoinstaller.install())
# chromedriver_path = ChromeDriverManager().install()
# self.service = Service(chromedriver_path)
# self.service.log_path = "NUL"
# self.driver = webdriver.Chrome(service=self.service, options=self.options)
# Use selenium-manager (built into Selenium 4.10.0+)
self.service = Service()
self.driver = webdriver.Chrome(options=self.options)
self.driver = self.execute_hook("on_driver_created", self.driver)
self.driver = self.execute_hook('on_driver_created', self.driver)
if kwargs.get("cookies"):
for cookie in kwargs.get("cookies"):
self.driver.add_cookie(cookie)
def set_hook(self, hook_type: str, hook: Callable):
if hook_type in self.hooks:
self.hooks[hook_type] = hook
else:
raise ValueError(f"Invalid hook type: {hook_type}")
def execute_hook(self, hook_type: str, *args):
hook = self.hooks.get(hook_type)
if hook:
@@ -172,9 +170,7 @@ class LocalSeleniumCrawlerStrategy(CrawlerStrategy):
if isinstance(result, webdriver.Chrome):
return result
else:
raise TypeError(
f"Hook {hook_type} must return an instance of webdriver.Chrome or None."
)
raise TypeError(f"Hook {hook_type} must return an instance of webdriver.Chrome or None.")
# If the hook returns None or there is no hook, return self.driver
return self.driver
@@ -182,77 +178,60 @@ class LocalSeleniumCrawlerStrategy(CrawlerStrategy):
self.options.add_argument(f"user-agent={user_agent}")
self.driver.quit()
self.driver = webdriver.Chrome(service=self.service, options=self.options)
self.driver = self.execute_hook("on_user_agent_updated", self.driver)
self.driver = self.execute_hook('on_user_agent_updated', self.driver)
def set_custom_headers(self, headers: dict):
# Enable Network domain for sending headers
self.driver.execute_cdp_cmd("Network.enable", {})
self.driver.execute_cdp_cmd('Network.enable', {})
# Set extra HTTP headers
self.driver.execute_cdp_cmd("Network.setExtraHTTPHeaders", {"headers": headers})
self.driver.execute_cdp_cmd('Network.setExtraHTTPHeaders', {'headers': headers})
def _ensure_page_load(self, max_checks=6, check_interval=0.01):
def _ensure_page_load(self, max_checks=6, check_interval=0.01):
initial_length = len(self.driver.page_source)
for ix in range(max_checks):
# print(f"Checking page load: {ix}")
time.sleep(check_interval)
current_length = len(self.driver.page_source)
if current_length != initial_length:
break
return self.driver.page_source
def crawl(self, url: str, **kwargs) -> str:
# Create md5 hash of the URL
import hashlib
url_hash = hashlib.md5(url.encode()).hexdigest()
if self.use_cached_html:
cache_file_path = os.path.join(
os.getenv("CRAWL4_AI_BASE_DIRECTORY", Path.home()),
".crawl4ai",
"cache",
url_hash,
)
cache_file_path = os.path.join(Path.home(), ".crawl4ai", "cache", url_hash)
if os.path.exists(cache_file_path):
with open(cache_file_path, "r") as f:
return sanitize_input_encode(f.read())
try:
self.driver = self.execute_hook("before_get_url", self.driver)
self.driver = self.execute_hook('before_get_url', self.driver)
if self.verbose:
print(f"[LOG] 🕸️ Crawling {url} using LocalSeleniumCrawlerStrategy...")
self.driver.get(url) # <html><head></head><body></body></html>
self.driver.get(url) #<html><head></head><body></body></html>
WebDriverWait(self.driver, 20).until(
lambda d: d.execute_script("return document.readyState") == "complete"
lambda d: d.execute_script('return document.readyState') == 'complete'
)
WebDriverWait(self.driver, 10).until(
EC.presence_of_all_elements_located((By.TAG_NAME, "body"))
)
self.driver.execute_script(
"window.scrollTo(0, document.body.scrollHeight);"
)
self.driver = self.execute_hook("after_get_url", self.driver)
html = sanitize_input_encode(
self._ensure_page_load()
) # self.driver.page_source
can_not_be_done_headless = (
False # Look at my creativity for naming variables
)
self.driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
self.driver = self.execute_hook('after_get_url', self.driver)
html = sanitize_input_encode(self._ensure_page_load()) # self.driver.page_source
can_not_be_done_headless = False # Look at my creativity for naming variables
# TODO: Very ugly approach, but promise to change it!
if (
kwargs.get("bypass_headless", False)
or html == "<html><head></head><body></body></html>"
):
print(
"[LOG] 🙌 Page could not be loaded in headless mode. Trying non-headless mode..."
)
if kwargs.get('bypass_headless', False) or html == "<html><head></head><body></body></html>":
print("[LOG] 🙌 Page could not be loaded in headless mode. Trying non-headless mode...")
can_not_be_done_headless = True
options = Options()
options.headless = False
@@ -260,31 +239,27 @@ class LocalSeleniumCrawlerStrategy(CrawlerStrategy):
options.add_argument("--window-size=5,5")
driver = webdriver.Chrome(service=self.service, options=options)
driver.get(url)
self.driver = self.execute_hook("after_get_url", driver)
self.driver = self.execute_hook('after_get_url', driver)
html = sanitize_input_encode(driver.page_source)
driver.quit()
# Execute JS code if provided
self.js_code = kwargs.get("js_code", self.js_code)
if self.js_code and type(self.js_code) == str:
self.driver.execute_script(self.js_code)
# Optionally, wait for some condition after executing the JS code
WebDriverWait(self.driver, 10).until(
lambda driver: driver.execute_script("return document.readyState")
== "complete"
lambda driver: driver.execute_script("return document.readyState") == "complete"
)
elif self.js_code and type(self.js_code) == list:
for js in self.js_code:
self.driver.execute_script(js)
WebDriverWait(self.driver, 10).until(
lambda driver: driver.execute_script(
"return document.readyState"
)
== "complete"
lambda driver: driver.execute_script("return document.readyState") == "complete"
)
# Optionally, wait for some condition after executing the JS code : Contributed by (https://github.com/jonymusky)
wait_for = kwargs.get("wait_for", False)
wait_for = kwargs.get('wait_for', False)
if wait_for:
if callable(wait_for):
print("[LOG] 🔄 Waiting for condition...")
@@ -293,37 +268,32 @@ class LocalSeleniumCrawlerStrategy(CrawlerStrategy):
print("[LOG] 🔄 Waiting for condition...")
WebDriverWait(self.driver, 20).until(
EC.presence_of_element_located((By.CSS_SELECTOR, wait_for))
)
)
if not can_not_be_done_headless:
html = sanitize_input_encode(self.driver.page_source)
self.driver = self.execute_hook("before_return_html", self.driver, html)
self.driver = self.execute_hook('before_return_html', self.driver, html)
# Store in cache
cache_file_path = os.path.join(
os.getenv("CRAWL4_AI_BASE_DIRECTORY", Path.home()),
".crawl4ai",
"cache",
url_hash,
)
cache_file_path = os.path.join(Path.home(), ".crawl4ai", "cache", url_hash)
with open(cache_file_path, "w", encoding="utf-8") as f:
f.write(html)
if self.verbose:
print(f"[LOG] ✅ Crawled {url} successfully!")
return html
except InvalidArgumentException as e:
if not hasattr(e, "msg"):
except InvalidArgumentException:
if not hasattr(e, 'msg'):
e.msg = sanitize_input_encode(str(e))
raise InvalidArgumentException(f"Failed to crawl {url}: {e.msg}")
except WebDriverException as e:
# If e does nlt have msg attribute create it and set it to str(e)
if not hasattr(e, "msg"):
if not hasattr(e, 'msg'):
e.msg = sanitize_input_encode(str(e))
raise WebDriverException(f"Failed to crawl {url}: {e.msg}")
raise WebDriverException(f"Failed to crawl {url}: {e.msg}")
except Exception as e:
if not hasattr(e, "msg"):
if not hasattr(e, 'msg'):
e.msg = sanitize_input_encode(str(e))
raise Exception(f"Failed to crawl {url}: {e.msg}")
@@ -331,9 +301,7 @@ class LocalSeleniumCrawlerStrategy(CrawlerStrategy):
try:
# Get the dimensions of the page
total_width = self.driver.execute_script("return document.body.scrollWidth")
total_height = self.driver.execute_script(
"return document.body.scrollHeight"
)
total_height = self.driver.execute_script("return document.body.scrollHeight")
# Set the window size to the dimensions of the page
self.driver.set_window_size(total_width, total_height)
@@ -345,27 +313,25 @@ class LocalSeleniumCrawlerStrategy(CrawlerStrategy):
image = Image.open(BytesIO(screenshot))
# Convert image to RGB mode (this will handle both RGB and RGBA images)
rgb_image = image.convert("RGB")
rgb_image = image.convert('RGB')
# Convert to JPEG and compress
buffered = BytesIO()
rgb_image.save(buffered, format="JPEG", quality=85)
img_base64 = base64.b64encode(buffered.getvalue()).decode("utf-8")
img_base64 = base64.b64encode(buffered.getvalue()).decode('utf-8')
if self.verbose:
print("[LOG] 📸 Screenshot taken and converted to base64")
print(f"[LOG] 📸 Screenshot taken and converted to base64")
return img_base64
except Exception as e:
error_message = sanitize_input_encode(
f"Failed to take screenshot: {str(e)}"
)
error_message = sanitize_input_encode(f"Failed to take screenshot: {str(e)}")
print(error_message)
# Generate an image with black background
img = Image.new("RGB", (800, 600), color="black")
img = Image.new('RGB', (800, 600), color='black')
draw = ImageDraw.Draw(img)
# Load a font
try:
font = ImageFont.truetype("arial.ttf", 40)
@@ -379,16 +345,16 @@ class LocalSeleniumCrawlerStrategy(CrawlerStrategy):
# Calculate text position
text_position = (10, 10)
# Draw the text on the image
draw.text(text_position, wrapped_text, fill=text_color, font=font)
# Convert to base64
buffered = BytesIO()
img.save(buffered, format="JPEG")
img_base64 = base64.b64encode(buffered.getvalue()).decode("utf-8")
img_base64 = base64.b64encode(buffered.getvalue()).decode('utf-8')
return img_base64
def quit(self):
self.driver.quit()

View File

@@ -3,17 +3,15 @@ from pathlib import Path
import sqlite3
from typing import Optional, Tuple
DB_PATH = os.path.join(os.getenv("CRAWL4_AI_BASE_DIRECTORY", Path.home()), ".crawl4ai")
DB_PATH = os.path.join(Path.home(), ".crawl4ai")
os.makedirs(DB_PATH, exist_ok=True)
DB_PATH = os.path.join(DB_PATH, "crawl4ai.db")
def init_db():
global DB_PATH
conn = sqlite3.connect(DB_PATH)
cursor = conn.cursor()
cursor.execute(
"""
cursor.execute('''
CREATE TABLE IF NOT EXISTS crawled_data (
url TEXT PRIMARY KEY,
html TEXT,
@@ -26,42 +24,31 @@ def init_db():
metadata TEXT DEFAULT "{}",
screenshot TEXT DEFAULT ""
)
"""
)
''')
conn.commit()
conn.close()
def alter_db_add_screenshot(new_column: str = "media"):
check_db_path()
try:
conn = sqlite3.connect(DB_PATH)
cursor = conn.cursor()
cursor.execute(
f'ALTER TABLE crawled_data ADD COLUMN {new_column} TEXT DEFAULT ""'
)
cursor.execute(f'ALTER TABLE crawled_data ADD COLUMN {new_column} TEXT DEFAULT ""')
conn.commit()
conn.close()
except Exception as e:
print(f"Error altering database to add screenshot column: {e}")
def check_db_path():
if not DB_PATH:
raise ValueError("Database path is not set or is empty.")
def get_cached_url(
url: str,
) -> Optional[Tuple[str, str, str, str, str, str, str, bool, str]]:
def get_cached_url(url: str) -> Optional[Tuple[str, str, str, str, str, str, str, bool, str]]:
check_db_path()
try:
conn = sqlite3.connect(DB_PATH)
cursor = conn.cursor()
cursor.execute(
"SELECT url, html, cleaned_html, markdown, extracted_content, success, media, links, metadata, screenshot FROM crawled_data WHERE url = ?",
(url,),
)
cursor.execute('SELECT url, html, cleaned_html, markdown, extracted_content, success, media, links, metadata, screenshot FROM crawled_data WHERE url = ?', (url,))
result = cursor.fetchone()
conn.close()
return result
@@ -69,25 +56,12 @@ def get_cached_url(
print(f"Error retrieving cached URL: {e}")
return None
def cache_url(
url: str,
html: str,
cleaned_html: str,
markdown: str,
extracted_content: str,
success: bool,
media: str = "{}",
links: str = "{}",
metadata: str = "{}",
screenshot: str = "",
):
def cache_url(url: str, html: str, cleaned_html: str, markdown: str, extracted_content: str, success: bool, media : str = "{}", links : str = "{}", metadata : str = "{}", screenshot: str = ""):
check_db_path()
try:
conn = sqlite3.connect(DB_PATH)
cursor = conn.cursor()
cursor.execute(
"""
cursor.execute('''
INSERT INTO crawled_data (url, html, cleaned_html, markdown, extracted_content, success, media, links, metadata, screenshot)
VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?)
ON CONFLICT(url) DO UPDATE SET
@@ -100,32 +74,18 @@ def cache_url(
links = excluded.links,
metadata = excluded.metadata,
screenshot = excluded.screenshot
""",
(
url,
html,
cleaned_html,
markdown,
extracted_content,
success,
media,
links,
metadata,
screenshot,
),
)
''', (url, html, cleaned_html, markdown, extracted_content, success, media, links, metadata, screenshot))
conn.commit()
conn.close()
except Exception as e:
print(f"Error caching URL: {e}")
def get_total_count() -> int:
check_db_path()
try:
conn = sqlite3.connect(DB_PATH)
cursor = conn.cursor()
cursor.execute("SELECT COUNT(*) FROM crawled_data")
cursor.execute('SELECT COUNT(*) FROM crawled_data')
result = cursor.fetchone()
conn.close()
return result[0]
@@ -133,48 +93,43 @@ def get_total_count() -> int:
print(f"Error getting total count: {e}")
return 0
def clear_db():
check_db_path()
try:
conn = sqlite3.connect(DB_PATH)
cursor = conn.cursor()
cursor.execute("DELETE FROM crawled_data")
cursor.execute('DELETE FROM crawled_data')
conn.commit()
conn.close()
except Exception as e:
print(f"Error clearing database: {e}")
def flush_db():
check_db_path()
try:
conn = sqlite3.connect(DB_PATH)
cursor = conn.cursor()
cursor.execute("DROP TABLE crawled_data")
cursor.execute('DROP TABLE crawled_data')
conn.commit()
conn.close()
except Exception as e:
print(f"Error flushing database: {e}")
def update_existing_records(new_column: str = "media", default_value: str = "{}"):
check_db_path()
try:
conn = sqlite3.connect(DB_PATH)
cursor = conn.cursor()
cursor.execute(
f'UPDATE crawled_data SET {new_column} = "{default_value}" WHERE screenshot IS NULL'
)
cursor.execute(f'UPDATE crawled_data SET {new_column} = "{default_value}" WHERE screenshot IS NULL')
conn.commit()
conn.close()
except Exception as e:
print(f"Error updating existing records: {e}")
if __name__ == "__main__":
# Delete the existing database file
if os.path.exists(DB_PATH):
os.remove(DB_PATH)
init_db()
init_db()
# alter_db_add_screenshot("COL_NAME")

View File

@@ -1,75 +0,0 @@
import requests
import shutil
from pathlib import Path
from crawl4ai.async_logger import AsyncLogger
from crawl4ai.llmtxt import AsyncLLMTextManager
class DocsManager:
def __init__(self, logger=None):
self.docs_dir = Path.home() / ".crawl4ai" / "docs"
self.local_docs = Path(__file__).parent.parent / "docs" / "llm.txt"
self.docs_dir.mkdir(parents=True, exist_ok=True)
self.logger = logger or AsyncLogger(verbose=True)
self.llm_text = AsyncLLMTextManager(self.docs_dir, self.logger)
async def ensure_docs_exist(self):
"""Fetch docs if not present"""
if not any(self.docs_dir.iterdir()):
await self.fetch_docs()
async def fetch_docs(self) -> bool:
"""Copy from local docs or download from GitHub"""
try:
# Try local first
if self.local_docs.exists() and (
any(self.local_docs.glob("*.md"))
or any(self.local_docs.glob("*.tokens"))
):
# Empty the local docs directory
for file_path in self.docs_dir.glob("*.md"):
file_path.unlink()
# for file_path in self.docs_dir.glob("*.tokens"):
# file_path.unlink()
for file_path in self.local_docs.glob("*.md"):
shutil.copy2(file_path, self.docs_dir / file_path.name)
# for file_path in self.local_docs.glob("*.tokens"):
# shutil.copy2(file_path, self.docs_dir / file_path.name)
return True
# Fallback to GitHub
response = requests.get(
"https://api.github.com/repos/unclecode/crawl4ai/contents/docs/llm.txt",
headers={"Accept": "application/vnd.github.v3+json"},
)
response.raise_for_status()
for item in response.json():
if item["type"] == "file" and item["name"].endswith(".md"):
content = requests.get(item["download_url"]).text
with open(self.docs_dir / item["name"], "w", encoding="utf-8") as f:
f.write(content)
return True
except Exception as e:
self.logger.error(f"Failed to fetch docs: {str(e)}")
raise
def list(self) -> list[str]:
"""List available topics"""
names = [file_path.stem for file_path in self.docs_dir.glob("*.md")]
# Remove [0-9]+_ prefix
names = [name.split("_", 1)[1] if name[0].isdigit() else name for name in names]
# Exclude those end with .xs.md and .q.md
names = [
name
for name in names
if not name.endswith(".xs") and not name.endswith(".q")
]
return names
def generate(self, sections, mode="extended"):
return self.llm_text.generate(sections, mode)
def search(self, query: str, top_k: int = 5):
return self.llm_text.search(query, top_k)

File diff suppressed because it is too large Load Diff

File diff suppressed because it is too large Load Diff

View File

@@ -1,3 +0,0 @@
from .cli import main
main()

View File

@@ -1,3 +0,0 @@
class OutCallback:
def __call__(self, s: str) -> None:
...

View File

@@ -1,330 +0,0 @@
import argparse
import sys
from . import HTML2Text, __version__, config
def main() -> None:
baseurl = ""
class bcolors:
HEADER = "\033[95m"
OKBLUE = "\033[94m"
OKGREEN = "\033[92m"
WARNING = "\033[93m"
FAIL = "\033[91m"
ENDC = "\033[0m"
BOLD = "\033[1m"
UNDERLINE = "\033[4m"
p = argparse.ArgumentParser()
p.add_argument(
"--default-image-alt",
dest="default_image_alt",
default=config.DEFAULT_IMAGE_ALT,
help="The default alt string for images with missing ones",
)
p.add_argument(
"--pad-tables",
dest="pad_tables",
action="store_true",
default=config.PAD_TABLES,
help="pad the cells to equal column width in tables",
)
p.add_argument(
"--no-wrap-links",
dest="wrap_links",
action="store_false",
default=config.WRAP_LINKS,
help="don't wrap links during conversion",
)
p.add_argument(
"--wrap-list-items",
dest="wrap_list_items",
action="store_true",
default=config.WRAP_LIST_ITEMS,
help="wrap list items during conversion",
)
p.add_argument(
"--wrap-tables",
dest="wrap_tables",
action="store_true",
default=config.WRAP_TABLES,
help="wrap tables",
)
p.add_argument(
"--ignore-emphasis",
dest="ignore_emphasis",
action="store_true",
default=config.IGNORE_EMPHASIS,
help="don't include any formatting for emphasis",
)
p.add_argument(
"--reference-links",
dest="inline_links",
action="store_false",
default=config.INLINE_LINKS,
help="use reference style links instead of inline links",
)
p.add_argument(
"--ignore-links",
dest="ignore_links",
action="store_true",
default=config.IGNORE_ANCHORS,
help="don't include any formatting for links",
)
p.add_argument(
"--ignore-mailto-links",
action="store_true",
dest="ignore_mailto_links",
default=config.IGNORE_MAILTO_LINKS,
help="don't include mailto: links",
)
p.add_argument(
"--protect-links",
dest="protect_links",
action="store_true",
default=config.PROTECT_LINKS,
help="protect links from line breaks surrounding them with angle brackets",
)
p.add_argument(
"--ignore-images",
dest="ignore_images",
action="store_true",
default=config.IGNORE_IMAGES,
help="don't include any formatting for images",
)
p.add_argument(
"--images-as-html",
dest="images_as_html",
action="store_true",
default=config.IMAGES_AS_HTML,
help=(
"Always write image tags as raw html; preserves `height`, `width` and "
"`alt` if possible."
),
)
p.add_argument(
"--images-to-alt",
dest="images_to_alt",
action="store_true",
default=config.IMAGES_TO_ALT,
help="Discard image data, only keep alt text",
)
p.add_argument(
"--images-with-size",
dest="images_with_size",
action="store_true",
default=config.IMAGES_WITH_SIZE,
help=(
"Write image tags with height and width attrs as raw html to retain "
"dimensions"
),
)
p.add_argument(
"-g",
"--google-doc",
action="store_true",
dest="google_doc",
default=False,
help="convert an html-exported Google Document",
)
p.add_argument(
"-d",
"--dash-unordered-list",
action="store_true",
dest="ul_style_dash",
default=False,
help="use a dash rather than a star for unordered list items",
)
p.add_argument(
"-e",
"--asterisk-emphasis",
action="store_true",
dest="em_style_asterisk",
default=False,
help="use an asterisk rather than an underscore for emphasized text",
)
p.add_argument(
"-b",
"--body-width",
dest="body_width",
type=int,
default=config.BODY_WIDTH,
help="number of characters per output line, 0 for no wrap",
)
p.add_argument(
"-i",
"--google-list-indent",
dest="list_indent",
type=int,
default=config.GOOGLE_LIST_INDENT,
help="number of pixels Google indents nested lists",
)
p.add_argument(
"-s",
"--hide-strikethrough",
action="store_true",
dest="hide_strikethrough",
default=False,
help="hide strike-through text. only relevant when -g is " "specified as well",
)
p.add_argument(
"--escape-all",
action="store_true",
dest="escape_snob",
default=False,
help=(
"Escape all special characters. Output is less readable, but avoids "
"corner case formatting issues."
),
)
p.add_argument(
"--bypass-tables",
action="store_true",
dest="bypass_tables",
default=config.BYPASS_TABLES,
help="Format tables in HTML rather than Markdown syntax.",
)
p.add_argument(
"--ignore-tables",
action="store_true",
dest="ignore_tables",
default=config.IGNORE_TABLES,
help="Ignore table-related tags (table, th, td, tr) " "while keeping rows.",
)
p.add_argument(
"--single-line-break",
action="store_true",
dest="single_line_break",
default=config.SINGLE_LINE_BREAK,
help=(
"Use a single line break after a block element rather than two line "
"breaks. NOTE: Requires --body-width=0"
),
)
p.add_argument(
"--unicode-snob",
action="store_true",
dest="unicode_snob",
default=config.UNICODE_SNOB,
help="Use unicode throughout document",
)
p.add_argument(
"--no-automatic-links",
action="store_false",
dest="use_automatic_links",
default=config.USE_AUTOMATIC_LINKS,
help="Do not use automatic links wherever applicable",
)
p.add_argument(
"--no-skip-internal-links",
action="store_false",
dest="skip_internal_links",
default=config.SKIP_INTERNAL_LINKS,
help="Do not skip internal links",
)
p.add_argument(
"--links-after-para",
action="store_true",
dest="links_each_paragraph",
default=config.LINKS_EACH_PARAGRAPH,
help="Put links after each paragraph instead of document",
)
p.add_argument(
"--mark-code",
action="store_true",
dest="mark_code",
default=config.MARK_CODE,
help="Mark program code blocks with [code]...[/code]",
)
p.add_argument(
"--decode-errors",
dest="decode_errors",
default=config.DECODE_ERRORS,
help=(
"What to do in case of decode errors.'ignore', 'strict' and 'replace' are "
"acceptable values"
),
)
p.add_argument(
"--open-quote",
dest="open_quote",
default=config.OPEN_QUOTE,
help="The character used to open quotes",
)
p.add_argument(
"--close-quote",
dest="close_quote",
default=config.CLOSE_QUOTE,
help="The character used to close quotes",
)
p.add_argument(
"--version", action="version", version=".".join(map(str, __version__))
)
p.add_argument("filename", nargs="?")
p.add_argument("encoding", nargs="?", default="utf-8")
p.add_argument(
"--include-sup-sub",
dest="include_sup_sub",
action="store_true",
default=config.INCLUDE_SUP_SUB,
help="Include the sup and sub tags",
)
args = p.parse_args()
if args.filename and args.filename != "-":
with open(args.filename, "rb") as fp:
data = fp.read()
else:
data = sys.stdin.buffer.read()
try:
html = data.decode(args.encoding, args.decode_errors)
except UnicodeDecodeError as err:
warning = bcolors.WARNING + "Warning:" + bcolors.ENDC
warning += " Use the " + bcolors.OKGREEN
warning += "--decode-errors=ignore" + bcolors.ENDC + " flag."
print(warning)
raise err
h = HTML2Text(baseurl=baseurl)
# handle options
if args.ul_style_dash:
h.ul_item_mark = "-"
if args.em_style_asterisk:
h.emphasis_mark = "*"
h.strong_mark = "__"
h.body_width = args.body_width
h.google_list_indent = args.list_indent
h.ignore_emphasis = args.ignore_emphasis
h.ignore_links = args.ignore_links
h.ignore_mailto_links = args.ignore_mailto_links
h.protect_links = args.protect_links
h.ignore_images = args.ignore_images
h.images_as_html = args.images_as_html
h.images_to_alt = args.images_to_alt
h.images_with_size = args.images_with_size
h.google_doc = args.google_doc
h.hide_strikethrough = args.hide_strikethrough
h.escape_snob = args.escape_snob
h.bypass_tables = args.bypass_tables
h.ignore_tables = args.ignore_tables
h.single_line_break = args.single_line_break
h.inline_links = args.inline_links
h.unicode_snob = args.unicode_snob
h.use_automatic_links = args.use_automatic_links
h.skip_internal_links = args.skip_internal_links
h.links_each_paragraph = args.links_each_paragraph
h.mark_code = args.mark_code
h.wrap_links = args.wrap_links
h.wrap_list_items = args.wrap_list_items
h.wrap_tables = args.wrap_tables
h.pad_tables = args.pad_tables
h.default_image_alt = args.default_image_alt
h.open_quote = args.open_quote
h.close_quote = args.close_quote
h.include_sup_sub = args.include_sup_sub
sys.stdout.write(h.handle(html))

View File

@@ -1,172 +0,0 @@
import re
# Use Unicode characters instead of their ascii pseudo-replacements
UNICODE_SNOB = False
# Marker to use for marking tables for padding post processing
TABLE_MARKER_FOR_PAD = "special_marker_for_table_padding"
# Escape all special characters. Output is less readable, but avoids
# corner case formatting issues.
ESCAPE_SNOB = False
ESCAPE_BACKSLASH = False
ESCAPE_DOT = False
ESCAPE_PLUS = False
ESCAPE_DASH = False
# Put the links after each paragraph instead of at the end.
LINKS_EACH_PARAGRAPH = False
# Wrap long lines at position. 0 for no wrapping.
BODY_WIDTH = 78
# Don't show internal links (href="#local-anchor") -- corresponding link
# targets won't be visible in the plain text file anyway.
SKIP_INTERNAL_LINKS = True
# Use inline, rather than reference, formatting for images and links
INLINE_LINKS = True
# Protect links from line breaks surrounding them with angle brackets (in
# addition to their square brackets)
PROTECT_LINKS = False
# WRAP_LINKS = True
WRAP_LINKS = True
# Wrap list items.
WRAP_LIST_ITEMS = False
# Wrap tables
WRAP_TABLES = False
# Number of pixels Google indents nested lists
GOOGLE_LIST_INDENT = 36
# Values Google and others may use to indicate bold text
BOLD_TEXT_STYLE_VALUES = ("bold", "700", "800", "900")
IGNORE_ANCHORS = False
IGNORE_MAILTO_LINKS = False
IGNORE_IMAGES = False
IMAGES_AS_HTML = False
IMAGES_TO_ALT = False
IMAGES_WITH_SIZE = False
IGNORE_EMPHASIS = False
MARK_CODE = False
DECODE_ERRORS = "strict"
DEFAULT_IMAGE_ALT = ""
PAD_TABLES = False
# Convert links with same href and text to <href> format
# if they are absolute links
USE_AUTOMATIC_LINKS = True
# For checking space-only lines on line 771
RE_SPACE = re.compile(r"\s\+")
RE_ORDERED_LIST_MATCHER = re.compile(r"\d+\.\s")
RE_UNORDERED_LIST_MATCHER = re.compile(r"[-\*\+]\s")
RE_MD_CHARS_MATCHER = re.compile(r"([\\\[\]\(\)])")
RE_MD_CHARS_MATCHER_ALL = re.compile(r"([`\*_{}\[\]\(\)#!])")
# to find links in the text
RE_LINK = re.compile(r"(\[.*?\] ?\(.*?\))|(\[.*?\]:.*?)")
# to find table separators
RE_TABLE = re.compile(r" \| ")
RE_MD_DOT_MATCHER = re.compile(
r"""
^ # start of line
(\s*\d+) # optional whitespace and a number
(\.) # dot
(?=\s) # lookahead assert whitespace
""",
re.MULTILINE | re.VERBOSE,
)
RE_MD_PLUS_MATCHER = re.compile(
r"""
^
(\s*)
(\+)
(?=\s)
""",
flags=re.MULTILINE | re.VERBOSE,
)
RE_MD_DASH_MATCHER = re.compile(
r"""
^
(\s*)
(-)
(?=\s|\-) # followed by whitespace (bullet list, or spaced out hr)
# or another dash (header or hr)
""",
flags=re.MULTILINE | re.VERBOSE,
)
RE_SLASH_CHARS = r"\`*_{}[]()#+-.!"
RE_MD_BACKSLASH_MATCHER = re.compile(
r"""
(\\) # match one slash
(?=[%s]) # followed by a char that requires escaping
"""
% re.escape(RE_SLASH_CHARS),
flags=re.VERBOSE,
)
UNIFIABLE = {
"rsquo": "'",
"lsquo": "'",
"rdquo": '"',
"ldquo": '"',
"copy": "(C)",
"mdash": "--",
"nbsp": " ",
"rarr": "->",
"larr": "<-",
"middot": "*",
"ndash": "-",
"oelig": "oe",
"aelig": "ae",
"agrave": "a",
"aacute": "a",
"acirc": "a",
"atilde": "a",
"auml": "a",
"aring": "a",
"egrave": "e",
"eacute": "e",
"ecirc": "e",
"euml": "e",
"igrave": "i",
"iacute": "i",
"icirc": "i",
"iuml": "i",
"ograve": "o",
"oacute": "o",
"ocirc": "o",
"otilde": "o",
"ouml": "o",
"ugrave": "u",
"uacute": "u",
"ucirc": "u",
"uuml": "u",
"lrm": "",
"rlm": "",
}
# Format tables in HTML rather than Markdown syntax
BYPASS_TABLES = False
# Ignore table-related tags (table, th, td, tr) while keeping rows
IGNORE_TABLES = False
# Use a single line break after a block element rather than two line breaks.
# NOTE: Requires body width setting to be 0.
SINGLE_LINE_BREAK = False
# Use double quotation marks when converting the <q> tag.
OPEN_QUOTE = '"'
CLOSE_QUOTE = '"'
# Include the <sup> and <sub> tags
INCLUDE_SUP_SUB = False

View File

@@ -1,18 +0,0 @@
from typing import Dict, Optional
class AnchorElement:
__slots__ = ["attrs", "count", "outcount"]
def __init__(self, attrs: Dict[str, Optional[str]], count: int, outcount: int):
self.attrs = attrs
self.count = count
self.outcount = outcount
class ListElement:
__slots__ = ["name", "num"]
def __init__(self, name: str, num: int):
self.name = name
self.num = num

View File

@@ -1,304 +0,0 @@
import html.entities
from typing import Dict, List, Optional
from . import config
unifiable_n = {
html.entities.name2codepoint[k]: v
for k, v in config.UNIFIABLE.items()
if k != "nbsp"
}
def hn(tag: str) -> int:
if tag[0] == "h" and len(tag) == 2:
n = tag[1]
if "0" < n <= "9":
return int(n)
return 0
def dumb_property_dict(style: str) -> Dict[str, str]:
"""
:returns: A hash of css attributes
"""
return {
x.strip().lower(): y.strip().lower()
for x, y in [z.split(":", 1) for z in style.split(";") if ":" in z]
}
def dumb_css_parser(data: str) -> Dict[str, Dict[str, str]]:
"""
:type data: str
:returns: A hash of css selectors, each of which contains a hash of
css attributes.
:rtype: dict
"""
# remove @import sentences
data += ";"
importIndex = data.find("@import")
while importIndex != -1:
data = data[0:importIndex] + data[data.find(";", importIndex) + 1 :]
importIndex = data.find("@import")
# parse the css. reverted from dictionary comprehension in order to
# support older pythons
pairs = [x.split("{") for x in data.split("}") if "{" in x.strip()]
try:
elements = {a.strip(): dumb_property_dict(b) for a, b in pairs}
except ValueError:
elements = {} # not that important
return elements
def element_style(
attrs: Dict[str, Optional[str]],
style_def: Dict[str, Dict[str, str]],
parent_style: Dict[str, str],
) -> Dict[str, str]:
"""
:type attrs: dict
:type style_def: dict
:type style_def: dict
:returns: A hash of the 'final' style attributes of the element
:rtype: dict
"""
style = parent_style.copy()
if "class" in attrs:
assert attrs["class"] is not None
for css_class in attrs["class"].split():
css_style = style_def.get("." + css_class, {})
style.update(css_style)
if "style" in attrs:
assert attrs["style"] is not None
immediate_style = dumb_property_dict(attrs["style"])
style.update(immediate_style)
return style
def google_list_style(style: Dict[str, str]) -> str:
"""
Finds out whether this is an ordered or unordered list
:type style: dict
:rtype: str
"""
if "list-style-type" in style:
list_style = style["list-style-type"]
if list_style in ["disc", "circle", "square", "none"]:
return "ul"
return "ol"
def google_has_height(style: Dict[str, str]) -> bool:
"""
Check if the style of the element has the 'height' attribute
explicitly defined
:type style: dict
:rtype: bool
"""
return "height" in style
def google_text_emphasis(style: Dict[str, str]) -> List[str]:
"""
:type style: dict
:returns: A list of all emphasis modifiers of the element
:rtype: list
"""
emphasis = []
if "text-decoration" in style:
emphasis.append(style["text-decoration"])
if "font-style" in style:
emphasis.append(style["font-style"])
if "font-weight" in style:
emphasis.append(style["font-weight"])
return emphasis
def google_fixed_width_font(style: Dict[str, str]) -> bool:
"""
Check if the css of the current element defines a fixed width font
:type style: dict
:rtype: bool
"""
font_family = ""
if "font-family" in style:
font_family = style["font-family"]
return "courier new" == font_family or "consolas" == font_family
def list_numbering_start(attrs: Dict[str, Optional[str]]) -> int:
"""
Extract numbering from list element attributes
:type attrs: dict
:rtype: int or None
"""
if "start" in attrs:
assert attrs["start"] is not None
try:
return int(attrs["start"]) - 1
except ValueError:
pass
return 0
def skipwrap(
para: str, wrap_links: bool, wrap_list_items: bool, wrap_tables: bool
) -> bool:
# If it appears to contain a link
# don't wrap
if not wrap_links and config.RE_LINK.search(para):
return True
# If the text begins with four spaces or one tab, it's a code block;
# don't wrap
if para[0:4] == " " or para[0] == "\t":
return True
# If the text begins with only two "--", possibly preceded by
# whitespace, that's an emdash; so wrap.
stripped = para.lstrip()
if stripped[0:2] == "--" and len(stripped) > 2 and stripped[2] != "-":
return False
# I'm not sure what this is for; I thought it was to detect lists,
# but there's a <br>-inside-<span> case in one of the tests that
# also depends upon it.
if stripped[0:1] in ("-", "*") and not stripped[0:2] == "**":
return not wrap_list_items
# If text contains a pipe character it is likely a table
if not wrap_tables and config.RE_TABLE.search(para):
return True
# If the text begins with a single -, *, or +, followed by a space,
# or an integer, followed by a ., followed by a space (in either
# case optionally proceeded by whitespace), it's a list; don't wrap.
return bool(
config.RE_ORDERED_LIST_MATCHER.match(stripped)
or config.RE_UNORDERED_LIST_MATCHER.match(stripped)
)
def escape_md(text: str) -> str:
"""
Escapes markdown-sensitive characters within other markdown
constructs.
"""
return config.RE_MD_CHARS_MATCHER.sub(r"\\\1", text)
def escape_md_section(
text: str,
escape_backslash: bool = True,
snob: bool = False,
escape_dot: bool = True,
escape_plus: bool = True,
escape_dash: bool = True,
) -> str:
"""
Escapes markdown-sensitive characters across whole document sections.
Each escaping operation can be controlled individually.
"""
if escape_backslash:
text = config.RE_MD_BACKSLASH_MATCHER.sub(r"\\\1", text)
if snob:
text = config.RE_MD_CHARS_MATCHER_ALL.sub(r"\\\1", text)
if escape_dot:
text = config.RE_MD_DOT_MATCHER.sub(r"\1\\\2", text)
if escape_plus:
text = config.RE_MD_PLUS_MATCHER.sub(r"\1\\\2", text)
if escape_dash:
text = config.RE_MD_DASH_MATCHER.sub(r"\1\\\2", text)
return text
def reformat_table(lines: List[str], right_margin: int) -> List[str]:
"""
Given the lines of a table
padds the cells and returns the new lines
"""
# find the maximum width of the columns
max_width = [len(x.rstrip()) + right_margin for x in lines[0].split("|")]
max_cols = len(max_width)
for line in lines:
cols = [x.rstrip() for x in line.split("|")]
num_cols = len(cols)
# don't drop any data if colspan attributes result in unequal lengths
if num_cols < max_cols:
cols += [""] * (max_cols - num_cols)
elif max_cols < num_cols:
max_width += [len(x) + right_margin for x in cols[-(num_cols - max_cols) :]]
max_cols = num_cols
max_width = [
max(len(x) + right_margin, old_len) for x, old_len in zip(cols, max_width)
]
# reformat
new_lines = []
for line in lines:
cols = [x.rstrip() for x in line.split("|")]
if set(line.strip()) == set("-|"):
filler = "-"
new_cols = [
x.rstrip() + (filler * (M - len(x.rstrip())))
for x, M in zip(cols, max_width)
]
new_lines.append("|-" + "|".join(new_cols) + "|")
else:
filler = " "
new_cols = [
x.rstrip() + (filler * (M - len(x.rstrip())))
for x, M in zip(cols, max_width)
]
new_lines.append("| " + "|".join(new_cols) + "|")
return new_lines
def pad_tables_in_text(text: str, right_margin: int = 1) -> str:
"""
Provide padding for tables in the text
"""
lines = text.split("\n")
table_buffer = [] # type: List[str]
table_started = False
new_lines = []
for line in lines:
# Toggle table started
if config.TABLE_MARKER_FOR_PAD in line:
table_started = not table_started
if not table_started:
table = reformat_table(table_buffer, right_margin)
new_lines.extend(table)
table_buffer = []
new_lines.append("")
continue
# Process lines
if table_started:
table_buffer.append(line)
else:
new_lines.append(line)
return "\n".join(new_lines)

View File

@@ -1,109 +0,0 @@
import subprocess
import sys
import asyncio
from .async_logger import AsyncLogger, LogLevel
# Initialize logger
logger = AsyncLogger(log_level=LogLevel.DEBUG, verbose=True)
def post_install():
"""Run all post-installation tasks"""
logger.info("Running post-installation setup...", tag="INIT")
install_playwright()
run_migration()
logger.success("Post-installation setup completed!", tag="COMPLETE")
def install_playwright():
logger.info("Installing Playwright browsers...", tag="INIT")
try:
# subprocess.check_call([sys.executable, "-m", "playwright", "install", "--with-deps", "--force", "chrome"])
subprocess.check_call(
[
sys.executable,
"-m",
"playwright",
"install",
"--with-deps",
"--force",
"chromium",
]
)
logger.success(
"Playwright installation completed successfully.", tag="COMPLETE"
)
except subprocess.CalledProcessError:
# logger.error(f"Error during Playwright installation: {e}", tag="ERROR")
logger.warning(
f"Please run '{sys.executable} -m playwright install --with-deps' manually after the installation."
)
except Exception:
# logger.error(f"Unexpected error during Playwright installation: {e}", tag="ERROR")
logger.warning(
f"Please run '{sys.executable} -m playwright install --with-deps' manually after the installation."
)
def run_migration():
"""Initialize database during installation"""
try:
logger.info("Starting database initialization...", tag="INIT")
from crawl4ai.async_database import async_db_manager
asyncio.run(async_db_manager.initialize())
logger.success(
"Database initialization completed successfully.", tag="COMPLETE"
)
except ImportError:
logger.warning("Database module not found. Will initialize on first use.")
except Exception as e:
logger.warning(f"Database initialization failed: {e}")
logger.warning("Database will be initialized on first use")
async def run_doctor():
"""Test if Crawl4AI is working properly"""
logger.info("Running Crawl4AI health check...", tag="INIT")
try:
from .async_webcrawler import (
AsyncWebCrawler,
BrowserConfig,
CrawlerRunConfig,
CacheMode,
)
browser_config = BrowserConfig(
headless=True,
browser_type="chromium",
ignore_https_errors=True,
light_mode=True,
viewport_width=1280,
viewport_height=720,
)
run_config = CrawlerRunConfig(
cache_mode=CacheMode.BYPASS,
screenshot=True,
)
async with AsyncWebCrawler(config=browser_config) as crawler:
logger.info("Testing crawling capabilities...", tag="TEST")
result = await crawler.arun(url="https://crawl4ai.com", config=run_config)
if result and result.markdown:
logger.success("✅ Crawling test passed!", tag="COMPLETE")
return True
else:
raise Exception("Failed to get content")
except Exception as e:
logger.error(f"❌ Test failed: {e}", tag="ERROR")
return False
def doctor():
"""Entry point for the doctor command"""
import asyncio
return asyncio.run(run_doctor())

View File

@@ -1,18 +0,0 @@
import os
# Create a function get name of a js script, then load from the CURRENT folder of this script and return its content as string, make sure its error free
def load_js_script(script_name):
# Get the path of the current script
current_script_path = os.path.dirname(os.path.realpath(__file__))
# Get the path of the script to load
script_path = os.path.join(current_script_path, script_name + ".js")
# Check if the script exists
if not os.path.exists(script_path):
raise ValueError(
f"Script {script_name} not found in the folder {current_script_path}"
)
# Load the content of the script
with open(script_path, "r") as f:
script_content = f.read()
return script_content

View File

@@ -1,25 +0,0 @@
// Pass the Permissions Test.
const originalQuery = window.navigator.permissions.query;
window.navigator.permissions.query = (parameters) =>
parameters.name === "notifications"
? Promise.resolve({ state: Notification.permission })
: originalQuery(parameters);
Object.defineProperty(navigator, "webdriver", {
get: () => undefined,
});
window.navigator.chrome = {
runtime: {},
// Add other properties if necessary
};
Object.defineProperty(navigator, "plugins", {
get: () => [1, 2, 3, 4, 5],
});
Object.defineProperty(navigator, "languages", {
get: () => ["en-US", "en"],
});
Object.defineProperty(document, "hidden", {
get: () => false,
});
Object.defineProperty(document, "visibilityState", {
get: () => "visible",
});

View File

@@ -1,119 +0,0 @@
async () => {
// Function to check if element is visible
const isVisible = (elem) => {
const style = window.getComputedStyle(elem);
return style.display !== "none" && style.visibility !== "hidden" && style.opacity !== "0";
};
// Common selectors for popups and overlays
const commonSelectors = [
// Close buttons first
'button[class*="close" i]',
'button[class*="dismiss" i]',
'button[aria-label*="close" i]',
'button[title*="close" i]',
'a[class*="close" i]',
'span[class*="close" i]',
// Cookie notices
'[class*="cookie-banner" i]',
'[id*="cookie-banner" i]',
'[class*="cookie-consent" i]',
'[id*="cookie-consent" i]',
// Newsletter/subscription dialogs
'[class*="newsletter" i]',
'[class*="subscribe" i]',
// Generic popups/modals
'[class*="popup" i]',
'[class*="modal" i]',
'[class*="overlay" i]',
'[class*="dialog" i]',
'[role="dialog"]',
'[role="alertdialog"]',
];
// Try to click close buttons first
for (const selector of commonSelectors.slice(0, 6)) {
const closeButtons = document.querySelectorAll(selector);
for (const button of closeButtons) {
if (isVisible(button)) {
try {
button.click();
await new Promise((resolve) => setTimeout(resolve, 100));
} catch (e) {
console.log("Error clicking button:", e);
}
}
}
}
// Remove remaining overlay elements
const removeOverlays = () => {
// Find elements with high z-index
const allElements = document.querySelectorAll("*");
for (const elem of allElements) {
const style = window.getComputedStyle(elem);
const zIndex = parseInt(style.zIndex);
const position = style.position;
if (
isVisible(elem) &&
(zIndex > 999 || position === "fixed" || position === "absolute") &&
(elem.offsetWidth > window.innerWidth * 0.5 ||
elem.offsetHeight > window.innerHeight * 0.5 ||
style.backgroundColor.includes("rgba") ||
parseFloat(style.opacity) < 1)
) {
elem.remove();
}
}
// Remove elements matching common selectors
for (const selector of commonSelectors) {
const elements = document.querySelectorAll(selector);
elements.forEach((elem) => {
if (isVisible(elem)) {
elem.remove();
}
});
}
};
// Remove overlay elements
removeOverlays();
// Remove any fixed/sticky position elements at the top/bottom
const removeFixedElements = () => {
const elements = document.querySelectorAll("*");
elements.forEach((elem) => {
const style = window.getComputedStyle(elem);
if ((style.position === "fixed" || style.position === "sticky") && isVisible(elem)) {
elem.remove();
}
});
};
removeFixedElements();
// Remove empty block elements as: div, p, span, etc.
const removeEmptyBlockElements = () => {
const blockElements = document.querySelectorAll(
"div, p, span, section, article, header, footer, aside, nav, main, ul, ol, li, dl, dt, dd, h1, h2, h3, h4, h5, h6"
);
blockElements.forEach((elem) => {
if (elem.innerText.trim() === "") {
elem.remove();
}
});
};
// Remove margin-right and padding-right from body (often added by modal scripts)
document.body.style.marginRight = "0px";
document.body.style.paddingRight = "0px";
document.body.style.overflow = "auto";
// Wait a bit for any animations to complete
await new Promise((resolve) => setTimeout(resolve, 100));
};

View File

@@ -1,54 +0,0 @@
() => {
return new Promise((resolve) => {
const filterImage = (img) => {
// Filter out images that are too small
if (img.width < 100 && img.height < 100) return false;
// Filter out images that are not visible
const rect = img.getBoundingClientRect();
if (rect.width === 0 || rect.height === 0) return false;
// Filter out images with certain class names (e.g., icons, thumbnails)
if (img.classList.contains("icon") || img.classList.contains("thumbnail")) return false;
// Filter out images with certain patterns in their src (e.g., placeholder images)
if (img.src.includes("placeholder") || img.src.includes("icon")) return false;
return true;
};
const images = Array.from(document.querySelectorAll("img")).filter(filterImage);
let imagesLeft = images.length;
if (imagesLeft === 0) {
resolve();
return;
}
const checkImage = (img) => {
if (img.complete && img.naturalWidth !== 0) {
img.setAttribute("width", img.naturalWidth);
img.setAttribute("height", img.naturalHeight);
imagesLeft--;
if (imagesLeft === 0) resolve();
}
};
images.forEach((img) => {
checkImage(img);
if (!img.complete) {
img.onload = () => {
checkImage(img);
};
img.onerror = () => {
imagesLeft--;
if (imagesLeft === 0) resolve();
};
}
});
// Fallback timeout of 5 seconds
// setTimeout(() => resolve(), 5000);
resolve();
});
};

View File

@@ -1,546 +0,0 @@
import os
from pathlib import Path
import re
from typing import Dict, List, Tuple, Optional, Any
import json
from tqdm import tqdm
import time
import psutil
import numpy as np
from rank_bm25 import BM25Okapi
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from litellm import batch_completion
from .async_logger import AsyncLogger
import litellm
import pickle
import hashlib # <--- ADDED for file-hash
import glob
litellm.set_verbose = False
def _compute_file_hash(file_path: Path) -> str:
"""Compute MD5 hash for the file's entire content."""
hash_md5 = hashlib.md5()
with file_path.open("rb") as f:
for chunk in iter(lambda: f.read(4096), b""):
hash_md5.update(chunk)
return hash_md5.hexdigest()
class AsyncLLMTextManager:
def __init__(
self,
docs_dir: Path,
logger: Optional[AsyncLogger] = None,
max_concurrent_calls: int = 5,
batch_size: int = 3,
) -> None:
self.docs_dir = docs_dir
self.logger = logger
self.max_concurrent_calls = max_concurrent_calls
self.batch_size = batch_size
self.bm25_index = None
self.document_map: Dict[str, Any] = {}
self.tokenized_facts: List[str] = []
self.bm25_index_file = self.docs_dir / "bm25_index.pkl"
async def _process_document_batch(self, doc_batch: List[Path]) -> None:
"""Process a batch of documents in parallel"""
contents = []
for file_path in doc_batch:
try:
with open(file_path, "r", encoding="utf-8") as f:
contents.append(f.read())
except Exception as e:
self.logger.error(f"Error reading {file_path}: {str(e)}")
contents.append("") # Add empty content to maintain batch alignment
prompt = """Given a documentation file, generate a list of atomic facts where each fact:
1. Represents a single piece of knowledge
2. Contains variations in terminology for the same concept
3. References relevant code patterns if they exist
4. Is written in a way that would match natural language queries
Each fact should follow this format:
<main_concept>: <fact_statement> | <related_terms> | <code_reference>
Example Facts:
browser_config: Configure headless mode and browser type for AsyncWebCrawler | headless, browser_type, chromium, firefox | BrowserConfig(browser_type="chromium", headless=True)
redis_connection: Redis client connection requires host and port configuration | redis setup, redis client, connection params | Redis(host='localhost', port=6379, db=0)
pandas_filtering: Filter DataFrame rows using boolean conditions | dataframe filter, query, boolean indexing | df[df['column'] > 5]
Wrap your response in <index>...</index> tags.
"""
# Prepare messages for batch processing
messages_list = [
[
{
"role": "user",
"content": f"{prompt}\n\nGenerate index for this documentation:\n\n{content}",
}
]
for content in contents
if content
]
try:
responses = batch_completion(
model="anthropic/claude-3-5-sonnet-latest",
messages=messages_list,
logger_fn=None,
)
# Process responses and save index files
for response, file_path in zip(responses, doc_batch):
try:
index_content_match = re.search(
r"<index>(.*?)</index>",
response.choices[0].message.content,
re.DOTALL,
)
if not index_content_match:
self.logger.warning(
f"No <index>...</index> content found for {file_path}"
)
continue
index_content = re.sub(
r"\n\s*\n", "\n", index_content_match.group(1)
).strip()
if index_content:
index_file = file_path.with_suffix(".q.md")
with open(index_file, "w", encoding="utf-8") as f:
f.write(index_content)
self.logger.info(f"Created index file: {index_file}")
else:
self.logger.warning(
f"No index content found in response for {file_path}"
)
except Exception as e:
self.logger.error(
f"Error processing response for {file_path}: {str(e)}"
)
except Exception as e:
self.logger.error(f"Error in batch completion: {str(e)}")
def _validate_fact_line(self, line: str) -> Tuple[bool, Optional[str]]:
if "|" not in line:
return False, "Missing separator '|'"
parts = [p.strip() for p in line.split("|")]
if len(parts) != 3:
return False, f"Expected 3 parts, got {len(parts)}"
concept_part = parts[0]
if ":" not in concept_part:
return False, "Missing ':' in concept definition"
return True, None
def _load_or_create_token_cache(self, fact_file: Path) -> Dict:
"""
Load token cache from .q.tokens if present and matching file hash.
Otherwise return a new structure with updated file-hash.
"""
cache_file = fact_file.with_suffix(".q.tokens")
current_hash = _compute_file_hash(fact_file)
if cache_file.exists():
try:
with open(cache_file, "r") as f:
cache = json.load(f)
# If the hash matches, return it directly
if cache.get("content_hash") == current_hash:
return cache
# Otherwise, we signal that it's changed
self.logger.info(f"Hash changed for {fact_file}, reindex needed.")
except json.JSONDecodeError:
self.logger.warning(f"Corrupt token cache for {fact_file}, rebuilding.")
except Exception as e:
self.logger.warning(f"Error reading cache for {fact_file}: {str(e)}")
# Return a fresh cache
return {"facts": {}, "content_hash": current_hash}
def _save_token_cache(self, fact_file: Path, cache: Dict) -> None:
cache_file = fact_file.with_suffix(".q.tokens")
# Always ensure we're saving the correct file-hash
cache["content_hash"] = _compute_file_hash(fact_file)
with open(cache_file, "w") as f:
json.dump(cache, f)
def preprocess_text(self, text: str) -> List[str]:
parts = [x.strip() for x in text.split("|")] if "|" in text else [text]
# Remove : after the first word of parts[0]
parts[0] = re.sub(r"^(.*?):", r"\1", parts[0])
lemmatizer = WordNetLemmatizer()
stop_words = set(stopwords.words("english")) - {
"how",
"what",
"when",
"where",
"why",
"which",
}
tokens = []
for part in parts:
if "(" in part and ")" in part:
code_tokens = re.findall(
r'[\w_]+(?=\()|[\w_]+(?==[\'"]{1}[\w_]+[\'"]{1})', part
)
tokens.extend(code_tokens)
words = word_tokenize(part.lower())
tokens.extend(
[
lemmatizer.lemmatize(token)
for token in words
if token not in stop_words
]
)
return tokens
def maybe_load_bm25_index(self, clear_cache=False) -> bool:
"""
Load existing BM25 index from disk, if present and clear_cache=False.
"""
if not clear_cache and os.path.exists(self.bm25_index_file):
self.logger.info("Loading existing BM25 index from disk.")
with open(self.bm25_index_file, "rb") as f:
data = pickle.load(f)
self.tokenized_facts = data["tokenized_facts"]
self.bm25_index = data["bm25_index"]
return True
return False
def build_search_index(self, clear_cache=False) -> None:
"""
Checks for new or modified .q.md files by comparing file-hash.
If none need reindexing and clear_cache is False, loads existing index if available.
Otherwise, reindexes only changed/new files and merges or creates a new index.
"""
# If clear_cache is True, we skip partial logic: rebuild everything from scratch
if clear_cache:
self.logger.info("Clearing cache and rebuilding full search index.")
if self.bm25_index_file.exists():
self.bm25_index_file.unlink()
process = psutil.Process()
self.logger.info("Checking which .q.md files need (re)indexing...")
# Gather all .q.md files
q_files = [
self.docs_dir / f for f in os.listdir(self.docs_dir) if f.endswith(".q.md")
]
# We'll store known (unchanged) facts in these lists
existing_facts: List[str] = []
existing_tokens: List[List[str]] = []
# Keep track of invalid lines for logging
invalid_lines = []
needSet = [] # files that must be (re)indexed
for qf in q_files:
token_cache_file = qf.with_suffix(".q.tokens")
# If no .q.tokens or clear_cache is True → definitely reindex
if clear_cache or not token_cache_file.exists():
needSet.append(qf)
continue
# Otherwise, load the existing cache and compare hash
cache = self._load_or_create_token_cache(qf)
# If the .q.tokens was out of date (i.e. changed hash), we reindex
if len(cache["facts"]) == 0 or cache.get(
"content_hash"
) != _compute_file_hash(qf):
needSet.append(qf)
else:
# File is unchanged → retrieve cached token data
for line, cache_data in cache["facts"].items():
existing_facts.append(line)
existing_tokens.append(cache_data["tokens"])
self.document_map[line] = qf # track the doc for that fact
if not needSet and not clear_cache:
# If no file needs reindexing, try loading existing index
if self.maybe_load_bm25_index(clear_cache=False):
self.logger.info(
"No new/changed .q.md files found. Using existing BM25 index."
)
return
else:
# If there's no existing index, we must build a fresh index from the old caches
self.logger.info(
"No existing BM25 index found. Building from cached facts."
)
if existing_facts:
self.logger.info(
f"Building BM25 index with {len(existing_facts)} cached facts."
)
self.bm25_index = BM25Okapi(existing_tokens)
self.tokenized_facts = existing_facts
with open(self.bm25_index_file, "wb") as f:
pickle.dump(
{
"bm25_index": self.bm25_index,
"tokenized_facts": self.tokenized_facts,
},
f,
)
else:
self.logger.warning("No facts found at all. Index remains empty.")
return
# ----------------------------------------------------- /Users/unclecode/.crawl4ai/docs/14_proxy_security.q.q.tokens '/Users/unclecode/.crawl4ai/docs/14_proxy_security.q.md'
# If we reach here, we have new or changed .q.md files
# We'll parse them, reindex them, and then combine with existing_facts
# -----------------------------------------------------
self.logger.info(f"{len(needSet)} file(s) need reindexing. Parsing now...")
# 1) Parse the new or changed .q.md files
new_facts = []
new_tokens = []
with tqdm(total=len(needSet), desc="Indexing changed files") as file_pbar:
for file in needSet:
# We'll build up a fresh cache
fresh_cache = {"facts": {}, "content_hash": _compute_file_hash(file)}
try:
with open(file, "r", encoding="utf-8") as f_obj:
content = f_obj.read().strip()
lines = [l.strip() for l in content.split("\n") if l.strip()]
for line in lines:
is_valid, error = self._validate_fact_line(line)
if not is_valid:
invalid_lines.append((file, line, error))
continue
tokens = self.preprocess_text(line)
fresh_cache["facts"][line] = {
"tokens": tokens,
"added": time.time(),
}
new_facts.append(line)
new_tokens.append(tokens)
self.document_map[line] = file
# Save the new .q.tokens with updated hash
self._save_token_cache(file, fresh_cache)
mem_usage = process.memory_info().rss / 1024 / 1024
self.logger.debug(
f"Memory usage after {file.name}: {mem_usage:.2f}MB"
)
except Exception as e:
self.logger.error(f"Error processing {file}: {str(e)}")
file_pbar.update(1)
if invalid_lines:
self.logger.warning(f"Found {len(invalid_lines)} invalid fact lines:")
for file, line, error in invalid_lines:
self.logger.warning(f"{file}: {error} in line: {line[:50]}...")
# 2) Merge newly tokenized facts with the existing ones
all_facts = existing_facts + new_facts
all_tokens = existing_tokens + new_tokens
# 3) Build BM25 index from combined facts
self.logger.info(
f"Building BM25 index with {len(all_facts)} total facts (old + new)."
)
self.bm25_index = BM25Okapi(all_tokens)
self.tokenized_facts = all_facts
# 4) Save the updated BM25 index to disk
with open(self.bm25_index_file, "wb") as f:
pickle.dump(
{
"bm25_index": self.bm25_index,
"tokenized_facts": self.tokenized_facts,
},
f,
)
final_mem = process.memory_info().rss / 1024 / 1024
self.logger.info(f"Search index updated. Final memory usage: {final_mem:.2f}MB")
async def generate_index_files(
self, force_generate_facts: bool = False, clear_bm25_cache: bool = False
) -> None:
"""
Generate index files for all documents in parallel batches
Args:
force_generate_facts (bool): If True, regenerate indexes even if they exist
clear_bm25_cache (bool): If True, clear existing BM25 index cache
"""
self.logger.info("Starting index generation for documentation files.")
md_files = [
self.docs_dir / f
for f in os.listdir(self.docs_dir)
if f.endswith(".md") and not any(f.endswith(x) for x in [".q.md", ".xs.md"])
]
# Filter out files that already have .q files unless force=True
if not force_generate_facts:
md_files = [
f
for f in md_files
if not (self.docs_dir / f.name.replace(".md", ".q.md")).exists()
]
if not md_files:
self.logger.info("All index files exist. Use force=True to regenerate.")
else:
# Process documents in batches
for i in range(0, len(md_files), self.batch_size):
batch = md_files[i : i + self.batch_size]
self.logger.info(
f"Processing batch {i//self.batch_size + 1}/{(len(md_files)//self.batch_size) + 1}"
)
await self._process_document_batch(batch)
self.logger.info("Index generation complete, building/updating search index.")
self.build_search_index(clear_cache=clear_bm25_cache)
def generate(self, sections: List[str], mode: str = "extended") -> str:
# Get all markdown files
all_files = glob.glob(str(self.docs_dir / "[0-9]*.md")) + glob.glob(
str(self.docs_dir / "[0-9]*.xs.md")
)
# Extract base names without extensions
base_docs = {
Path(f).name.split(".")[0]
for f in all_files
if not Path(f).name.endswith(".q.md")
}
# Filter by sections if provided
if sections:
base_docs = {
doc
for doc in base_docs
if any(section.lower() in doc.lower() for section in sections)
}
# Get file paths based on mode
files = []
for doc in sorted(
base_docs,
key=lambda x: int(x.split("_")[0]) if x.split("_")[0].isdigit() else 999999,
):
if mode == "condensed":
xs_file = self.docs_dir / f"{doc}.xs.md"
regular_file = self.docs_dir / f"{doc}.md"
files.append(str(xs_file if xs_file.exists() else regular_file))
else:
files.append(str(self.docs_dir / f"{doc}.md"))
# Read and format content
content = []
for file in files:
try:
with open(file, "r", encoding="utf-8") as f:
fname = Path(file).name
content.append(f"{'#'*20}\n# {fname}\n{'#'*20}\n\n{f.read()}")
except Exception as e:
self.logger.error(f"Error reading {file}: {str(e)}")
return "\n\n---\n\n".join(content) if content else ""
def search(self, query: str, top_k: int = 5) -> str:
if not self.bm25_index:
return "No search index available. Call build_search_index() first."
query_tokens = self.preprocess_text(query)
doc_scores = self.bm25_index.get_scores(query_tokens)
mean_score = np.mean(doc_scores)
std_score = np.std(doc_scores)
score_threshold = mean_score + (0.25 * std_score)
file_data = self._aggregate_search_scores(
doc_scores=doc_scores,
score_threshold=score_threshold,
query_tokens=query_tokens,
)
ranked_files = sorted(
file_data.items(),
key=lambda x: (
x[1]["code_match_score"] * 2.0
+ x[1]["match_count"] * 1.5
+ x[1]["total_score"]
),
reverse=True,
)[:top_k]
results = []
for file, _ in ranked_files:
main_doc = str(file).replace(".q.md", ".md")
if os.path.exists(self.docs_dir / main_doc):
with open(self.docs_dir / main_doc, "r", encoding="utf-8") as f:
only_file_name = main_doc.split("/")[-1]
content = ["#" * 20, f"# {only_file_name}", "#" * 20, "", f.read()]
results.append("\n".join(content))
return "\n\n---\n\n".join(results)
def _aggregate_search_scores(
self, doc_scores: List[float], score_threshold: float, query_tokens: List[str]
) -> Dict:
file_data = {}
for idx, score in enumerate(doc_scores):
if score <= score_threshold:
continue
fact = self.tokenized_facts[idx]
file_path = self.document_map[fact]
if file_path not in file_data:
file_data[file_path] = {
"total_score": 0,
"match_count": 0,
"code_match_score": 0,
"matched_facts": [],
}
components = fact.split("|") if "|" in fact else [fact]
code_match_score = 0
if len(components) == 3:
code_ref = components[2].strip()
code_tokens = self.preprocess_text(code_ref)
code_match_score = len(set(query_tokens) & set(code_tokens)) / len(
query_tokens
)
file_data[file_path]["total_score"] += score
file_data[file_path]["match_count"] += 1
file_data[file_path]["code_match_score"] = max(
file_data[file_path]["code_match_score"], code_match_score
)
file_data[file_path]["matched_facts"].append(fact)
return file_data
def refresh_index(self) -> None:
"""Convenience method for a full rebuild."""
self.build_search_index(clear_cache=True)

View File

@@ -1,253 +0,0 @@
from abc import ABC, abstractmethod
from typing import Optional, Dict, Any, Tuple
from .models import MarkdownGenerationResult
from .html2text import CustomHTML2Text
from .content_filter_strategy import RelevantContentFilter
import re
from urllib.parse import urljoin
# Pre-compile the regex pattern
LINK_PATTERN = re.compile(r'!?\[([^\]]+)\]\(([^)]+?)(?:\s+"([^"]*)")?\)')
def fast_urljoin(base: str, url: str) -> str:
"""Fast URL joining for common cases."""
if url.startswith(("http://", "https://", "mailto:", "//")):
return url
if url.startswith("/"):
# Handle absolute paths
if base.endswith("/"):
return base[:-1] + url
return base + url
return urljoin(base, url)
class MarkdownGenerationStrategy(ABC):
"""Abstract base class for markdown generation strategies."""
def __init__(
self,
content_filter: Optional[RelevantContentFilter] = None,
options: Optional[Dict[str, Any]] = None,
):
self.content_filter = content_filter
self.options = options or {}
@abstractmethod
def generate_markdown(
self,
cleaned_html: str,
base_url: str = "",
html2text_options: Optional[Dict[str, Any]] = None,
content_filter: Optional[RelevantContentFilter] = None,
citations: bool = True,
**kwargs,
) -> MarkdownGenerationResult:
"""Generate markdown from cleaned HTML."""
pass
class DefaultMarkdownGenerator(MarkdownGenerationStrategy):
"""
Default implementation of markdown generation strategy.
How it works:
1. Generate raw markdown from cleaned HTML.
2. Convert links to citations.
3. Generate fit markdown if content filter is provided.
4. Return MarkdownGenerationResult.
Args:
content_filter (Optional[RelevantContentFilter]): Content filter for generating fit markdown.
options (Optional[Dict[str, Any]]): Additional options for markdown generation. Defaults to None.
Returns:
MarkdownGenerationResult: Result containing raw markdown, fit markdown, fit HTML, and references markdown.
"""
def __init__(
self,
content_filter: Optional[RelevantContentFilter] = None,
options: Optional[Dict[str, Any]] = None,
):
super().__init__(content_filter, options)
def convert_links_to_citations(
self, markdown: str, base_url: str = ""
) -> Tuple[str, str]:
"""
Convert links in markdown to citations.
How it works:
1. Find all links in the markdown.
2. Convert links to citations.
3. Return converted markdown and references markdown.
Note:
This function uses a regex pattern to find links in markdown.
Args:
markdown (str): Markdown text.
base_url (str): Base URL for URL joins.
Returns:
Tuple[str, str]: Converted markdown and references markdown.
"""
link_map = {}
url_cache = {} # Cache for URL joins
parts = []
last_end = 0
counter = 1
for match in LINK_PATTERN.finditer(markdown):
parts.append(markdown[last_end : match.start()])
text, url, title = match.groups()
# Use cached URL if available, otherwise compute and cache
if base_url and not url.startswith(("http://", "https://", "mailto:")):
if url not in url_cache:
url_cache[url] = fast_urljoin(base_url, url)
url = url_cache[url]
if url not in link_map:
desc = []
if title:
desc.append(title)
if text and text != title:
desc.append(text)
link_map[url] = (counter, ": " + " - ".join(desc) if desc else "")
counter += 1
num = link_map[url][0]
parts.append(
f"{text}{num}"
if not match.group(0).startswith("!")
else f"![{text}{num}⟩]"
)
last_end = match.end()
parts.append(markdown[last_end:])
converted_text = "".join(parts)
# Pre-build reference strings
references = ["\n\n## References\n\n"]
references.extend(
f"{num}{url}{desc}\n"
for url, (num, desc) in sorted(link_map.items(), key=lambda x: x[1][0])
)
return converted_text, "".join(references)
def generate_markdown(
self,
cleaned_html: str,
base_url: str = "",
html2text_options: Optional[Dict[str, Any]] = None,
options: Optional[Dict[str, Any]] = None,
content_filter: Optional[RelevantContentFilter] = None,
citations: bool = True,
**kwargs,
) -> MarkdownGenerationResult:
"""
Generate markdown with citations from cleaned HTML.
How it works:
1. Generate raw markdown from cleaned HTML.
2. Convert links to citations.
3. Generate fit markdown if content filter is provided.
4. Return MarkdownGenerationResult.
Args:
cleaned_html (str): Cleaned HTML content.
base_url (str): Base URL for URL joins.
html2text_options (Optional[Dict[str, Any]]): HTML2Text options.
options (Optional[Dict[str, Any]]): Additional options for markdown generation.
content_filter (Optional[RelevantContentFilter]): Content filter for generating fit markdown.
citations (bool): Whether to generate citations.
Returns:
MarkdownGenerationResult: Result containing raw markdown, fit markdown, fit HTML, and references markdown.
"""
try:
# Initialize HTML2Text with default options for better conversion
h = CustomHTML2Text(baseurl=base_url)
default_options = {
"body_width": 0, # Disable text wrapping
"ignore_emphasis": False,
"ignore_links": False,
"ignore_images": False,
"protect_links": True,
"single_line_break": True,
"mark_code": True,
"escape_snob": False,
}
# Update with custom options if provided
if html2text_options:
default_options.update(html2text_options)
elif options:
default_options.update(options)
elif self.options:
default_options.update(self.options)
h.update_params(**default_options)
# Ensure we have valid input
if not cleaned_html:
cleaned_html = ""
elif not isinstance(cleaned_html, str):
cleaned_html = str(cleaned_html)
# Generate raw markdown
try:
raw_markdown = h.handle(cleaned_html)
except Exception as e:
raw_markdown = f"Error converting HTML to markdown: {str(e)}"
raw_markdown = raw_markdown.replace(" ```", "```")
# Convert links to citations
markdown_with_citations: str = raw_markdown
references_markdown: str = ""
if citations:
try:
(
markdown_with_citations,
references_markdown,
) = self.convert_links_to_citations(raw_markdown, base_url)
except Exception as e:
markdown_with_citations = raw_markdown
references_markdown = f"Error generating citations: {str(e)}"
# Generate fit markdown if content filter is provided
fit_markdown: Optional[str] = ""
filtered_html: Optional[str] = ""
if content_filter or self.content_filter:
try:
content_filter = content_filter or self.content_filter
filtered_html = content_filter.filter_content(cleaned_html)
filtered_html = "\n".join(
"<div>{}</div>".format(s) for s in filtered_html
)
fit_markdown = h.handle(filtered_html)
except Exception as e:
fit_markdown = f"Error generating fit markdown: {str(e)}"
filtered_html = ""
return MarkdownGenerationResult(
raw_markdown=raw_markdown or "",
markdown_with_citations=markdown_with_citations or "",
references_markdown=references_markdown or "",
fit_markdown=fit_markdown or "",
fit_html=filtered_html or "",
)
except Exception as e:
# If anything fails, return empty strings with error message
error_msg = f"Error in markdown generation: {str(e)}"
return MarkdownGenerationResult(
raw_markdown=error_msg,
markdown_with_citations=error_msg,
references_markdown="",
fit_markdown="",
fit_html="",
)

View File

@@ -1,194 +0,0 @@
import os
import asyncio
from pathlib import Path
import aiosqlite
from typing import Optional
import xxhash
import aiofiles
import shutil
from datetime import datetime
from .async_logger import AsyncLogger, LogLevel
# Initialize logger
logger = AsyncLogger(log_level=LogLevel.DEBUG, verbose=True)
# logging.basicConfig(level=logging.INFO)
# logger = logging.getLogger(__name__)
class DatabaseMigration:
def __init__(self, db_path: str):
self.db_path = db_path
self.content_paths = self._ensure_content_dirs(os.path.dirname(db_path))
def _ensure_content_dirs(self, base_path: str) -> dict:
dirs = {
"html": "html_content",
"cleaned": "cleaned_html",
"markdown": "markdown_content",
"extracted": "extracted_content",
"screenshots": "screenshots",
}
content_paths = {}
for key, dirname in dirs.items():
path = os.path.join(base_path, dirname)
os.makedirs(path, exist_ok=True)
content_paths[key] = path
return content_paths
def _generate_content_hash(self, content: str) -> str:
x = xxhash.xxh64()
x.update(content.encode())
content_hash = x.hexdigest()
return content_hash
# return hashlib.sha256(content.encode()).hexdigest()
async def _store_content(self, content: str, content_type: str) -> str:
if not content:
return ""
content_hash = self._generate_content_hash(content)
file_path = os.path.join(self.content_paths[content_type], content_hash)
if not os.path.exists(file_path):
async with aiofiles.open(file_path, "w", encoding="utf-8") as f:
await f.write(content)
return content_hash
async def migrate_database(self):
"""Migrate existing database to file-based storage"""
# logger.info("Starting database migration...")
logger.info("Starting database migration...", tag="INIT")
try:
async with aiosqlite.connect(self.db_path) as db:
# Get all rows
async with db.execute(
"""SELECT url, html, cleaned_html, markdown,
extracted_content, screenshot FROM crawled_data"""
) as cursor:
rows = await cursor.fetchall()
migrated_count = 0
for row in rows:
(
url,
html,
cleaned_html,
markdown,
extracted_content,
screenshot,
) = row
# Store content in files and get hashes
html_hash = await self._store_content(html, "html")
cleaned_hash = await self._store_content(cleaned_html, "cleaned")
markdown_hash = await self._store_content(markdown, "markdown")
extracted_hash = await self._store_content(
extracted_content, "extracted"
)
screenshot_hash = await self._store_content(
screenshot, "screenshots"
)
# Update database with hashes
await db.execute(
"""
UPDATE crawled_data
SET html = ?,
cleaned_html = ?,
markdown = ?,
extracted_content = ?,
screenshot = ?
WHERE url = ?
""",
(
html_hash,
cleaned_hash,
markdown_hash,
extracted_hash,
screenshot_hash,
url,
),
)
migrated_count += 1
if migrated_count % 100 == 0:
logger.info(f"Migrated {migrated_count} records...", tag="INIT")
await db.commit()
logger.success(
f"Migration completed. {migrated_count} records processed.",
tag="COMPLETE",
)
except Exception as e:
# logger.error(f"Migration failed: {e}")
logger.error(
message="Migration failed: {error}",
tag="ERROR",
params={"error": str(e)},
)
raise e
async def backup_database(db_path: str) -> str:
"""Create backup of existing database"""
if not os.path.exists(db_path):
logger.info("No existing database found. Skipping backup.", tag="INIT")
return None
# Create backup with timestamp
timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
backup_path = f"{db_path}.backup_{timestamp}"
try:
# Wait for any potential write operations to finish
await asyncio.sleep(1)
# Create backup
shutil.copy2(db_path, backup_path)
logger.info(f"Database backup created at: {backup_path}", tag="COMPLETE")
return backup_path
except Exception as e:
# logger.error(f"Backup failed: {e}")
logger.error(
message="Migration failed: {error}", tag="ERROR", params={"error": str(e)}
)
raise e
async def run_migration(db_path: Optional[str] = None):
"""Run database migration"""
if db_path is None:
db_path = os.path.join(Path.home(), ".crawl4ai", "crawl4ai.db")
if not os.path.exists(db_path):
logger.info("No existing database found. Skipping migration.", tag="INIT")
return
# Create backup first
backup_path = await backup_database(db_path)
if not backup_path:
return
migration = DatabaseMigration(db_path)
await migration.migrate_database()
def main():
"""CLI entry point for migration"""
import argparse
parser = argparse.ArgumentParser(
description="Migrate Crawl4AI database to file-based storage"
)
parser.add_argument("--db-path", help="Custom database path")
args = parser.parse_args()
asyncio.run(run_migration(args.db_path))
if __name__ == "__main__":
main()

View File

@@ -2,125 +2,101 @@ from functools import lru_cache
from pathlib import Path
import subprocess, os
import shutil
import tarfile
from .model_loader import *
import argparse
import urllib.request
from crawl4ai.config import MODEL_REPO_BRANCH
__location__ = os.path.realpath(os.path.join(os.getcwd(), os.path.dirname(__file__)))
@lru_cache()
def get_available_memory(device):
import torch
if device.type == "cuda":
if device.type == 'cuda':
return torch.cuda.get_device_properties(device).total_memory
elif device.type == "mps":
return 48 * 1024**3 # Assuming 8GB for MPS, as a conservative estimate
elif device.type == 'mps':
return 48 * 1024 ** 3 # Assuming 8GB for MPS, as a conservative estimate
else:
return 0
@lru_cache()
def calculate_batch_size(device):
available_memory = get_available_memory(device)
if device.type == "cpu":
if device.type == 'cpu':
return 16
elif device.type in ["cuda", "mps"]:
elif device.type in ['cuda', 'mps']:
# Adjust these thresholds based on your model size and available memory
if available_memory >= 31 * 1024**3: # > 32GB
if available_memory >= 31 * 1024 ** 3: # > 32GB
return 256
elif available_memory >= 15 * 1024**3: # > 16GB to 32GB
elif available_memory >= 15 * 1024 ** 3: # > 16GB to 32GB
return 128
elif available_memory >= 8 * 1024**3: # 8GB to 16GB
elif available_memory >= 8 * 1024 ** 3: # 8GB to 16GB
return 64
else:
return 32
else:
return 16 # Default batch size
return 16 # Default batch size
@lru_cache()
def get_device():
import torch
if torch.cuda.is_available():
device = torch.device("cuda")
device = torch.device('cuda')
elif torch.backends.mps.is_available():
device = torch.device("mps")
device = torch.device('mps')
else:
device = torch.device("cpu")
return device
device = torch.device('cpu')
return device
def set_model_device(model):
device = get_device()
model.to(device)
model.to(device)
return model, device
@lru_cache()
def get_home_folder():
home_folder = os.path.join(
os.getenv("CRAWL4_AI_BASE_DIRECTORY", Path.home()), ".crawl4ai"
)
home_folder = os.path.join(Path.home(), ".crawl4ai")
os.makedirs(home_folder, exist_ok=True)
os.makedirs(f"{home_folder}/cache", exist_ok=True)
os.makedirs(f"{home_folder}/models", exist_ok=True)
return home_folder
return home_folder
@lru_cache()
def load_bert_base_uncased():
from transformers import BertTokenizer, BertModel
tokenizer = BertTokenizer.from_pretrained("bert-base-uncased", resume_download=None)
model = BertModel.from_pretrained("bert-base-uncased", resume_download=None)
from transformers import BertTokenizer, BertModel, AutoTokenizer, AutoModel
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased', resume_download=None)
model = BertModel.from_pretrained('bert-base-uncased', resume_download=None)
model.eval()
model, device = set_model_device(model)
return tokenizer, model
@lru_cache()
def load_HF_embedding_model(model_name="BAAI/bge-small-en-v1.5") -> tuple:
"""Load the Hugging Face model for embedding.
Args:
model_name (str, optional): The model name to load. Defaults to "BAAI/bge-small-en-v1.5".
Returns:
tuple: The tokenizer and model.
"""
from transformers import AutoTokenizer, AutoModel
tokenizer = AutoTokenizer.from_pretrained(model_name, resume_download=None)
model = AutoModel.from_pretrained(model_name, resume_download=None)
def load_bge_small_en_v1_5():
from transformers import BertTokenizer, BertModel, AutoTokenizer, AutoModel
tokenizer = AutoTokenizer.from_pretrained('BAAI/bge-small-en-v1.5', resume_download=None)
model = AutoModel.from_pretrained('BAAI/bge-small-en-v1.5', resume_download=None)
model.eval()
model, device = set_model_device(model)
return tokenizer, model
@lru_cache()
def load_text_classifier():
from transformers import AutoTokenizer, AutoModelForSequenceClassification
from transformers import pipeline
import torch
tokenizer = AutoTokenizer.from_pretrained(
"dstefa/roberta-base_topic_classification_nyt_news"
)
model = AutoModelForSequenceClassification.from_pretrained(
"dstefa/roberta-base_topic_classification_nyt_news"
)
tokenizer = AutoTokenizer.from_pretrained("dstefa/roberta-base_topic_classification_nyt_news")
model = AutoModelForSequenceClassification.from_pretrained("dstefa/roberta-base_topic_classification_nyt_news")
model.eval()
model, device = set_model_device(model)
pipe = pipeline("text-classification", model=model, tokenizer=tokenizer)
return pipe
@lru_cache()
def load_text_multilabel_classifier():
from transformers import AutoModelForSequenceClassification, AutoTokenizer
import numpy as np
from scipy.special import expit
import torch
@@ -132,27 +108,18 @@ def load_text_multilabel_classifier():
# else:
# device = torch.device("cpu")
# # return load_spacy_model(), torch.device("cpu")
MODEL = "cardiffnlp/tweet-topic-21-multi"
tokenizer = AutoTokenizer.from_pretrained(MODEL, resume_download=None)
model = AutoModelForSequenceClassification.from_pretrained(
MODEL, resume_download=None
)
model = AutoModelForSequenceClassification.from_pretrained(MODEL, resume_download=None)
model.eval()
model, device = set_model_device(model)
class_mapping = model.config.id2label
def _classifier(texts, threshold=0.5, max_length=64):
tokens = tokenizer(
texts,
return_tensors="pt",
padding=True,
truncation=True,
max_length=max_length,
)
tokens = {
key: val.to(device) for key, val in tokens.items()
} # Move tokens to the selected device
tokens = tokenizer(texts, return_tensors='pt', padding=True, truncation=True, max_length=max_length)
tokens = {key: val.to(device) for key, val in tokens.items()} # Move tokens to the selected device
with torch.no_grad():
output = model(**tokens)
@@ -163,41 +130,35 @@ def load_text_multilabel_classifier():
batch_labels = []
for prediction in predictions:
labels = [
class_mapping[i] for i, value in enumerate(prediction) if value == 1
]
labels = [class_mapping[i] for i, value in enumerate(prediction) if value == 1]
batch_labels.append(labels)
return batch_labels
return _classifier, device
@lru_cache()
def load_nltk_punkt():
import nltk
try:
nltk.data.find("tokenizers/punkt")
nltk.data.find('tokenizers/punkt')
except LookupError:
nltk.download("punkt")
return nltk.data.find("tokenizers/punkt")
nltk.download('punkt')
return nltk.data.find('tokenizers/punkt')
@lru_cache()
def load_spacy_model():
import spacy
name = "models/reuters"
home_folder = get_home_folder()
model_folder = Path(home_folder) / name
# Check if the model directory already exists
if not (model_folder.exists() and any(model_folder.iterdir())):
repo_url = "https://github.com/unclecode/crawl4ai.git"
branch = MODEL_REPO_BRANCH
branch = MODEL_REPO_BRANCH
repo_folder = Path(home_folder) / "crawl4ai"
print("[LOG] ⏬ Downloading Spacy model for the first time...")
# Remove existing repo folder if it exists
@@ -207,9 +168,7 @@ def load_spacy_model():
if model_folder.exists():
shutil.rmtree(model_folder)
except PermissionError:
print(
"[WARNING] Unable to remove existing folders. Please manually delete the following folders and try again:"
)
print("[WARNING] Unable to remove existing folders. Please manually delete the following folders and try again:")
print(f"- {repo_folder}")
print(f"- {model_folder}")
return None
@@ -220,7 +179,7 @@ def load_spacy_model():
["git", "clone", "-b", branch, repo_url, str(repo_folder)],
stdout=subprocess.DEVNULL,
stderr=subprocess.DEVNULL,
check=True,
check=True
)
# Create the models directory if it doesn't exist
@@ -248,7 +207,6 @@ def load_spacy_model():
print(f"Error loading spacy model: {e}")
return None
def download_all_models(remove_existing=False):
"""Download all models required for Crawl4AI."""
if remove_existing:
@@ -277,20 +235,14 @@ def download_all_models(remove_existing=False):
load_nltk_punkt()
print("[LOG] ✅ All models downloaded successfully.")
def main():
print("[LOG] Welcome to the Crawl4AI Model Downloader!")
print("[LOG] This script will download all the models required for Crawl4AI.")
parser = argparse.ArgumentParser(description="Crawl4AI Model Downloader")
parser.add_argument(
"--remove-existing",
action="store_true",
help="Remove existing models before downloading",
)
parser.add_argument('--remove-existing', action='store_true', help="Remove existing models before downloading")
args = parser.parse_args()
download_all_models(remove_existing=args.remove_existing)
if __name__ == "__main__":
main()

View File

@@ -1,100 +1,10 @@
from pydantic import BaseModel, HttpUrl
from typing import List, Dict, Optional, Callable, Awaitable, Union, Any
from enum import Enum
from dataclasses import dataclass
from .ssl_certificate import SSLCertificate
from datetime import datetime
from datetime import timedelta
###############################
# Dispatcher Models
###############################
@dataclass
class DomainState:
last_request_time: float = 0
current_delay: float = 0
fail_count: int = 0
@dataclass
class CrawlerTaskResult:
task_id: str
url: str
result: "CrawlResult"
memory_usage: float
peak_memory: float
start_time: datetime
end_time: datetime
error_message: str = ""
class CrawlStatus(Enum):
QUEUED = "QUEUED"
IN_PROGRESS = "IN_PROGRESS"
COMPLETED = "COMPLETED"
FAILED = "FAILED"
@dataclass
class CrawlStats:
task_id: str
url: str
status: CrawlStatus
start_time: Optional[datetime] = None
end_time: Optional[datetime] = None
memory_usage: float = 0.0
peak_memory: float = 0.0
error_message: str = ""
@property
def duration(self) -> str:
if not self.start_time:
return "0:00"
end = self.end_time or datetime.now()
duration = end - self.start_time
return str(timedelta(seconds=int(duration.total_seconds())))
class DisplayMode(Enum):
DETAILED = "DETAILED"
AGGREGATED = "AGGREGATED"
###############################
# Crawler Models
###############################
@dataclass
class TokenUsage:
completion_tokens: int = 0
prompt_tokens: int = 0
total_tokens: int = 0
completion_tokens_details: Optional[dict] = None
prompt_tokens_details: Optional[dict] = None
from typing import List, Dict, Optional
class UrlModel(BaseModel):
url: HttpUrl
forced: bool = False
class MarkdownGenerationResult(BaseModel):
raw_markdown: str
markdown_with_citations: str
references_markdown: str
fit_markdown: Optional[str] = None
fit_html: Optional[str] = None
class DispatchResult(BaseModel):
task_id: str
memory_usage: float
peak_memory: float
start_time: datetime
end_time: datetime
error_message: str = ""
class CrawlResult(BaseModel):
url: str
html: str
@@ -102,81 +12,11 @@ class CrawlResult(BaseModel):
cleaned_html: Optional[str] = None
media: Dict[str, List[Dict]] = {}
links: Dict[str, List[Dict]] = {}
downloaded_files: Optional[List[str]] = None
screenshot: Optional[str] = None
pdf: Optional[bytes] = None
markdown: Optional[Union[str, MarkdownGenerationResult]] = None
markdown_v2: Optional[MarkdownGenerationResult] = None
fit_markdown: Optional[str] = None
fit_html: Optional[str] = None
markdown: Optional[str] = None
extracted_content: Optional[str] = None
metadata: Optional[dict] = None
error_message: Optional[str] = None
session_id: Optional[str] = None
response_headers: Optional[dict] = None
status_code: Optional[int] = None
ssl_certificate: Optional[SSLCertificate] = None
dispatch_result: Optional[DispatchResult] = None
redirected_url: Optional[str] = None
class Config:
arbitrary_types_allowed = True
class AsyncCrawlResponse(BaseModel):
html: str
response_headers: Dict[str, str]
status_code: int
screenshot: Optional[str] = None
pdf_data: Optional[bytes] = None
get_delayed_content: Optional[Callable[[Optional[float]], Awaitable[str]]] = None
downloaded_files: Optional[List[str]] = None
ssl_certificate: Optional[SSLCertificate] = None
final_url: Optional[str] = None
class Config:
arbitrary_types_allowed = True
###############################
# Scraping Models
###############################
class MediaItem(BaseModel):
src: Optional[str] = ""
alt: Optional[str] = ""
desc: Optional[str] = ""
score: Optional[int] = 0
type: str = "image"
group_id: Optional[int] = 0
format: Optional[str] = None
width: Optional[int] = None
class Link(BaseModel):
href: Optional[str] = ""
text: Optional[str] = ""
title: Optional[str] = ""
base_domain: Optional[str] = ""
class Media(BaseModel):
images: List[MediaItem] = []
videos: List[
MediaItem
] = [] # Using MediaItem model for now, can be extended with Video model if needed
audios: List[
MediaItem
] = [] # Using MediaItem model for now, can be extended with Audio model if needed
class Links(BaseModel):
internal: List[Link] = []
external: List[Link] = []
class ScrapingResult(BaseModel):
cleaned_html: str
success: bool
media: Media = Media()
links: Links = Links()
metadata: Dict[str, Any] = {}
status_code: Optional[int] = None

View File

@@ -0,0 +1,25 @@
{
"_name_or_path": "sentence-transformers/all-MiniLM-L6-v2",
"architectures": [
"BertModel"
],
"attention_probs_dropout_prob": 0.1,
"classifier_dropout": null,
"gradient_checkpointing": false,
"hidden_act": "gelu",
"hidden_dropout_prob": 0.1,
"hidden_size": 384,
"initializer_range": 0.02,
"intermediate_size": 1536,
"layer_norm_eps": 1e-12,
"max_position_embeddings": 512,
"model_type": "bert",
"num_attention_heads": 12,
"num_hidden_layers": 6,
"pad_token_id": 0,
"position_embedding_type": "absolute",
"transformers_version": "4.27.4",
"type_vocab_size": 2,
"use_cache": true,
"vocab_size": 30522
}

Binary file not shown.

View File

@@ -0,0 +1,7 @@
{
"cls_token": "[CLS]",
"mask_token": "[MASK]",
"pad_token": "[PAD]",
"sep_token": "[SEP]",
"unk_token": "[UNK]"
}

File diff suppressed because it is too large Load Diff

View File

@@ -0,0 +1,15 @@
{
"cls_token": "[CLS]",
"do_basic_tokenize": true,
"do_lower_case": true,
"mask_token": "[MASK]",
"model_max_length": 512,
"never_split": null,
"pad_token": "[PAD]",
"sep_token": "[SEP]",
"special_tokens_map_file": "/Users/hammad/.cache/huggingface/hub/models--sentence-transformers--all-MiniLM-L6-v2/snapshots/7dbbc90392e2f80f3d3c277d6e90027e55de9125/special_tokens_map.json",
"strip_accents": null,
"tokenize_chinese_chars": true,
"tokenizer_class": "BertTokenizer",
"unk_token": "[UNK]"
}

30522
crawl4ai/models/onnx/vocab.txt Normal file

File diff suppressed because it is too large Load Diff

View File

@@ -202,808 +202,3 @@ Avoid Common Mistakes:
Result
Output the final list of JSON objects, wrapped in <blocks>...</blocks> XML tags. Make sure to close the tag properly."""
PROMPT_FILTER_CONTENT = """Your task is to filter and convert HTML content into clean, focused markdown that's optimized for use with LLMs and information retrieval systems.
INPUT HTML:
<|HTML_CONTENT_START|>
{HTML}
<|HTML_CONTENT_END|>
SPECIFIC INSTRUCTION:
<|USER_INSTRUCTION_START|>
{REQUEST}
<|USER_INSTRUCTION_END|>
TASK DETAILS:
1. Content Selection
- DO: Keep essential information, main content, key details
- DO: Preserve hierarchical structure using markdown headers
- DO: Keep code blocks, tables, key lists
- DON'T: Include navigation menus, ads, footers, cookie notices
- DON'T: Keep social media widgets, sidebars, related content
2. Content Transformation
- DO: Use proper markdown syntax (#, ##, **, `, etc)
- DO: Convert tables to markdown tables
- DO: Preserve code formatting with ```language blocks
- DO: Maintain link texts but remove tracking parameters
- DON'T: Include HTML tags in output
- DON'T: Keep class names, ids, or other HTML attributes
3. Content Organization
- DO: Maintain logical flow of information
- DO: Group related content under appropriate headers
- DO: Use consistent header levels
- DON'T: Fragment related content
- DON'T: Duplicate information
Example Input:
<div class="main-content"><h1>Setup Guide</h1><p>Follow these steps...</p></div>
<div class="sidebar">Related articles...</div>
Example Output:
# Setup Guide
Follow these steps...
IMPORTANT: If specific instruction is provided above, prioritize those requirements over these general guidelines.
OUTPUT FORMAT:
Wrap your response in <content> tags. Use proper markdown throughout.
<content>
[Your markdown content here]
</content>
Begin filtering now."""
JSON_SCHEMA_BUILDER= """
# HTML Schema Generation Instructions
You are a specialized model designed to analyze HTML patterns and generate extraction schemas. Your primary job is to create structured JSON schemas that can be used to extract data from HTML in a consistent and reliable way. When presented with HTML content, you must analyze its structure and generate a schema that captures all relevant data points.
## Your Core Responsibilities:
1. Analyze HTML structure to identify repeating patterns and important data points
2. Generate valid JSON schemas following the specified format
3. Create appropriate selectors that will work reliably for data extraction
4. Name fields meaningfully based on their content and purpose
5. Handle both specific user requests and autonomous pattern detection
## Available Schema Types You Can Generate:
<schema_types>
1. Basic Single-Level Schema
- Use for simple, flat data structures
- Example: Product cards, user profiles
- Direct field extractions
2. Nested Object Schema
- Use for hierarchical data
- Example: Articles with author details
- Contains objects within objects
3. List Schema
- Use for repeating elements
- Example: Comment sections, product lists
- Handles arrays of similar items
4. Complex Nested Lists
- Use for multi-level data
- Example: Categories with subcategories
- Multiple levels of nesting
5. Transformation Schema
- Use for data requiring processing
- Supports regex and text transformations
- Special attribute handling
</schema_types>
<schema_structure>
Your output must always be a JSON object with this structure:
{
"name": "Descriptive name of the pattern",
"baseSelector": "CSS selector for the repeating element",
"fields": [
{
"name": "field_name",
"selector": "CSS selector",
"type": "text|attribute|nested|list|regex",
"attribute": "attribute_name", // Optional
"transform": "transformation_type", // Optional
"pattern": "regex_pattern", // Optional
"fields": [] // For nested/list types
}
]
}
</schema_structure>
<type_definitions>
Available field types:
- text: Direct text extraction
- attribute: HTML attribute extraction
- nested: Object containing other fields
- list: Array of similar items
- regex: Pattern-based extraction
</type_definitions>
<behavior_rules>
1. When given a specific query:
- Focus on extracting requested data points
- Use most specific selectors possible
- Include all fields mentioned in the query
2. When no query is provided:
- Identify main content areas
- Extract all meaningful data points
- Use semantic structure to determine importance
- Include prices, dates, titles, and other common data types
3. Always:
- Use reliable CSS selectors
- Handle dynamic class names appropriately
- Create descriptive field names
- Follow consistent naming conventions
</behavior_rules>
<examples>
1. Basic Product Card Example:
<html>
<div class="product-card" data-cat-id="electronics" data-subcat-id="laptops">
<h2 class="product-title">Gaming Laptop</h2>
<span class="price">$999.99</span>
<img src="laptop.jpg" alt="Gaming Laptop">
</div>
</html>
Generated Schema:
{
"name": "Product Cards",
"baseSelector": ".product-card",
"baseFields": [
{"name": "data_cat_id", "type": "attribute", "attribute": "data-cat-id"},
{"name": "data_subcat_id", "type": "attribute", "attribute": "data-subcat-id"}
],
"fields": [
{
"name": "title",
"selector": ".product-title",
"type": "text"
},
{
"name": "price",
"selector": ".price",
"type": "text"
},
{
"name": "image_url",
"selector": "img",
"type": "attribute",
"attribute": "src"
}
]
}
2. Article with Author Details Example:
<html>
<article>
<h1>The Future of AI</h1>
<div class="author-info">
<span class="author-name">Dr. Smith</span>
<img src="author.jpg" alt="Dr. Smith">
</div>
</article>
</html>
Generated Schema:
{
"name": "Article Details",
"baseSelector": "article",
"fields": [
{
"name": "title",
"selector": "h1",
"type": "text"
},
{
"name": "author",
"type": "nested",
"selector": ".author-info",
"fields": [
{
"name": "name",
"selector": ".author-name",
"type": "text"
},
{
"name": "avatar",
"selector": "img",
"type": "attribute",
"attribute": "src"
}
]
}
]
}
3. Comments Section Example:
<html>
<div class="comments-container">
<div class="comment" data-user-id="123">
<div class="user-name">John123</div>
<p class="comment-text">Great article!</p>
</div>
<div class="comment" data-user-id="456">
<div class="user-name">Alice456</div>
<p class="comment-text">Thanks for sharing.</p>
</div>
</div>
</html>
Generated Schema:
{
"name": "Comment Section",
"baseSelector": ".comments-container",
"baseFields": [
{"name": "data_user_id", "type": "attribute", "attribute": "data-user-id"}
],
"fields": [
{
"name": "comments",
"type": "list",
"selector": ".comment",
"fields": [
{
"name": "user",
"selector": ".user-name",
"type": "text"
},
{
"name": "content",
"selector": ".comment-text",
"type": "text"
}
]
}
]
}
4. E-commerce Categories Example:
<html>
<div class="category-section" data-category="electronics">
<h2>Electronics</h2>
<div class="subcategory">
<h3>Laptops</h3>
<div class="product">
<span class="product-name">MacBook Pro</span>
<span class="price">$1299</span>
</div>
<div class="product">
<span class="product-name">Dell XPS</span>
<span class="price">$999</span>
</div>
</div>
</div>
</html>
Generated Schema:
{
"name": "E-commerce Categories",
"baseSelector": ".category-section",
"baseFields": [
{"name": "data_category", "type": "attribute", "attribute": "data-category"}
],
"fields": [
{
"name": "category_name",
"selector": "h2",
"type": "text"
},
{
"name": "subcategories",
"type": "nested_list",
"selector": ".subcategory",
"fields": [
{
"name": "name",
"selector": "h3",
"type": "text"
},
{
"name": "products",
"type": "list",
"selector": ".product",
"fields": [
{
"name": "name",
"selector": ".product-name",
"type": "text"
},
{
"name": "price",
"selector": ".price",
"type": "text"
}
]
}
]
}
]
}
5. Job Listings with Transformations Example:
<html>
<div class="job-post">
<h3 class="job-title">Senior Developer</h3>
<span class="salary-text">Salary: $120,000/year</span>
<span class="location"> New York, NY </span>
</div>
</html>
Generated Schema:
{
"name": "Job Listings",
"baseSelector": ".job-post",
"fields": [
{
"name": "title",
"selector": ".job-title",
"type": "text",
"transform": "uppercase"
},
{
"name": "salary",
"selector": ".salary-text",
"type": "regex",
"pattern": "\\$([\\d,]+)"
},
{
"name": "location",
"selector": ".location",
"type": "text",
"transform": "strip"
}
]
}
6. Skyscanner Place Card Example:
<html>
<div class="PlaceCard_descriptionContainer__M2NjN" data-testid="description-container">
<div class="PlaceCard_nameContainer__ZjZmY" tabindex="0" role="link">
<div class="PlaceCard_nameContent__ODUwZ">
<span class="BpkText_bpk-text__MjhhY BpkText_bpk-text--heading-4__Y2FlY">Doha</span>
</div>
<span class="BpkText_bpk-text__MjhhY BpkText_bpk-text--heading-4__Y2FlY PlaceCard_subName__NTVkY">Qatar</span>
</div>
<span class="PlaceCard_advertLabel__YTM0N">Sunny days and the warmest welcome awaits</span>
<a class="BpkLink_bpk-link__MmQwY PlaceCard_descriptionLink__NzYwN" href="/flights/del/doha/" data-testid="flights-link">
<div class="PriceDescription_container__NjEzM">
<span class="BpkText_bpk-text--heading-5__MTRjZ">₹17,559</span>
</div>
</a>
</div>
</html>
Generated Schema:
{
"name": "Skyscanner Place Cards",
"baseSelector": "div[class^='PlaceCard_descriptionContainer__']",
"baseFields": [
{"name": "data_testid", "type": "attribute", "attribute": "data-testid"}
],
"fields": [
{
"name": "city_name",
"selector": "div[class^='PlaceCard_nameContent__'] .BpkText_bpk-text--heading-4__",
"type": "text"
},
{
"name": "country_name",
"selector": "span[class*='PlaceCard_subName__']",
"type": "text"
},
{
"name": "description",
"selector": "span[class*='PlaceCard_advertLabel__']",
"type": "text"
},
{
"name": "flight_price",
"selector": "a[data-testid='flights-link'] .BpkText_bpk-text--heading-5__",
"type": "text"
},
{
"name": "flight_url",
"selector": "a[data-testid='flights-link']",
"type": "attribute",
"attribute": "href"
}
]
}
</examples>
<output_requirements>
Your output must:
1. Be valid JSON only
2. Include no explanatory text
3. Follow the exact schema structure provided
4. Use appropriate field types
5. Include all required fields
6. Use valid CSS selectors
</output_requirements>
"""
JSON_SCHEMA_BUILDER_XPATH = """
# HTML Schema Generation Instructions
You are a specialized model designed to analyze HTML patterns and generate extraction schemas. Your primary job is to create structured JSON schemas that can be used to extract data from HTML in a consistent and reliable way. When presented with HTML content, you must analyze its structure and generate a schema that captures all relevant data points.
## Your Core Responsibilities:
1. Analyze HTML structure to identify repeating patterns and important data points
2. Generate valid JSON schemas following the specified format
3. Create appropriate XPath selectors that will work reliably for data extraction
4. Name fields meaningfully based on their content and purpose
5. Handle both specific user requests and autonomous pattern detection
## Available Schema Types You Can Generate:
<schema_types>
1. Basic Single-Level Schema
- Use for simple, flat data structures
- Example: Product cards, user profiles
- Direct field extractions
2. Nested Object Schema
- Use for hierarchical data
- Example: Articles with author details
- Contains objects within objects
3. List Schema
- Use for repeating elements
- Example: Comment sections, product lists
- Handles arrays of similar items
4. Complex Nested Lists
- Use for multi-level data
- Example: Categories with subcategories
- Multiple levels of nesting
5. Transformation Schema
- Use for data requiring processing
- Supports regex and text transformations
- Special attribute handling
</schema_types>
<schema_structure>
Your output must always be a JSON object with this structure:
{
"name": "Descriptive name of the pattern",
"baseSelector": "XPath selector for the repeating element",
"fields": [
{
"name": "field_name",
"selector": "XPath selector",
"type": "text|attribute|nested|list|regex",
"attribute": "attribute_name", // Optional
"transform": "transformation_type", // Optional
"pattern": "regex_pattern", // Optional
"fields": [] // For nested/list types
}
]
}
</schema_structure>
<type_definitions>
Available field types:
- text: Direct text extraction
- attribute: HTML attribute extraction
- nested: Object containing other fields
- list: Array of similar items
- regex: Pattern-based extraction
</type_definitions>
<behavior_rules>
1. When given a specific query:
- Focus on extracting requested data points
- Use most specific selectors possible
- Include all fields mentioned in the query
2. When no query is provided:
- Identify main content areas
- Extract all meaningful data points
- Use semantic structure to determine importance
- Include prices, dates, titles, and other common data types
3. Always:
- Use reliable XPath selectors
- Handle dynamic element IDs appropriately
- Create descriptive field names
- Follow consistent naming conventions
</behavior_rules>
<examples>
1. Basic Product Card Example:
<html>
<div class="product-card" data-cat-id="electronics" data-subcat-id="laptops">
<h2 class="product-title">Gaming Laptop</h2>
<span class="price">$999.99</span>
<img src="laptop.jpg" alt="Gaming Laptop">
</div>
</html>
Generated Schema:
{
"name": "Product Cards",
"baseSelector": "//div[@class='product-card']",
"baseFields": [
{"name": "data_cat_id", "type": "attribute", "attribute": "data-cat-id"},
{"name": "data_subcat_id", "type": "attribute", "attribute": "data-subcat-id"}
],
"fields": [
{
"name": "title",
"selector": ".//h2[@class='product-title']",
"type": "text"
},
{
"name": "price",
"selector": ".//span[@class='price']",
"type": "text"
},
{
"name": "image_url",
"selector": ".//img",
"type": "attribute",
"attribute": "src"
}
]
}
2. Article with Author Details Example:
<html>
<article>
<h1>The Future of AI</h1>
<div class="author-info">
<span class="author-name">Dr. Smith</span>
<img src="author.jpg" alt="Dr. Smith">
</div>
</article>
</html>
Generated Schema:
{
"name": "Article Details",
"baseSelector": "//article",
"fields": [
{
"name": "title",
"selector": ".//h1",
"type": "text"
},
{
"name": "author",
"type": "nested",
"selector": ".//div[@class='author-info']",
"fields": [
{
"name": "name",
"selector": ".//span[@class='author-name']",
"type": "text"
},
{
"name": "avatar",
"selector": ".//img",
"type": "attribute",
"attribute": "src"
}
]
}
]
}
3. Comments Section Example:
<html>
<div class="comments-container">
<div class="comment" data-user-id="123">
<div class="user-name">John123</div>
<p class="comment-text">Great article!</p>
</div>
<div class="comment" data-user-id="456">
<div class="user-name">Alice456</div>
<p class="comment-text">Thanks for sharing.</p>
</div>
</div>
</html>
Generated Schema:
{
"name": "Comment Section",
"baseSelector": "//div[@class='comments-container']",
"fields": [
{
"name": "comments",
"type": "list",
"selector": ".//div[@class='comment']",
"baseFields": [
{"name": "data_user_id", "type": "attribute", "attribute": "data-user-id"}
],
"fields": [
{
"name": "user",
"selector": ".//div[@class='user-name']",
"type": "text"
},
{
"name": "content",
"selector": ".//p[@class='comment-text']",
"type": "text"
}
]
}
]
}
4. E-commerce Categories Example:
<html>
<div class="category-section" data-category="electronics">
<h2>Electronics</h2>
<div class="subcategory">
<h3>Laptops</h3>
<div class="product">
<span class="product-name">MacBook Pro</span>
<span class="price">$1299</span>
</div>
<div class="product">
<span class="product-name">Dell XPS</span>
<span class="price">$999</span>
</div>
</div>
</div>
</html>
Generated Schema:
{
"name": "E-commerce Categories",
"baseSelector": "//div[@class='category-section']",
"baseFields": [
{"name": "data_category", "type": "attribute", "attribute": "data-category"}
],
"fields": [
{
"name": "category_name",
"selector": ".//h2",
"type": "text"
},
{
"name": "subcategories",
"type": "nested_list",
"selector": ".//div[@class='subcategory']",
"fields": [
{
"name": "name",
"selector": ".//h3",
"type": "text"
},
{
"name": "products",
"type": "list",
"selector": ".//div[@class='product']",
"fields": [
{
"name": "name",
"selector": ".//span[@class='product-name']",
"type": "text"
},
{
"name": "price",
"selector": ".//span[@class='price']",
"type": "text"
}
]
}
]
}
]
}
5. Job Listings with Transformations Example:
<html>
<div class="job-post">
<h3 class="job-title">Senior Developer</h3>
<span class="salary-text">Salary: $120,000/year</span>
<span class="location"> New York, NY </span>
</div>
</html>
Generated Schema:
{
"name": "Job Listings",
"baseSelector": "//div[@class='job-post']",
"fields": [
{
"name": "title",
"selector": ".//h3[@class='job-title']",
"type": "text",
"transform": "uppercase"
},
{
"name": "salary",
"selector": ".//span[@class='salary-text']",
"type": "regex",
"pattern": "\\$([\\d,]+)"
},
{
"name": "location",
"selector": ".//span[@class='location']",
"type": "text",
"transform": "strip"
}
]
}
6. Skyscanner Place Card Example:
<html>
<div class="PlaceCard_descriptionContainer__M2NjN" data-testid="description-container">
<div class="PlaceCard_nameContainer__ZjZmY" tabindex="0" role="link">
<div class="PlaceCard_nameContent__ODUwZ">
<span class="BpkText_bpk-text__MjhhY BpkText_bpk-text--heading-4__Y2FlY">Doha</span>
</div>
<span class="BpkText_bpk-text__MjhhY BpkText_bpk-text--heading-4__Y2FlY PlaceCard_subName__NTVkY">Qatar</span>
</div>
<span class="PlaceCard_advertLabel__YTM0N">Sunny days and the warmest welcome awaits</span>
<a class="BpkLink_bpk-link__MmQwY PlaceCard_descriptionLink__NzYwN" href="/flights/del/doha/" data-testid="flights-link">
<div class="PriceDescription_container__NjEzM">
<span class="BpkText_bpk-text--heading-5__MTRjZ">₹17,559</span>
</div>
</a>
</div>
</html>
Generated Schema:
{
"name": "Skyscanner Place Cards",
"baseSelector": "//div[contains(@class, 'PlaceCard_descriptionContainer__')]",
"baseFields": [
{"name": "data_testid", "type": "attribute", "attribute": "data-testid"}
],
"fields": [
{
"name": "city_name",
"selector": ".//div[contains(@class, 'PlaceCard_nameContent__')]//span[contains(@class, 'BpkText_bpk-text--heading-4__')]",
"type": "text"
},
{
"name": "country_name",
"selector": ".//span[contains(@class, 'PlaceCard_subName__')]",
"type": "text"
},
{
"name": "description",
"selector": ".//span[contains(@class, 'PlaceCard_advertLabel__')]",
"type": "text"
},
{
"name": "flight_price",
"selector": ".//a[@data-testid='flights-link']//span[contains(@class, 'BpkText_bpk-text--heading-5__')]",
"type": "text"
},
{
"name": "flight_url",
"selector": ".//a[@data-testid='flights-link']",
"type": "attribute",
"attribute": "href"
}
]
}
</examples>
<output_requirements>
Your output must:
1. Be valid JSON only
2. Include no explanatory text
3. Follow the exact schema structure provided
4. Use appropriate field types
5. Include all required fields
6. Use valid XPath selectors
</output_requirements>
"""

View File

@@ -0,0 +1,3 @@
from .async_web_scraper import AsyncWebScraper
from .bfs_scraper_strategy import BFSScraperStrategy
from .filters import URLFilter, FilterChain, URLPatternFilter, ContentTypeFilter

View File

@@ -0,0 +1,123 @@
from typing import Union, AsyncGenerator, Optional
from .scraper_strategy import ScraperStrategy
from .models import ScraperResult, CrawlResult
from ..async_webcrawler import AsyncWebCrawler
import logging
from dataclasses import dataclass
from contextlib import asynccontextmanager
@dataclass
class ScrapingProgress:
"""Tracks the progress of a scraping operation."""
processed_urls: int = 0
failed_urls: int = 0
current_url: Optional[str] = None
class AsyncWebScraper:
"""
A high-level web scraper that combines an async crawler with a scraping strategy.
Args:
crawler (AsyncWebCrawler): The async web crawler implementation
strategy (ScraperStrategy): The scraping strategy to use
logger (Optional[logging.Logger]): Custom logger for the scraper
"""
def __init__(
self,
crawler: AsyncWebCrawler,
strategy: ScraperStrategy,
logger: Optional[logging.Logger] = None
):
if not isinstance(crawler, AsyncWebCrawler):
raise TypeError("crawler must be an instance of AsyncWebCrawler")
if not isinstance(strategy, ScraperStrategy):
raise TypeError("strategy must be an instance of ScraperStrategy")
self.crawler = crawler
self.strategy = strategy
self.logger = logger or logging.getLogger(__name__)
self._progress = ScrapingProgress()
@property
def progress(self) -> ScrapingProgress:
"""Get current scraping progress."""
return self._progress
@asynccontextmanager
async def _error_handling_context(self, url: str):
"""Context manager for handling errors during scraping."""
try:
yield
except Exception as e:
self.logger.error(f"Error scraping {url}: {str(e)}")
self._progress.failed_urls += 1
raise
async def ascrape(
self,
url: str,
parallel_processing: bool = True,
stream: bool = False
) -> Union[AsyncGenerator[CrawlResult, None], ScraperResult]:
"""
Scrape a website starting from the given URL.
Args:
url: Starting URL for scraping
parallel_processing: Whether to process URLs in parallel
stream: If True, yield results as they come; if False, collect all results
Returns:
Either an async generator yielding CrawlResults or a final ScraperResult
"""
self._progress = ScrapingProgress() # Reset progress
async with self._error_handling_context(url):
if stream:
return self._ascrape_yielding(url, parallel_processing)
return await self._ascrape_collecting(url, parallel_processing)
async def _ascrape_yielding(
self,
url: str,
parallel_processing: bool
) -> AsyncGenerator[CrawlResult, None]:
"""Stream scraping results as they become available."""
try:
result_generator = self.strategy.ascrape(url, self.crawler, parallel_processing)
async for res in result_generator:
self._progress.processed_urls += 1
self._progress.current_url = res.url
yield res
except Exception as e:
self.logger.error(f"Error in streaming scrape: {str(e)}")
raise
async def _ascrape_collecting(
self,
url: str,
parallel_processing: bool
) -> ScraperResult:
"""Collect all scraping results before returning."""
extracted_data = {}
try:
result_generator = self.strategy.ascrape(url, self.crawler, parallel_processing)
async for res in result_generator:
self._progress.processed_urls += 1
self._progress.current_url = res.url
extracted_data[res.url] = res
return ScraperResult(
url=url,
crawled_urls=list(extracted_data.keys()),
extracted_data=extracted_data,
stats={
'processed_urls': self._progress.processed_urls,
'failed_urls': self._progress.failed_urls
}
)
except Exception as e:
self.logger.error(f"Error in collecting scrape: {str(e)}")
raise

View File

@@ -0,0 +1,327 @@
from abc import ABC, abstractmethod
from typing import Union, AsyncGenerator, Optional, Dict, Set
from dataclasses import dataclass
from datetime import datetime
import asyncio
import logging
from urllib.parse import urljoin, urlparse, urlunparse
from urllib.robotparser import RobotFileParser
import validators
import time
from aiolimiter import AsyncLimiter
from tenacity import retry, stop_after_attempt, wait_exponential
from collections import defaultdict
from .models import ScraperResult, CrawlResult
from .filters import FilterChain
from .scorers import URLScorer
from ..async_webcrawler import AsyncWebCrawler
@dataclass
class CrawlStats:
"""Statistics for the crawling process"""
start_time: datetime
urls_processed: int = 0
urls_failed: int = 0
urls_skipped: int = 0
total_depth_reached: int = 0
current_depth: int = 0
robots_blocked: int = 0
class ScraperStrategy(ABC):
"""Base class for scraping strategies"""
@abstractmethod
async def ascrape(
self,
url: str,
crawler: AsyncWebCrawler,
parallel_processing: bool = True,
stream: bool = False
) -> Union[AsyncGenerator[CrawlResult, None], ScraperResult]:
"""Abstract method for scraping implementation"""
pass
@abstractmethod
async def can_process_url(self, url: str) -> bool:
"""Check if URL can be processed based on strategy rules"""
pass
@abstractmethod
async def shutdown(self):
"""Clean up resources used by the strategy"""
pass
class BFSScraperStrategy(ScraperStrategy):
"""Breadth-First Search scraping strategy with politeness controls"""
def __init__(
self,
max_depth: int,
filter_chain: FilterChain,
url_scorer: URLScorer,
max_concurrent: int = 5,
min_crawl_delay: int = 1,
timeout: int = 30,
logger: Optional[logging.Logger] = None
):
self.max_depth = max_depth
self.filter_chain = filter_chain
self.url_scorer = url_scorer
self.max_concurrent = max_concurrent
self.min_crawl_delay = min_crawl_delay
self.timeout = timeout
self.logger = logger or logging.getLogger(__name__)
# Crawl control
self.stats = CrawlStats(start_time=datetime.now())
self._cancel_event = asyncio.Event()
self.process_external_links = False
# Rate limiting and politeness
self.rate_limiter = AsyncLimiter(1, 1)
self.last_crawl_time = defaultdict(float)
self.robot_parsers: Dict[str, RobotFileParser] = {}
self.domain_queues: Dict[str, asyncio.Queue] = defaultdict(asyncio.Queue)
async def can_process_url(self, url: str) -> bool:
"""Check if URL can be processed based on robots.txt and filters
This is our gatekeeper method that determines if a URL should be processed. It:
- Validates URL format using the validators library
- Checks robots.txt permissions for the domain
- Applies custom filters from the filter chain
- Updates statistics for blocked URLs
- Returns False early if any check fails
"""
if not validators.url(url):
self.logger.warning(f"Invalid URL: {url}")
return False
robot_parser = await self._get_robot_parser(url)
if robot_parser and not robot_parser.can_fetch("*", url):
self.stats.robots_blocked += 1
self.logger.info(f"Blocked by robots.txt: {url}")
return False
return self.filter_chain.apply(url)
async def _get_robot_parser(self, url: str) -> Optional[RobotFileParser]:
"""Get or create robots.txt parser for domain.
This is our robots.txt manager that:
- Uses domain-level caching of robot parsers
- Creates and caches new parsers as needed
- Handles failed robots.txt fetches gracefully
- Returns None if robots.txt can't be fetched, allowing crawling to proceed
"""
domain = urlparse(url).netloc
if domain not in self.robot_parsers:
parser = RobotFileParser()
try:
robots_url = f"{urlparse(url).scheme}://{domain}/robots.txt"
parser.set_url(robots_url)
parser.read()
self.robot_parsers[domain] = parser
except Exception as e:
self.logger.warning(f"Error fetching robots.txt for {domain}: {e}")
return None
return self.robot_parsers[domain]
@retry(stop=stop_after_attempt(3),
wait=wait_exponential(multiplier=1, min=4, max=10))
async def _crawl_with_retry(
self,
crawler: AsyncWebCrawler,
url: str
) -> CrawlResult:
"""Crawl URL with retry logic"""
try:
async with asyncio.timeout(self.timeout):
return await crawler.arun(url)
except asyncio.TimeoutError:
self.logger.error(f"Timeout crawling {url}")
raise
async def process_url(
self,
url: str,
depth: int,
crawler: AsyncWebCrawler,
queue: asyncio.PriorityQueue,
visited: Set[str],
depths: Dict[str, int]
) -> Optional[CrawlResult]:
"""Process a single URL and extract links.
This is our main URL processing workhorse that:
- Checks for cancellation
- Validates URLs through can_process_url
- Implements politeness delays per domain
- Applies rate limiting
- Handles crawling with retries
- Updates various statistics
- Processes extracted links
- Returns the crawl result or None on failure
"""
if self._cancel_event.is_set():
return None
if not await self.can_process_url(url):
self.stats.urls_skipped += 1
return None
# Politeness delay
domain = urlparse(url).netloc
time_since_last = time.time() - self.last_crawl_time[domain]
if time_since_last < self.min_crawl_delay:
await asyncio.sleep(self.min_crawl_delay - time_since_last)
self.last_crawl_time[domain] = time.time()
# Crawl with rate limiting
try:
async with self.rate_limiter:
result = await self._crawl_with_retry(crawler, url)
self.stats.urls_processed += 1
except Exception as e:
self.logger.error(f"Error crawling {url}: {e}")
self.stats.urls_failed += 1
return None
# Process links
await self._process_links(result, url, depth, queue, visited, depths)
return result
async def _process_links(
self,
result: CrawlResult,
source_url: str,
depth: int,
queue: asyncio.PriorityQueue,
visited: Set[str],
depths: Dict[str, int]
):
"""Process extracted links from crawl result.
This is our link processor that:
Handles both internal and external links
Normalizes URLs (removes fragments)
Checks depth limits
Scores URLs for priority
Updates depth tracking
Adds valid URLs to the queue
Updates maximum depth statistics
"""
links_ro_process = result.links["internal"]
if self.process_external_links:
links_ro_process += result.links["external"]
for link_type in links_ro_process:
for link in result.links[link_type]:
url = link['href']
# url = urljoin(source_url, link['href'])
# url = urlunparse(urlparse(url)._replace(fragment=""))
if url not in visited and await self.can_process_url(url):
new_depth = depths[source_url] + 1
if new_depth <= self.max_depth:
score = self.url_scorer.score(url)
await queue.put((score, new_depth, url))
depths[url] = new_depth
self.stats.total_depth_reached = max(
self.stats.total_depth_reached,
new_depth
)
async def ascrape(
self,
start_url: str,
crawler: AsyncWebCrawler,
parallel_processing: bool = True
) -> AsyncGenerator[CrawlResult, None]:
"""Implement BFS crawling strategy"""
# Initialize crawl state
"""
queue: A priority queue where items are tuples of (score, depth, url)
Score: Determines crawling priority (lower = higher priority)
Depth: Current distance from start_url
URL: The actual URL to crawl
visited: Keeps track of URLs we've already seen to avoid cycles
depths: Maps URLs to their depths from the start URL
pending_tasks: Tracks currently running crawl tasks
"""
queue = asyncio.PriorityQueue()
await queue.put((0, 0, start_url))
visited: Set[str] = set()
depths = {start_url: 0}
pending_tasks = set()
try:
while (not queue.empty() or pending_tasks) and not self._cancel_event.is_set():
"""
This sets up our main control loop which:
- Continues while there are URLs to process (not queue.empty())
- Or while there are tasks still running (pending_tasks)
- Can be interrupted via cancellation (not self._cancel_event.is_set())
"""
# Start new tasks up to max_concurrent
while not queue.empty() and len(pending_tasks) < self.max_concurrent:
"""
This section manages task creation:
Checks if we can start more tasks (under max_concurrent limit)
Gets the next URL from the priority queue
Marks URLs as visited immediately to prevent duplicates
Updates current depth in stats
Either:
Creates a new async task (parallel mode)
Processes URL directly (sequential mode)
"""
_, depth, url = await queue.get()
if url not in visited:
visited.add(url)
self.stats.current_depth = depth
if parallel_processing:
task = asyncio.create_task(
self.process_url(url, depth, crawler, queue, visited, depths)
)
pending_tasks.add(task)
else:
result = await self.process_url(
url, depth, crawler, queue, visited, depths
)
if result:
yield result
# Process completed tasks
"""
This section manages completed tasks:
Waits for any task to complete using asyncio.wait
Uses FIRST_COMPLETED to handle results as soon as they're ready
Yields successful results to the caller
Updates pending_tasks to remove completed ones
"""
if pending_tasks:
done, pending_tasks = await asyncio.wait(
pending_tasks,
return_when=asyncio.FIRST_COMPLETED
)
for task in done:
result = await task
if result:
yield result
except Exception as e:
self.logger.error(f"Error in crawl process: {e}")
raise
finally:
# Clean up any remaining tasks
for task in pending_tasks:
task.cancel()
self.stats.end_time = datetime.now()
async def shutdown(self):
"""Clean up resources and stop crawling"""
self._cancel_event.set()
# Clear caches and close connections
self.robot_parsers.clear()
self.domain_queues.clear()

205
crawl4ai/scraper/filters.py Normal file
View File

@@ -0,0 +1,205 @@
# from .url_filter import URLFilter, FilterChain
# from .content_type_filter import ContentTypeFilter
# from .url_pattern_filter import URLPatternFilter
from abc import ABC, abstractmethod
from typing import List, Pattern, Set, Union
import re
from urllib.parse import urlparse
import mimetypes
import logging
from dataclasses import dataclass
import fnmatch
@dataclass
class FilterStats:
"""Statistics for filter applications"""
total_urls: int = 0
rejected_urls: int = 0
passed_urls: int = 0
class URLFilter(ABC):
"""Base class for URL filters"""
def __init__(self, name: str = None):
self.name = name or self.__class__.__name__
self.stats = FilterStats()
self.logger = logging.getLogger(f"urlfilter.{self.name}")
@abstractmethod
def apply(self, url: str) -> bool:
"""Apply the filter to a URL"""
pass
def _update_stats(self, passed: bool):
"""Update filter statistics"""
self.stats.total_urls += 1
if passed:
self.stats.passed_urls += 1
else:
self.stats.rejected_urls += 1
class FilterChain:
"""Chain of URL filters."""
def __init__(self, filters: List[URLFilter] = None):
self.filters = filters or []
self.stats = FilterStats()
self.logger = logging.getLogger("urlfilter.chain")
def add_filter(self, filter_: URLFilter) -> 'FilterChain':
"""Add a filter to the chain"""
self.filters.append(filter_)
return self # Enable method chaining
def apply(self, url: str) -> bool:
"""Apply all filters in the chain"""
self.stats.total_urls += 1
for filter_ in self.filters:
if not filter_.apply(url):
self.stats.rejected_urls += 1
self.logger.debug(f"URL {url} rejected by {filter_.name}")
return False
self.stats.passed_urls += 1
return True
class URLPatternFilter(URLFilter):
"""Filter URLs based on glob patterns or regex.
pattern_filter = URLPatternFilter([
"*.example.com/*", # Glob pattern
"*/article/*", # Path pattern
re.compile(r"blog-\d+") # Regex pattern
])
- Supports glob patterns and regex
- Multiple patterns per filter
- Pattern pre-compilation for performance
"""
def __init__(self, patterns: Union[str, Pattern, List[Union[str, Pattern]]],
use_glob: bool = True):
super().__init__()
self.patterns = [patterns] if isinstance(patterns, (str, Pattern)) else patterns
self.use_glob = use_glob
self._compiled_patterns = []
for pattern in self.patterns:
if isinstance(pattern, str) and use_glob:
self._compiled_patterns.append(self._glob_to_regex(pattern))
else:
self._compiled_patterns.append(re.compile(pattern) if isinstance(pattern, str) else pattern)
def _glob_to_regex(self, pattern: str) -> Pattern:
"""Convert glob pattern to regex"""
return re.compile(fnmatch.translate(pattern))
def apply(self, url: str) -> bool:
"""Check if URL matches any of the patterns"""
matches = any(pattern.search(url) for pattern in self._compiled_patterns)
self._update_stats(matches)
return matches
class ContentTypeFilter(URLFilter):
"""Filter URLs based on expected content type.
content_filter = ContentTypeFilter([
"text/html",
"application/pdf"
], check_extension=True)
- Filter by MIME types
- Extension checking
- Support for multiple content types
"""
def __init__(self, allowed_types: Union[str, List[str]],
check_extension: bool = True):
super().__init__()
self.allowed_types = [allowed_types] if isinstance(allowed_types, str) else allowed_types
self.check_extension = check_extension
self._normalize_types()
def _normalize_types(self):
"""Normalize content type strings"""
self.allowed_types = [t.lower() for t in self.allowed_types]
def _check_extension(self, url: str) -> bool:
"""Check URL's file extension"""
ext = urlparse(url).path.split('.')[-1].lower() if '.' in urlparse(url).path else ''
if not ext:
return True # No extension, might be dynamic content
guessed_type = mimetypes.guess_type(url)[0]
return any(allowed in (guessed_type or '').lower() for allowed in self.allowed_types)
def apply(self, url: str) -> bool:
"""Check if URL's content type is allowed"""
result = True
if self.check_extension:
result = self._check_extension(url)
self._update_stats(result)
return result
class DomainFilter(URLFilter):
"""Filter URLs based on allowed/blocked domains.
domain_filter = DomainFilter(
allowed_domains=["example.com", "blog.example.com"],
blocked_domains=["ads.example.com"]
)
- Allow/block specific domains
- Subdomain support
- Efficient domain matching
"""
def __init__(self, allowed_domains: Union[str, List[str]] = None,
blocked_domains: Union[str, List[str]] = None):
super().__init__()
self.allowed_domains = set(self._normalize_domains(allowed_domains)) if allowed_domains else None
self.blocked_domains = set(self._normalize_domains(blocked_domains)) if blocked_domains else set()
def _normalize_domains(self, domains: Union[str, List[str]]) -> List[str]:
"""Normalize domain strings"""
if isinstance(domains, str):
domains = [domains]
return [d.lower().strip() for d in domains]
def _extract_domain(self, url: str) -> str:
"""Extract domain from URL"""
return urlparse(url).netloc.lower()
def apply(self, url: str) -> bool:
"""Check if URL's domain is allowed"""
domain = self._extract_domain(url)
if domain in self.blocked_domains:
self._update_stats(False)
return False
if self.allowed_domains is not None and domain not in self.allowed_domains:
self._update_stats(False)
return False
self._update_stats(True)
return True
# Example usage:
def create_common_filter_chain() -> FilterChain:
"""Create a commonly used filter chain"""
return FilterChain([
URLPatternFilter([
"*.html", "*.htm", # HTML files
"*/article/*", "*/blog/*" # Common content paths
]),
ContentTypeFilter([
"text/html",
"application/xhtml+xml"
]),
DomainFilter(
blocked_domains=["ads.*", "analytics.*"]
)
])

View File

@@ -0,0 +1,8 @@
from pydantic import BaseModel
from typing import List, Dict
from ..models import CrawlResult
class ScraperResult(BaseModel):
url: str
crawled_urls: List[str]
extracted_data: Dict[str,CrawlResult]

268
crawl4ai/scraper/scorers.py Normal file
View File

@@ -0,0 +1,268 @@
# from .url_scorer import URLScorer
# from .keyword_relevance_scorer import KeywordRelevanceScorer
from abc import ABC, abstractmethod
from typing import List, Dict, Optional, Union
from dataclasses import dataclass
from urllib.parse import urlparse, unquote
import re
from collections import defaultdict
import math
import logging
@dataclass
class ScoringStats:
"""Statistics for URL scoring"""
urls_scored: int = 0
total_score: float = 0.0
min_score: float = float('inf')
max_score: float = float('-inf')
def update(self, score: float):
"""Update scoring statistics"""
self.urls_scored += 1
self.total_score += score
self.min_score = min(self.min_score, score)
self.max_score = max(self.max_score, score)
@property
def average_score(self) -> float:
"""Calculate average score"""
return self.total_score / self.urls_scored if self.urls_scored > 0 else 0.0
class URLScorer(ABC):
"""Base class for URL scoring strategies"""
def __init__(self, weight: float = 1.0, name: str = None):
self.weight = weight
self.name = name or self.__class__.__name__
self.stats = ScoringStats()
self.logger = logging.getLogger(f"urlscorer.{self.name}")
@abstractmethod
def _calculate_score(self, url: str) -> float:
"""Calculate the raw score for a URL"""
pass
def score(self, url: str) -> float:
"""Calculate the weighted score for a URL"""
raw_score = self._calculate_score(url)
weighted_score = raw_score * self.weight
self.stats.update(weighted_score)
return weighted_score
class CompositeScorer(URLScorer):
"""Combines multiple scorers with weights"""
def __init__(self, scorers: List[URLScorer], normalize: bool = True):
super().__init__(name="CompositeScorer")
self.scorers = scorers
self.normalize = normalize
def _calculate_score(self, url: str) -> float:
scores = [scorer.score(url) for scorer in self.scorers]
total_score = sum(scores)
if self.normalize and scores:
total_score /= len(scores)
return total_score
class KeywordRelevanceScorer(URLScorer):
"""Score URLs based on keyword relevance.
keyword_scorer = KeywordRelevanceScorer(
keywords=["python", "programming"],
weight=1.0,
case_sensitive=False
)
- Score based on keyword matches
- Case sensitivity options
- Weighted scoring
"""
def __init__(self, keywords: List[str], weight: float = 1.0,
case_sensitive: bool = False):
super().__init__(weight=weight)
self.keywords = keywords
self.case_sensitive = case_sensitive
self._compile_keywords()
def _compile_keywords(self):
"""Prepare keywords for matching"""
flags = 0 if self.case_sensitive else re.IGNORECASE
self.patterns = [re.compile(re.escape(k), flags) for k in self.keywords]
def _calculate_score(self, url: str) -> float:
"""Calculate score based on keyword matches"""
decoded_url = unquote(url)
total_matches = sum(
1 for pattern in self.patterns
if pattern.search(decoded_url)
)
# Normalize score between 0 and 1
return total_matches / len(self.patterns) if self.patterns else 0.0
class PathDepthScorer(URLScorer):
"""Score URLs based on their path depth.
path_scorer = PathDepthScorer(
optimal_depth=3, # Preferred URL depth
weight=0.7
)
- Score based on URL path depth
- Configurable optimal depth
- Diminishing returns for deeper paths
"""
def __init__(self, optimal_depth: int = 3, weight: float = 1.0):
super().__init__(weight=weight)
self.optimal_depth = optimal_depth
def _calculate_score(self, url: str) -> float:
"""Calculate score based on path depth"""
path = urlparse(url).path
depth = len([x for x in path.split('/') if x])
# Score decreases as we move away from optimal depth
distance_from_optimal = abs(depth - self.optimal_depth)
return 1.0 / (1.0 + distance_from_optimal)
class ContentTypeScorer(URLScorer):
"""Score URLs based on content type preferences.
content_scorer = ContentTypeScorer({
r'\.html$': 1.0,
r'\.pdf$': 0.8,
r'\.xml$': 0.6
})
- Score based on file types
- Configurable type weights
- Pattern matching support
"""
def __init__(self, type_weights: Dict[str, float], weight: float = 1.0):
super().__init__(weight=weight)
self.type_weights = type_weights
self._compile_patterns()
def _compile_patterns(self):
"""Prepare content type patterns"""
self.patterns = {
re.compile(pattern): weight
for pattern, weight in self.type_weights.items()
}
def _calculate_score(self, url: str) -> float:
"""Calculate score based on content type matching"""
for pattern, weight in self.patterns.items():
if pattern.search(url):
return weight
return 0.0
class FreshnessScorer(URLScorer):
"""Score URLs based on freshness indicators.
freshness_scorer = FreshnessScorer(weight=0.9)
Score based on date indicators in URLs
Multiple date format support
Recency weighting"""
def __init__(self, weight: float = 1.0):
super().__init__(weight=weight)
self.date_patterns = [
r'/(\d{4})/(\d{2})/(\d{2})/', # yyyy/mm/dd
r'(\d{4})[-_](\d{2})[-_](\d{2})', # yyyy-mm-dd
r'/(\d{4})/', # year only
]
self._compile_patterns()
def _compile_patterns(self):
"""Prepare date patterns"""
self.compiled_patterns = [re.compile(p) for p in self.date_patterns]
def _calculate_score(self, url: str) -> float:
"""Calculate score based on date indicators"""
for pattern in self.compiled_patterns:
if match := pattern.search(url):
year = int(match.group(1))
# Score higher for more recent years
return 1.0 - (2024 - year) * 0.1
return 0.5 # Default score for URLs without dates
class DomainAuthorityScorer(URLScorer):
"""Score URLs based on domain authority.
authority_scorer = DomainAuthorityScorer({
"python.org": 1.0,
"github.com": 0.9,
"medium.com": 0.7
})
Score based on domain importance
Configurable domain weights
Default weight for unknown domains"""
def __init__(self, domain_weights: Dict[str, float],
default_weight: float = 0.5, weight: float = 1.0):
super().__init__(weight=weight)
self.domain_weights = domain_weights
self.default_weight = default_weight
def _calculate_score(self, url: str) -> float:
"""Calculate score based on domain authority"""
domain = urlparse(url).netloc.lower()
return self.domain_weights.get(domain, self.default_weight)
def create_balanced_scorer() -> CompositeScorer:
"""Create a balanced composite scorer"""
return CompositeScorer([
KeywordRelevanceScorer(
keywords=["article", "blog", "news", "research"],
weight=1.0
),
PathDepthScorer(
optimal_depth=3,
weight=0.7
),
ContentTypeScorer(
type_weights={
r'\.html?$': 1.0,
r'\.pdf$': 0.8,
r'\.xml$': 0.6
},
weight=0.8
),
FreshnessScorer(
weight=0.9
)
])
# Example Usage:
"""
# Create a composite scorer
scorer = CompositeScorer([
KeywordRelevanceScorer(["python", "programming"], weight=1.0),
PathDepthScorer(optimal_depth=2, weight=0.7),
FreshnessScorer(weight=0.8),
DomainAuthorityScorer(
domain_weights={
"python.org": 1.0,
"github.com": 0.9,
"medium.com": 0.7
},
weight=0.9
)
])
# Score a URL
score = scorer.score("https://python.org/article/2024/01/new-features")
# Access statistics
print(f"Average score: {scorer.stats.average_score}")
print(f"URLs scored: {scorer.stats.urls_scored}")
"""

View File

@@ -0,0 +1,26 @@
from abc import ABC, abstractmethod
from .models import ScraperResult, CrawlResult
from ..models import CrawlResult
from ..async_webcrawler import AsyncWebCrawler
from typing import Union, AsyncGenerator
class ScraperStrategy(ABC):
@abstractmethod
async def ascrape(self, url: str, crawler: AsyncWebCrawler, parallel_processing: bool = True, stream: bool = False) -> Union[AsyncGenerator[CrawlResult, None], ScraperResult]:
"""Scrape the given URL using the specified crawler.
Args:
url (str): The starting URL for the scrape.
crawler (AsyncWebCrawler): The web crawler instance.
parallel_processing (bool): Whether to use parallel processing. Defaults to True.
stream (bool): If True, yields individual crawl results as they are ready;
if False, accumulates results and returns a final ScraperResult.
Yields:
CrawlResult: Individual crawl results if stream is True.
Returns:
ScraperResult: A summary of the scrape results containing the final extracted data
and the list of crawled URLs if stream is False.
"""
pass

View File

@@ -1,184 +0,0 @@
"""SSL Certificate class for handling certificate operations."""
import ssl
import socket
import base64
import json
from typing import Dict, Any, Optional
from urllib.parse import urlparse
import OpenSSL.crypto
from pathlib import Path
class SSLCertificate:
"""
A class representing an SSL certificate with methods to export in various formats.
Attributes:
cert_info (Dict[str, Any]): The certificate information.
Methods:
from_url(url: str, timeout: int = 10) -> Optional['SSLCertificate']: Create SSLCertificate instance from a URL.
from_file(file_path: str) -> Optional['SSLCertificate']: Create SSLCertificate instance from a file.
from_binary(binary_data: bytes) -> Optional['SSLCertificate']: Create SSLCertificate instance from binary data.
export_as_pem() -> str: Export the certificate as PEM format.
export_as_der() -> bytes: Export the certificate as DER format.
export_as_json() -> Dict[str, Any]: Export the certificate as JSON format.
export_as_text() -> str: Export the certificate as text format.
"""
def __init__(self, cert_info: Dict[str, Any]):
self._cert_info = self._decode_cert_data(cert_info)
@staticmethod
def from_url(url: str, timeout: int = 10) -> Optional["SSLCertificate"]:
"""
Create SSLCertificate instance from a URL.
Args:
url (str): URL of the website.
timeout (int): Timeout for the connection (default: 10).
Returns:
Optional[SSLCertificate]: SSLCertificate instance if successful, None otherwise.
"""
try:
hostname = urlparse(url).netloc
if ":" in hostname:
hostname = hostname.split(":")[0]
context = ssl.create_default_context()
with socket.create_connection((hostname, 443), timeout=timeout) as sock:
with context.wrap_socket(sock, server_hostname=hostname) as ssock:
cert_binary = ssock.getpeercert(binary_form=True)
x509 = OpenSSL.crypto.load_certificate(
OpenSSL.crypto.FILETYPE_ASN1, cert_binary
)
cert_info = {
"subject": dict(x509.get_subject().get_components()),
"issuer": dict(x509.get_issuer().get_components()),
"version": x509.get_version(),
"serial_number": hex(x509.get_serial_number()),
"not_before": x509.get_notBefore(),
"not_after": x509.get_notAfter(),
"fingerprint": x509.digest("sha256").hex(),
"signature_algorithm": x509.get_signature_algorithm(),
"raw_cert": base64.b64encode(cert_binary),
}
# Add extensions
extensions = []
for i in range(x509.get_extension_count()):
ext = x509.get_extension(i)
extensions.append(
{"name": ext.get_short_name(), "value": str(ext)}
)
cert_info["extensions"] = extensions
return SSLCertificate(cert_info)
except Exception:
return None
@staticmethod
def _decode_cert_data(data: Any) -> Any:
"""Helper method to decode bytes in certificate data."""
if isinstance(data, bytes):
return data.decode("utf-8")
elif isinstance(data, dict):
return {
(
k.decode("utf-8") if isinstance(k, bytes) else k
): SSLCertificate._decode_cert_data(v)
for k, v in data.items()
}
elif isinstance(data, list):
return [SSLCertificate._decode_cert_data(item) for item in data]
return data
def to_json(self, filepath: Optional[str] = None) -> Optional[str]:
"""
Export certificate as JSON.
Args:
filepath (Optional[str]): Path to save the JSON file (default: None).
Returns:
Optional[str]: JSON string if successful, None otherwise.
"""
json_str = json.dumps(self._cert_info, indent=2, ensure_ascii=False)
if filepath:
Path(filepath).write_text(json_str, encoding="utf-8")
return None
return json_str
def to_pem(self, filepath: Optional[str] = None) -> Optional[str]:
"""
Export certificate as PEM.
Args:
filepath (Optional[str]): Path to save the PEM file (default: None).
Returns:
Optional[str]: PEM string if successful, None otherwise.
"""
try:
x509 = OpenSSL.crypto.load_certificate(
OpenSSL.crypto.FILETYPE_ASN1,
base64.b64decode(self._cert_info["raw_cert"]),
)
pem_data = OpenSSL.crypto.dump_certificate(
OpenSSL.crypto.FILETYPE_PEM, x509
).decode("utf-8")
if filepath:
Path(filepath).write_text(pem_data, encoding="utf-8")
return None
return pem_data
except Exception:
return None
def to_der(self, filepath: Optional[str] = None) -> Optional[bytes]:
"""
Export certificate as DER.
Args:
filepath (Optional[str]): Path to save the DER file (default: None).
Returns:
Optional[bytes]: DER bytes if successful, None otherwise.
"""
try:
der_data = base64.b64decode(self._cert_info["raw_cert"])
if filepath:
Path(filepath).write_bytes(der_data)
return None
return der_data
except Exception:
return None
@property
def issuer(self) -> Dict[str, str]:
"""Get certificate issuer information."""
return self._cert_info.get("issuer", {})
@property
def subject(self) -> Dict[str, str]:
"""Get certificate subject information."""
return self._cert_info.get("subject", {})
@property
def valid_from(self) -> str:
"""Get certificate validity start date."""
return self._cert_info.get("not_before", "")
@property
def valid_until(self) -> str:
"""Get certificate validity end date."""
return self._cert_info.get("not_after", "")
@property
def fingerprint(self) -> str:
"""Get certificate fingerprint."""
return self._cert_info.get("fingerprint", "")

146
crawl4ai/train.py Normal file
View File

@@ -0,0 +1,146 @@
import spacy
from spacy.training import Example
import random
import nltk
from nltk.corpus import reuters
import torch
def save_spacy_model_as_torch(nlp, model_dir="models/reuters"):
# Extract the TextCategorizer component
textcat = nlp.get_pipe("textcat_multilabel")
# Convert the weights to a PyTorch state dictionary
state_dict = {name: torch.tensor(param.data) for name, param in textcat.model.named_parameters()}
# Save the state dictionary
torch.save(state_dict, f"{model_dir}/model_weights.pth")
# Extract and save the vocabulary
vocab = extract_vocab(nlp)
with open(f"{model_dir}/vocab.txt", "w") as vocab_file:
for word, idx in vocab.items():
vocab_file.write(f"{word}\t{idx}\n")
print(f"Model weights and vocabulary saved to: {model_dir}")
def extract_vocab(nlp):
# Extract vocabulary from the SpaCy model
vocab = {word: i for i, word in enumerate(nlp.vocab.strings)}
return vocab
nlp = spacy.load("models/reuters")
save_spacy_model_as_torch(nlp, model_dir="models")
def train_and_save_reuters_model(model_dir="models/reuters"):
# Ensure the Reuters corpus is downloaded
nltk.download('reuters')
nltk.download('punkt')
if not reuters.fileids():
print("Reuters corpus not found.")
return
# Load a blank English spaCy model
nlp = spacy.blank("en")
# Create a TextCategorizer with the ensemble model for multi-label classification
textcat = nlp.add_pipe("textcat_multilabel")
# Add labels to text classifier
for label in reuters.categories():
textcat.add_label(label)
# Prepare training data
train_examples = []
for fileid in reuters.fileids():
categories = reuters.categories(fileid)
text = reuters.raw(fileid)
cats = {label: label in categories for label in reuters.categories()}
# Prepare spacy Example objects
doc = nlp.make_doc(text)
example = Example.from_dict(doc, {'cats': cats})
train_examples.append(example)
# Initialize the text categorizer with the example objects
nlp.initialize(lambda: train_examples)
# Train the model
random.seed(1)
spacy.util.fix_random_seed(1)
for i in range(5): # Adjust iterations for better accuracy
random.shuffle(train_examples)
losses = {}
# Create batches of data
batches = spacy.util.minibatch(train_examples, size=8)
for batch in batches:
nlp.update(batch, drop=0.2, losses=losses)
print(f"Losses at iteration {i}: {losses}")
# Save the trained model
nlp.to_disk(model_dir)
print(f"Model saved to: {model_dir}")
def train_model(model_dir, additional_epochs=0):
# Load the model if it exists, otherwise start with a blank model
try:
nlp = spacy.load(model_dir)
print("Model loaded from disk.")
except IOError:
print("No existing model found. Starting with a new model.")
nlp = spacy.blank("en")
textcat = nlp.add_pipe("textcat_multilabel")
for label in reuters.categories():
textcat.add_label(label)
# Prepare training data
train_examples = []
for fileid in reuters.fileids():
categories = reuters.categories(fileid)
text = reuters.raw(fileid)
cats = {label: label in categories for label in reuters.categories()}
doc = nlp.make_doc(text)
example = Example.from_dict(doc, {'cats': cats})
train_examples.append(example)
# Initialize the model if it was newly created
if 'textcat_multilabel' not in nlp.pipe_names:
nlp.initialize(lambda: train_examples)
else:
print("Continuing training with existing model.")
# Train the model
random.seed(1)
spacy.util.fix_random_seed(1)
num_epochs = 5 + additional_epochs
for i in range(num_epochs):
random.shuffle(train_examples)
losses = {}
batches = spacy.util.minibatch(train_examples, size=8)
for batch in batches:
nlp.update(batch, drop=0.2, losses=losses)
print(f"Losses at iteration {i}: {losses}")
# Save the trained model
nlp.to_disk(model_dir)
print(f"Model saved to: {model_dir}")
def load_model_and_predict(model_dir, text, tok_k = 3):
# Load the trained model from the specified directory
nlp = spacy.load(model_dir)
# Process the text with the loaded model
doc = nlp(text)
# gee top 3 categories
top_categories = sorted(doc.cats.items(), key=lambda x: x[1], reverse=True)[:tok_k]
print(f"Top {tok_k} categories:")
return top_categories
if __name__ == "__main__":
train_and_save_reuters_model()
train_model("models/reuters", additional_epochs=5)
model_directory = "reuters_model_10"
print(reuters.categories())
example_text = "Apple Inc. is reportedly buying a startup for $1 billion"
r =load_model_and_predict(model_directory, example_text)
print(r)

View File

@@ -1,299 +0,0 @@
import random
from typing import Optional, Literal, List, Dict, Tuple
import re
class UserAgentGenerator:
"""
Generate random user agents with specified constraints.
Attributes:
desktop_platforms (dict): A dictionary of possible desktop platforms and their corresponding user agent strings.
mobile_platforms (dict): A dictionary of possible mobile platforms and their corresponding user agent strings.
browser_combinations (dict): A dictionary of possible browser combinations and their corresponding user agent strings.
rendering_engines (dict): A dictionary of possible rendering engines and their corresponding user agent strings.
chrome_versions (list): A list of possible Chrome browser versions.
firefox_versions (list): A list of possible Firefox browser versions.
edge_versions (list): A list of possible Edge browser versions.
safari_versions (list): A list of possible Safari browser versions.
ios_versions (list): A list of possible iOS browser versions.
android_versions (list): A list of possible Android browser versions.
Methods:
generate_user_agent(
platform: Literal["desktop", "mobile"] = "desktop",
browser: str = "chrome",
rendering_engine: str = "chrome_webkit",
chrome_version: Optional[str] = None,
firefox_version: Optional[str] = None,
edge_version: Optional[str] = None,
safari_version: Optional[str] = None,
ios_version: Optional[str] = None,
android_version: Optional[str] = None
): Generates a random user agent string based on the specified parameters.
"""
def __init__(self):
# Previous platform definitions remain the same...
self.desktop_platforms = {
"windows": {
"10_64": "(Windows NT 10.0; Win64; x64)",
"10_32": "(Windows NT 10.0; WOW64)",
},
"macos": {
"intel": "(Macintosh; Intel Mac OS X 10_15_7)",
"newer": "(Macintosh; Intel Mac OS X 10.15; rv:109.0)",
},
"linux": {
"generic": "(X11; Linux x86_64)",
"ubuntu": "(X11; Ubuntu; Linux x86_64)",
"chrome_os": "(X11; CrOS x86_64 14541.0.0)",
},
}
self.mobile_platforms = {
"android": {
"samsung": "(Linux; Android 13; SM-S901B)",
"pixel": "(Linux; Android 12; Pixel 6)",
"oneplus": "(Linux; Android 13; OnePlus 9 Pro)",
"xiaomi": "(Linux; Android 12; M2102J20SG)",
},
"ios": {
"iphone": "(iPhone; CPU iPhone OS 16_5 like Mac OS X)",
"ipad": "(iPad; CPU OS 16_5 like Mac OS X)",
},
}
# Browser Combinations
self.browser_combinations = {
1: [["chrome"], ["firefox"], ["safari"], ["edge"]],
2: [["gecko", "firefox"], ["chrome", "safari"], ["webkit", "safari"]],
3: [["chrome", "safari", "edge"], ["webkit", "chrome", "safari"]],
}
# Rendering Engines with versions
self.rendering_engines = {
"chrome_webkit": "AppleWebKit/537.36",
"safari_webkit": "AppleWebKit/605.1.15",
"gecko": [ # Added Gecko versions
"Gecko/20100101",
"Gecko/20100101", # Firefox usually uses this constant version
"Gecko/2010010",
],
}
# Browser Versions
self.chrome_versions = [
"Chrome/119.0.6045.199",
"Chrome/118.0.5993.117",
"Chrome/117.0.5938.149",
"Chrome/116.0.5845.187",
"Chrome/115.0.5790.171",
]
self.edge_versions = [
"Edg/119.0.2151.97",
"Edg/118.0.2088.76",
"Edg/117.0.2045.47",
"Edg/116.0.1938.81",
"Edg/115.0.1901.203",
]
self.safari_versions = [
"Safari/537.36", # For Chrome-based
"Safari/605.1.15",
"Safari/604.1",
"Safari/602.1",
"Safari/601.5.17",
]
# Added Firefox versions
self.firefox_versions = [
"Firefox/119.0",
"Firefox/118.0.2",
"Firefox/117.0.1",
"Firefox/116.0",
"Firefox/115.0.3",
"Firefox/114.0.2",
"Firefox/113.0.1",
"Firefox/112.0",
"Firefox/111.0.1",
"Firefox/110.0",
]
def get_browser_stack(self, num_browsers: int = 1) -> List[str]:
"""
Get a valid combination of browser versions.
How it works:
1. Check if the number of browsers is supported.
2. Randomly choose a combination of browsers.
3. Iterate through the combination and add browser versions.
4. Return the browser stack.
Args:
num_browsers: Number of browser specifications (1-3)
Returns:
List[str]: A list of browser versions.
"""
if num_browsers not in self.browser_combinations:
raise ValueError(f"Unsupported number of browsers: {num_browsers}")
combination = random.choice(self.browser_combinations[num_browsers])
browser_stack = []
for browser in combination:
if browser == "chrome":
browser_stack.append(random.choice(self.chrome_versions))
elif browser == "firefox":
browser_stack.append(random.choice(self.firefox_versions))
elif browser == "safari":
browser_stack.append(random.choice(self.safari_versions))
elif browser == "edge":
browser_stack.append(random.choice(self.edge_versions))
elif browser == "gecko":
browser_stack.append(random.choice(self.rendering_engines["gecko"]))
elif browser == "webkit":
browser_stack.append(self.rendering_engines["chrome_webkit"])
return browser_stack
def generate(
self,
device_type: Optional[Literal["desktop", "mobile"]] = None,
os_type: Optional[str] = None,
device_brand: Optional[str] = None,
browser_type: Optional[Literal["chrome", "edge", "safari", "firefox"]] = None,
num_browsers: int = 3,
) -> str:
"""
Generate a random user agent with specified constraints.
Args:
device_type: 'desktop' or 'mobile'
os_type: 'windows', 'macos', 'linux', 'android', 'ios'
device_brand: Specific device brand
browser_type: 'chrome', 'edge', 'safari', or 'firefox'
num_browsers: Number of browser specifications (1-3)
"""
# Get platform string
platform = self.get_random_platform(device_type, os_type, device_brand)
# Start with Mozilla
components = ["Mozilla/5.0", platform]
# Add browser stack
browser_stack = self.get_browser_stack(num_browsers)
# Add appropriate legacy token based on browser stack
if "Firefox" in str(browser_stack):
components.append(random.choice(self.rendering_engines["gecko"]))
elif "Chrome" in str(browser_stack) or "Safari" in str(browser_stack):
components.append(self.rendering_engines["chrome_webkit"])
components.append("(KHTML, like Gecko)")
# Add browser versions
components.extend(browser_stack)
return " ".join(components)
def generate_with_client_hints(self, **kwargs) -> Tuple[str, str]:
"""Generate both user agent and matching client hints"""
user_agent = self.generate(**kwargs)
client_hints = self.generate_client_hints(user_agent)
return user_agent, client_hints
def get_random_platform(self, device_type, os_type, device_brand):
"""Helper method to get random platform based on constraints"""
platforms = (
self.desktop_platforms
if device_type == "desktop"
else self.mobile_platforms
if device_type == "mobile"
else {**self.desktop_platforms, **self.mobile_platforms}
)
if os_type:
for platform_group in [self.desktop_platforms, self.mobile_platforms]:
if os_type in platform_group:
platforms = {os_type: platform_group[os_type]}
break
os_key = random.choice(list(platforms.keys()))
if device_brand and device_brand in platforms[os_key]:
return platforms[os_key][device_brand]
return random.choice(list(platforms[os_key].values()))
def parse_user_agent(self, user_agent: str) -> Dict[str, str]:
"""Parse a user agent string to extract browser and version information"""
browsers = {
"chrome": r"Chrome/(\d+)",
"edge": r"Edg/(\d+)",
"safari": r"Version/(\d+)",
"firefox": r"Firefox/(\d+)",
}
result = {}
for browser, pattern in browsers.items():
match = re.search(pattern, user_agent)
if match:
result[browser] = match.group(1)
return result
def generate_client_hints(self, user_agent: str) -> str:
"""Generate Sec-CH-UA header value based on user agent string"""
browsers = self.parse_user_agent(user_agent)
# Client hints components
hints = []
# Handle different browser combinations
if "chrome" in browsers:
hints.append(f'"Chromium";v="{browsers["chrome"]}"')
hints.append('"Not_A Brand";v="8"')
if "edge" in browsers:
hints.append(f'"Microsoft Edge";v="{browsers["edge"]}"')
else:
hints.append(f'"Google Chrome";v="{browsers["chrome"]}"')
elif "firefox" in browsers:
# Firefox doesn't typically send Sec-CH-UA
return '""'
elif "safari" in browsers:
# Safari's format for client hints
hints.append(f'"Safari";v="{browsers["safari"]}"')
hints.append('"Not_A Brand";v="8"')
return ", ".join(hints)
# Example usage:
if __name__ == "__main__":
generator = UserAgentGenerator()
print(generator.generate())
print("\nSingle browser (Chrome):")
print(generator.generate(num_browsers=1, browser_type="chrome"))
print("\nTwo browsers (Gecko/Firefox):")
print(generator.generate(num_browsers=2))
print("\nThree browsers (Chrome/Safari/Edge):")
print(generator.generate(num_browsers=3))
print("\nFirefox on Linux:")
print(
generator.generate(
device_type="desktop",
os_type="linux",
browser_type="firefox",
num_browsers=2,
)
)
print("\nChrome/Safari/Edge on Windows:")
print(generator.generate(device_type="desktop", os_type="windows", num_browsers=3))

File diff suppressed because it is too large Load Diff

View File

@@ -1,29 +0,0 @@
# version_manager.py
from pathlib import Path
from packaging import version
from . import __version__
class VersionManager:
def __init__(self):
self.home_dir = Path.home() / ".crawl4ai"
self.version_file = self.home_dir / "version.txt"
def get_installed_version(self):
"""Get the version recorded in home directory"""
if not self.version_file.exists():
return None
try:
return version.parse(self.version_file.read_text().strip())
except:
return None
def update_version(self):
"""Update the version file to current library version"""
self.version_file.write_text(__version__.__version__)
def needs_update(self):
"""Check if database needs update based on version"""
installed = self.get_installed_version()
current = version.parse(__version__.__version__)
return installed is None or installed < current

View File

@@ -0,0 +1,357 @@
import os, time
os.environ["TOKENIZERS_PARALLELISM"] = "false"
from pathlib import Path
from .models import UrlModel, CrawlResult
from .database import init_db, get_cached_url, cache_url, DB_PATH, flush_db
from .utils import *
from .chunking_strategy import *
from .extraction_strategy import *
from .crawler_strategy import *
from typing import List
from concurrent.futures import ThreadPoolExecutor
from .config import *
class WebCrawler:
def __init__(
self,
# db_path: str = None,
crawler_strategy: CrawlerStrategy = None,
always_by_pass_cache: bool = False,
verbose: bool = False,
):
# self.db_path = db_path
self.crawler_strategy = crawler_strategy or LocalSeleniumCrawlerStrategy(verbose=verbose)
self.always_by_pass_cache = always_by_pass_cache
# Create the .crawl4ai folder in the user's home directory if it doesn't exist
self.crawl4ai_folder = os.path.join(Path.home(), ".crawl4ai")
os.makedirs(self.crawl4ai_folder, exist_ok=True)
os.makedirs(f"{self.crawl4ai_folder}/cache", exist_ok=True)
# If db_path is not provided, use the default path
# if not db_path:
# self.db_path = f"{self.crawl4ai_folder}/crawl4ai.db"
# flush_db()
init_db()
self.ready = False
def warmup(self):
print("[LOG] 🌤️ Warming up the WebCrawler")
result = self.run(
url='https://crawl4ai.uccode.io/',
word_count_threshold=5,
extraction_strategy= NoExtractionStrategy(),
bypass_cache=False,
verbose = False
)
self.ready = True
print("[LOG] 🌞 WebCrawler is ready to crawl")
def fetch_page(
self,
url_model: UrlModel,
provider: str = DEFAULT_PROVIDER,
api_token: str = None,
extract_blocks_flag: bool = True,
word_count_threshold=MIN_WORD_THRESHOLD,
css_selector: str = None,
screenshot: bool = False,
use_cached_html: bool = False,
extraction_strategy: ExtractionStrategy = None,
chunking_strategy: ChunkingStrategy = RegexChunking(),
**kwargs,
) -> CrawlResult:
return self.run(
url_model.url,
word_count_threshold,
extraction_strategy or NoExtractionStrategy(),
chunking_strategy,
bypass_cache=url_model.forced,
css_selector=css_selector,
screenshot=screenshot,
**kwargs,
)
pass
def run_old(
self,
url: str,
word_count_threshold=MIN_WORD_THRESHOLD,
extraction_strategy: ExtractionStrategy = None,
chunking_strategy: ChunkingStrategy = RegexChunking(),
bypass_cache: bool = False,
css_selector: str = None,
screenshot: bool = False,
user_agent: str = None,
verbose=True,
**kwargs,
) -> CrawlResult:
if user_agent:
self.crawler_strategy.update_user_agent(user_agent)
extraction_strategy = extraction_strategy or NoExtractionStrategy()
extraction_strategy.verbose = verbose
# Check if extraction strategy is an instance of ExtractionStrategy if not raise an error
if not isinstance(extraction_strategy, ExtractionStrategy):
raise ValueError("Unsupported extraction strategy")
if not isinstance(chunking_strategy, ChunkingStrategy):
raise ValueError("Unsupported chunking strategy")
# make sure word_count_threshold is not lesser than MIN_WORD_THRESHOLD
if word_count_threshold < MIN_WORD_THRESHOLD:
word_count_threshold = MIN_WORD_THRESHOLD
# Check cache first
if not bypass_cache and not self.always_by_pass_cache:
cached = get_cached_url(url)
if cached:
return CrawlResult(
**{
"url": cached[0],
"html": cached[1],
"cleaned_html": cached[2],
"markdown": cached[3],
"extracted_content": cached[4],
"success": cached[5],
"media": json.loads(cached[6] or "{}"),
"links": json.loads(cached[7] or "{}"),
"metadata": json.loads(cached[8] or "{}"), # "metadata": "{}
"screenshot": cached[9],
"error_message": "",
}
)
# Initialize WebDriver for crawling
t = time.time()
if kwargs.get("js", None):
self.crawler_strategy.js_code = kwargs.get("js")
html = self.crawler_strategy.crawl(url)
base64_image = None
if screenshot:
base64_image = self.crawler_strategy.take_screenshot()
success = True
error_message = ""
# Extract content from HTML
try:
result = get_content_of_website(url, html, word_count_threshold, css_selector=css_selector)
metadata = extract_metadata(html)
if result is None:
raise ValueError(f"Failed to extract content from the website: {url}")
except InvalidCSSSelectorError as e:
raise ValueError(str(e))
cleaned_html = result.get("cleaned_html", "")
markdown = result.get("markdown", "")
media = result.get("media", [])
links = result.get("links", [])
# Print a profession LOG style message, show time taken and say crawling is done
if verbose:
print(
f"[LOG] 🚀 Crawling done for {url}, success: {success}, time taken: {time.time() - t} seconds"
)
extracted_content = []
if verbose:
print(f"[LOG] 🔥 Extracting semantic blocks for {url}, Strategy: {extraction_strategy.name}")
t = time.time()
# Split markdown into sections
sections = chunking_strategy.chunk(markdown)
# sections = merge_chunks_based_on_token_threshold(sections, CHUNK_TOKEN_THRESHOLD)
extracted_content = extraction_strategy.run(
url, sections,
)
extracted_content = json.dumps(extracted_content)
if verbose:
print(
f"[LOG] 🚀 Extraction done for {url}, time taken: {time.time() - t} seconds."
)
# Cache the result
cleaned_html = beautify_html(cleaned_html)
cache_url(
url,
html,
cleaned_html,
markdown,
extracted_content,
success,
json.dumps(media),
json.dumps(links),
json.dumps(metadata),
screenshot=base64_image,
)
return CrawlResult(
url=url,
html=html,
cleaned_html=cleaned_html,
markdown=markdown,
media=media,
links=links,
metadata=metadata,
screenshot=base64_image,
extracted_content=extracted_content,
success=success,
error_message=error_message,
)
def fetch_pages(
self,
url_models: List[UrlModel],
provider: str = DEFAULT_PROVIDER,
api_token: str = None,
extract_blocks_flag: bool = True,
word_count_threshold=MIN_WORD_THRESHOLD,
use_cached_html: bool = False,
css_selector: str = None,
screenshot: bool = False,
extraction_strategy: ExtractionStrategy = None,
chunking_strategy: ChunkingStrategy = RegexChunking(),
**kwargs,
) -> List[CrawlResult]:
extraction_strategy = extraction_strategy or NoExtractionStrategy()
def fetch_page_wrapper(url_model, *args, **kwargs):
return self.fetch_page(url_model, *args, **kwargs)
with ThreadPoolExecutor() as executor:
results = list(
executor.map(
fetch_page_wrapper,
url_models,
[provider] * len(url_models),
[api_token] * len(url_models),
[extract_blocks_flag] * len(url_models),
[word_count_threshold] * len(url_models),
[css_selector] * len(url_models),
[screenshot] * len(url_models),
[use_cached_html] * len(url_models),
[extraction_strategy] * len(url_models),
[chunking_strategy] * len(url_models),
*[kwargs] * len(url_models),
)
)
return results
def run(
self,
url: str,
word_count_threshold=MIN_WORD_THRESHOLD,
extraction_strategy: ExtractionStrategy = None,
chunking_strategy: ChunkingStrategy = RegexChunking(),
bypass_cache: bool = False,
css_selector: str = None,
screenshot: bool = False,
user_agent: str = None,
verbose=True,
**kwargs,
) -> CrawlResult:
extraction_strategy = extraction_strategy or NoExtractionStrategy()
extraction_strategy.verbose = verbose
if not isinstance(extraction_strategy, ExtractionStrategy):
raise ValueError("Unsupported extraction strategy")
if not isinstance(chunking_strategy, ChunkingStrategy):
raise ValueError("Unsupported chunking strategy")
if word_count_threshold < MIN_WORD_THRESHOLD:
word_count_threshold = MIN_WORD_THRESHOLD
# Check cache first
cached = None
extracted_content = None
if not bypass_cache and not self.always_by_pass_cache:
cached = get_cached_url(url)
if cached:
html = cached[1]
extracted_content = cached[2]
if screenshot:
screenshot = cached[9]
else:
if user_agent:
self.crawler_strategy.update_user_agent(user_agent)
html = self.crawler_strategy.crawl(url)
if screenshot:
screenshot = self.crawler_strategy.take_screenshot()
return self.process_html(url, html, extracted_content, word_count_threshold, extraction_strategy, chunking_strategy, css_selector, screenshot, verbose, bool(cached), **kwargs)
def process_html(
self,
url: str,
html: str,
extracted_content: str,
word_count_threshold: int,
extraction_strategy: ExtractionStrategy,
chunking_strategy: ChunkingStrategy,
css_selector: str,
screenshot: bool,
verbose: bool,
is_cached: bool,
**kwargs,
) -> CrawlResult:
t = time.time()
# Extract content from HTML
try:
result = get_content_of_website(url, html, word_count_threshold, css_selector=css_selector)
metadata = extract_metadata(html)
if result is None:
raise ValueError(f"Failed to extract content from the website: {url}")
except InvalidCSSSelectorError as e:
raise ValueError(str(e))
cleaned_html = result.get("cleaned_html", "")
markdown = result.get("markdown", "")
media = result.get("media", [])
links = result.get("links", [])
if verbose:
print(f"[LOG] 🚀 Crawling done for {url}, success: True, time taken: {time.time() - t} seconds")
if extracted_content is None:
if verbose:
print(f"[LOG] 🔥 Extracting semantic blocks for {url}, Strategy: {extraction_strategy.name}")
sections = chunking_strategy.chunk(markdown)
extracted_content = extraction_strategy.run(url, sections)
extracted_content = json.dumps(extracted_content)
if verbose:
print(f"[LOG] 🚀 Extraction done for {url}, time taken: {time.time() - t} seconds.")
screenshot = None if not screenshot else screenshot
if not is_cached:
cache_url(
url,
html,
cleaned_html,
markdown,
extracted_content,
True,
json.dumps(media),
json.dumps(links),
json.dumps(metadata),
screenshot=screenshot,
)
return CrawlResult(
url=url,
html=html,
cleaned_html=cleaned_html,
markdown=markdown,
media=media,
links=links,
metadata=metadata,
screenshot=screenshot,
extracted_content=extracted_content,
success=True,
error_message="",
)

View File

@@ -1,58 +1,43 @@
import os, time
os.environ["TOKENIZERS_PARALLELISM"] = "false"
from pathlib import Path
from .models import UrlModel, CrawlResult
from .database import init_db, get_cached_url, cache_url
from .database import init_db, get_cached_url, cache_url, DB_PATH, flush_db
from .utils import *
from .chunking_strategy import *
from .extraction_strategy import *
from .crawler_strategy import *
from typing import List
from concurrent.futures import ThreadPoolExecutor
from .content_scraping_strategy import WebScrapingStrategy
from .config import *
import warnings
import json
warnings.filterwarnings(
"ignore",
message='Field "model_name" has conflict with protected namespace "model_".',
)
warnings.filterwarnings("ignore", message='Field "model_name" has conflict with protected namespace "model_".')
class WebCrawler:
def __init__(
self,
crawler_strategy: CrawlerStrategy = None,
always_by_pass_cache: bool = False,
verbose: bool = False,
):
self.crawler_strategy = crawler_strategy or LocalSeleniumCrawlerStrategy(
verbose=verbose
)
def __init__(self, crawler_strategy: CrawlerStrategy = None, always_by_pass_cache: bool = False, verbose: bool = False):
self.crawler_strategy = crawler_strategy or LocalSeleniumCrawlerStrategy(verbose=verbose)
self.always_by_pass_cache = always_by_pass_cache
self.crawl4ai_folder = os.path.join(
os.getenv("CRAWL4_AI_BASE_DIRECTORY", Path.home()), ".crawl4ai"
)
self.crawl4ai_folder = os.path.join(Path.home(), ".crawl4ai")
os.makedirs(self.crawl4ai_folder, exist_ok=True)
os.makedirs(f"{self.crawl4ai_folder}/cache", exist_ok=True)
init_db()
self.ready = False
def warmup(self):
print("[LOG] 🌤️ Warming up the WebCrawler")
self.run(
url="https://google.com/",
url='https://google.com/',
word_count_threshold=5,
extraction_strategy=NoExtractionStrategy(),
bypass_cache=False,
verbose=False,
verbose=False
)
self.ready = True
print("[LOG] 🌞 WebCrawler is ready to crawl")
def fetch_page(
self,
url_model: UrlModel,
@@ -94,7 +79,6 @@ class WebCrawler:
**kwargs,
) -> List[CrawlResult]:
extraction_strategy = extraction_strategy or NoExtractionStrategy()
def fetch_page_wrapper(url_model, *args, **kwargs):
return self.fetch_page(url_model, *args, **kwargs)
@@ -119,176 +103,136 @@ class WebCrawler:
return results
def run(
self,
url: str,
word_count_threshold=MIN_WORD_THRESHOLD,
extraction_strategy: ExtractionStrategy = None,
chunking_strategy: ChunkingStrategy = RegexChunking(),
bypass_cache: bool = False,
css_selector: str = None,
screenshot: bool = False,
user_agent: str = None,
verbose=True,
**kwargs,
) -> CrawlResult:
try:
extraction_strategy = extraction_strategy or NoExtractionStrategy()
extraction_strategy.verbose = verbose
if not isinstance(extraction_strategy, ExtractionStrategy):
raise ValueError("Unsupported extraction strategy")
if not isinstance(chunking_strategy, ChunkingStrategy):
raise ValueError("Unsupported chunking strategy")
self,
url: str,
word_count_threshold=MIN_WORD_THRESHOLD,
extraction_strategy: ExtractionStrategy = None,
chunking_strategy: ChunkingStrategy = RegexChunking(),
bypass_cache: bool = False,
css_selector: str = None,
screenshot: bool = False,
user_agent: str = None,
verbose=True,
**kwargs,
) -> CrawlResult:
try:
extraction_strategy = extraction_strategy or NoExtractionStrategy()
extraction_strategy.verbose = verbose
if not isinstance(extraction_strategy, ExtractionStrategy):
raise ValueError("Unsupported extraction strategy")
if not isinstance(chunking_strategy, ChunkingStrategy):
raise ValueError("Unsupported chunking strategy")
word_count_threshold = max(word_count_threshold, MIN_WORD_THRESHOLD)
word_count_threshold = max(word_count_threshold, MIN_WORD_THRESHOLD)
cached = None
screenshot_data = None
extracted_content = None
if not bypass_cache and not self.always_by_pass_cache:
cached = get_cached_url(url)
if kwargs.get("warmup", True) and not self.ready:
return None
if cached:
html = sanitize_input_encode(cached[1])
extracted_content = sanitize_input_encode(cached[4])
if screenshot:
screenshot_data = cached[9]
if not screenshot_data:
cached = None
if not cached or not html:
if user_agent:
self.crawler_strategy.update_user_agent(user_agent)
t1 = time.time()
html = sanitize_input_encode(self.crawler_strategy.crawl(url, **kwargs))
t2 = time.time()
if verbose:
print(f"[LOG] 🚀 Crawling done for {url}, success: {bool(html)}, time taken: {t2 - t1:.2f} seconds")
if screenshot:
screenshot_data = self.crawler_strategy.take_screenshot()
cached = None
screenshot_data = None
extracted_content = None
if not bypass_cache and not self.always_by_pass_cache:
cached = get_cached_url(url)
if kwargs.get("warmup", True) and not self.ready:
return None
if cached:
html = sanitize_input_encode(cached[1])
extracted_content = sanitize_input_encode(cached[4])
if screenshot:
screenshot_data = cached[9]
if not screenshot_data:
cached = None
if not cached or not html:
if user_agent:
self.crawler_strategy.update_user_agent(user_agent)
t1 = time.time()
html = sanitize_input_encode(self.crawler_strategy.crawl(url, **kwargs))
t2 = time.time()
if verbose:
print(
f"[LOG] 🚀 Crawling done for {url}, success: {bool(html)}, time taken: {t2 - t1:.2f} seconds"
)
if screenshot:
screenshot_data = self.crawler_strategy.take_screenshot()
crawl_result = self.process_html(
url,
html,
extracted_content,
word_count_threshold,
extraction_strategy,
chunking_strategy,
css_selector,
screenshot_data,
verbose,
bool(cached),
**kwargs,
)
crawl_result.success = bool(html)
return crawl_result
except Exception as e:
if not hasattr(e, "msg"):
e.msg = str(e)
print(f"[ERROR] 🚫 Failed to crawl {url}, error: {e.msg}")
return CrawlResult(url=url, html="", success=False, error_message=e.msg)
crawl_result = self.process_html(url, html, extracted_content, word_count_threshold, extraction_strategy, chunking_strategy, css_selector, screenshot_data, verbose, bool(cached), **kwargs)
crawl_result.success = bool(html)
return crawl_result
except Exception as e:
if not hasattr(e, "msg"):
e.msg = str(e)
print(f"[ERROR] 🚫 Failed to crawl {url}, error: {e.msg}")
return CrawlResult(url=url, html="", success=False, error_message=e.msg)
def process_html(
self,
url: str,
html: str,
extracted_content: str,
word_count_threshold: int,
extraction_strategy: ExtractionStrategy,
chunking_strategy: ChunkingStrategy,
css_selector: str,
screenshot: bool,
verbose: bool,
is_cached: bool,
**kwargs,
) -> CrawlResult:
t = time.time()
# Extract content from HTML
try:
t1 = time.time()
scrapping_strategy = WebScrapingStrategy()
extra_params = {
k: v
for k, v in kwargs.items()
if k not in ["only_text", "image_description_min_word_threshold"]
}
result = scrapping_strategy.scrap(
url,
html,
word_count_threshold=word_count_threshold,
css_selector=css_selector,
only_text=kwargs.get("only_text", False),
image_description_min_word_threshold=kwargs.get(
"image_description_min_word_threshold",
IMAGE_DESCRIPTION_MIN_WORD_THRESHOLD,
),
**extra_params,
)
self,
url: str,
html: str,
extracted_content: str,
word_count_threshold: int,
extraction_strategy: ExtractionStrategy,
chunking_strategy: ChunkingStrategy,
css_selector: str,
screenshot: bool,
verbose: bool,
is_cached: bool,
**kwargs,
) -> CrawlResult:
t = time.time()
# Extract content from HTML
try:
t1 = time.time()
result = get_content_of_website_optimized(url, html, word_count_threshold, css_selector=css_selector, only_text=kwargs.get("only_text", False))
if verbose:
print(f"[LOG] 🚀 Content extracted for {url}, success: True, time taken: {time.time() - t1:.2f} seconds")
if result is None:
raise ValueError(f"Failed to extract content from the website: {url}")
except InvalidCSSSelectorError as e:
raise ValueError(str(e))
cleaned_html = sanitize_input_encode(result.get("cleaned_html", ""))
markdown = sanitize_input_encode(result.get("markdown", ""))
media = result.get("media", [])
links = result.get("links", [])
metadata = result.get("metadata", {})
if extracted_content is None:
if verbose:
print(f"[LOG] 🔥 Extracting semantic blocks for {url}, Strategy: {extraction_strategy.name}")
# result = get_content_of_website_optimized(url, html, word_count_threshold, css_selector=css_selector, only_text=kwargs.get("only_text", False))
if verbose:
print(
f"[LOG] 🚀 Content extracted for {url}, success: True, time taken: {time.time() - t1:.2f} seconds"
)
sections = chunking_strategy.chunk(markdown)
extracted_content = extraction_strategy.run(url, sections)
extracted_content = json.dumps(extracted_content, indent=4, default=str, ensure_ascii=False)
if result is None:
raise ValueError(f"Failed to extract content from the website: {url}")
except InvalidCSSSelectorError as e:
raise ValueError(str(e))
cleaned_html = sanitize_input_encode(result.get("cleaned_html", ""))
markdown = sanitize_input_encode(result.get("markdown", ""))
media = result.get("media", [])
links = result.get("links", [])
metadata = result.get("metadata", {})
if extracted_content is None:
if verbose:
print(
f"[LOG] 🔥 Extracting semantic blocks for {url}, Strategy: {extraction_strategy.name}"
)
sections = chunking_strategy.chunk(markdown)
extracted_content = extraction_strategy.run(url, sections)
extracted_content = json.dumps(
extracted_content, indent=4, default=str, ensure_ascii=False
)
if verbose:
print(
f"[LOG] 🚀 Extraction done for {url}, time taken: {time.time() - t:.2f} seconds."
)
screenshot = None if not screenshot else screenshot
if not is_cached:
cache_url(
url,
html,
cleaned_html,
markdown,
extracted_content,
True,
json.dumps(media),
json.dumps(links),
json.dumps(metadata),
if verbose:
print(f"[LOG] 🚀 Extraction done for {url}, time taken: {time.time() - t:.2f} seconds.")
screenshot = None if not screenshot else screenshot
if not is_cached:
cache_url(
url,
html,
cleaned_html,
markdown,
extracted_content,
True,
json.dumps(media),
json.dumps(links),
json.dumps(metadata),
screenshot=screenshot,
)
return CrawlResult(
url=url,
html=html,
cleaned_html=format_html(cleaned_html),
markdown=markdown,
media=media,
links=links,
metadata=metadata,
screenshot=screenshot,
)
return CrawlResult(
url=url,
html=html,
cleaned_html=format_html(cleaned_html),
markdown=markdown,
media=media,
links=links,
metadata=metadata,
screenshot=screenshot,
extracted_content=extracted_content,
success=True,
error_message="",
)
extracted_content=extracted_content,
success=True,
error_message="",
)

View File

@@ -1,67 +0,0 @@
services:
# Local build services for different platforms
crawl4ai-amd64:
build:
context: .
dockerfile: Dockerfile
args:
PYTHON_VERSION: "3.10"
INSTALL_TYPE: ${INSTALL_TYPE:-basic}
ENABLE_GPU: false
platforms:
- linux/amd64
profiles: ["local-amd64"]
extends: &base-config
file: docker-compose.yml
service: base-config
crawl4ai-arm64:
build:
context: .
dockerfile: Dockerfile
args:
PYTHON_VERSION: "3.10"
INSTALL_TYPE: ${INSTALL_TYPE:-basic}
ENABLE_GPU: false
platforms:
- linux/arm64
profiles: ["local-arm64"]
extends: *base-config
# Hub services for different platforms and versions
crawl4ai-hub-amd64:
image: unclecode/crawl4ai:${VERSION:-basic}-amd64
profiles: ["hub-amd64"]
extends: *base-config
crawl4ai-hub-arm64:
image: unclecode/crawl4ai:${VERSION:-basic}-arm64
profiles: ["hub-arm64"]
extends: *base-config
# Base configuration to be extended
base-config:
ports:
- "11235:11235"
- "8000:8000"
- "9222:9222"
- "8080:8080"
environment:
- CRAWL4AI_API_TOKEN=${CRAWL4AI_API_TOKEN:-}
- OPENAI_API_KEY=${OPENAI_API_KEY:-}
- CLAUDE_API_KEY=${CLAUDE_API_KEY:-}
volumes:
- /dev/shm:/dev/shm
deploy:
resources:
limits:
memory: 4G
reservations:
memory: 1G
restart: unless-stopped
healthcheck:
test: ["CMD", "curl", "-f", "http://localhost:11235/health"]
interval: 30s
timeout: 10s
retries: 3
start_period: 40s

BIN
docs/.DS_Store vendored Normal file

Binary file not shown.

Binary file not shown.

Before

Width:  |  Height:  |  Size: 33 KiB

View File

@@ -1,64 +0,0 @@
<svg xmlns="http://www.w3.org/2000/svg" viewBox="0 0 800 500">
<!-- Background -->
<rect width="800" height="500" fill="#1a1a1a"/>
<!-- Opportunities Section -->
<g transform="translate(50,50)">
<!-- Opportunity 1 Box -->
<rect x="0" y="0" width="300" height="150" rx="10" fill="#1a2d3d" stroke="#64b5f6" stroke-width="2"/>
<text x="150" y="30" text-anchor="middle" font-family="Arial" font-weight="bold" font-size="16" fill="#64b5f6">Data Capitalization Opportunity</text>
<text x="150" y="60" text-anchor="middle" font-family="Arial" font-size="12" fill="#e0e0e0">
<tspan x="150" dy="0">Transform digital footprints into assets</tspan>
<tspan x="150" dy="20">Personal data as capital</tspan>
<tspan x="150" dy="20">Enterprise knowledge valuation</tspan>
<tspan x="150" dy="20">New form of wealth creation</tspan>
</text>
<!-- Opportunity 2 Box -->
<rect x="0" y="200" width="300" height="150" rx="10" fill="#1a2d1a" stroke="#81c784" stroke-width="2"/>
<text x="150" y="230" text-anchor="middle" font-family="Arial" font-weight="bold" font-size="16" fill="#81c784">Authentic Data Potential</text>
<text x="150" y="260" text-anchor="middle" font-family="Arial" font-size="12" fill="#e0e0e0">
<tspan x="150" dy="0">Vast reservoir of real insights</tspan>
<tspan x="150" dy="20">Enhanced AI development</tspan>
<tspan x="150" dy="20">Diverse human knowledge</tspan>
<tspan x="150" dy="20">Willing participation model</tspan>
</text>
</g>
<!-- Development Pathway -->
<g transform="translate(450,50)">
<!-- Step 1 Box -->
<rect x="0" y="0" width="300" height="100" rx="10" fill="#2d1a2d" stroke="#ce93d8" stroke-width="2"/>
<text x="150" y="35" text-anchor="middle" font-family="Arial" font-weight="bold" font-size="16" fill="#ce93d8">1. Open-Source Foundation</text>
<text x="150" y="65" text-anchor="middle" font-family="Arial" font-size="12" fill="#e0e0e0">Data extraction engine &amp; community development</text>
<!-- Step 2 Box -->
<rect x="0" y="125" width="300" height="100" rx="10" fill="#2d1a2d" stroke="#ce93d8" stroke-width="2"/>
<text x="150" y="160" text-anchor="middle" font-family="Arial" font-weight="bold" font-size="16" fill="#ce93d8">2. Data Capitalization Platform</text>
<text x="150" y="190" text-anchor="middle" font-family="Arial" font-size="12" fill="#e0e0e0">Tools to structure &amp; value digital assets</text>
<!-- Step 3 Box -->
<rect x="0" y="250" width="300" height="100" rx="10" fill="#2d1a2d" stroke="#ce93d8" stroke-width="2"/>
<text x="150" y="285" text-anchor="middle" font-family="Arial" font-weight="bold" font-size="16" fill="#ce93d8">3. Shared Data Marketplace</text>
<text x="150" y="315" text-anchor="middle" font-family="Arial" font-size="12" fill="#e0e0e0">Economic platform for data exchange</text>
</g>
<!-- Connecting Arrows -->
<g transform="translate(400,125)">
<path d="M-20,0 L40,0" stroke="#666" stroke-width="2" marker-end="url(#arrowhead)"/>
<path d="M-20,200 L40,200" stroke="#666" stroke-width="2" marker-end="url(#arrowhead)"/>
</g>
<!-- Arrow Marker -->
<defs>
<marker id="arrowhead" markerWidth="10" markerHeight="7" refX="9" refY="3.5" orient="auto">
<polygon points="0 0, 10 3.5, 0 7" fill="#666"/>
</marker>
</defs>
<!-- Vision Box at Bottom -->
<g transform="translate(200,420)">
<rect x="0" y="0" width="400" height="60" rx="10" fill="#2d2613" stroke="#ffd54f" stroke-width="2"/>
<text x="200" y="35" text-anchor="middle" font-family="Arial" font-weight="bold" font-size="16" fill="#ffd54f">Economic Vision: Shared Data Economy</text>
</g>
</svg>

Before

Width:  |  Height:  |  Size: 3.8 KiB

View File

@@ -0,0 +1,12 @@
{
"RegexChunking": "### RegexChunking\n\n`RegexChunking` is a text chunking strategy that splits a given text into smaller parts using regular expressions.\nThis is useful for preparing large texts for processing by language models, ensuring they are divided into manageable segments.\n\n#### Constructor Parameters:\n- `patterns` (list, optional): A list of regular expression patterns used to split the text. Default is to split by double newlines (`['\\n\\n']`).\n\n#### Example usage:\n```python\nchunker = RegexChunking(patterns=[r'\\n\\n', r'\\. '])\nchunks = chunker.chunk(\"This is a sample text. It will be split into chunks.\")\n```",
"NlpSentenceChunking": "### NlpSentenceChunking\n\n`NlpSentenceChunking` uses a natural language processing model to chunk a given text into sentences. This approach leverages SpaCy to accurately split text based on sentence boundaries.\n\n#### Constructor Parameters:\n- None.\n\n#### Example usage:\n```python\nchunker = NlpSentenceChunking()\nchunks = chunker.chunk(\"This is a sample text. It will be split into sentences.\")\n```",
"TopicSegmentationChunking": "### TopicSegmentationChunking\n\n`TopicSegmentationChunking` uses the TextTiling algorithm to segment a given text into topic-based chunks. This method identifies thematic boundaries in the text.\n\n#### Constructor Parameters:\n- `num_keywords` (int, optional): The number of keywords to extract for each topic segment. Default is `3`.\n\n#### Example usage:\n```python\nchunker = TopicSegmentationChunking(num_keywords=3)\nchunks = chunker.chunk(\"This is a sample text. It will be split into topic-based segments.\")\n```",
"FixedLengthWordChunking": "### FixedLengthWordChunking\n\n`FixedLengthWordChunking` splits a given text into chunks of fixed length, based on the number of words.\n\n#### Constructor Parameters:\n- `chunk_size` (int, optional): The number of words in each chunk. Default is `100`.\n\n#### Example usage:\n```python\nchunker = FixedLengthWordChunking(chunk_size=100)\nchunks = chunker.chunk(\"This is a sample text. It will be split into fixed-length word chunks.\")\n```",
"SlidingWindowChunking": "### SlidingWindowChunking\n\n`SlidingWindowChunking` uses a sliding window approach to chunk a given text. Each chunk has a fixed length, and the window slides by a specified step size.\n\n#### Constructor Parameters:\n- `window_size` (int, optional): The number of words in each chunk. Default is `100`.\n- `step` (int, optional): The number of words to slide the window. Default is `50`.\n\n#### Example usage:\n```python\nchunker = SlidingWindowChunking(window_size=100, step=50)\nchunks = chunker.chunk(\"This is a sample text. It will be split using a sliding window approach.\")\n```"
}

View File

@@ -1,189 +0,0 @@
# 🐳 Using Docker (Legacy)
Crawl4AI is available as Docker images for easy deployment. You can either pull directly from Docker Hub (recommended) or build from the repository.
---
<details>
<summary>🐳 <strong>Option 1: Docker Hub (Recommended)</strong></summary>
Choose the appropriate image based on your platform and needs:
### For AMD64 (Regular Linux/Windows):
```bash
# Basic version (recommended)
docker pull unclecode/crawl4ai:basic-amd64
docker run -p 11235:11235 unclecode/crawl4ai:basic-amd64
# Full ML/LLM support
docker pull unclecode/crawl4ai:all-amd64
docker run -p 11235:11235 unclecode/crawl4ai:all-amd64
# With GPU support
docker pull unclecode/crawl4ai:gpu-amd64
docker run -p 11235:11235 unclecode/crawl4ai:gpu-amd64
```
### For ARM64 (M1/M2 Macs, ARM servers):
```bash
# Basic version (recommended)
docker pull unclecode/crawl4ai:basic-arm64
docker run -p 11235:11235 unclecode/crawl4ai:basic-arm64
# Full ML/LLM support
docker pull unclecode/crawl4ai:all-arm64
docker run -p 11235:11235 unclecode/crawl4ai:all-arm64
# With GPU support
docker pull unclecode/crawl4ai:gpu-arm64
docker run -p 11235:11235 unclecode/crawl4ai:gpu-arm64
```
Need more memory? Add `--shm-size`:
```bash
docker run --shm-size=2gb -p 11235:11235 unclecode/crawl4ai:basic-amd64
```
Test the installation:
```bash
curl http://localhost:11235/health
```
### For Raspberry Pi (32-bit) (coming soon):
```bash
# Pull and run basic version (recommended for Raspberry Pi)
docker pull unclecode/crawl4ai:basic-armv7
docker run -p 11235:11235 unclecode/crawl4ai:basic-armv7
# With increased shared memory if needed
docker run --shm-size=2gb -p 11235:11235 unclecode/crawl4ai:basic-armv7
```
Note: Due to hardware constraints, only the basic version is recommended for Raspberry Pi.
</details>
<details>
<summary>🐳 <strong>Option 2: Build from Repository</strong></summary>
Build the image locally based on your platform:
```bash
# Clone the repository
git clone https://github.com/unclecode/crawl4ai.git
cd crawl4ai
# For AMD64 (Regular Linux/Windows)
docker build --platform linux/amd64 \
--tag crawl4ai:local \
--build-arg INSTALL_TYPE=basic \
.
# For ARM64 (M1/M2 Macs, ARM servers)
docker build --platform linux/arm64 \
--tag crawl4ai:local \
--build-arg INSTALL_TYPE=basic \
.
```
Build options:
- INSTALL_TYPE=basic (default): Basic crawling features
- INSTALL_TYPE=all: Full ML/LLM support
- ENABLE_GPU=true: Add GPU support
Example with all options:
```bash
docker build --platform linux/amd64 \
--tag crawl4ai:local \
--build-arg INSTALL_TYPE=all \
--build-arg ENABLE_GPU=true \
.
```
Run your local build:
```bash
# Regular run
docker run -p 11235:11235 crawl4ai:local
# With increased shared memory
docker run --shm-size=2gb -p 11235:11235 crawl4ai:local
```
Test the installation:
```bash
curl http://localhost:11235/health
```
</details>
<details>
<summary>🐳 <strong>Option 3: Using Docker Compose</strong></summary>
Docker Compose provides a more structured way to run Crawl4AI, especially when dealing with environment variables and multiple configurations.
```bash
# Clone the repository
git clone https://github.com/unclecode/crawl4ai.git
cd crawl4ai
```
### For AMD64 (Regular Linux/Windows):
```bash
# Build and run locally
docker-compose --profile local-amd64 up
# Run from Docker Hub
VERSION=basic docker-compose --profile hub-amd64 up # Basic version
VERSION=all docker-compose --profile hub-amd64 up # Full ML/LLM support
VERSION=gpu docker-compose --profile hub-amd64 up # GPU support
```
### For ARM64 (M1/M2 Macs, ARM servers):
```bash
# Build and run locally
docker-compose --profile local-arm64 up
# Run from Docker Hub
VERSION=basic docker-compose --profile hub-arm64 up # Basic version
VERSION=all docker-compose --profile hub-arm64 up # Full ML/LLM support
VERSION=gpu docker-compose --profile hub-arm64 up # GPU support
```
Environment variables (optional):
```bash
# Create a .env file
CRAWL4AI_API_TOKEN=your_token
OPENAI_API_KEY=your_openai_key
CLAUDE_API_KEY=your_claude_key
```
The compose file includes:
- Memory management (4GB limit, 1GB reserved)
- Shared memory volume for browser support
- Health checks
- Auto-restart policy
- All necessary port mappings
Test the installation:
```bash
curl http://localhost:11235/health
```
</details>
<details>
<summary>🚀 <strong>One-Click Deployment</strong></summary>
Deploy your own instance of Crawl4AI with one click:
[![DigitalOcean Referral Badge](https://web-platforms.sfo2.cdn.digitaloceanspaces.com/WWW/Badge%203.svg)](https://www.digitalocean.com/?repo=https://github.com/unclecode/crawl4ai/tree/0.3.74&refcode=a0780f1bdb3d&utm_campaign=Referral_Invite&utm_medium=Referral_Program&utm_source=badge)
> 💡 **Recommended specs**: 4GB RAM minimum. Select "professional-xs" or higher when deploying for stable operation.
The deploy will:
- Set up a Docker container with Crawl4AI
- Configure Playwright and all dependencies
- Start the FastAPI server on port `11235`
- Set up health checks and auto-deployment
</details>

View File

@@ -1,110 +0,0 @@
"""
This example demonstrates how to use JSON CSS extraction to scrape product information
from Amazon search results. It shows how to extract structured data like product titles,
prices, ratings, and other details using CSS selectors.
"""
from crawl4ai import AsyncWebCrawler
from crawl4ai.extraction_strategy import JsonCssExtractionStrategy
from crawl4ai.async_configs import BrowserConfig, CrawlerRunConfig
import json
async def extract_amazon_products():
# Initialize browser config
browser_config = BrowserConfig(browser_type="chromium", headless=True)
# Initialize crawler config with JSON CSS extraction strategy
crawler_config = CrawlerRunConfig(
extraction_strategy=JsonCssExtractionStrategy(
schema={
"name": "Amazon Product Search Results",
"baseSelector": "[data-component-type='s-search-result']",
"fields": [
{
"name": "asin",
"selector": "",
"type": "attribute",
"attribute": "data-asin",
},
{"name": "title", "selector": "h2 a span", "type": "text"},
{
"name": "url",
"selector": "h2 a",
"type": "attribute",
"attribute": "href",
},
{
"name": "image",
"selector": ".s-image",
"type": "attribute",
"attribute": "src",
},
{
"name": "rating",
"selector": ".a-icon-star-small .a-icon-alt",
"type": "text",
},
{
"name": "reviews_count",
"selector": "[data-csa-c-func-deps='aui-da-a-popover'] ~ span span",
"type": "text",
},
{
"name": "price",
"selector": ".a-price .a-offscreen",
"type": "text",
},
{
"name": "original_price",
"selector": ".a-price.a-text-price .a-offscreen",
"type": "text",
},
{
"name": "sponsored",
"selector": ".puis-sponsored-label-text",
"type": "exists",
},
{
"name": "delivery_info",
"selector": "[data-cy='delivery-recipe'] .a-color-base",
"type": "text",
"multiple": True,
},
],
}
)
)
# Example search URL (you should replace with your actual Amazon URL)
url = "https://www.amazon.com/s?k=Samsung+Galaxy+Tab"
# Use context manager for proper resource handling
async with AsyncWebCrawler(config=browser_config) as crawler:
# Extract the data
result = await crawler.arun(url=url, config=crawler_config)
# Process and print the results
if result and result.extracted_content:
# Parse the JSON string into a list of products
products = json.loads(result.extracted_content)
# Process each product in the list
for product in products:
print("\nProduct Details:")
print(f"ASIN: {product.get('asin')}")
print(f"Title: {product.get('title')}")
print(f"Price: {product.get('price')}")
print(f"Original Price: {product.get('original_price')}")
print(f"Rating: {product.get('rating')}")
print(f"Reviews: {product.get('reviews_count')}")
print(f"Sponsored: {'Yes' if product.get('sponsored') else 'No'}")
if product.get("delivery_info"):
print(f"Delivery: {' '.join(product['delivery_info'])}")
print("-" * 80)
if __name__ == "__main__":
import asyncio
asyncio.run(extract_amazon_products())

View File

@@ -1,150 +0,0 @@
"""
This example demonstrates how to use JSON CSS extraction to scrape product information
from Amazon search results. It shows how to extract structured data like product titles,
prices, ratings, and other details using CSS selectors.
"""
from crawl4ai import AsyncWebCrawler, CacheMode
from crawl4ai.extraction_strategy import JsonCssExtractionStrategy
from crawl4ai.async_configs import BrowserConfig, CrawlerRunConfig
import json
from playwright.async_api import Page, BrowserContext
async def extract_amazon_products():
# Initialize browser config
browser_config = BrowserConfig(
# browser_type="chromium",
headless=True
)
# Initialize crawler config with JSON CSS extraction strategy nav-search-submit-button
crawler_config = CrawlerRunConfig(
cache_mode=CacheMode.BYPASS,
extraction_strategy=JsonCssExtractionStrategy(
schema={
"name": "Amazon Product Search Results",
"baseSelector": "[data-component-type='s-search-result']",
"fields": [
{
"name": "asin",
"selector": "",
"type": "attribute",
"attribute": "data-asin",
},
{"name": "title", "selector": "h2 a span", "type": "text"},
{
"name": "url",
"selector": "h2 a",
"type": "attribute",
"attribute": "href",
},
{
"name": "image",
"selector": ".s-image",
"type": "attribute",
"attribute": "src",
},
{
"name": "rating",
"selector": ".a-icon-star-small .a-icon-alt",
"type": "text",
},
{
"name": "reviews_count",
"selector": "[data-csa-c-func-deps='aui-da-a-popover'] ~ span span",
"type": "text",
},
{
"name": "price",
"selector": ".a-price .a-offscreen",
"type": "text",
},
{
"name": "original_price",
"selector": ".a-price.a-text-price .a-offscreen",
"type": "text",
},
{
"name": "sponsored",
"selector": ".puis-sponsored-label-text",
"type": "exists",
},
{
"name": "delivery_info",
"selector": "[data-cy='delivery-recipe'] .a-color-base",
"type": "text",
"multiple": True,
},
],
}
),
)
url = "https://www.amazon.com/"
async def after_goto(
page: Page, context: BrowserContext, url: str, response: dict, **kwargs
):
"""Hook called after navigating to each URL"""
print(f"[HOOK] after_goto - Successfully loaded: {url}")
try:
# Wait for search box to be available
search_box = await page.wait_for_selector(
"#twotabsearchtextbox", timeout=1000
)
# Type the search query
await search_box.fill("Samsung Galaxy Tab")
# Get the search button and prepare for navigation
search_button = await page.wait_for_selector(
"#nav-search-submit-button", timeout=1000
)
# Click with navigation waiting
await search_button.click()
# Wait for search results to load
await page.wait_for_selector(
'[data-component-type="s-search-result"]', timeout=10000
)
print("[HOOK] Search completed and results loaded!")
except Exception as e:
print(f"[HOOK] Error during search operation: {str(e)}")
return page
# Use context manager for proper resource handling
async with AsyncWebCrawler(config=browser_config) as crawler:
crawler.crawler_strategy.set_hook("after_goto", after_goto)
# Extract the data
result = await crawler.arun(url=url, config=crawler_config)
# Process and print the results
if result and result.extracted_content:
# Parse the JSON string into a list of products
products = json.loads(result.extracted_content)
# Process each product in the list
for product in products:
print("\nProduct Details:")
print(f"ASIN: {product.get('asin')}")
print(f"Title: {product.get('title')}")
print(f"Price: {product.get('price')}")
print(f"Original Price: {product.get('original_price')}")
print(f"Rating: {product.get('rating')}")
print(f"Reviews: {product.get('reviews_count')}")
print(f"Sponsored: {'Yes' if product.get('sponsored') else 'No'}")
if product.get("delivery_info"):
print(f"Delivery: {' '.join(product['delivery_info'])}")
print("-" * 80)
if __name__ == "__main__":
import asyncio
asyncio.run(extract_amazon_products())

View File

@@ -1,126 +0,0 @@
"""
This example demonstrates how to use JSON CSS extraction to scrape product information
from Amazon search results. It shows how to extract structured data like product titles,
prices, ratings, and other details using CSS selectors.
"""
from crawl4ai import AsyncWebCrawler, CacheMode
from crawl4ai.extraction_strategy import JsonCssExtractionStrategy
from crawl4ai.async_configs import BrowserConfig, CrawlerRunConfig
import json
async def extract_amazon_products():
# Initialize browser config
browser_config = BrowserConfig(
# browser_type="chromium",
headless=True
)
js_code_to_search = """
const task = async () => {
document.querySelector('#twotabsearchtextbox').value = 'Samsung Galaxy Tab';
document.querySelector('#nav-search-submit-button').click();
}
await task();
"""
js_code_to_search_sync = """
document.querySelector('#twotabsearchtextbox').value = 'Samsung Galaxy Tab';
document.querySelector('#nav-search-submit-button').click();
"""
crawler_config = CrawlerRunConfig(
cache_mode=CacheMode.BYPASS,
js_code=js_code_to_search,
wait_for='css:[data-component-type="s-search-result"]',
extraction_strategy=JsonCssExtractionStrategy(
schema={
"name": "Amazon Product Search Results",
"baseSelector": "[data-component-type='s-search-result']",
"fields": [
{
"name": "asin",
"selector": "",
"type": "attribute",
"attribute": "data-asin",
},
{"name": "title", "selector": "h2 a span", "type": "text"},
{
"name": "url",
"selector": "h2 a",
"type": "attribute",
"attribute": "href",
},
{
"name": "image",
"selector": ".s-image",
"type": "attribute",
"attribute": "src",
},
{
"name": "rating",
"selector": ".a-icon-star-small .a-icon-alt",
"type": "text",
},
{
"name": "reviews_count",
"selector": "[data-csa-c-func-deps='aui-da-a-popover'] ~ span span",
"type": "text",
},
{
"name": "price",
"selector": ".a-price .a-offscreen",
"type": "text",
},
{
"name": "original_price",
"selector": ".a-price.a-text-price .a-offscreen",
"type": "text",
},
{
"name": "sponsored",
"selector": ".puis-sponsored-label-text",
"type": "exists",
},
{
"name": "delivery_info",
"selector": "[data-cy='delivery-recipe'] .a-color-base",
"type": "text",
"multiple": True,
},
],
}
),
)
# Example search URL (you should replace with your actual Amazon URL)
url = "https://www.amazon.com/"
# Use context manager for proper resource handling
async with AsyncWebCrawler(config=browser_config) as crawler:
# Extract the data
result = await crawler.arun(url=url, config=crawler_config)
# Process and print the results
if result and result.extracted_content:
# Parse the JSON string into a list of products
products = json.loads(result.extracted_content)
# Process each product in the list
for product in products:
print("\nProduct Details:")
print(f"ASIN: {product.get('asin')}")
print(f"Title: {product.get('title')}")
print(f"Price: {product.get('price')}")
print(f"Original Price: {product.get('original_price')}")
print(f"Rating: {product.get('rating')}")
print(f"Reviews: {product.get('reviews_count')}")
print(f"Sponsored: {'Yes' if product.get('sponsored') else 'No'}")
if product.get("delivery_info"):
print(f"Delivery: {' '.join(product['delivery_info'])}")
print("-" * 80)
if __name__ == "__main__":
import asyncio
asyncio.run(extract_amazon_products())

View File

@@ -1,16 +1,12 @@
# File: async_webcrawler_multiple_urls_example.py
import os, sys
# append 2 parent directories to sys.path to import crawl4ai
parent_dir = os.path.dirname(
os.path.dirname(os.path.dirname(os.path.abspath(__file__)))
)
parent_dir = os.path.dirname(os.path.dirname(os.path.dirname(os.path.abspath(__file__))))
sys.path.append(parent_dir)
import asyncio
from crawl4ai import AsyncWebCrawler
async def main():
# Initialize the AsyncWebCrawler
async with AsyncWebCrawler(verbose=True) as crawler:
@@ -20,7 +16,7 @@ async def main():
"https://python.org",
"https://github.com",
"https://stackoverflow.com",
"https://news.ycombinator.com",
"https://news.ycombinator.com"
]
# Set up crawling parameters
@@ -31,7 +27,7 @@ async def main():
urls=urls,
word_count_threshold=word_count_threshold,
bypass_cache=True,
verbose=True,
verbose=True
)
# Process the results
@@ -40,9 +36,7 @@ async def main():
print(f"Successfully crawled: {result.url}")
print(f"Title: {result.metadata.get('title', 'N/A')}")
print(f"Word count: {len(result.markdown.split())}")
print(
f"Number of links: {len(result.links.get('internal', [])) + len(result.links.get('external', []))}"
)
print(f"Number of links: {len(result.links.get('internal', [])) + len(result.links.get('external', []))}")
print(f"Number of images: {len(result.media.get('images', []))}")
print("---")
else:
@@ -50,6 +44,5 @@ async def main():
print(f"Error: {result.error_message}")
print("---")
if __name__ == "__main__":
asyncio.run(main())
asyncio.run(main())

View File

@@ -1,126 +0,0 @@
"""
This example demonstrates optimal browser usage patterns in Crawl4AI:
1. Sequential crawling with session reuse
2. Parallel crawling with browser instance reuse
3. Performance optimization settings
"""
import asyncio
from typing import List
from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig
from crawl4ai.markdown_generation_strategy import DefaultMarkdownGenerator
async def crawl_sequential(urls: List[str]):
"""
Sequential crawling using session reuse - most efficient for moderate workloads
"""
print("\n=== Sequential Crawling with Session Reuse ===")
# Configure browser with optimized settings
browser_config = BrowserConfig(
headless=True,
browser_args=[
"--disable-gpu", # Disable GPU acceleration
"--disable-dev-shm-usage", # Disable /dev/shm usage
"--no-sandbox", # Required for Docker
],
viewport={
"width": 800,
"height": 600,
}, # Smaller viewport for better performance
)
# Configure crawl settings
crawl_config = CrawlerRunConfig(
markdown_generator=DefaultMarkdownGenerator(
# content_filter=PruningContentFilter(), In case you need fit_markdown
),
)
# Create single crawler instance
crawler = AsyncWebCrawler(config=browser_config)
await crawler.start()
try:
session_id = "session1" # Use same session for all URLs
for url in urls:
result = await crawler.arun(
url=url,
config=crawl_config,
session_id=session_id, # Reuse same browser tab
)
if result.success:
print(f"Successfully crawled {url}")
print(f"Content length: {len(result.markdown_v2.raw_markdown)}")
finally:
await crawler.close()
async def crawl_parallel(urls: List[str], max_concurrent: int = 3):
"""
Parallel crawling while reusing browser instance - best for large workloads
"""
print("\n=== Parallel Crawling with Browser Reuse ===")
browser_config = BrowserConfig(
headless=True,
browser_args=["--disable-gpu", "--disable-dev-shm-usage", "--no-sandbox"],
viewport={"width": 800, "height": 600},
)
crawl_config = CrawlerRunConfig(
markdown_generator=DefaultMarkdownGenerator(
# content_filter=PruningContentFilter(), In case you need fit_markdown
),
)
# Create single crawler instance for all parallel tasks
crawler = AsyncWebCrawler(config=browser_config)
await crawler.start()
try:
# Create tasks in batches to control concurrency
for i in range(0, len(urls), max_concurrent):
batch = urls[i : i + max_concurrent]
tasks = []
for j, url in enumerate(batch):
session_id = (
f"parallel_session_{j}" # Different session per concurrent task
)
task = crawler.arun(url=url, config=crawl_config, session_id=session_id)
tasks.append(task)
# Wait for batch to complete
results = await asyncio.gather(*tasks, return_exceptions=True)
# Process results
for url, result in zip(batch, results):
if isinstance(result, Exception):
print(f"Error crawling {url}: {str(result)}")
elif result.success:
print(f"Successfully crawled {url}")
print(f"Content length: {len(result.markdown_v2.raw_markdown)}")
finally:
await crawler.close()
async def main():
# Example URLs
urls = [
"https://example.com/page1",
"https://example.com/page2",
"https://example.com/page3",
"https://example.com/page4",
]
# Demo sequential crawling
await crawl_sequential(urls)
# Demo parallel crawling
await crawl_parallel(urls, max_concurrent=2)
if __name__ == "__main__":
asyncio.run(main())

View File

@@ -1,32 +1,31 @@
import os, time
# append the path to the root of the project
import sys
import asyncio
sys.path.append(os.path.join(os.path.dirname(__file__), "..", ".."))
sys.path.append(os.path.join(os.path.dirname(__file__), '..', '..'))
from firecrawl import FirecrawlApp
from crawl4ai import AsyncWebCrawler
__data__ = os.path.join(os.path.dirname(__file__), "..", "..") + "/.data"
__data__ = os.path.join(os.path.dirname(__file__), '..', '..') + '/.data'
async def compare():
app = FirecrawlApp(api_key=os.environ["FIRECRAWL_API_KEY"])
app = FirecrawlApp(api_key=os.environ['FIRECRAWL_API_KEY'])
# Tet Firecrawl with a simple crawl
start = time.time()
scrape_status = app.scrape_url(
"https://www.nbcnews.com/business", params={"formats": ["markdown", "html"]}
'https://www.nbcnews.com/business',
params={'formats': ['markdown', 'html']}
)
end = time.time()
print(f"Time taken: {end - start} seconds")
print(len(scrape_status["markdown"]))
print(len(scrape_status['markdown']))
# save the markdown content with provider name
with open(f"{__data__}/firecrawl_simple.md", "w") as f:
f.write(scrape_status["markdown"])
f.write(scrape_status['markdown'])
# Count how many "cldnry.s-nbcnews.com" are in the markdown
print(scrape_status["markdown"].count("cldnry.s-nbcnews.com"))
print(scrape_status['markdown'].count("cldnry.s-nbcnews.com"))
async with AsyncWebCrawler() as crawler:
start = time.time()
@@ -34,13 +33,13 @@ async def compare():
url="https://www.nbcnews.com/business",
# js_code=["const loadMoreButton = Array.from(document.querySelectorAll('button')).find(button => button.textContent.includes('Load More')); loadMoreButton && loadMoreButton.click();"],
word_count_threshold=0,
bypass_cache=True,
verbose=False,
bypass_cache=True,
verbose=False
)
end = time.time()
print(f"Time taken: {end - start} seconds")
print(len(result.markdown))
# save the markdown content with provider name
# save the markdown content with provider name
with open(f"{__data__}/crawl4ai_simple.md", "w") as f:
f.write(result.markdown)
# count how many "cldnry.s-nbcnews.com" are in the markdown
@@ -49,12 +48,10 @@ async def compare():
start = time.time()
result = await crawler.arun(
url="https://www.nbcnews.com/business",
js_code=[
"const loadMoreButton = Array.from(document.querySelectorAll('button')).find(button => button.textContent.includes('Load More')); loadMoreButton && loadMoreButton.click();"
],
js_code=["const loadMoreButton = Array.from(document.querySelectorAll('button')).find(button => button.textContent.includes('Load More')); loadMoreButton && loadMoreButton.click();"],
word_count_threshold=0,
bypass_cache=True,
verbose=False,
bypass_cache=True,
verbose=False
)
end = time.time()
print(f"Time taken: {end - start} seconds")
@@ -64,7 +61,7 @@ async def compare():
f.write(result.markdown)
# count how many "cldnry.s-nbcnews.com" are in the markdown
print(result.markdown.count("cldnry.s-nbcnews.com"))
if __name__ == "__main__":
asyncio.run(compare())

View File

@@ -1,136 +0,0 @@
import asyncio
import time
from rich import print
from rich.table import Table
from crawl4ai import (
AsyncWebCrawler,
BrowserConfig,
CrawlerRunConfig,
MemoryAdaptiveDispatcher,
SemaphoreDispatcher,
RateLimiter,
CrawlerMonitor,
DisplayMode,
CacheMode,
LXMLWebScrapingStrategy,
)
async def memory_adaptive(urls, browser_config, run_config):
"""Memory adaptive crawler with monitoring"""
start = time.perf_counter()
async with AsyncWebCrawler(config=browser_config) as crawler:
dispatcher = MemoryAdaptiveDispatcher(
memory_threshold_percent=70.0,
max_session_permit=10,
monitor=CrawlerMonitor(
max_visible_rows=15, display_mode=DisplayMode.DETAILED
),
)
results = await crawler.arun_many(
urls, config=run_config, dispatcher=dispatcher
)
duration = time.perf_counter() - start
return len(results), duration
async def memory_adaptive_with_rate_limit(urls, browser_config, run_config):
"""Memory adaptive crawler with rate limiting"""
start = time.perf_counter()
async with AsyncWebCrawler(config=browser_config) as crawler:
dispatcher = MemoryAdaptiveDispatcher(
memory_threshold_percent=70.0,
max_session_permit=10,
rate_limiter=RateLimiter(
base_delay=(1.0, 2.0), max_delay=30.0, max_retries=2
),
monitor=CrawlerMonitor(
max_visible_rows=15, display_mode=DisplayMode.DETAILED
),
)
results = await crawler.arun_many(
urls, config=run_config, dispatcher=dispatcher
)
duration = time.perf_counter() - start
return len(results), duration
async def semaphore(urls, browser_config, run_config):
"""Basic semaphore crawler"""
start = time.perf_counter()
async with AsyncWebCrawler(config=browser_config) as crawler:
dispatcher = SemaphoreDispatcher(
semaphore_count=5,
monitor=CrawlerMonitor(
max_visible_rows=15, display_mode=DisplayMode.DETAILED
),
)
results = await crawler.arun_many(
urls, config=run_config, dispatcher=dispatcher
)
duration = time.perf_counter() - start
return len(results), duration
async def semaphore_with_rate_limit(urls, browser_config, run_config):
"""Semaphore crawler with rate limiting"""
start = time.perf_counter()
async with AsyncWebCrawler(config=browser_config) as crawler:
dispatcher = SemaphoreDispatcher(
semaphore_count=5,
rate_limiter=RateLimiter(
base_delay=(1.0, 2.0), max_delay=30.0, max_retries=2
),
monitor=CrawlerMonitor(
max_visible_rows=15, display_mode=DisplayMode.DETAILED
),
)
results = await crawler.arun_many(
urls, config=run_config, dispatcher=dispatcher
)
duration = time.perf_counter() - start
return len(results), duration
def create_performance_table(results):
"""Creates a rich table showing performance results"""
table = Table(title="Crawler Strategy Performance Comparison")
table.add_column("Strategy", style="cyan")
table.add_column("URLs Crawled", justify="right", style="green")
table.add_column("Time (seconds)", justify="right", style="yellow")
table.add_column("URLs/second", justify="right", style="magenta")
sorted_results = sorted(results.items(), key=lambda x: x[1][1])
for strategy, (urls_crawled, duration) in sorted_results:
urls_per_second = urls_crawled / duration
table.add_row(
strategy, str(urls_crawled), f"{duration:.2f}", f"{urls_per_second:.2f}"
)
return table
async def main():
urls = [f"https://example.com/page{i}" for i in range(1, 20)]
browser_config = BrowserConfig(headless=True, verbose=False)
run_config = CrawlerRunConfig(cache_mode=CacheMode.BYPASS, scraping_strategy=LXMLWebScrapingStrategy())
results = {
"Memory Adaptive": await memory_adaptive(urls, browser_config, run_config),
"Memory Adaptive + Rate Limit": await memory_adaptive_with_rate_limit(
urls, browser_config, run_config
),
"Semaphore": await semaphore(urls, browser_config, run_config),
"Semaphore + Rate Limit": await semaphore_with_rate_limit(
urls, browser_config, run_config
),
}
table = create_performance_table(results)
print("\nPerformance Summary:")
print(table)
if __name__ == "__main__":
asyncio.run(main())

View File

@@ -1,372 +0,0 @@
import requests
import json
import time
import sys
import base64
import os
from typing import Dict, Any
class Crawl4AiTester:
def __init__(self, base_url: str = "http://localhost:11235", api_token: str = None):
self.base_url = base_url
self.api_token = (
api_token or os.getenv("CRAWL4AI_API_TOKEN") or "test_api_code"
) # Check environment variable as fallback
self.headers = (
{"Authorization": f"Bearer {self.api_token}"} if self.api_token else {}
)
def submit_and_wait(
self, request_data: Dict[str, Any], timeout: int = 300
) -> Dict[str, Any]:
# Submit crawl job
response = requests.post(
f"{self.base_url}/crawl", json=request_data, headers=self.headers
)
if response.status_code == 403:
raise Exception("API token is invalid or missing")
task_id = response.json()["task_id"]
print(f"Task ID: {task_id}")
# Poll for result
start_time = time.time()
while True:
if time.time() - start_time > timeout:
raise TimeoutError(
f"Task {task_id} did not complete within {timeout} seconds"
)
result = requests.get(
f"{self.base_url}/task/{task_id}", headers=self.headers
)
status = result.json()
if status["status"] == "failed":
print("Task failed:", status.get("error"))
raise Exception(f"Task failed: {status.get('error')}")
if status["status"] == "completed":
return status
time.sleep(2)
def submit_sync(self, request_data: Dict[str, Any]) -> Dict[str, Any]:
response = requests.post(
f"{self.base_url}/crawl_sync",
json=request_data,
headers=self.headers,
timeout=60,
)
if response.status_code == 408:
raise TimeoutError("Task did not complete within server timeout")
response.raise_for_status()
return response.json()
def crawl_direct(self, request_data: Dict[str, Any]) -> Dict[str, Any]:
"""Directly crawl without using task queue"""
response = requests.post(
f"{self.base_url}/crawl_direct", json=request_data, headers=self.headers
)
response.raise_for_status()
return response.json()
def test_docker_deployment(version="basic"):
tester = Crawl4AiTester(
base_url="http://localhost:11235",
# base_url="https://api.crawl4ai.com" # just for example
# api_token="test" # just for example
)
print(f"Testing Crawl4AI Docker {version} version")
# Health check with timeout and retry
max_retries = 5
for i in range(max_retries):
try:
health = requests.get(f"{tester.base_url}/health", timeout=10)
print("Health check:", health.json())
break
except requests.exceptions.RequestException:
if i == max_retries - 1:
print(f"Failed to connect after {max_retries} attempts")
sys.exit(1)
print(f"Waiting for service to start (attempt {i+1}/{max_retries})...")
time.sleep(5)
# Test cases based on version
test_basic_crawl_direct(tester)
test_basic_crawl(tester)
test_basic_crawl(tester)
test_basic_crawl_sync(tester)
if version in ["full", "transformer"]:
test_cosine_extraction(tester)
test_js_execution(tester)
test_css_selector(tester)
test_structured_extraction(tester)
test_llm_extraction(tester)
test_llm_with_ollama(tester)
test_screenshot(tester)
def test_basic_crawl(tester: Crawl4AiTester):
print("\n=== Testing Basic Crawl ===")
request = {
"urls": "https://www.nbcnews.com/business",
"priority": 10,
"session_id": "test",
}
result = tester.submit_and_wait(request)
print(f"Basic crawl result length: {len(result['result']['markdown'])}")
assert result["result"]["success"]
assert len(result["result"]["markdown"]) > 0
def test_basic_crawl_sync(tester: Crawl4AiTester):
print("\n=== Testing Basic Crawl (Sync) ===")
request = {
"urls": "https://www.nbcnews.com/business",
"priority": 10,
"session_id": "test",
}
result = tester.submit_sync(request)
print(f"Basic crawl result length: {len(result['result']['markdown'])}")
assert result["status"] == "completed"
assert result["result"]["success"]
assert len(result["result"]["markdown"]) > 0
def test_basic_crawl_direct(tester: Crawl4AiTester):
print("\n=== Testing Basic Crawl (Direct) ===")
request = {
"urls": "https://www.nbcnews.com/business",
"priority": 10,
# "session_id": "test"
"cache_mode": "bypass", # or "enabled", "disabled", "read_only", "write_only"
}
result = tester.crawl_direct(request)
print(f"Basic crawl result length: {len(result['result']['markdown'])}")
assert result["result"]["success"]
assert len(result["result"]["markdown"]) > 0
def test_js_execution(tester: Crawl4AiTester):
print("\n=== Testing JS Execution ===")
request = {
"urls": "https://www.nbcnews.com/business",
"priority": 8,
"js_code": [
"const loadMoreButton = Array.from(document.querySelectorAll('button')).find(button => button.textContent.includes('Load More')); loadMoreButton && loadMoreButton.click();"
],
"wait_for": "article.tease-card:nth-child(10)",
"crawler_params": {"headless": True},
}
result = tester.submit_and_wait(request)
print(f"JS execution result length: {len(result['result']['markdown'])}")
assert result["result"]["success"]
def test_css_selector(tester: Crawl4AiTester):
print("\n=== Testing CSS Selector ===")
request = {
"urls": "https://www.nbcnews.com/business",
"priority": 7,
"css_selector": ".wide-tease-item__description",
"crawler_params": {"headless": True},
"extra": {"word_count_threshold": 10},
}
result = tester.submit_and_wait(request)
print(f"CSS selector result length: {len(result['result']['markdown'])}")
assert result["result"]["success"]
def test_structured_extraction(tester: Crawl4AiTester):
print("\n=== Testing Structured Extraction ===")
schema = {
"name": "Coinbase Crypto Prices",
"baseSelector": ".cds-tableRow-t45thuk",
"fields": [
{
"name": "crypto",
"selector": "td:nth-child(1) h2",
"type": "text",
},
{
"name": "symbol",
"selector": "td:nth-child(1) p",
"type": "text",
},
{
"name": "price",
"selector": "td:nth-child(2)",
"type": "text",
},
],
}
request = {
"urls": "https://www.coinbase.com/explore",
"priority": 9,
"extraction_config": {"type": "json_css", "params": {"schema": schema}},
}
result = tester.submit_and_wait(request)
extracted = json.loads(result["result"]["extracted_content"])
print(f"Extracted {len(extracted)} items")
print("Sample item:", json.dumps(extracted[0], indent=2))
assert result["result"]["success"]
assert len(extracted) > 0
def test_llm_extraction(tester: Crawl4AiTester):
print("\n=== Testing LLM Extraction ===")
schema = {
"type": "object",
"properties": {
"model_name": {
"type": "string",
"description": "Name of the OpenAI model.",
},
"input_fee": {
"type": "string",
"description": "Fee for input token for the OpenAI model.",
},
"output_fee": {
"type": "string",
"description": "Fee for output token for the OpenAI model.",
},
},
"required": ["model_name", "input_fee", "output_fee"],
}
request = {
"urls": "https://openai.com/api/pricing",
"priority": 8,
"extraction_config": {
"type": "llm",
"params": {
"provider": "openai/gpt-4o-mini",
"api_token": os.getenv("OPENAI_API_KEY"),
"schema": schema,
"extraction_type": "schema",
"instruction": """From the crawled content, extract all mentioned model names along with their fees for input and output tokens.""",
},
},
"crawler_params": {"word_count_threshold": 1},
}
try:
result = tester.submit_and_wait(request)
extracted = json.loads(result["result"]["extracted_content"])
print(f"Extracted {len(extracted)} model pricing entries")
print("Sample entry:", json.dumps(extracted[0], indent=2))
assert result["result"]["success"]
except Exception as e:
print(f"LLM extraction test failed (might be due to missing API key): {str(e)}")
def test_llm_with_ollama(tester: Crawl4AiTester):
print("\n=== Testing LLM with Ollama ===")
schema = {
"type": "object",
"properties": {
"article_title": {
"type": "string",
"description": "The main title of the news article",
},
"summary": {
"type": "string",
"description": "A brief summary of the article content",
},
"main_topics": {
"type": "array",
"items": {"type": "string"},
"description": "Main topics or themes discussed in the article",
},
},
}
request = {
"urls": "https://www.nbcnews.com/business",
"priority": 8,
"extraction_config": {
"type": "llm",
"params": {
"provider": "ollama/llama2",
"schema": schema,
"extraction_type": "schema",
"instruction": "Extract the main article information including title, summary, and main topics.",
},
},
"extra": {"word_count_threshold": 1},
"crawler_params": {"verbose": True},
}
try:
result = tester.submit_and_wait(request)
extracted = json.loads(result["result"]["extracted_content"])
print("Extracted content:", json.dumps(extracted, indent=2))
assert result["result"]["success"]
except Exception as e:
print(f"Ollama extraction test failed: {str(e)}")
def test_cosine_extraction(tester: Crawl4AiTester):
print("\n=== Testing Cosine Extraction ===")
request = {
"urls": "https://www.nbcnews.com/business",
"priority": 8,
"extraction_config": {
"type": "cosine",
"params": {
"semantic_filter": "business finance economy",
"word_count_threshold": 10,
"max_dist": 0.2,
"top_k": 3,
},
},
}
try:
result = tester.submit_and_wait(request)
extracted = json.loads(result["result"]["extracted_content"])
print(f"Extracted {len(extracted)} text clusters")
print("First cluster tags:", extracted[0]["tags"])
assert result["result"]["success"]
except Exception as e:
print(f"Cosine extraction test failed: {str(e)}")
def test_screenshot(tester: Crawl4AiTester):
print("\n=== Testing Screenshot ===")
request = {
"urls": "https://www.nbcnews.com/business",
"priority": 5,
"screenshot": True,
"crawler_params": {"headless": True},
}
result = tester.submit_and_wait(request)
print("Screenshot captured:", bool(result["result"]["screenshot"]))
if result["result"]["screenshot"]:
# Save screenshot
screenshot_data = base64.b64decode(result["result"]["screenshot"])
with open("test_screenshot.jpg", "wb") as f:
f.write(screenshot_data)
print("Screenshot saved as test_screenshot.jpg")
assert result["result"]["success"]
if __name__ == "__main__":
version = sys.argv[1] if len(sys.argv) > 1 else "basic"
# version = "full"
test_docker_deployment(version)

View File

@@ -1,127 +0,0 @@
"""
Example demonstrating different extraction strategies with various input formats.
This example shows how to:
1. Use different input formats (markdown, HTML, fit_markdown)
2. Work with JSON-based extractors (CSS and XPath)
3. Use LLM-based extraction with different input formats
4. Configure browser and crawler settings properly
"""
import asyncio
import os
from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig, CacheMode
from crawl4ai.extraction_strategy import (
LLMExtractionStrategy,
JsonCssExtractionStrategy,
JsonXPathExtractionStrategy,
)
from crawl4ai.content_filter_strategy import PruningContentFilter
from crawl4ai.markdown_generation_strategy import DefaultMarkdownGenerator
async def run_extraction(crawler: AsyncWebCrawler, url: str, strategy, name: str):
"""Helper function to run extraction with proper configuration"""
try:
# Configure the crawler run settings
config = CrawlerRunConfig(
cache_mode=CacheMode.BYPASS,
extraction_strategy=strategy,
markdown_generator=DefaultMarkdownGenerator(
content_filter=PruningContentFilter() # For fit_markdown support
),
)
# Run the crawler
result = await crawler.arun(url=url, config=config)
if result.success:
print(f"\n=== {name} Results ===")
print(f"Extracted Content: {result.extracted_content}")
print(f"Raw Markdown Length: {len(result.markdown_v2.raw_markdown)}")
print(
f"Citations Markdown Length: {len(result.markdown_v2.markdown_with_citations)}"
)
else:
print(f"Error in {name}: Crawl failed")
except Exception as e:
print(f"Error in {name}: {str(e)}")
async def main():
# Example URL (replace with actual URL)
url = "https://example.com/product-page"
# Configure browser settings
browser_config = BrowserConfig(headless=True, verbose=True)
# Initialize extraction strategies
# 1. LLM Extraction with different input formats
markdown_strategy = LLMExtractionStrategy(
provider="openai/gpt-4o-mini",
api_token=os.getenv("OPENAI_API_KEY"),
instruction="Extract product information including name, price, and description",
)
html_strategy = LLMExtractionStrategy(
input_format="html",
provider="openai/gpt-4o-mini",
api_token=os.getenv("OPENAI_API_KEY"),
instruction="Extract product information from HTML including structured data",
)
fit_markdown_strategy = LLMExtractionStrategy(
input_format="fit_markdown",
provider="openai/gpt-4o-mini",
api_token=os.getenv("OPENAI_API_KEY"),
instruction="Extract product information from cleaned markdown",
)
# 2. JSON CSS Extraction (automatically uses HTML input)
css_schema = {
"baseSelector": ".product",
"fields": [
{"name": "title", "selector": "h1.product-title", "type": "text"},
{"name": "price", "selector": ".price", "type": "text"},
{"name": "description", "selector": ".description", "type": "text"},
],
}
css_strategy = JsonCssExtractionStrategy(schema=css_schema)
# 3. JSON XPath Extraction (automatically uses HTML input)
xpath_schema = {
"baseSelector": "//div[@class='product']",
"fields": [
{
"name": "title",
"selector": ".//h1[@class='product-title']/text()",
"type": "text",
},
{
"name": "price",
"selector": ".//span[@class='price']/text()",
"type": "text",
},
{
"name": "description",
"selector": ".//div[@class='description']/text()",
"type": "text",
},
],
}
xpath_strategy = JsonXPathExtractionStrategy(schema=xpath_schema)
# Use context manager for proper resource handling
async with AsyncWebCrawler(config=browser_config) as crawler:
# Run all strategies
await run_extraction(crawler, url, markdown_strategy, "Markdown LLM")
await run_extraction(crawler, url, html_strategy, "HTML LLM")
await run_extraction(crawler, url, fit_markdown_strategy, "Fit Markdown LLM")
await run_extraction(crawler, url, css_strategy, "CSS Extraction")
await run_extraction(crawler, url, xpath_strategy, "XPath Extraction")
if __name__ == "__main__":
asyncio.run(main())

View File

@@ -1,58 +0,0 @@
# Capturing Full-Page Screenshots and PDFs from Massive Webpages with Crawl4AI
When dealing with very long web pages, traditional full-page screenshots can be slow or fail entirely. For large pages (like extensive Wikipedia articles), generating a single massive screenshot often leads to delays, memory issues, or style differences.
**The New Approach:**
Weve introduced a new feature that effortlessly handles even the biggest pages by first exporting them as a PDF, then converting that PDF into a high-quality image. This approach leverages the browsers built-in PDF rendering, making it both stable and efficient for very long content. You also have the option to directly save the PDF for your own usage—no need for multiple passes or complex stitching logic.
**Key Benefits:**
- **Reliability:** The PDF export never times out and works regardless of page length.
- **Versatility:** Get both the PDF and a screenshot in one crawl, without reloading or reprocessing.
- **Performance:** Skips manual scrolling and stitching images, reducing complexity and runtime.
**Simple Example:**
```python
import os, sys
import asyncio
from crawl4ai import AsyncWebCrawler, CacheMode
# Adjust paths as needed
parent_dir = os.path.dirname(os.path.dirname(os.path.abspath(__file__)))
sys.path.append(parent_dir)
__location__ = os.path.realpath(os.path.join(os.getcwd(), os.path.dirname(__file__)))
async def main():
async with AsyncWebCrawler() as crawler:
# Request both PDF and screenshot
result = await crawler.arun(
url='https://en.wikipedia.org/wiki/List_of_common_misconceptions',
cache_mode=CacheMode.BYPASS,
pdf=True,
screenshot=True
)
if result.success:
# Save screenshot
if result.screenshot:
from base64 import b64decode
with open(os.path.join(__location__, "screenshot.png"), "wb") as f:
f.write(b64decode(result.screenshot))
# Save PDF
if result.pdf:
pdf_bytes = b64decode(result.pdf)
with open(os.path.join(__location__, "page.pdf"), "wb") as f:
f.write(pdf_bytes)
if __name__ == "__main__":
asyncio.run(main())
```
**What Happens Under the Hood:**
- Crawl4AI navigates to the target page.
- If `pdf=True`, it exports the current page as a full PDF, capturing all of its content no matter the length.
- If `screenshot=True`, and a PDF is already available, it directly converts the first page of that PDF to an image for you—no repeated loading or scrolling.
- Finally, you get your PDF and/or screenshot ready to use.
**Conclusion:**
With this feature, Crawl4AI becomes even more robust and versatile for large-scale content extraction. Whether you need a PDF snapshot or a quick screenshot, you now have a reliable solution for even the most extensive webpages.

View File

@@ -1,23 +0,0 @@
import asyncio
from crawl4ai import *
async def main():
browser_config = BrowserConfig(headless=True, verbose=True)
async with AsyncWebCrawler(config=browser_config) as crawler:
crawler_config = CrawlerRunConfig(
cache_mode=CacheMode.BYPASS,
markdown_generator=DefaultMarkdownGenerator(
content_filter=PruningContentFilter(
threshold=0.48, threshold_type="fixed", min_word_threshold=0
)
),
)
result = await crawler.arun(
url="https://www.helloworld.org", config=crawler_config
)
print(result.markdown_v2.raw_markdown[:500])
if __name__ == "__main__":
asyncio.run(main())

View File

@@ -1,118 +0,0 @@
from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig, CacheMode
from playwright.async_api import Page, BrowserContext
async def main():
print("🔗 Hooks Example: Demonstrating different hook use cases")
# Configure browser settings
browser_config = BrowserConfig(headless=True)
# Configure crawler settings
crawler_run_config = CrawlerRunConfig(
js_code="window.scrollTo(0, document.body.scrollHeight);",
wait_for="body",
cache_mode=CacheMode.BYPASS,
)
# Create crawler instance
crawler = AsyncWebCrawler(config=browser_config)
# Define and set hook functions
async def on_browser_created(browser, context: BrowserContext, **kwargs):
"""Hook called after the browser is created"""
print("[HOOK] on_browser_created - Browser is ready!")
# Example: Set a cookie that will be used for all requests
return browser
async def on_page_context_created(page: Page, context: BrowserContext, **kwargs):
"""Hook called after a new page and context are created"""
print("[HOOK] on_page_context_created - New page created!")
# Example: Set default viewport size
await context.add_cookies(
[
{
"name": "session_id",
"value": "example_session",
"domain": ".example.com",
"path": "/",
}
]
)
await page.set_viewport_size({"width": 1080, "height": 800})
return page
async def on_user_agent_updated(
page: Page, context: BrowserContext, user_agent: str, **kwargs
):
"""Hook called when the user agent is updated"""
print(f"[HOOK] on_user_agent_updated - New user agent: {user_agent}")
return page
async def on_execution_started(page: Page, context: BrowserContext, **kwargs):
"""Hook called after custom JavaScript execution"""
print("[HOOK] on_execution_started - Custom JS executed!")
return page
async def before_goto(page: Page, context: BrowserContext, url: str, **kwargs):
"""Hook called before navigating to each URL"""
print(f"[HOOK] before_goto - About to visit: {url}")
# Example: Add custom headers for the request
await page.set_extra_http_headers({"Custom-Header": "my-value"})
return page
async def after_goto(
page: Page, context: BrowserContext, url: str, response: dict, **kwargs
):
"""Hook called after navigating to each URL"""
print(f"[HOOK] after_goto - Successfully loaded: {url}")
# Example: Wait for a specific element to be loaded
try:
await page.wait_for_selector(".content", timeout=1000)
print("Content element found!")
except:
print("Content element not found, continuing anyway")
return page
async def before_retrieve_html(page: Page, context: BrowserContext, **kwargs):
"""Hook called before retrieving the HTML content"""
print("[HOOK] before_retrieve_html - About to get HTML content")
# Example: Scroll to bottom to trigger lazy loading
await page.evaluate("window.scrollTo(0, document.body.scrollHeight);")
return page
async def before_return_html(
page: Page, context: BrowserContext, html: str, **kwargs
):
"""Hook called before returning the HTML content"""
print(f"[HOOK] before_return_html - Got HTML content (length: {len(html)})")
# Example: You could modify the HTML content here if needed
return page
# Set all the hooks
crawler.crawler_strategy.set_hook("on_browser_created", on_browser_created)
crawler.crawler_strategy.set_hook(
"on_page_context_created", on_page_context_created
)
crawler.crawler_strategy.set_hook("on_user_agent_updated", on_user_agent_updated)
crawler.crawler_strategy.set_hook("on_execution_started", on_execution_started)
crawler.crawler_strategy.set_hook("before_goto", before_goto)
crawler.crawler_strategy.set_hook("after_goto", after_goto)
crawler.crawler_strategy.set_hook("before_retrieve_html", before_retrieve_html)
crawler.crawler_strategy.set_hook("before_return_html", before_return_html)
await crawler.start()
# Example usage: crawl a simple website
url = "https://example.com"
result = await crawler.arun(url, config=crawler_run_config)
print(f"\nCrawled URL: {result.url}")
print(f"HTML length: {len(result.html)}")
await crawler.close()
if __name__ == "__main__":
import asyncio
asyncio.run(main())

View File

@@ -1,7 +1,6 @@
import asyncio
from crawl4ai import AsyncWebCrawler, AsyncPlaywrightCrawlerStrategy
async def main():
# Example 1: Setting language when creating the crawler
crawler1 = AsyncWebCrawler(
@@ -10,15 +9,11 @@ async def main():
)
)
result1 = await crawler1.arun("https://www.example.com")
print(
"Example 1 result:", result1.extracted_content[:100]
) # Print first 100 characters
print("Example 1 result:", result1.extracted_content[:100]) # Print first 100 characters
# Example 2: Setting language before crawling
crawler2 = AsyncWebCrawler()
crawler2.crawler_strategy.headers[
"Accept-Language"
] = "es-ES,es;q=0.9,en-US;q=0.8,en;q=0.7"
crawler2.crawler_strategy.headers["Accept-Language"] = "es-ES,es;q=0.9,en-US;q=0.8,en;q=0.7"
result2 = await crawler2.arun("https://www.example.com")
print("Example 2 result:", result2.extracted_content[:100])
@@ -26,7 +21,7 @@ async def main():
crawler3 = AsyncWebCrawler()
result3 = await crawler3.arun(
"https://www.example.com",
headers={"Accept-Language": "de-DE,de;q=0.9,en-US;q=0.8,en;q=0.7"},
headers={"Accept-Language": "de-DE,de;q=0.9,en-US;q=0.8,en;q=0.7"}
)
print("Example 3 result:", result3.extracted_content[:100])
@@ -36,15 +31,15 @@ async def main():
("https://www.example.org", "es-ES,es;q=0.9"),
("https://www.example.net", "de-DE,de;q=0.9"),
]
crawler4 = AsyncWebCrawler()
results = await asyncio.gather(
*[crawler4.arun(url, headers={"Accept-Language": lang}) for url, lang in urls]
)
results = await asyncio.gather(*[
crawler4.arun(url, headers={"Accept-Language": lang})
for url, lang in urls
])
for url, result in zip([u for u, _ in urls], results):
print(f"Result for {url}:", result.extracted_content[:100])
if __name__ == "__main__":
asyncio.run(main())
asyncio.run(main())

View File

@@ -1,46 +1,41 @@
import os
import time
from crawl4ai.web_crawler import WebCrawler
from crawl4ai.chunking_strategy import *
from crawl4ai.extraction_strategy import *
from crawl4ai.crawler_strategy import *
import asyncio
url = r'https://openai.com/api/pricing/'
crawler = WebCrawler()
crawler.warmup()
from pydantic import BaseModel, Field
url = r"https://openai.com/api/pricing/"
class OpenAIModelFee(BaseModel):
model_name: str = Field(..., description="Name of the OpenAI model.")
input_fee: str = Field(..., description="Fee for input token for the OpenAI model.")
output_fee: str = Field(
..., description="Fee for output token for the OpenAI model."
)
output_fee: str = Field(..., description="Fee for output token for the OpenAI model.")
result = crawler.run(
url=url,
word_count_threshold=1,
extraction_strategy= LLMExtractionStrategy(
# provider= "openai/gpt-4o", api_token = os.getenv('OPENAI_API_KEY'),
provider= "groq/llama-3.1-70b-versatile", api_token = os.getenv('GROQ_API_KEY'),
schema=OpenAIModelFee.model_json_schema(),
extraction_type="schema",
instruction="From the crawled content, extract all mentioned model names along with their "\
"fees for input and output tokens. Make sure not to miss anything in the entire content. "\
'One extracted model JSON format should look like this: '\
'{ "model_name": "GPT-4", "input_fee": "US$10.00 / 1M tokens", "output_fee": "US$30.00 / 1M tokens" }'
),
bypass_cache=True,
)
from crawl4ai import AsyncWebCrawler
model_fees = json.loads(result.extracted_content)
print(len(model_fees))
async def main():
# Use AsyncWebCrawler
async with AsyncWebCrawler() as crawler:
result = await crawler.arun(
url=url,
word_count_threshold=1,
extraction_strategy=LLMExtractionStrategy(
# provider= "openai/gpt-4o", api_token = os.getenv('OPENAI_API_KEY'),
provider="groq/llama-3.1-70b-versatile",
api_token=os.getenv("GROQ_API_KEY"),
schema=OpenAIModelFee.model_json_schema(),
extraction_type="schema",
instruction="From the crawled content, extract all mentioned model names along with their "
"fees for input and output tokens. Make sure not to miss anything in the entire content. "
"One extracted model JSON format should look like this: "
'{ "model_name": "GPT-4", "input_fee": "US$10.00 / 1M tokens", "output_fee": "US$30.00 / 1M tokens" }',
),
)
print("Success:", result.success)
model_fees = json.loads(result.extracted_content)
print(len(model_fees))
with open(".data/data.json", "w", encoding="utf-8") as f:
f.write(result.extracted_content)
asyncio.run(main())
with open(".data/data.json", "w", encoding="utf-8") as f:
f.write(result.extracted_content)

View File

@@ -1,87 +0,0 @@
import os
import asyncio
from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig, CacheMode
from crawl4ai.content_filter_strategy import LLMContentFilter
async def test_llm_filter():
# Create an HTML source that needs intelligent filtering
url = "https://docs.python.org/3/tutorial/classes.html"
browser_config = BrowserConfig(
headless=True,
verbose=True
)
# run_config = CrawlerRunConfig(cache_mode=CacheMode.BYPASS)
run_config = CrawlerRunConfig(cache_mode=CacheMode.ENABLED)
async with AsyncWebCrawler(config=browser_config) as crawler:
# First get the raw HTML
result = await crawler.arun(url, config=run_config)
html = result.cleaned_html
# Initialize LLM filter with focused instruction
filter = LLMContentFilter(
provider="openai/gpt-4o",
api_token=os.getenv('OPENAI_API_KEY'),
instruction="""
Focus on extracting the core educational content about Python classes.
Include:
- Key concepts and their explanations
- Important code examples
- Essential technical details
Exclude:
- Navigation elements
- Sidebars
- Footer content
- Version information
- Any non-essential UI elements
Format the output as clean markdown with proper code blocks and headers.
""",
verbose=True
)
filter = LLMContentFilter(
provider="openai/gpt-4o",
api_token=os.getenv('OPENAI_API_KEY'),
chunk_token_threshold=2 ** 12 * 2, # 2048 * 2
instruction="""
Extract the main educational content while preserving its original wording and substance completely. Your task is to:
1. Maintain the exact language and terminology used in the main content
2. Keep all technical explanations, examples, and educational content intact
3. Preserve the original flow and structure of the core content
4. Remove only clearly irrelevant elements like:
- Navigation menus
- Advertisement sections
- Cookie notices
- Footers with site information
- Sidebars with external links
- Any UI elements that don't contribute to learning
The goal is to create a clean markdown version that reads exactly like the original article,
keeping all valuable content but free from distracting elements. Imagine you're creating
a perfect reading experience where nothing valuable is lost, but all noise is removed.
""",
verbose=True
)
# Apply filtering
filtered_content = filter.filter_content(html, ignore_cache = True)
# Show results
print("\nFiltered Content Length:", len(filtered_content))
print("\nFirst 500 chars of filtered content:")
if filtered_content:
print(filtered_content[0][:500])
# Save on disc the markdown version
with open("filtered_content.md", "w", encoding="utf-8") as f:
f.write("\n".join(filtered_content))
# Show token usage
filter.show_usage()
if __name__ == "__main__":
asyncio.run(test_llm_filter())

File diff suppressed because one or more lines are too long

View File

@@ -1,618 +0,0 @@
import os, sys
sys.path.append(
os.path.dirname(os.path.dirname(os.path.dirname(os.path.abspath(__file__))))
)
import asyncio
import time
import json
import re
from typing import Dict
from bs4 import BeautifulSoup
from pydantic import BaseModel, Field
from crawl4ai import AsyncWebCrawler, CacheMode, BrowserConfig, CrawlerRunConfig
from crawl4ai.markdown_generation_strategy import DefaultMarkdownGenerator
from crawl4ai.content_filter_strategy import PruningContentFilter
from crawl4ai.extraction_strategy import (
JsonCssExtractionStrategy,
LLMExtractionStrategy,
)
__location__ = os.path.realpath(os.path.join(os.getcwd(), os.path.dirname(__file__)))
print("Crawl4AI: Advanced Web Crawling and Data Extraction")
print("GitHub Repository: https://github.com/unclecode/crawl4ai")
print("Twitter: @unclecode")
print("Website: https://crawl4ai.com")
# Basic Example - Simple Crawl
async def simple_crawl():
print("\n--- Basic Usage ---")
browser_config = BrowserConfig(headless=True)
crawler_config = CrawlerRunConfig(cache_mode=CacheMode.BYPASS)
async with AsyncWebCrawler(config=browser_config) as crawler:
result = await crawler.arun(
url="https://www.nbcnews.com/business", config=crawler_config
)
print(result.markdown[:500])
async def clean_content():
crawler_config = CrawlerRunConfig(
cache_mode=CacheMode.BYPASS,
excluded_tags=["nav", "footer", "aside"],
remove_overlay_elements=True,
markdown_generator=DefaultMarkdownGenerator(
content_filter=PruningContentFilter(
threshold=0.48, threshold_type="fixed", min_word_threshold=0
),
options={"ignore_links": True},
),
)
async with AsyncWebCrawler() as crawler:
result = await crawler.arun(
url="https://en.wikipedia.org/wiki/Apple",
config=crawler_config,
)
full_markdown_length = len(result.markdown_v2.raw_markdown)
fit_markdown_length = len(result.markdown_v2.fit_markdown)
print(f"Full Markdown Length: {full_markdown_length}")
print(f"Fit Markdown Length: {fit_markdown_length}")
async def link_analysis():
crawler_config = CrawlerRunConfig(
cache_mode=CacheMode.ENABLED,
exclude_external_links=True,
exclude_social_media_links=True,
)
async with AsyncWebCrawler() as crawler:
result = await crawler.arun(
url="https://www.nbcnews.com/business",
config=crawler_config,
)
print(f"Found {len(result.links['internal'])} internal links")
print(f"Found {len(result.links['external'])} external links")
for link in result.links["internal"][:5]:
print(f"Href: {link['href']}\nText: {link['text']}\n")
# JavaScript Execution Example
async def simple_example_with_running_js_code():
print("\n--- Executing JavaScript and Using CSS Selectors ---")
browser_config = BrowserConfig(headless=True, java_script_enabled=True)
crawler_config = CrawlerRunConfig(
cache_mode=CacheMode.BYPASS,
js_code="const loadMoreButton = Array.from(document.querySelectorAll('button')).find(button => button.textContent.includes('Load More')); loadMoreButton && loadMoreButton.click();",
# wait_for="() => { return Array.from(document.querySelectorAll('article.tease-card')).length > 10; }"
)
async with AsyncWebCrawler(config=browser_config) as crawler:
result = await crawler.arun(
url="https://www.nbcnews.com/business", config=crawler_config
)
print(result.markdown[:500])
# CSS Selector Example
async def simple_example_with_css_selector():
print("\n--- Using CSS Selectors ---")
browser_config = BrowserConfig(headless=True)
crawler_config = CrawlerRunConfig(
cache_mode=CacheMode.BYPASS, css_selector=".wide-tease-item__description"
)
async with AsyncWebCrawler(config=browser_config) as crawler:
result = await crawler.arun(
url="https://www.nbcnews.com/business", config=crawler_config
)
print(result.markdown[:500])
async def media_handling():
crawler_config = CrawlerRunConfig(
cache_mode=CacheMode.BYPASS, exclude_external_images=True, screenshot=True
)
async with AsyncWebCrawler() as crawler:
result = await crawler.arun(
url="https://www.nbcnews.com/business", config=crawler_config
)
for img in result.media["images"][:5]:
print(f"Image URL: {img['src']}, Alt: {img['alt']}, Score: {img['score']}")
async def custom_hook_workflow(verbose=True):
async with AsyncWebCrawler() as crawler:
# Set a 'before_goto' hook to run custom code just before navigation
crawler.crawler_strategy.set_hook(
"before_goto",
lambda page, context: print("[Hook] Preparing to navigate..."),
)
# Perform the crawl operation
result = await crawler.arun(url="https://crawl4ai.com")
print(result.markdown_v2.raw_markdown[:500].replace("\n", " -- "))
# Proxy Example
async def use_proxy():
print("\n--- Using a Proxy ---")
browser_config = BrowserConfig(
headless=True,
proxy_config={
"server": "http://proxy.example.com:8080",
"username": "username",
"password": "password",
},
)
crawler_config = CrawlerRunConfig(cache_mode=CacheMode.BYPASS)
async with AsyncWebCrawler(config=browser_config) as crawler:
result = await crawler.arun(
url="https://www.nbcnews.com/business", config=crawler_config
)
if result.success:
print(result.markdown[:500])
# Screenshot Example
async def capture_and_save_screenshot(url: str, output_path: str):
browser_config = BrowserConfig(headless=True)
crawler_config = CrawlerRunConfig(cache_mode=CacheMode.BYPASS, screenshot=True)
async with AsyncWebCrawler(config=browser_config) as crawler:
result = await crawler.arun(url=url, config=crawler_config)
if result.success and result.screenshot:
import base64
screenshot_data = base64.b64decode(result.screenshot)
with open(output_path, "wb") as f:
f.write(screenshot_data)
print(f"Screenshot saved successfully to {output_path}")
else:
print("Failed to capture screenshot")
# LLM Extraction Example
class OpenAIModelFee(BaseModel):
model_name: str = Field(..., description="Name of the OpenAI model.")
input_fee: str = Field(..., description="Fee for input token for the OpenAI model.")
output_fee: str = Field(
..., description="Fee for output token for the OpenAI model."
)
async def extract_structured_data_using_llm(
provider: str, api_token: str = None, extra_headers: Dict[str, str] = None
):
print(f"\n--- Extracting Structured Data with {provider} ---")
if api_token is None and provider != "ollama":
print(f"API token is required for {provider}. Skipping this example.")
return
browser_config = BrowserConfig(headless=True)
extra_args = {"temperature": 0, "top_p": 0.9, "max_tokens": 2000}
if extra_headers:
extra_args["extra_headers"] = extra_headers
crawler_config = CrawlerRunConfig(
cache_mode=CacheMode.BYPASS,
word_count_threshold=1,
page_timeout=80000,
extraction_strategy=LLMExtractionStrategy(
provider=provider,
api_token=api_token,
schema=OpenAIModelFee.model_json_schema(),
extraction_type="schema",
instruction="""From the crawled content, extract all mentioned model names along with their fees for input and output tokens.
Do not miss any models in the entire content.""",
extra_args=extra_args,
),
)
async with AsyncWebCrawler(config=browser_config) as crawler:
result = await crawler.arun(
url="https://openai.com/api/pricing/", config=crawler_config
)
print(result.extracted_content)
# CSS Extraction Example
async def extract_structured_data_using_css_extractor():
print("\n--- Using JsonCssExtractionStrategy for Fast Structured Output ---")
schema = {
"name": "KidoCode Courses",
"baseSelector": "section.charge-methodology .framework-collection-item.w-dyn-item",
"fields": [
{
"name": "section_title",
"selector": "h3.heading-50",
"type": "text",
},
{
"name": "section_description",
"selector": ".charge-content",
"type": "text",
},
{
"name": "course_name",
"selector": ".text-block-93",
"type": "text",
},
{
"name": "course_description",
"selector": ".course-content-text",
"type": "text",
},
{
"name": "course_icon",
"selector": ".image-92",
"type": "attribute",
"attribute": "src",
},
],
}
browser_config = BrowserConfig(headless=True, java_script_enabled=True)
js_click_tabs = """
(async () => {
const tabs = document.querySelectorAll("section.charge-methodology .tabs-menu-3 > div");
for(let tab of tabs) {
tab.scrollIntoView();
tab.click();
await new Promise(r => setTimeout(r, 500));
}
})();
"""
crawler_config = CrawlerRunConfig(
cache_mode=CacheMode.BYPASS,
extraction_strategy=JsonCssExtractionStrategy(schema),
js_code=[js_click_tabs],
delay_before_return_html=1
)
async with AsyncWebCrawler(config=browser_config) as crawler:
result = await crawler.arun(
url="https://www.kidocode.com/degrees/technology", config=crawler_config
)
companies = json.loads(result.extracted_content)
print(f"Successfully extracted {len(companies)} companies")
print(json.dumps(companies[0], indent=2))
# Dynamic Content Examples - Method 1
async def crawl_dynamic_content_pages_method_1():
print("\n--- Advanced Multi-Page Crawling with JavaScript Execution ---")
first_commit = ""
async def on_execution_started(page, **kwargs):
nonlocal first_commit
try:
while True:
await page.wait_for_selector("li.Box-sc-g0xbh4-0 h4")
commit = await page.query_selector("li.Box-sc-g0xbh4-0 h4")
commit = await commit.evaluate("(element) => element.textContent")
commit = re.sub(r"\s+", "", commit)
if commit and commit != first_commit:
first_commit = commit
break
await asyncio.sleep(0.5)
except Exception as e:
print(f"Warning: New content didn't appear after JavaScript execution: {e}")
browser_config = BrowserConfig(headless=False, java_script_enabled=True)
async with AsyncWebCrawler(config=browser_config) as crawler:
crawler.crawler_strategy.set_hook("on_execution_started", on_execution_started)
url = "https://github.com/microsoft/TypeScript/commits/main"
session_id = "typescript_commits_session"
all_commits = []
js_next_page = """
const button = document.querySelector('a[data-testid="pagination-next-button"]');
if (button) button.click();
"""
for page in range(3):
crawler_config = CrawlerRunConfig(
cache_mode=CacheMode.BYPASS,
css_selector="li.Box-sc-g0xbh4-0",
js_code=js_next_page if page > 0 else None,
js_only=page > 0,
session_id=session_id,
)
result = await crawler.arun(url=url, config=crawler_config)
assert result.success, f"Failed to crawl page {page + 1}"
soup = BeautifulSoup(result.cleaned_html, "html.parser")
commits = soup.select("li")
all_commits.extend(commits)
print(f"Page {page + 1}: Found {len(commits)} commits")
print(f"Successfully crawled {len(all_commits)} commits across 3 pages")
# Dynamic Content Examples - Method 2
async def crawl_dynamic_content_pages_method_2():
print("\n--- Advanced Multi-Page Crawling with JavaScript Execution ---")
browser_config = BrowserConfig(headless=False, java_script_enabled=True)
js_next_page_and_wait = """
(async () => {
const getCurrentCommit = () => {
const commits = document.querySelectorAll('li.Box-sc-g0xbh4-0 h4');
return commits.length > 0 ? commits[0].textContent.trim() : null;
};
const initialCommit = getCurrentCommit();
const button = document.querySelector('a[data-testid="pagination-next-button"]');
if (button) button.click();
while (true) {
await new Promise(resolve => setTimeout(resolve, 100));
const newCommit = getCurrentCommit();
if (newCommit && newCommit !== initialCommit) {
break;
}
}
})();
"""
schema = {
"name": "Commit Extractor",
"baseSelector": "li.Box-sc-g0xbh4-0",
"fields": [
{
"name": "title",
"selector": "h4.markdown-title",
"type": "text",
"transform": "strip",
},
],
}
async with AsyncWebCrawler(config=browser_config) as crawler:
url = "https://github.com/microsoft/TypeScript/commits/main"
session_id = "typescript_commits_session"
all_commits = []
extraction_strategy = JsonCssExtractionStrategy(schema)
for page in range(3):
crawler_config = CrawlerRunConfig(
cache_mode=CacheMode.BYPASS,
css_selector="li.Box-sc-g0xbh4-0",
extraction_strategy=extraction_strategy,
js_code=js_next_page_and_wait if page > 0 else None,
js_only=page > 0,
session_id=session_id,
)
result = await crawler.arun(url=url, config=crawler_config)
assert result.success, f"Failed to crawl page {page + 1}"
commits = json.loads(result.extracted_content)
all_commits.extend(commits)
print(f"Page {page + 1}: Found {len(commits)} commits")
print(f"Successfully crawled {len(all_commits)} commits across 3 pages")
async def cosine_similarity_extraction():
crawl_config = CrawlerRunConfig(
cache_mode=CacheMode.BYPASS,
extraction_strategy=CosineStrategy(
word_count_threshold=10,
max_dist=0.2, # Maximum distance between two words
linkage_method="ward", # Linkage method for hierarchical clustering (ward, complete, average, single)
top_k=3, # Number of top keywords to extract
sim_threshold=0.3, # Similarity threshold for clustering
semantic_filter="McDonald's economic impact, American consumer trends", # Keywords to filter the content semantically using embeddings
verbose=True,
),
)
async with AsyncWebCrawler() as crawler:
result = await crawler.arun(
url="https://www.nbcnews.com/business/consumer/how-mcdonalds-e-coli-crisis-inflation-politics-reflect-american-story-rcna177156",
config=crawl_config,
)
print(json.loads(result.extracted_content)[:5])
# Browser Comparison
async def crawl_custom_browser_type():
print("\n--- Browser Comparison ---")
# Firefox
browser_config_firefox = BrowserConfig(browser_type="firefox", headless=True)
start = time.time()
async with AsyncWebCrawler(config=browser_config_firefox) as crawler:
result = await crawler.arun(
url="https://www.example.com",
config=CrawlerRunConfig(cache_mode=CacheMode.BYPASS),
)
print("Firefox:", time.time() - start)
print(result.markdown[:500])
# WebKit
browser_config_webkit = BrowserConfig(browser_type="webkit", headless=True)
start = time.time()
async with AsyncWebCrawler(config=browser_config_webkit) as crawler:
result = await crawler.arun(
url="https://www.example.com",
config=CrawlerRunConfig(cache_mode=CacheMode.BYPASS),
)
print("WebKit:", time.time() - start)
print(result.markdown[:500])
# Chromium (default)
browser_config_chromium = BrowserConfig(browser_type="chromium", headless=True)
start = time.time()
async with AsyncWebCrawler(config=browser_config_chromium) as crawler:
result = await crawler.arun(
url="https://www.example.com",
config=CrawlerRunConfig(cache_mode=CacheMode.BYPASS),
)
print("Chromium:", time.time() - start)
print(result.markdown[:500])
# Anti-Bot and User Simulation
async def crawl_with_user_simulation():
browser_config = BrowserConfig(
headless=True,
user_agent_mode="random",
user_agent_generator_config={"device_type": "mobile", "os_type": "android"},
)
crawler_config = CrawlerRunConfig(
cache_mode=CacheMode.BYPASS,
magic=True,
simulate_user=True,
override_navigator=True,
)
async with AsyncWebCrawler(config=browser_config) as crawler:
result = await crawler.arun(url="YOUR-URL-HERE", config=crawler_config)
print(result.markdown)
async def ssl_certification():
# Configure crawler to fetch SSL certificate
config = CrawlerRunConfig(
fetch_ssl_certificate=True,
cache_mode=CacheMode.BYPASS, # Bypass cache to always get fresh certificates
)
async with AsyncWebCrawler() as crawler:
result = await crawler.arun(url="https://example.com", config=config)
if result.success and result.ssl_certificate:
cert = result.ssl_certificate
# 1. Access certificate properties directly
print("\nCertificate Information:")
print(f"Issuer: {cert.issuer.get('CN', '')}")
print(f"Valid until: {cert.valid_until}")
print(f"Fingerprint: {cert.fingerprint}")
# 2. Export certificate in different formats
cert.to_json(os.path.join(tmp_dir, "certificate.json")) # For analysis
print("\nCertificate exported to:")
print(f"- JSON: {os.path.join(tmp_dir, 'certificate.json')}")
pem_data = cert.to_pem(
os.path.join(tmp_dir, "certificate.pem")
) # For web servers
print(f"- PEM: {os.path.join(tmp_dir, 'certificate.pem')}")
der_data = cert.to_der(
os.path.join(tmp_dir, "certificate.der")
) # For Java apps
print(f"- DER: {os.path.join(tmp_dir, 'certificate.der')}")
# Speed Comparison
async def speed_comparison():
print("\n--- Speed Comparison ---")
# Firecrawl comparison
from firecrawl import FirecrawlApp
app = FirecrawlApp(api_key=os.environ["FIRECRAWL_API_KEY"])
start = time.time()
scrape_status = app.scrape_url(
"https://www.nbcnews.com/business", params={"formats": ["markdown", "html"]}
)
end = time.time()
print("Firecrawl:")
print(f"Time taken: {end - start:.2f} seconds")
print(f"Content length: {len(scrape_status['markdown'])} characters")
print(f"Images found: {scrape_status['markdown'].count('cldnry.s-nbcnews.com')}")
print()
# Crawl4AI comparisons
browser_config = BrowserConfig(headless=True)
# Simple crawl
async with AsyncWebCrawler(config=browser_config) as crawler:
start = time.time()
result = await crawler.arun(
url="https://www.nbcnews.com/business",
config=CrawlerRunConfig(
cache_mode=CacheMode.BYPASS, word_count_threshold=0
),
)
end = time.time()
print("Crawl4AI (simple crawl):")
print(f"Time taken: {end - start:.2f} seconds")
print(f"Content length: {len(result.markdown)} characters")
print(f"Images found: {result.markdown.count('cldnry.s-nbcnews.com')}")
print()
# Advanced filtering
start = time.time()
result = await crawler.arun(
url="https://www.nbcnews.com/business",
config=CrawlerRunConfig(
cache_mode=CacheMode.BYPASS,
word_count_threshold=0,
markdown_generator=DefaultMarkdownGenerator(
content_filter=PruningContentFilter(
threshold=0.48, threshold_type="fixed", min_word_threshold=0
)
),
),
)
end = time.time()
print("Crawl4AI (Markdown Plus):")
print(f"Time taken: {end - start:.2f} seconds")
print(f"Content length: {len(result.markdown_v2.raw_markdown)} characters")
print(f"Fit Markdown: {len(result.markdown_v2.fit_markdown)} characters")
print(f"Images found: {result.markdown.count('cldnry.s-nbcnews.com')}")
print()
# Main execution
async def main():
# Basic examples
await simple_crawl()
await simple_example_with_running_js_code()
await simple_example_with_css_selector()
# Advanced examples
await extract_structured_data_using_css_extractor()
await extract_structured_data_using_llm(
"openai/gpt-4o", os.getenv("OPENAI_API_KEY")
)
await crawl_dynamic_content_pages_method_1()
await crawl_dynamic_content_pages_method_2()
# Browser comparisons
await crawl_custom_browser_type()
# Screenshot example
await capture_and_save_screenshot(
"https://www.example.com",
os.path.join(__location__, "tmp/example_screenshot.jpg")
)
if __name__ == "__main__":
asyncio.run(main())

View File

@@ -1,10 +1,6 @@
import os, sys
# append parent directory to system path
sys.path.append(
os.path.dirname(os.path.dirname(os.path.dirname(os.path.abspath(__file__))))
)
os.environ["FIRECRAWL_API_KEY"] = "fc-84b370ccfad44beabc686b38f1769692"
sys.path.append(os.path.dirname(os.path.dirname(os.path.dirname(os.path.abspath(__file__))))); os.environ['FIRECRAWL_API_KEY'] = "fc-84b370ccfad44beabc686b38f1769692";
import asyncio
# import nest_asyncio
@@ -14,12 +10,10 @@ import time
import json
import os
import re
from typing import Dict, List
from typing import Dict
from bs4 import BeautifulSoup
from pydantic import BaseModel, Field
from crawl4ai import AsyncWebCrawler, CacheMode
from crawl4ai.markdown_generation_strategy import DefaultMarkdownGenerator
from crawl4ai.content_filter_strategy import PruningContentFilter
from crawl4ai import AsyncWebCrawler
from crawl4ai.extraction_strategy import (
JsonCssExtractionStrategy,
LLMExtractionStrategy,
@@ -36,12 +30,9 @@ print("Website: https://crawl4ai.com")
async def simple_crawl():
print("\n--- Basic Usage ---")
async with AsyncWebCrawler(verbose=True) as crawler:
result = await crawler.arun(
url="https://www.nbcnews.com/business", cache_mode=CacheMode.BYPASS
)
result = await crawler.arun(url="https://www.nbcnews.com/business")
print(result.markdown[:500]) # Print first 500 characters
async def simple_example_with_running_js_code():
print("\n--- Executing JavaScript and Using CSS Selectors ---")
# New code to handle the wait_for parameter
@@ -60,59 +51,55 @@ async def simple_example_with_running_js_code():
url="https://www.nbcnews.com/business",
js_code=js_code,
# wait_for=wait_for,
cache_mode=CacheMode.BYPASS,
bypass_cache=True,
)
print(result.markdown[:500]) # Print first 500 characters
async def simple_example_with_css_selector():
print("\n--- Using CSS Selectors ---")
async with AsyncWebCrawler(verbose=True) as crawler:
result = await crawler.arun(
url="https://www.nbcnews.com/business",
css_selector=".wide-tease-item__description",
cache_mode=CacheMode.BYPASS,
bypass_cache=True,
)
print(result.markdown[:500]) # Print first 500 characters
async def use_proxy():
print("\n--- Using a Proxy ---")
print(
"Note: Replace 'http://your-proxy-url:port' with a working proxy to run this example."
)
# Uncomment and modify the following lines to use a proxy
async with AsyncWebCrawler(
verbose=True, proxy="http://your-proxy-url:port"
) as crawler:
result = await crawler.arun(
url="https://www.nbcnews.com/business", cache_mode=CacheMode.BYPASS
)
if result.success:
print(result.markdown[:500]) # Print first 500 characters
# async with AsyncWebCrawler(verbose=True, proxy="http://your-proxy-url:port") as crawler:
# result = await crawler.arun(
# url="https://www.nbcnews.com/business",
# bypass_cache=True
# )
# print(result.markdown[:500]) # Print first 500 characters
async def capture_and_save_screenshot(url: str, output_path: str):
async with AsyncWebCrawler(verbose=True) as crawler:
result = await crawler.arun(
url=url, screenshot=True, cache_mode=CacheMode.BYPASS
url=url,
screenshot=True,
bypass_cache=True
)
if result.success and result.screenshot:
import base64
# Decode the base64 screenshot data
screenshot_data = base64.b64decode(result.screenshot)
# Save the screenshot as a JPEG file
with open(output_path, "wb") as f:
with open(output_path, 'wb') as f:
f.write(screenshot_data)
print(f"Screenshot saved successfully to {output_path}")
else:
print("Failed to capture screenshot")
class OpenAIModelFee(BaseModel):
model_name: str = Field(..., description="Name of the OpenAI model.")
input_fee: str = Field(..., description="Fee for input token for the OpenAI model.")
@@ -120,23 +107,14 @@ class OpenAIModelFee(BaseModel):
..., description="Fee for output token for the OpenAI model."
)
async def extract_structured_data_using_llm(
provider: str, api_token: str = None, extra_headers: Dict[str, str] = None
):
async def extract_structured_data_using_llm(provider: str, api_token: str = None, extra_headers: Dict[str, str] = None):
print(f"\n--- Extracting Structured Data with {provider} ---")
if api_token is None and provider != "ollama":
print(f"API token is required for {provider}. Skipping this example.")
return
# extra_args = {}
extra_args = {
"temperature": 0,
"top_p": 0.9,
"max_tokens": 2000,
# any other supported parameters for litellm
}
extra_args = {}
if extra_headers:
extra_args["extra_headers"] = extra_headers
@@ -147,80 +125,55 @@ async def extract_structured_data_using_llm(
extraction_strategy=LLMExtractionStrategy(
provider=provider,
api_token=api_token,
schema=OpenAIModelFee.model_json_schema(),
schema=OpenAIModelFee.schema(),
extraction_type="schema",
instruction="""From the crawled content, extract all mentioned model names along with their fees for input and output tokens.
Do not miss any models in the entire content. One extracted model JSON format should look like this:
{"model_name": "GPT-4", "input_fee": "US$10.00 / 1M tokens", "output_fee": "US$30.00 / 1M tokens"}.""",
extra_args=extra_args,
extra_args=extra_args
),
cache_mode=CacheMode.BYPASS,
bypass_cache=True,
)
print(result.extracted_content)
async def extract_structured_data_using_css_extractor():
print("\n--- Using JsonCssExtractionStrategy for Fast Structured Output ---")
schema = {
"name": "KidoCode Courses",
"baseSelector": "section.charge-methodology .w-tab-content > div",
"name": "Coinbase Crypto Prices",
"baseSelector": ".cds-tableRow-t45thuk",
"fields": [
{
"name": "section_title",
"selector": "h3.heading-50",
"name": "crypto",
"selector": "td:nth-child(1) h2",
"type": "text",
},
{
"name": "section_description",
"selector": ".charge-content",
"name": "symbol",
"selector": "td:nth-child(1) p",
"type": "text",
},
{
"name": "course_name",
"selector": ".text-block-93",
"name": "price",
"selector": "td:nth-child(2)",
"type": "text",
},
{
"name": "course_description",
"selector": ".course-content-text",
"type": "text",
},
{
"name": "course_icon",
"selector": ".image-92",
"type": "attribute",
"attribute": "src",
},
}
],
}
async with AsyncWebCrawler(headless=True, verbose=True) as crawler:
# Create the JavaScript that handles clicking multiple times
js_click_tabs = """
(async () => {
const tabs = document.querySelectorAll("section.charge-methodology .tabs-menu-3 > div");
for(let tab of tabs) {
// scroll to the tab
tab.scrollIntoView();
tab.click();
// Wait for content to load and animations to complete
await new Promise(r => setTimeout(r, 500));
}
})();
"""
extraction_strategy = JsonCssExtractionStrategy(schema, verbose=True)
async with AsyncWebCrawler(verbose=True) as crawler:
result = await crawler.arun(
url="https://www.kidocode.com/degrees/technology",
extraction_strategy=JsonCssExtractionStrategy(schema, verbose=True),
js_code=[js_click_tabs],
cache_mode=CacheMode.BYPASS,
url="https://www.coinbase.com/explore",
extraction_strategy=extraction_strategy,
bypass_cache=True,
)
companies = json.loads(result.extracted_content)
print(f"Successfully extracted {len(companies)} companies")
print(json.dumps(companies[0], indent=2))
assert result.success, "Failed to crawl the page"
news_teasers = json.loads(result.extracted_content)
print(f"Successfully extracted {len(news_teasers)} news teasers")
print(json.dumps(news_teasers[0], indent=2))
# Advanced Session-Based Crawling with Dynamic Content 🔄
async def crawl_dynamic_content_pages_method_1():
@@ -250,10 +203,8 @@ async def crawl_dynamic_content_pages_method_1():
all_commits = []
js_next_page = """
(() => {
const button = document.querySelector('a[data-testid="pagination-next-button"]');
if (button) button.click();
})();
const button = document.querySelector('a[data-testid="pagination-next-button"]');
if (button) button.click();
"""
for page in range(3): # Crawl 3 pages
@@ -262,7 +213,7 @@ async def crawl_dynamic_content_pages_method_1():
session_id=session_id,
css_selector="li.Box-sc-g0xbh4-0",
js=js_next_page if page > 0 else None,
cache_mode=CacheMode.BYPASS,
bypass_cache=True,
js_only=page > 0,
headless=False,
)
@@ -278,7 +229,6 @@ async def crawl_dynamic_content_pages_method_1():
await crawler.crawler_strategy.kill_session(session_id)
print(f"Successfully crawled {len(all_commits)} commits across 3 pages")
async def crawl_dynamic_content_pages_method_2():
print("\n--- Advanced Multi-Page Crawling with JavaScript Execution ---")
@@ -332,7 +282,7 @@ async def crawl_dynamic_content_pages_method_2():
extraction_strategy=extraction_strategy,
js_code=js_next_page_and_wait if page > 0 else None,
js_only=page > 0,
cache_mode=CacheMode.BYPASS,
bypass_cache=True,
headless=False,
)
@@ -346,11 +296,8 @@ async def crawl_dynamic_content_pages_method_2():
await crawler.crawler_strategy.kill_session(session_id)
print(f"Successfully crawled {len(all_commits)} commits across 3 pages")
async def crawl_dynamic_content_pages_method_3():
print(
"\n--- Advanced Multi-Page Crawling with JavaScript Execution using `wait_for` ---"
)
print("\n--- Advanced Multi-Page Crawling with JavaScript Execution using `wait_for` ---")
async with AsyncWebCrawler(verbose=True) as crawler:
url = "https://github.com/microsoft/TypeScript/commits/main"
@@ -372,7 +319,7 @@ async def crawl_dynamic_content_pages_method_3():
const firstCommit = commits[0].textContent.trim();
return firstCommit !== window.firstCommit;
}"""
schema = {
"name": "Commit Extractor",
"baseSelector": "li.Box-sc-g0xbh4-0",
@@ -396,7 +343,7 @@ async def crawl_dynamic_content_pages_method_3():
js_code=js_next_page if page > 0 else None,
wait_for=wait_for if page > 0 else None,
js_only=page > 0,
cache_mode=CacheMode.BYPASS,
bypass_cache=True,
headless=False,
)
@@ -410,54 +357,28 @@ async def crawl_dynamic_content_pages_method_3():
await crawler.crawler_strategy.kill_session(session_id)
print(f"Successfully crawled {len(all_commits)} commits across 3 pages")
async def crawl_custom_browser_type():
# Use Firefox
start = time.time()
async with AsyncWebCrawler(
browser_type="firefox", verbose=True, headless=True
) as crawler:
result = await crawler.arun(
url="https://www.example.com", cache_mode=CacheMode.BYPASS
)
async with AsyncWebCrawler(browser_type="firefox", verbose=True, headless = True) as crawler:
result = await crawler.arun(url="https://www.example.com", bypass_cache=True)
print(result.markdown[:500])
print("Time taken: ", time.time() - start)
# Use WebKit
start = time.time()
async with AsyncWebCrawler(
browser_type="webkit", verbose=True, headless=True
) as crawler:
result = await crawler.arun(
url="https://www.example.com", cache_mode=CacheMode.BYPASS
)
async with AsyncWebCrawler(browser_type="webkit", verbose=True, headless = True) as crawler:
result = await crawler.arun(url="https://www.example.com", bypass_cache=True)
print(result.markdown[:500])
print("Time taken: ", time.time() - start)
# Use Chromium (default)
start = time.time()
async with AsyncWebCrawler(verbose=True, headless=True) as crawler:
result = await crawler.arun(
url="https://www.example.com", cache_mode=CacheMode.BYPASS
)
async with AsyncWebCrawler(verbose=True, headless = True) as crawler:
result = await crawler.arun(url="https://www.example.com", bypass_cache=True)
print(result.markdown[:500])
print("Time taken: ", time.time() - start)
async def crawl_with_user_simultion():
async with AsyncWebCrawler(verbose=True, headless=True) as crawler:
url = "YOUR-URL-HERE"
result = await crawler.arun(
url=url,
cache_mode=CacheMode.BYPASS,
magic=True, # Automatically detects and removes overlays, popups, and other elements that block content
# simulate_user = True,# Causes a series of random mouse movements and clicks to simulate user interaction
# override_navigator = True # Overrides the navigator object to make it look like a real user
)
print(result.markdown)
async def speed_comparison():
# print("\n--- Speed Comparison ---")
# print("Firecrawl (simulated):")
@@ -467,18 +388,18 @@ async def speed_comparison():
# print()
# Simulated Firecrawl performance
from firecrawl import FirecrawlApp
app = FirecrawlApp(api_key=os.environ["FIRECRAWL_API_KEY"])
app = FirecrawlApp(api_key=os.environ['FIRECRAWL_API_KEY'])
start = time.time()
scrape_status = app.scrape_url(
"https://www.nbcnews.com/business", params={"formats": ["markdown", "html"]}
'https://www.nbcnews.com/business',
params={'formats': ['markdown', 'html']}
)
end = time.time()
print("Firecrawl:")
print("Firecrawl (simulated):")
print(f"Time taken: {end - start:.2f} seconds")
print(f"Content length: {len(scrape_status['markdown'])} characters")
print(f"Images found: {scrape_status['markdown'].count('cldnry.s-nbcnews.com')}")
print()
print()
async with AsyncWebCrawler() as crawler:
# Crawl4AI simple crawl
@@ -486,7 +407,7 @@ async def speed_comparison():
result = await crawler.arun(
url="https://www.nbcnews.com/business",
word_count_threshold=0,
cache_mode=CacheMode.BYPASS,
bypass_cache=True,
verbose=False,
)
end = time.time()
@@ -496,28 +417,6 @@ async def speed_comparison():
print(f"Images found: {result.markdown.count('cldnry.s-nbcnews.com')}")
print()
# Crawl4AI with advanced content filtering
start = time.time()
result = await crawler.arun(
url="https://www.nbcnews.com/business",
word_count_threshold=0,
markdown_generator=DefaultMarkdownGenerator(
content_filter=PruningContentFilter(
threshold=0.48, threshold_type="fixed", min_word_threshold=0
)
# content_filter=BM25ContentFilter(user_query=None, bm25_threshold=1.0)
),
cache_mode=CacheMode.BYPASS,
verbose=False,
)
end = time.time()
print("Crawl4AI (Markdown Plus):")
print(f"Time taken: {end - start:.2f} seconds")
print(f"Content length: {len(result.markdown_v2.raw_markdown)} characters")
print(f"Fit Markdown: {len(result.markdown_v2.fit_markdown)} characters")
print(f"Images found: {result.markdown.count('cldnry.s-nbcnews.com')}")
print()
# Crawl4AI with JavaScript execution
start = time.time()
result = await crawler.arun(
@@ -526,20 +425,13 @@ async def speed_comparison():
"const loadMoreButton = Array.from(document.querySelectorAll('button')).find(button => button.textContent.includes('Load More')); loadMoreButton && loadMoreButton.click();"
],
word_count_threshold=0,
cache_mode=CacheMode.BYPASS,
markdown_generator=DefaultMarkdownGenerator(
content_filter=PruningContentFilter(
threshold=0.48, threshold_type="fixed", min_word_threshold=0
)
# content_filter=BM25ContentFilter(user_query=None, bm25_threshold=1.0)
),
bypass_cache=True,
verbose=False,
)
end = time.time()
print("Crawl4AI (with JavaScript execution):")
print(f"Time taken: {end - start:.2f} seconds")
print(f"Content length: {len(result.markdown)} characters")
print(f"Fit Markdown: {len(result.markdown_v2.fit_markdown)} characters")
print(f"Images found: {result.markdown.count('cldnry.s-nbcnews.com')}")
print("\nNote on Speed Comparison:")
@@ -552,123 +444,34 @@ async def speed_comparison():
print("If you run these tests in an environment with better network conditions,")
print("you may observe an even more significant speed advantage for Crawl4AI.")
async def generate_knowledge_graph():
class Entity(BaseModel):
name: str
description: str
class Relationship(BaseModel):
entity1: Entity
entity2: Entity
description: str
relation_type: str
class KnowledgeGraph(BaseModel):
entities: List[Entity]
relationships: List[Relationship]
extraction_strategy = LLMExtractionStrategy(
provider="openai/gpt-4o-mini", # Or any other provider, including Ollama and open source models
api_token=os.getenv("OPENAI_API_KEY"), # In case of Ollama just pass "no-token"
schema=KnowledgeGraph.model_json_schema(),
extraction_type="schema",
instruction="""Extract entities and relationships from the given text.""",
)
async with AsyncWebCrawler() as crawler:
url = "https://paulgraham.com/love.html"
result = await crawler.arun(
url=url,
cache_mode=CacheMode.BYPASS,
extraction_strategy=extraction_strategy,
# magic=True
)
# print(result.extracted_content)
with open(os.path.join(__location__, "kb.json"), "w") as f:
f.write(result.extracted_content)
async def fit_markdown_remove_overlay():
async with AsyncWebCrawler(
headless=True, # Set to False to see what is happening
verbose=True,
user_agent_mode="random",
user_agent_generator_config={"device_type": "mobile", "os_type": "android"},
) as crawler:
result = await crawler.arun(
url="https://www.kidocode.com/degrees/technology",
cache_mode=CacheMode.BYPASS,
markdown_generator=DefaultMarkdownGenerator(
content_filter=PruningContentFilter(
threshold=0.48, threshold_type="fixed", min_word_threshold=0
),
options={"ignore_links": True},
),
# markdown_generator=DefaultMarkdownGenerator(
# content_filter=BM25ContentFilter(user_query="", bm25_threshold=1.0),
# options={
# "ignore_links": True
# }
# ),
)
if result.success:
print(len(result.markdown_v2.raw_markdown))
print(len(result.markdown_v2.markdown_with_citations))
print(len(result.markdown_v2.fit_markdown))
# Save clean html
with open(os.path.join(__location__, "output/cleaned_html.html"), "w") as f:
f.write(result.cleaned_html)
with open(
os.path.join(__location__, "output/output_raw_markdown.md"), "w"
) as f:
f.write(result.markdown_v2.raw_markdown)
with open(
os.path.join(__location__, "output/output_markdown_with_citations.md"),
"w",
) as f:
f.write(result.markdown_v2.markdown_with_citations)
with open(
os.path.join(__location__, "output/output_fit_markdown.md"), "w"
) as f:
f.write(result.markdown_v2.fit_markdown)
print("Done")
async def main():
# await extract_structured_data_using_llm("openai/gpt-4o", os.getenv("OPENAI_API_KEY"))
# await simple_crawl()
# await simple_example_with_running_js_code()
# await simple_example_with_css_selector()
# # await use_proxy()
# await capture_and_save_screenshot("https://www.example.com", os.path.join(__location__, "tmp/example_screenshot.jpg"))
# await extract_structured_data_using_css_extractor()
await simple_crawl()
await simple_example_with_running_js_code()
await simple_example_with_css_selector()
await use_proxy()
await capture_and_save_screenshot("https://www.example.com", os.path.join(__location__, "tmp/example_screenshot.jpg"))
await extract_structured_data_using_css_extractor()
# LLM extraction examples
# await extract_structured_data_using_llm()
# await extract_structured_data_using_llm("huggingface/meta-llama/Meta-Llama-3.1-8B-Instruct", os.getenv("HUGGINGFACE_API_KEY"))
# await extract_structured_data_using_llm("ollama/llama3.2")
await extract_structured_data_using_llm()
await extract_structured_data_using_llm("huggingface/meta-llama/Meta-Llama-3.1-8B-Instruct", os.getenv("HUGGINGFACE_API_KEY"))
await extract_structured_data_using_llm("openai/gpt-4", os.getenv("OPENAI_API_KEY"))
await extract_structured_data_using_llm("ollama/llama3.2")
# You always can pass custom headers to the extraction strategy
# custom_headers = {
# "Authorization": "Bearer your-custom-token",
# "X-Custom-Header": "Some-Value"
# }
# await extract_structured_data_using_llm(extra_headers=custom_headers)
custom_headers = {
"Authorization": "Bearer your-custom-token",
"X-Custom-Header": "Some-Value"
}
await extract_structured_data_using_llm(extra_headers=custom_headers)
# await crawl_dynamic_content_pages_method_1()
# await crawl_dynamic_content_pages_method_2()
await crawl_dynamic_content_pages_method_3()
# await crawl_custom_browser_type()
# await speed_comparison()
await crawl_custom_browser_type()
await speed_comparison()
if __name__ == "__main__":

View File

@@ -10,17 +10,15 @@ from functools import lru_cache
console = Console()
@lru_cache()
def create_crawler():
crawler = WebCrawler(verbose=True)
crawler.warmup()
return crawler
def print_result(result):
# Print each key in one line and just the first 10 characters of each one's value and three dots
console.print("\t[bold]Result:[/bold]")
console.print(f"\t[bold]Result:[/bold]")
for key, value in result.model_dump().items():
if isinstance(value, str) and value:
console.print(f"\t{key}: [green]{value[:20]}...[/green]")
@@ -35,27 +33,18 @@ def cprint(message, press_any_key=False):
console.print("Press any key to continue...", style="")
input()
def basic_usage(crawler):
cprint(
"🛠️ [bold cyan]Basic Usage: Simply provide a URL and let Crawl4ai do the magic![/bold cyan]"
)
result = crawler.run(url="https://www.nbcnews.com/business", only_text=True)
cprint("🛠️ [bold cyan]Basic Usage: Simply provide a URL and let Crawl4ai do the magic![/bold cyan]")
result = crawler.run(url="https://www.nbcnews.com/business", only_text = True)
cprint("[LOG] 📦 [bold yellow]Basic crawl result:[/bold yellow]")
print_result(result)
def basic_usage_some_params(crawler):
cprint(
"🛠️ [bold cyan]Basic Usage: Simply provide a URL and let Crawl4ai do the magic![/bold cyan]"
)
result = crawler.run(
url="https://www.nbcnews.com/business", word_count_threshold=1, only_text=True
)
cprint("🛠️ [bold cyan]Basic Usage: Simply provide a URL and let Crawl4ai do the magic![/bold cyan]")
result = crawler.run(url="https://www.nbcnews.com/business", word_count_threshold=1, only_text = True)
cprint("[LOG] 📦 [bold yellow]Basic crawl result:[/bold yellow]")
print_result(result)
def screenshot_usage(crawler):
cprint("\n📸 [bold cyan]Let's take a screenshot of the page![/bold cyan]")
result = crawler.run(url="https://www.nbcnews.com/business", screenshot=True)
@@ -66,23 +55,16 @@ def screenshot_usage(crawler):
cprint("Screenshot saved to 'screenshot.png'!")
print_result(result)
def understanding_parameters(crawler):
cprint(
"\n🧠 [bold cyan]Understanding 'bypass_cache' and 'include_raw_html' parameters:[/bold cyan]"
)
cprint(
"By default, Crawl4ai caches the results of your crawls. This means that subsequent crawls of the same URL will be much faster! Let's see this in action."
)
cprint("\n🧠 [bold cyan]Understanding 'bypass_cache' and 'include_raw_html' parameters:[/bold cyan]")
cprint("By default, Crawl4ai caches the results of your crawls. This means that subsequent crawls of the same URL will be much faster! Let's see this in action.")
# First crawl (reads from cache)
cprint("1⃣ First crawl (caches the result):", True)
start_time = time.time()
result = crawler.run(url="https://www.nbcnews.com/business")
end_time = time.time()
cprint(
f"[LOG] 📦 [bold yellow]First crawl took {end_time - start_time} seconds and result (from cache):[/bold yellow]"
)
cprint(f"[LOG] 📦 [bold yellow]First crawl took {end_time - start_time} seconds and result (from cache):[/bold yellow]")
print_result(result)
# Force to crawl again
@@ -90,232 +72,169 @@ def understanding_parameters(crawler):
start_time = time.time()
result = crawler.run(url="https://www.nbcnews.com/business", bypass_cache=True)
end_time = time.time()
cprint(
f"[LOG] 📦 [bold yellow]Second crawl took {end_time - start_time} seconds and result (forced to crawl):[/bold yellow]"
)
cprint(f"[LOG] 📦 [bold yellow]Second crawl took {end_time - start_time} seconds and result (forced to crawl):[/bold yellow]")
print_result(result)
def add_chunking_strategy(crawler):
# Adding a chunking strategy: RegexChunking
cprint(
"\n🧩 [bold cyan]Let's add a chunking strategy: RegexChunking![/bold cyan]",
True,
)
cprint(
"RegexChunking is a simple chunking strategy that splits the text based on a given regex pattern. Let's see it in action!"
)
cprint("\n🧩 [bold cyan]Let's add a chunking strategy: RegexChunking![/bold cyan]", True)
cprint("RegexChunking is a simple chunking strategy that splits the text based on a given regex pattern. Let's see it in action!")
result = crawler.run(
url="https://www.nbcnews.com/business",
chunking_strategy=RegexChunking(patterns=["\n\n"]),
chunking_strategy=RegexChunking(patterns=["\n\n"])
)
cprint("[LOG] 📦 [bold yellow]RegexChunking result:[/bold yellow]")
print_result(result)
# Adding another chunking strategy: NlpSentenceChunking
cprint(
"\n🔍 [bold cyan]Time to explore another chunking strategy: NlpSentenceChunking![/bold cyan]",
True,
)
cprint(
"NlpSentenceChunking uses NLP techniques to split the text into sentences. Let's see how it performs!"
)
cprint("\n🔍 [bold cyan]Time to explore another chunking strategy: NlpSentenceChunking![/bold cyan]", True)
cprint("NlpSentenceChunking uses NLP techniques to split the text into sentences. Let's see how it performs!")
result = crawler.run(
url="https://www.nbcnews.com/business", chunking_strategy=NlpSentenceChunking()
url="https://www.nbcnews.com/business",
chunking_strategy=NlpSentenceChunking()
)
cprint("[LOG] 📦 [bold yellow]NlpSentenceChunking result:[/bold yellow]")
print_result(result)
def add_extraction_strategy(crawler):
# Adding an extraction strategy: CosineStrategy
cprint(
"\n🧠 [bold cyan]Let's get smarter with an extraction strategy: CosineStrategy![/bold cyan]",
True,
)
cprint(
"CosineStrategy uses cosine similarity to extract semantically similar blocks of text. Let's see it in action!"
)
cprint("\n🧠 [bold cyan]Let's get smarter with an extraction strategy: CosineStrategy![/bold cyan]", True)
cprint("CosineStrategy uses cosine similarity to extract semantically similar blocks of text. Let's see it in action!")
result = crawler.run(
url="https://www.nbcnews.com/business",
extraction_strategy=CosineStrategy(
word_count_threshold=10,
max_dist=0.2,
linkage_method="ward",
top_k=3,
sim_threshold=0.3,
verbose=True,
),
extraction_strategy=CosineStrategy(word_count_threshold=10, max_dist=0.2, linkage_method="ward", top_k=3, sim_threshold = 0.3, verbose=True)
)
cprint("[LOG] 📦 [bold yellow]CosineStrategy result:[/bold yellow]")
print_result(result)
# Using semantic_filter with CosineStrategy
cprint(
"You can pass other parameters like 'semantic_filter' to the CosineStrategy to extract semantically similar blocks of text. Let's see it in action!"
)
cprint("You can pass other parameters like 'semantic_filter' to the CosineStrategy to extract semantically similar blocks of text. Let's see it in action!")
result = crawler.run(
url="https://www.nbcnews.com/business",
extraction_strategy=CosineStrategy(
semantic_filter="inflation rent prices",
),
)
cprint(
"[LOG] 📦 [bold yellow]CosineStrategy result with semantic filter:[/bold yellow]"
)
)
cprint("[LOG] 📦 [bold yellow]CosineStrategy result with semantic filter:[/bold yellow]")
print_result(result)
def add_llm_extraction_strategy(crawler):
# Adding an LLM extraction strategy without instructions
cprint(
"\n🤖 [bold cyan]Time to bring in the big guns: LLMExtractionStrategy without instructions![/bold cyan]",
True,
)
cprint(
"LLMExtractionStrategy uses a large language model to extract relevant information from the web page. Let's see it in action!"
)
cprint("\n🤖 [bold cyan]Time to bring in the big guns: LLMExtractionStrategy without instructions![/bold cyan]", True)
cprint("LLMExtractionStrategy uses a large language model to extract relevant information from the web page. Let's see it in action!")
result = crawler.run(
url="https://www.nbcnews.com/business",
extraction_strategy=LLMExtractionStrategy(
provider="openai/gpt-4o", api_token=os.getenv("OPENAI_API_KEY")
),
)
cprint(
"[LOG] 📦 [bold yellow]LLMExtractionStrategy (no instructions) result:[/bold yellow]"
extraction_strategy=LLMExtractionStrategy(provider="openai/gpt-4o", api_token=os.getenv('OPENAI_API_KEY'))
)
cprint("[LOG] 📦 [bold yellow]LLMExtractionStrategy (no instructions) result:[/bold yellow]")
print_result(result)
# Adding an LLM extraction strategy with instructions
cprint(
"\n📜 [bold cyan]Let's make it even more interesting: LLMExtractionStrategy with instructions![/bold cyan]",
True,
)
cprint(
"Let's say we are only interested in financial news. Let's see how LLMExtractionStrategy performs with instructions!"
)
cprint("\n📜 [bold cyan]Let's make it even more interesting: LLMExtractionStrategy with instructions![/bold cyan]", True)
cprint("Let's say we are only interested in financial news. Let's see how LLMExtractionStrategy performs with instructions!")
result = crawler.run(
url="https://www.nbcnews.com/business",
extraction_strategy=LLMExtractionStrategy(
provider="openai/gpt-4o",
api_token=os.getenv("OPENAI_API_KEY"),
instruction="I am interested in only financial news",
),
)
cprint(
"[LOG] 📦 [bold yellow]LLMExtractionStrategy (with instructions) result:[/bold yellow]"
api_token=os.getenv('OPENAI_API_KEY'),
instruction="I am interested in only financial news"
)
)
cprint("[LOG] 📦 [bold yellow]LLMExtractionStrategy (with instructions) result:[/bold yellow]")
print_result(result)
result = crawler.run(
url="https://www.nbcnews.com/business",
extraction_strategy=LLMExtractionStrategy(
provider="openai/gpt-4o",
api_token=os.getenv("OPENAI_API_KEY"),
instruction="Extract only content related to technology",
),
)
cprint(
"[LOG] 📦 [bold yellow]LLMExtractionStrategy (with technology instruction) result:[/bold yellow]"
api_token=os.getenv('OPENAI_API_KEY'),
instruction="Extract only content related to technology"
)
)
cprint("[LOG] 📦 [bold yellow]LLMExtractionStrategy (with technology instruction) result:[/bold yellow]")
print_result(result)
def targeted_extraction(crawler):
# Using a CSS selector to extract only H2 tags
cprint(
"\n🎯 [bold cyan]Targeted extraction: Let's use a CSS selector to extract only H2 tags![/bold cyan]",
True,
cprint("\n🎯 [bold cyan]Targeted extraction: Let's use a CSS selector to extract only H2 tags![/bold cyan]", True)
result = crawler.run(
url="https://www.nbcnews.com/business",
css_selector="h2"
)
result = crawler.run(url="https://www.nbcnews.com/business", css_selector="h2")
cprint("[LOG] 📦 [bold yellow]CSS Selector (H2 tags) result:[/bold yellow]")
print_result(result)
def interactive_extraction(crawler):
# Passing JavaScript code to interact with the page
cprint(
"\n🖱️ [bold cyan]Let's get interactive: Passing JavaScript code to click 'Load More' button![/bold cyan]",
True,
)
cprint(
"In this example we try to click the 'Load More' button on the page using JavaScript code."
)
cprint("\n🖱️ [bold cyan]Let's get interactive: Passing JavaScript code to click 'Load More' button![/bold cyan]", True)
cprint("In this example we try to click the 'Load More' button on the page using JavaScript code.")
js_code = """
const loadMoreButton = Array.from(document.querySelectorAll('button')).find(button => button.textContent.includes('Load More'));
loadMoreButton && loadMoreButton.click();
"""
# crawler_strategy = LocalSeleniumCrawlerStrategy(js_code=js_code)
# crawler = WebCrawler(crawler_strategy=crawler_strategy, always_by_pass_cache=True)
result = crawler.run(url="https://www.nbcnews.com/business", js=js_code)
cprint(
"[LOG] 📦 [bold yellow]JavaScript Code (Load More button) result:[/bold yellow]"
result = crawler.run(
url="https://www.nbcnews.com/business",
js = js_code
)
cprint("[LOG] 📦 [bold yellow]JavaScript Code (Load More button) result:[/bold yellow]")
print_result(result)
def multiple_scrip(crawler):
# Passing JavaScript code to interact with the page
cprint(
"\n🖱️ [bold cyan]Let's get interactive: Passing JavaScript code to click 'Load More' button![/bold cyan]",
True,
)
cprint(
"In this example we try to click the 'Load More' button on the page using JavaScript code."
)
js_code = [
"""
cprint("\n🖱️ [bold cyan]Let's get interactive: Passing JavaScript code to click 'Load More' button![/bold cyan]", True)
cprint("In this example we try to click the 'Load More' button on the page using JavaScript code.")
js_code = ["""
const loadMoreButton = Array.from(document.querySelectorAll('button')).find(button => button.textContent.includes('Load More'));
loadMoreButton && loadMoreButton.click();
"""
] * 2
"""] * 2
# crawler_strategy = LocalSeleniumCrawlerStrategy(js_code=js_code)
# crawler = WebCrawler(crawler_strategy=crawler_strategy, always_by_pass_cache=True)
result = crawler.run(url="https://www.nbcnews.com/business", js=js_code)
cprint(
"[LOG] 📦 [bold yellow]JavaScript Code (Load More button) result:[/bold yellow]"
result = crawler.run(
url="https://www.nbcnews.com/business",
js = js_code
)
cprint("[LOG] 📦 [bold yellow]JavaScript Code (Load More button) result:[/bold yellow]")
print_result(result)
def using_crawler_hooks(crawler):
# Example usage of the hooks for authentication and setting a cookie
def on_driver_created(driver):
print("[HOOK] on_driver_created")
# Example customization: maximize the window
driver.maximize_window()
# Example customization: logging in to a hypothetical website
driver.get("https://example.com/login")
driver.get('https://example.com/login')
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
WebDriverWait(driver, 10).until(
EC.presence_of_element_located((By.NAME, "username"))
EC.presence_of_element_located((By.NAME, 'username'))
)
driver.find_element(By.NAME, "username").send_keys("testuser")
driver.find_element(By.NAME, "password").send_keys("password123")
driver.find_element(By.NAME, "login").click()
driver.find_element(By.NAME, 'username').send_keys('testuser')
driver.find_element(By.NAME, 'password').send_keys('password123')
driver.find_element(By.NAME, 'login').click()
WebDriverWait(driver, 10).until(
EC.presence_of_element_located((By.ID, "welcome"))
EC.presence_of_element_located((By.ID, 'welcome'))
)
# Add a custom cookie
driver.add_cookie({"name": "test_cookie", "value": "cookie_value"})
return driver
driver.add_cookie({'name': 'test_cookie', 'value': 'cookie_value'})
return driver
def before_get_url(driver):
print("[HOOK] before_get_url")
# Example customization: add a custom header
# Enable Network domain for sending headers
driver.execute_cdp_cmd("Network.enable", {})
driver.execute_cdp_cmd('Network.enable', {})
# Add a custom header
driver.execute_cdp_cmd(
"Network.setExtraHTTPHeaders", {"headers": {"X-Test-Header": "test"}}
)
driver.execute_cdp_cmd('Network.setExtraHTTPHeaders', {'headers': {'X-Test-Header': 'test'}})
return driver
def after_get_url(driver):
print("[HOOK] after_get_url")
# Example customization: log the URL
@@ -327,59 +246,48 @@ def using_crawler_hooks(crawler):
# Example customization: log the HTML
print(len(html))
return driver
cprint(
"\n🔗 [bold cyan]Using Crawler Hooks: Let's see how we can customize the crawler using hooks![/bold cyan]",
True,
)
cprint("\n🔗 [bold cyan]Using Crawler Hooks: Let's see how we can customize the crawler using hooks![/bold cyan]", True)
crawler_strategy = LocalSeleniumCrawlerStrategy(verbose=True)
crawler_strategy.set_hook("on_driver_created", on_driver_created)
crawler_strategy.set_hook("before_get_url", before_get_url)
crawler_strategy.set_hook("after_get_url", after_get_url)
crawler_strategy.set_hook("before_return_html", before_return_html)
crawler_strategy.set_hook('on_driver_created', on_driver_created)
crawler_strategy.set_hook('before_get_url', before_get_url)
crawler_strategy.set_hook('after_get_url', after_get_url)
crawler_strategy.set_hook('before_return_html', before_return_html)
crawler = WebCrawler(verbose=True, crawler_strategy=crawler_strategy)
crawler.warmup()
crawler.warmup()
result = crawler.run(url="https://example.com")
cprint("[LOG] 📦 [bold yellow]Crawler Hooks result:[/bold yellow]")
print_result(result=result)
print_result(result= result)
def using_crawler_hooks_dleay_example(crawler):
def delay(driver):
print("Delaying for 5 seconds...")
time.sleep(5)
print("Resuming...")
def create_crawler():
crawler_strategy = LocalSeleniumCrawlerStrategy(verbose=True)
crawler_strategy.set_hook("after_get_url", delay)
crawler_strategy.set_hook('after_get_url', delay)
crawler = WebCrawler(verbose=True, crawler_strategy=crawler_strategy)
crawler.warmup()
return crawler
cprint(
"\n🔗 [bold cyan]Using Crawler Hooks: Let's add a delay after fetching the url to make sure entire page is fetched.[/bold cyan]"
)
cprint("\n🔗 [bold cyan]Using Crawler Hooks: Let's add a delay after fetching the url to make sure entire page is fetched.[/bold cyan]")
crawler = create_crawler()
result = crawler.run(url="https://google.com", bypass_cache=True)
result = crawler.run(url="https://google.com", bypass_cache=True)
cprint("[LOG] 📦 [bold yellow]Crawler Hooks result:[/bold yellow]")
print_result(result)
def main():
cprint(
"🌟 [bold green]Welcome to the Crawl4ai Quickstart Guide! Let's dive into some web crawling fun! 🌐[/bold green]"
)
cprint(
"⛳️ [bold cyan]First Step: Create an instance of WebCrawler and call the `warmup()` function.[/bold cyan]"
)
cprint(
"If this is the first time you're running Crawl4ai, this might take a few seconds to load required model files."
)
cprint("🌟 [bold green]Welcome to the Crawl4ai Quickstart Guide! Let's dive into some web crawling fun! 🌐[/bold green]")
cprint("⛳️ [bold cyan]First Step: Create an instance of WebCrawler and call the `warmup()` function.[/bold cyan]")
cprint("If this is the first time you're running Crawl4ai, this might take a few seconds to load required model files.")
crawler = create_crawler()
@@ -387,7 +295,7 @@ def main():
basic_usage(crawler)
# basic_usage_some_params(crawler)
understanding_parameters(crawler)
crawler.always_by_pass_cache = True
screenshot_usage(crawler)
add_chunking_strategy(crawler)
@@ -397,10 +305,8 @@ def main():
interactive_extraction(crawler)
multiple_scrip(crawler)
cprint(
"\n🎉 [bold green]Congratulations! You've made it through the Crawl4ai Quickstart Guide! Now go forth and crawl the web like a pro! 🕸️[/bold green]"
)
cprint("\n🎉 [bold green]Congratulations! You've made it through the Crawl4ai Quickstart Guide! Now go forth and crawl the web like a pro! 🕸️[/bold green]")
if __name__ == "__main__":
main()

View File

@@ -1,735 +0,0 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {
"id": "6yLvrXn7yZQI"
},
"source": [
"# Crawl4AI: Advanced Web Crawling and Data Extraction\n",
"\n",
"Welcome to this interactive notebook showcasing Crawl4AI, an advanced asynchronous web crawling and data extraction library.\n",
"\n",
"- GitHub Repository: [https://github.com/unclecode/crawl4ai](https://github.com/unclecode/crawl4ai)\n",
"- Twitter: [@unclecode](https://twitter.com/unclecode)\n",
"- Website: [https://crawl4ai.com](https://crawl4ai.com)\n",
"\n",
"Let's explore the powerful features of Crawl4AI!"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "KIn_9nxFyZQK"
},
"source": [
"## Installation\n",
"\n",
"First, let's install Crawl4AI from GitHub:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "mSnaxLf3zMog"
},
"outputs": [],
"source": [
"!sudo apt-get update && sudo apt-get install -y libwoff1 libopus0 libwebp6 libwebpdemux2 libenchant1c2a libgudev-1.0-0 libsecret-1-0 libhyphen0 libgdk-pixbuf2.0-0 libegl1 libnotify4 libxslt1.1 libevent-2.1-7 libgles2 libvpx6 libxcomposite1 libatk1.0-0 libatk-bridge2.0-0 libepoxy0 libgtk-3-0 libharfbuzz-icu0"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "xlXqaRtayZQK"
},
"outputs": [],
"source": [
"!pip install crawl4ai\n",
"!pip install nest-asyncio\n",
"!playwright install"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "qKCE7TI7yZQL"
},
"source": [
"Now, let's import the necessary libraries:"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {
"id": "I67tr7aAyZQL"
},
"outputs": [],
"source": [
"import asyncio\n",
"import nest_asyncio\n",
"from crawl4ai import AsyncWebCrawler\n",
"from crawl4ai.extraction_strategy import JsonCssExtractionStrategy, LLMExtractionStrategy\n",
"import json\n",
"import time\n",
"from pydantic import BaseModel, Field\n",
"\n",
"nest_asyncio.apply()"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "h7yR_Rt_yZQM"
},
"source": [
"## Basic Usage\n",
"\n",
"Let's start with a simple crawl example:"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "yBh6hf4WyZQM",
"outputId": "0f83af5c-abba-4175-ed95-70b7512e6bcc"
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"[LOG] 🌤️ Warming up the AsyncWebCrawler\n",
"[LOG] 🌞 AsyncWebCrawler is ready to crawl\n",
"[LOG] 🚀 Content extracted for https://www.nbcnews.com/business, success: True, time taken: 0.05 seconds\n",
"[LOG] 🚀 Extraction done for https://www.nbcnews.com/business, time taken: 0.05 seconds.\n",
"18102\n"
]
}
],
"source": [
"async def simple_crawl():\n",
" async with AsyncWebCrawler(verbose=True) as crawler:\n",
" result = await crawler.arun(url=\"https://www.nbcnews.com/business\")\n",
" print(len(result.markdown))\n",
"await simple_crawl()"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "9rtkgHI28uI4"
},
"source": [
"💡 By default, **Crawl4AI** caches the result of every URL, so the next time you call it, youll get an instant result. But if you want to bypass the cache, just set `bypass_cache=True`."
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "MzZ0zlJ9yZQM"
},
"source": [
"## Advanced Features\n",
"\n",
"### Executing JavaScript and Using CSS Selectors"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "gHStF86xyZQM",
"outputId": "34d0fb6d-4dec-4677-f76e-85a1f082829b"
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"[LOG] 🌤️ Warming up the AsyncWebCrawler\n",
"[LOG] 🌞 AsyncWebCrawler is ready to crawl\n",
"[LOG] 🕸️ Crawling https://www.nbcnews.com/business using AsyncPlaywrightCrawlerStrategy...\n",
"[LOG] ✅ Crawled https://www.nbcnews.com/business successfully!\n",
"[LOG] 🚀 Crawling done for https://www.nbcnews.com/business, success: True, time taken: 6.06 seconds\n",
"[LOG] 🚀 Content extracted for https://www.nbcnews.com/business, success: True, time taken: 0.10 seconds\n",
"[LOG] 🔥 Extracting semantic blocks for https://www.nbcnews.com/business, Strategy: AsyncWebCrawler\n",
"[LOG] 🚀 Extraction done for https://www.nbcnews.com/business, time taken: 0.11 seconds.\n",
"41135\n"
]
}
],
"source": [
"async def js_and_css():\n",
" async with AsyncWebCrawler(verbose=True) as crawler:\n",
" js_code = [\"const loadMoreButton = Array.from(document.querySelectorAll('button')).find(button => button.textContent.includes('Load More')); loadMoreButton && loadMoreButton.click();\"]\n",
" result = await crawler.arun(\n",
" url=\"https://www.nbcnews.com/business\",\n",
" js_code=js_code,\n",
" # css_selector=\"YOUR_CSS_SELECTOR_HERE\",\n",
" bypass_cache=True\n",
" )\n",
" print(len(result.markdown))\n",
"\n",
"await js_and_css()"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "cqE_W4coyZQM"
},
"source": [
"### Using a Proxy\n",
"\n",
"Note: You'll need to replace the proxy URL with a working proxy for this example to run successfully."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "QjAyiAGqyZQM"
},
"outputs": [],
"source": [
"async def use_proxy():\n",
" async with AsyncWebCrawler(verbose=True, proxy=\"http://your-proxy-url:port\") as crawler:\n",
" result = await crawler.arun(\n",
" url=\"https://www.nbcnews.com/business\",\n",
" bypass_cache=True\n",
" )\n",
" print(result.markdown[:500]) # Print first 500 characters\n",
"\n",
"# Uncomment the following line to run the proxy example\n",
"# await use_proxy()"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "XTZ88lbayZQN"
},
"source": [
"### Extracting Structured Data with OpenAI\n",
"\n",
"Note: You'll need to set your OpenAI API key as an environment variable for this example to work."
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "fIOlDayYyZQN",
"outputId": "cb8359cc-dee0-4762-9698-5dfdcee055b8"
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"[LOG] 🌤️ Warming up the AsyncWebCrawler\n",
"[LOG] 🌞 AsyncWebCrawler is ready to crawl\n",
"[LOG] 🕸️ Crawling https://openai.com/api/pricing/ using AsyncPlaywrightCrawlerStrategy...\n",
"[LOG] ✅ Crawled https://openai.com/api/pricing/ successfully!\n",
"[LOG] 🚀 Crawling done for https://openai.com/api/pricing/, success: True, time taken: 3.77 seconds\n",
"[LOG] 🚀 Content extracted for https://openai.com/api/pricing/, success: True, time taken: 0.21 seconds\n",
"[LOG] 🔥 Extracting semantic blocks for https://openai.com/api/pricing/, Strategy: AsyncWebCrawler\n",
"[LOG] Call LLM for https://openai.com/api/pricing/ - block index: 0\n",
"[LOG] Call LLM for https://openai.com/api/pricing/ - block index: 1\n",
"[LOG] Call LLM for https://openai.com/api/pricing/ - block index: 2\n",
"[LOG] Call LLM for https://openai.com/api/pricing/ - block index: 3\n",
"[LOG] Extracted 4 blocks from URL: https://openai.com/api/pricing/ block index: 3\n",
"[LOG] Call LLM for https://openai.com/api/pricing/ - block index: 4\n",
"[LOG] Extracted 5 blocks from URL: https://openai.com/api/pricing/ block index: 0\n",
"[LOG] Extracted 1 blocks from URL: https://openai.com/api/pricing/ block index: 4\n",
"[LOG] Extracted 8 blocks from URL: https://openai.com/api/pricing/ block index: 1\n",
"[LOG] Extracted 12 blocks from URL: https://openai.com/api/pricing/ block index: 2\n",
"[LOG] 🚀 Extraction done for https://openai.com/api/pricing/, time taken: 8.55 seconds.\n",
"5029\n"
]
}
],
"source": [
"import os\n",
"from google.colab import userdata\n",
"os.environ['OPENAI_API_KEY'] = userdata.get('OPENAI_API_KEY')\n",
"\n",
"class OpenAIModelFee(BaseModel):\n",
" model_name: str = Field(..., description=\"Name of the OpenAI model.\")\n",
" input_fee: str = Field(..., description=\"Fee for input token for the OpenAI model.\")\n",
" output_fee: str = Field(..., description=\"Fee for output token for the OpenAI model.\")\n",
"\n",
"async def extract_openai_fees():\n",
" async with AsyncWebCrawler(verbose=True) as crawler:\n",
" result = await crawler.arun(\n",
" url='https://openai.com/api/pricing/',\n",
" word_count_threshold=1,\n",
" extraction_strategy=LLMExtractionStrategy(\n",
" provider=\"openai/gpt-4o\", api_token=os.getenv('OPENAI_API_KEY'),\n",
" schema=OpenAIModelFee.schema(),\n",
" extraction_type=\"schema\",\n",
" instruction=\"\"\"From the crawled content, extract all mentioned model names along with their fees for input and output tokens.\n",
" Do not miss any models in the entire content. One extracted model JSON format should look like this:\n",
" {\"model_name\": \"GPT-4\", \"input_fee\": \"US$10.00 / 1M tokens\", \"output_fee\": \"US$30.00 / 1M tokens\"}.\"\"\"\n",
" ),\n",
" bypass_cache=True,\n",
" )\n",
" print(len(result.extracted_content))\n",
"\n",
"# Uncomment the following line to run the OpenAI extraction example\n",
"await extract_openai_fees()"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "BypA5YxEyZQN"
},
"source": [
"### Advanced Multi-Page Crawling with JavaScript Execution"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "tfkcVQ0b7mw-"
},
"source": [
"## Advanced Multi-Page Crawling with JavaScript Execution\n",
"\n",
"This example demonstrates Crawl4AI's ability to handle complex crawling scenarios, specifically extracting commits from multiple pages of a GitHub repository. The challenge here is that clicking the \"Next\" button doesn't load a new page, but instead uses asynchronous JavaScript to update the content. This is a common hurdle in modern web crawling.\n",
"\n",
"To overcome this, we use Crawl4AI's custom JavaScript execution to simulate clicking the \"Next\" button, and implement a custom hook to detect when new data has loaded. Our strategy involves comparing the first commit's text before and after \"clicking\" Next, waiting until it changes to confirm new data has rendered. This showcases Crawl4AI's flexibility in handling dynamic content and its ability to implement custom logic for even the most challenging crawling tasks."
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "qUBKGpn3yZQN",
"outputId": "3e555b6a-ed33-42f4-cce9-499a923fbe17"
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"[LOG] 🌤️ Warming up the AsyncWebCrawler\n",
"[LOG] 🌞 AsyncWebCrawler is ready to crawl\n",
"[LOG] 🕸️ Crawling https://github.com/microsoft/TypeScript/commits/main using AsyncPlaywrightCrawlerStrategy...\n",
"[LOG] ✅ Crawled https://github.com/microsoft/TypeScript/commits/main successfully!\n",
"[LOG] 🚀 Crawling done for https://github.com/microsoft/TypeScript/commits/main, success: True, time taken: 5.16 seconds\n",
"[LOG] 🚀 Content extracted for https://github.com/microsoft/TypeScript/commits/main, success: True, time taken: 0.28 seconds\n",
"[LOG] 🔥 Extracting semantic blocks for https://github.com/microsoft/TypeScript/commits/main, Strategy: AsyncWebCrawler\n",
"[LOG] 🚀 Extraction done for https://github.com/microsoft/TypeScript/commits/main, time taken: 0.28 seconds.\n",
"Page 1: Found 35 commits\n",
"[LOG] 🕸️ Crawling https://github.com/microsoft/TypeScript/commits/main using AsyncPlaywrightCrawlerStrategy...\n",
"[LOG] ✅ Crawled https://github.com/microsoft/TypeScript/commits/main successfully!\n",
"[LOG] 🚀 Crawling done for https://github.com/microsoft/TypeScript/commits/main, success: True, time taken: 0.78 seconds\n",
"[LOG] 🚀 Content extracted for https://github.com/microsoft/TypeScript/commits/main, success: True, time taken: 0.90 seconds\n",
"[LOG] 🔥 Extracting semantic blocks for https://github.com/microsoft/TypeScript/commits/main, Strategy: AsyncWebCrawler\n",
"[LOG] 🚀 Extraction done for https://github.com/microsoft/TypeScript/commits/main, time taken: 0.90 seconds.\n",
"Page 2: Found 35 commits\n",
"[LOG] 🕸️ Crawling https://github.com/microsoft/TypeScript/commits/main using AsyncPlaywrightCrawlerStrategy...\n",
"[LOG] ✅ Crawled https://github.com/microsoft/TypeScript/commits/main successfully!\n",
"[LOG] 🚀 Crawling done for https://github.com/microsoft/TypeScript/commits/main, success: True, time taken: 2.00 seconds\n",
"[LOG] 🚀 Content extracted for https://github.com/microsoft/TypeScript/commits/main, success: True, time taken: 0.74 seconds\n",
"[LOG] 🔥 Extracting semantic blocks for https://github.com/microsoft/TypeScript/commits/main, Strategy: AsyncWebCrawler\n",
"[LOG] 🚀 Extraction done for https://github.com/microsoft/TypeScript/commits/main, time taken: 0.75 seconds.\n",
"Page 3: Found 35 commits\n",
"Successfully crawled 105 commits across 3 pages\n"
]
}
],
"source": [
"import re\n",
"from bs4 import BeautifulSoup\n",
"\n",
"async def crawl_typescript_commits():\n",
" first_commit = \"\"\n",
" async def on_execution_started(page):\n",
" nonlocal first_commit\n",
" try:\n",
" while True:\n",
" await page.wait_for_selector('li.Box-sc-g0xbh4-0 h4')\n",
" commit = await page.query_selector('li.Box-sc-g0xbh4-0 h4')\n",
" commit = await commit.evaluate('(element) => element.textContent')\n",
" commit = re.sub(r'\\s+', '', commit)\n",
" if commit and commit != first_commit:\n",
" first_commit = commit\n",
" break\n",
" await asyncio.sleep(0.5)\n",
" except Exception as e:\n",
" print(f\"Warning: New content didn't appear after JavaScript execution: {e}\")\n",
"\n",
" async with AsyncWebCrawler(verbose=True) as crawler:\n",
" crawler.crawler_strategy.set_hook('on_execution_started', on_execution_started)\n",
"\n",
" url = \"https://github.com/microsoft/TypeScript/commits/main\"\n",
" session_id = \"typescript_commits_session\"\n",
" all_commits = []\n",
"\n",
" js_next_page = \"\"\"\n",
" const button = document.querySelector('a[data-testid=\"pagination-next-button\"]');\n",
" if (button) button.click();\n",
" \"\"\"\n",
"\n",
" for page in range(3): # Crawl 3 pages\n",
" result = await crawler.arun(\n",
" url=url,\n",
" session_id=session_id,\n",
" css_selector=\"li.Box-sc-g0xbh4-0\",\n",
" js=js_next_page if page > 0 else None,\n",
" bypass_cache=True,\n",
" js_only=page > 0\n",
" )\n",
"\n",
" assert result.success, f\"Failed to crawl page {page + 1}\"\n",
"\n",
" soup = BeautifulSoup(result.cleaned_html, 'html.parser')\n",
" commits = soup.select(\"li\")\n",
" all_commits.extend(commits)\n",
"\n",
" print(f\"Page {page + 1}: Found {len(commits)} commits\")\n",
"\n",
" await crawler.crawler_strategy.kill_session(session_id)\n",
" print(f\"Successfully crawled {len(all_commits)} commits across 3 pages\")\n",
"\n",
"await crawl_typescript_commits()"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "EJRnYsp6yZQN"
},
"source": [
"### Using JsonCssExtractionStrategy for Fast Structured Output"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "1ZMqIzB_8SYp"
},
"source": [
"The JsonCssExtractionStrategy is a powerful feature of Crawl4AI that allows for precise, structured data extraction from web pages. Here's how it works:\n",
"\n",
"1. You define a schema that describes the pattern of data you're interested in extracting.\n",
"2. The schema includes a base selector that identifies repeating elements on the page.\n",
"3. Within the schema, you define fields, each with its own selector and type.\n",
"4. These field selectors are applied within the context of each base selector element.\n",
"5. The strategy supports nested structures, lists within lists, and various data types.\n",
"6. You can even include computed fields for more complex data manipulation.\n",
"\n",
"This approach allows for highly flexible and precise data extraction, transforming semi-structured web content into clean, structured JSON data. It's particularly useful for extracting consistent data patterns from pages like product listings, news articles, or search results.\n",
"\n",
"For more details and advanced usage, check out the full documentation on the Crawl4AI website."
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "trCMR2T9yZQN",
"outputId": "718d36f4-cccf-40f4-8d8c-c3ba73524d16"
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"[LOG] 🌤️ Warming up the AsyncWebCrawler\n",
"[LOG] 🌞 AsyncWebCrawler is ready to crawl\n",
"[LOG] 🕸️ Crawling https://www.nbcnews.com/business using AsyncPlaywrightCrawlerStrategy...\n",
"[LOG] ✅ Crawled https://www.nbcnews.com/business successfully!\n",
"[LOG] 🚀 Crawling done for https://www.nbcnews.com/business, success: True, time taken: 7.00 seconds\n",
"[LOG] 🚀 Content extracted for https://www.nbcnews.com/business, success: True, time taken: 0.32 seconds\n",
"[LOG] 🔥 Extracting semantic blocks for https://www.nbcnews.com/business, Strategy: AsyncWebCrawler\n",
"[LOG] 🚀 Extraction done for https://www.nbcnews.com/business, time taken: 0.48 seconds.\n",
"Successfully extracted 11 news teasers\n",
"{\n",
" \"category\": \"Business News\",\n",
" \"headline\": \"NBC ripped up its Olympics playbook for 2024 \\u2014 so far, the new strategy paid off\",\n",
" \"summary\": \"The Olympics have long been key to NBCUniversal. Paris marked the 18th Olympic Games broadcast by NBC in the U.S.\",\n",
" \"time\": \"13h ago\",\n",
" \"image\": {\n",
" \"src\": \"https://media-cldnry.s-nbcnews.com/image/upload/t_focal-200x100,f_auto,q_auto:best/rockcms/2024-09/240903-nbc-olympics-ch-1344-c7a486.jpg\",\n",
" \"alt\": \"Mike Tirico.\"\n",
" },\n",
" \"link\": \"https://www.nbcnews.com/business\"\n",
"}\n"
]
}
],
"source": [
"async def extract_news_teasers():\n",
" schema = {\n",
" \"name\": \"News Teaser Extractor\",\n",
" \"baseSelector\": \".wide-tease-item__wrapper\",\n",
" \"fields\": [\n",
" {\n",
" \"name\": \"category\",\n",
" \"selector\": \".unibrow span[data-testid='unibrow-text']\",\n",
" \"type\": \"text\",\n",
" },\n",
" {\n",
" \"name\": \"headline\",\n",
" \"selector\": \".wide-tease-item__headline\",\n",
" \"type\": \"text\",\n",
" },\n",
" {\n",
" \"name\": \"summary\",\n",
" \"selector\": \".wide-tease-item__description\",\n",
" \"type\": \"text\",\n",
" },\n",
" {\n",
" \"name\": \"time\",\n",
" \"selector\": \"[data-testid='wide-tease-date']\",\n",
" \"type\": \"text\",\n",
" },\n",
" {\n",
" \"name\": \"image\",\n",
" \"type\": \"nested\",\n",
" \"selector\": \"picture.teasePicture img\",\n",
" \"fields\": [\n",
" {\"name\": \"src\", \"type\": \"attribute\", \"attribute\": \"src\"},\n",
" {\"name\": \"alt\", \"type\": \"attribute\", \"attribute\": \"alt\"},\n",
" ],\n",
" },\n",
" {\n",
" \"name\": \"link\",\n",
" \"selector\": \"a[href]\",\n",
" \"type\": \"attribute\",\n",
" \"attribute\": \"href\",\n",
" },\n",
" ],\n",
" }\n",
"\n",
" extraction_strategy = JsonCssExtractionStrategy(schema, verbose=True)\n",
"\n",
" async with AsyncWebCrawler(verbose=True) as crawler:\n",
" result = await crawler.arun(\n",
" url=\"https://www.nbcnews.com/business\",\n",
" extraction_strategy=extraction_strategy,\n",
" bypass_cache=True,\n",
" )\n",
"\n",
" assert result.success, \"Failed to crawl the page\"\n",
"\n",
" news_teasers = json.loads(result.extracted_content)\n",
" print(f\"Successfully extracted {len(news_teasers)} news teasers\")\n",
" print(json.dumps(news_teasers[0], indent=2))\n",
"\n",
"await extract_news_teasers()"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "FnyVhJaByZQN"
},
"source": [
"## Speed Comparison\n",
"\n",
"Let's compare the speed of Crawl4AI with Firecrawl, a paid service. Note that we can't run Firecrawl in this Colab environment, so we'll simulate its performance based on previously recorded data."
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "agDD186f3wig"
},
"source": [
"💡 **Note on Speed Comparison:**\n",
"\n",
"The speed test conducted here is running on Google Colab, where the internet speed and performance can vary and may not reflect optimal conditions. When we call Firecrawl's API, we're seeing its best performance, while Crawl4AI's performance is limited by Colab's network speed.\n",
"\n",
"For a more accurate comparison, it's recommended to run these tests on your own servers or computers with a stable and fast internet connection. Despite these limitations, Crawl4AI still demonstrates faster performance in this environment.\n",
"\n",
"If you run these tests locally, you may observe an even more significant speed advantage for Crawl4AI compared to other services."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "F7KwHv8G1LbY"
},
"outputs": [],
"source": [
"!pip install firecrawl"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "91813zILyZQN",
"outputId": "663223db-ab89-4976-b233-05ceca62b19b"
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Firecrawl (simulated):\n",
"Time taken: 4.38 seconds\n",
"Content length: 41967 characters\n",
"Images found: 49\n",
"\n",
"Crawl4AI (simple crawl):\n",
"Time taken: 4.22 seconds\n",
"Content length: 18221 characters\n",
"Images found: 49\n",
"\n",
"Crawl4AI (with JavaScript execution):\n",
"Time taken: 9.13 seconds\n",
"Content length: 34243 characters\n",
"Images found: 89\n"
]
}
],
"source": [
"import os\n",
"from google.colab import userdata\n",
"os.environ['FIRECRAWL_API_KEY'] = userdata.get('FIRECRAWL_API_KEY')\n",
"import time\n",
"from firecrawl import FirecrawlApp\n",
"\n",
"async def speed_comparison():\n",
" # Simulated Firecrawl performance\n",
" app = FirecrawlApp(api_key=os.environ['FIRECRAWL_API_KEY'])\n",
" start = time.time()\n",
" scrape_status = app.scrape_url(\n",
" 'https://www.nbcnews.com/business',\n",
" params={'formats': ['markdown', 'html']}\n",
" )\n",
" end = time.time()\n",
" print(\"Firecrawl (simulated):\")\n",
" print(f\"Time taken: {end - start:.2f} seconds\")\n",
" print(f\"Content length: {len(scrape_status['markdown'])} characters\")\n",
" print(f\"Images found: {scrape_status['markdown'].count('cldnry.s-nbcnews.com')}\")\n",
" print()\n",
"\n",
" async with AsyncWebCrawler() as crawler:\n",
" # Crawl4AI simple crawl\n",
" start = time.time()\n",
" result = await crawler.arun(\n",
" url=\"https://www.nbcnews.com/business\",\n",
" word_count_threshold=0,\n",
" bypass_cache=True,\n",
" verbose=False\n",
" )\n",
" end = time.time()\n",
" print(\"Crawl4AI (simple crawl):\")\n",
" print(f\"Time taken: {end - start:.2f} seconds\")\n",
" print(f\"Content length: {len(result.markdown)} characters\")\n",
" print(f\"Images found: {result.markdown.count('cldnry.s-nbcnews.com')}\")\n",
" print()\n",
"\n",
" # Crawl4AI with JavaScript execution\n",
" start = time.time()\n",
" result = await crawler.arun(\n",
" url=\"https://www.nbcnews.com/business\",\n",
" js_code=[\"const loadMoreButton = Array.from(document.querySelectorAll('button')).find(button => button.textContent.includes('Load More')); loadMoreButton && loadMoreButton.click();\"],\n",
" word_count_threshold=0,\n",
" bypass_cache=True,\n",
" verbose=False\n",
" )\n",
" end = time.time()\n",
" print(\"Crawl4AI (with JavaScript execution):\")\n",
" print(f\"Time taken: {end - start:.2f} seconds\")\n",
" print(f\"Content length: {len(result.markdown)} characters\")\n",
" print(f\"Images found: {result.markdown.count('cldnry.s-nbcnews.com')}\")\n",
"\n",
"await speed_comparison()"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "OBFFYVJIyZQN"
},
"source": [
"If you run on a local machine with a proper internet speed:\n",
"- Simple crawl: Crawl4AI is typically over 3-4 times faster than Firecrawl.\n",
"- With JavaScript execution: Even when executing JavaScript to load more content (potentially doubling the number of images found), Crawl4AI is still faster than Firecrawl's simple crawl.\n",
"\n",
"Please note that actual performance may vary depending on network conditions and the specific content being crawled."
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "A6_1RK1_yZQO"
},
"source": [
"## Conclusion\n",
"\n",
"In this notebook, we've explored the powerful features of Crawl4AI, including:\n",
"\n",
"1. Basic crawling\n",
"2. JavaScript execution and CSS selector usage\n",
"3. Proxy support\n",
"4. Structured data extraction with OpenAI\n",
"5. Advanced multi-page crawling with JavaScript execution\n",
"6. Fast structured output using JsonCssExtractionStrategy\n",
"7. Speed comparison with other services\n",
"\n",
"Crawl4AI offers a fast, flexible, and powerful solution for web crawling and data extraction tasks. Its asynchronous architecture and advanced features make it suitable for a wide range of applications, from simple web scraping to complex, multi-page data extraction scenarios.\n",
"\n",
"For more information and advanced usage, please visit the [Crawl4AI documentation](https://docs.crawl4ai.com/).\n",
"\n",
"Happy crawling!"
]
}
],
"metadata": {
"colab": {
"provenance": []
},
"kernelspec": {
"display_name": "venv",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.10.13"
}
},
"nbformat": 4,
"nbformat_minor": 0
}

View File

@@ -11,9 +11,7 @@ from groq import Groq
# Import threadpools to run the crawl_url function in a separate thread
from concurrent.futures import ThreadPoolExecutor
client = AsyncOpenAI(
base_url="https://api.groq.com/openai/v1", api_key=os.getenv("GROQ_API_KEY")
)
client = AsyncOpenAI(base_url="https://api.groq.com/openai/v1", api_key=os.getenv("GROQ_API_KEY"))
# Instrument the OpenAI client
cl.instrument_openai()
@@ -27,39 +25,41 @@ settings = {
"presence_penalty": 0,
}
def extract_urls(text):
url_pattern = re.compile(r"(https?://\S+)")
url_pattern = re.compile(r'(https?://\S+)')
return url_pattern.findall(text)
def crawl_url(url):
data = {
"urls": [url],
"include_raw_html": True,
"word_count_threshold": 10,
"extraction_strategy": "NoExtractionStrategy",
"chunking_strategy": "RegexChunking",
"chunking_strategy": "RegexChunking"
}
response = requests.post("https://crawl4ai.com/crawl", json=data)
response_data = response.json()
response_data = response_data["results"][0]
return response_data["markdown"]
response_data = response_data['results'][0]
return response_data['markdown']
@cl.on_chat_start
async def on_chat_start():
cl.user_session.set("session", {"history": [], "context": {}})
await cl.Message(content="Welcome to the chat! How can I assist you today?").send()
cl.user_session.set("session", {
"history": [],
"context": {}
})
await cl.Message(
content="Welcome to the chat! How can I assist you today?"
).send()
@cl.on_message
async def on_message(message: cl.Message):
user_session = cl.user_session.get("session")
# Extract URLs from the user's message
urls = extract_urls(message.content)
futures = []
with ThreadPoolExecutor() as executor:
for url in urls:
@@ -69,9 +69,16 @@ async def on_message(message: cl.Message):
for url, result in zip(urls, results):
ref_number = f"REF_{len(user_session['context']) + 1}"
user_session["context"][ref_number] = {"url": url, "content": result}
user_session["context"][ref_number] = {
"url": url,
"content": result
}
user_session["history"].append({"role": "user", "content": message.content})
user_session["history"].append({
"role": "user",
"content": message.content
})
# Create a system message that includes the context
context_messages = [
@@ -88,17 +95,26 @@ async def on_message(message: cl.Message):
"If not, there is no need to add a references section. "
"At the end of your response, provide a reference section listing the URLs and their REF numbers only if sources from the appendices were used.\n\n"
"\n\n".join(context_messages)
),
)
}
else:
system_message = {"role": "system", "content": "You are a helpful assistant."}
system_message = {
"role": "system",
"content": "You are a helpful assistant."
}
msg = cl.Message(content="")
await msg.send()
# Get response from the LLM
stream = await client.chat.completions.create(
messages=[system_message, *user_session["history"]], stream=True, **settings
messages=[
system_message,
*user_session["history"]
],
stream=True,
**settings
)
assistant_response = ""
@@ -108,7 +124,10 @@ async def on_message(message: cl.Message):
await msg.stream_token(token)
# Add assistant message to the history
user_session["history"].append({"role": "assistant", "content": assistant_response})
user_session["history"].append({
"role": "assistant",
"content": assistant_response
})
await msg.update()
# Append the reference section to the assistant's response
@@ -135,11 +154,10 @@ async def on_audio_chunk(chunk: cl.AudioChunk):
pass
@cl.step(type="tool")
async def speech_to_text(audio_file):
cli = Groq()
response = await client.audio.transcriptions.create(
model="whisper-large-v3", file=audio_file
)
@@ -154,19 +172,24 @@ async def on_audio_end(elements: list[ElementBased]):
audio_buffer.seek(0) # Move the file pointer to the beginning
audio_file = audio_buffer.read()
audio_mime_type: str = cl.user_session.get("audio_mime_type")
start_time = time.time()
whisper_input = (audio_buffer.name, audio_file, audio_mime_type)
transcription = await speech_to_text(whisper_input)
end_time = time.time()
print(f"Transcription took {end_time - start_time} seconds")
user_msg = cl.Message(author="You", type="user_message", content=transcription)
user_msg = cl.Message(
author="You",
type="user_message",
content=transcription
)
await user_msg.send()
await on_message(user_msg)
if __name__ == "__main__":
from chainlit.cli import run_chainlit
run_chainlit(__file__)

View File

@@ -1,3 +1,4 @@
import requests, base64, os
data = {
@@ -5,50 +6,59 @@ data = {
"screenshot": True,
}
response = requests.post("https://crawl4ai.com/crawl", json=data)
result = response.json()["results"][0]
response = requests.post("https://crawl4ai.com/crawl", json=data)
result = response.json()['results'][0]
print(result.keys())
# dict_keys(['url', 'html', 'success', 'cleaned_html', 'media',
# 'links', 'screenshot', 'markdown', 'extracted_content',
# dict_keys(['url', 'html', 'success', 'cleaned_html', 'media',
# 'links', 'screenshot', 'markdown', 'extracted_content',
# 'metadata', 'error_message'])
with open("screenshot.png", "wb") as f:
f.write(base64.b64decode(result["screenshot"]))
f.write(base64.b64decode(result['screenshot']))
# Example of filtering the content using CSS selectors
data = {
"urls": ["https://www.nbcnews.com/business"],
"urls": [
"https://www.nbcnews.com/business"
],
"css_selector": "article",
"screenshot": True,
}
# Example of executing a JS script on the page before extracting the content
data = {
"urls": ["https://www.nbcnews.com/business"],
"urls": [
"https://www.nbcnews.com/business"
],
"screenshot": True,
"js": [
"""
'js' : ["""
const loadMoreButton = Array.from(document.querySelectorAll('button')).
find(button => button.textContent.includes('Load More'));
loadMoreButton && loadMoreButton.click();
"""
],
"""]
}
# Example of using a custom extraction strategy
data = {
"urls": ["https://www.nbcnews.com/business"],
"urls": [
"https://www.nbcnews.com/business"
],
"extraction_strategy": "CosineStrategy",
"extraction_strategy_args": {"semantic_filter": "inflation rent prices"},
"extraction_strategy_args": {
"semantic_filter": "inflation rent prices"
},
}
# Example of using LLM to extract content
data = {
"urls": ["https://www.nbcnews.com/business"],
"urls": [
"https://www.nbcnews.com/business"
],
"extraction_strategy": "LLMExtractionStrategy",
"extraction_strategy_args": {
"provider": "groq/llama3-8b-8192",
"api_token": os.environ.get("GROQ_API_KEY"),
"instruction": """I am interested in only financial news,
and translate them in French.""",
and translate them in French."""
},
}

View File

@@ -1,135 +0,0 @@
import time, re
from crawl4ai.content_scraping_strategy import WebScrapingStrategy, LXMLWebScrapingStrategy
import time
import functools
from collections import defaultdict
class TimingStats:
def __init__(self):
self.stats = defaultdict(lambda: defaultdict(lambda: {"calls": 0, "total_time": 0}))
def add(self, strategy_name, func_name, elapsed):
self.stats[strategy_name][func_name]["calls"] += 1
self.stats[strategy_name][func_name]["total_time"] += elapsed
def report(self):
for strategy_name, funcs in self.stats.items():
print(f"\n{strategy_name} Timing Breakdown:")
print("-" * 60)
print(f"{'Function':<30} {'Calls':<10} {'Total(s)':<10} {'Avg(ms)':<10}")
print("-" * 60)
for func, data in sorted(funcs.items(), key=lambda x: x[1]["total_time"], reverse=True):
avg_ms = (data["total_time"] / data["calls"]) * 1000
print(f"{func:<30} {data['calls']:<10} {data['total_time']:<10.3f} {avg_ms:<10.2f}")
timing_stats = TimingStats()
# Modify timing decorator
def timing_decorator(strategy_name):
def decorator(func):
@functools.wraps(func)
def wrapper(*args, **kwargs):
start = time.time()
result = func(*args, **kwargs)
elapsed = time.time() - start
timing_stats.add(strategy_name, func.__name__, elapsed)
return result
return wrapper
return decorator
# Modified decorator application
def apply_decorators(cls, method_name, strategy_name):
try:
original_method = getattr(cls, method_name)
decorated_method = timing_decorator(strategy_name)(original_method)
setattr(cls, method_name, decorated_method)
except AttributeError:
print(f"Method {method_name} not found in class {cls.__name__}.")
# Apply to key methods
methods_to_profile = [
'_scrap',
# 'process_element',
'_process_element',
'process_image',
]
# Apply decorators to both strategies
for strategy, name in [(WebScrapingStrategy, "Original"), (LXMLWebScrapingStrategy, "LXML")]:
for method in methods_to_profile:
apply_decorators(strategy, method, name)
def generate_large_html(n_elements=1000):
html = ['<!DOCTYPE html><html><head></head><body>']
for i in range(n_elements):
html.append(f'''
<div class="article">
<h2>Heading {i}</h2>
<div>
<div>
<p>This is paragraph {i} with some content and a <a href="http://example.com/{i}">link</a></p>
</div>
</div>
<img src="image{i}.jpg" alt="Image {i}">
<ul>
<li>List item {i}.1</li>
<li>List item {i}.2</li>
</ul>
</div>
''')
html.append('</body></html>')
return ''.join(html)
def test_scraping():
# Initialize both scrapers
original_scraper = WebScrapingStrategy()
selected_scraper = LXMLWebScrapingStrategy()
# Generate test HTML
print("Generating HTML...")
html = generate_large_html(5000)
print(f"HTML Size: {len(html)/1024:.2f} KB")
# Time the scraping
print("\nStarting scrape...")
start_time = time.time()
kwargs = {
"url": "http://example.com",
"html": html,
"word_count_threshold": 5,
"keep_data_attributes": True
}
t1 = time.perf_counter()
result_selected = selected_scraper.scrap(**kwargs)
t2 = time.perf_counter()
result_original = original_scraper.scrap(**kwargs)
t3 = time.perf_counter()
elapsed = t3 - start_time
print(f"\nScraping completed in {elapsed:.2f} seconds")
timing_stats.report()
# Print stats of LXML output
print("\nLXML Output:")
print(f"\nExtracted links: {len(result_selected['links']['internal']) + len(result_selected['links']['external'])}")
print(f"Extracted images: {len(result_selected['media']['images'])}")
print(f"Clean HTML size: {len(result_selected['cleaned_html'])/1024:.2f} KB")
print(f"Scraping time: {t2 - t1:.2f} seconds")
# Print stats of original output
print("\nOriginal Output:")
print(f"\nExtracted links: {len(result_original['links']['internal']) + len(result_original['links']['external'])}")
print(f"Extracted images: {len(result_original['media']['images'])}")
print(f"Clean HTML size: {len(result_original['cleaned_html'])/1024:.2f} KB")
print(f"Scraping time: {t3 - t1:.2f} seconds")
if __name__ == "__main__":
test_scraping()

View File

@@ -1,51 +0,0 @@
"""Example showing how to work with SSL certificates in Crawl4AI."""
import asyncio
import os
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig, CacheMode
# Create tmp directory if it doesn't exist
parent_dir = os.path.dirname(
os.path.dirname(os.path.dirname(os.path.abspath(__file__)))
)
tmp_dir = os.path.join(parent_dir, "tmp")
os.makedirs(tmp_dir, exist_ok=True)
async def main():
# Configure crawler to fetch SSL certificate
config = CrawlerRunConfig(
fetch_ssl_certificate=True,
cache_mode=CacheMode.BYPASS, # Bypass cache to always get fresh certificates
)
async with AsyncWebCrawler() as crawler:
result = await crawler.arun(url="https://example.com", config=config)
if result.success and result.ssl_certificate:
cert = result.ssl_certificate
# 1. Access certificate properties directly
print("\nCertificate Information:")
print(f"Issuer: {cert.issuer.get('CN', '')}")
print(f"Valid until: {cert.valid_until}")
print(f"Fingerprint: {cert.fingerprint}")
# 2. Export certificate in different formats
cert.to_json(os.path.join(tmp_dir, "certificate.json")) # For analysis
print("\nCertificate exported to:")
print(f"- JSON: {os.path.join(tmp_dir, 'certificate.json')}")
pem_data = cert.to_pem(
os.path.join(tmp_dir, "certificate.pem")
) # For web servers
print(f"- PEM: {os.path.join(tmp_dir, 'certificate.pem')}")
der_data = cert.to_der(
os.path.join(tmp_dir, "certificate.der")
) # For Java apps
print(f"- DER: {os.path.join(tmp_dir, 'certificate.der')}")
if __name__ == "__main__":
asyncio.run(main())

View File

@@ -1,225 +0,0 @@
### Using `storage_state` to Pre-Load Cookies and LocalStorage
Crawl4ais `AsyncWebCrawler` lets you preserve and reuse session data, including cookies and localStorage, across multiple runs. By providing a `storage_state`, you can start your crawls already “logged in” or with any other necessary session data—no need to repeat the login flow every time.
#### What is `storage_state`?
`storage_state` can be:
- A dictionary containing cookies and localStorage data.
- A path to a JSON file that holds this information.
When you pass `storage_state` to the crawler, it applies these cookies and localStorage entries before loading any pages. This means your crawler effectively starts in a known authenticated or pre-configured state.
#### Example Structure
Heres an example storage state:
```json
{
"cookies": [
{
"name": "session",
"value": "abcd1234",
"domain": "example.com",
"path": "/",
"expires": 1675363572.037711,
"httpOnly": false,
"secure": false,
"sameSite": "None"
}
],
"origins": [
{
"origin": "https://example.com",
"localStorage": [
{ "name": "token", "value": "my_auth_token" },
{ "name": "refreshToken", "value": "my_refresh_token" }
]
}
]
}
```
This JSON sets a `session` cookie and two localStorage entries (`token` and `refreshToken`) for `https://example.com`.
---
### Passing `storage_state` as a Dictionary
You can directly provide the data as a dictionary:
```python
import asyncio
from crawl4ai import AsyncWebCrawler
async def main():
storage_dict = {
"cookies": [
{
"name": "session",
"value": "abcd1234",
"domain": "example.com",
"path": "/",
"expires": 1675363572.037711,
"httpOnly": False,
"secure": False,
"sameSite": "None"
}
],
"origins": [
{
"origin": "https://example.com",
"localStorage": [
{"name": "token", "value": "my_auth_token"},
{"name": "refreshToken", "value": "my_refresh_token"}
]
}
]
}
async with AsyncWebCrawler(
headless=True,
storage_state=storage_dict
) as crawler:
result = await crawler.arun(url='https://example.com/protected')
if result.success:
print("Crawl succeeded with pre-loaded session data!")
print("Page HTML length:", len(result.html))
if __name__ == "__main__":
asyncio.run(main())
```
---
### Passing `storage_state` as a File
If you prefer a file-based approach, save the JSON above to `mystate.json` and reference it:
```python
import asyncio
from crawl4ai import AsyncWebCrawler
async def main():
async with AsyncWebCrawler(
headless=True,
storage_state="mystate.json" # Uses a JSON file instead of a dictionary
) as crawler:
result = await crawler.arun(url='https://example.com/protected')
if result.success:
print("Crawl succeeded with pre-loaded session data!")
print("Page HTML length:", len(result.html))
if __name__ == "__main__":
asyncio.run(main())
```
---
### Using `storage_state` to Avoid Repeated Logins (Sign In Once, Use Later)
A common scenario is when you need to log in to a site (entering username/password, etc.) to access protected pages. Doing so every crawl is cumbersome. Instead, you can:
1. Perform the login once in a hook.
2. After login completes, export the resulting `storage_state` to a file.
3. On subsequent runs, provide that `storage_state` to skip the login step.
**Step-by-Step Example:**
**First Run (Perform Login and Save State):**
```python
import asyncio
from crawl4ai import AsyncWebCrawler, CacheMode
from crawl4ai.markdown_generation_strategy import DefaultMarkdownGenerator
async def on_browser_created_hook(browser):
# Access the default context and create a page
context = browser.contexts[0]
page = await context.new_page()
# Navigate to the login page
await page.goto("https://example.com/login", wait_until="domcontentloaded")
# Fill in credentials and submit
await page.fill("input[name='username']", "myuser")
await page.fill("input[name='password']", "mypassword")
await page.click("button[type='submit']")
await page.wait_for_load_state("networkidle")
# Now the site sets tokens in localStorage and cookies
# Export this state to a file so we can reuse it
await context.storage_state(path="my_storage_state.json")
await page.close()
async def main():
# First run: perform login and export the storage_state
async with AsyncWebCrawler(
headless=True,
verbose=True,
hooks={"on_browser_created": on_browser_created_hook},
use_persistent_context=True,
user_data_dir="./my_user_data"
) as crawler:
# After on_browser_created_hook runs, we have storage_state saved to my_storage_state.json
result = await crawler.arun(
url='https://example.com/protected-page',
cache_mode=CacheMode.BYPASS,
markdown_generator=DefaultMarkdownGenerator(options={"ignore_links": True}),
)
print("First run result success:", result.success)
if result.success:
print("Protected page HTML length:", len(result.html))
if __name__ == "__main__":
asyncio.run(main())
```
**Second Run (Reuse Saved State, No Login Needed):**
```python
import asyncio
from crawl4ai import AsyncWebCrawler, CacheMode
from crawl4ai.markdown_generation_strategy import DefaultMarkdownGenerator
async def main():
# Second run: no need to hook on_browser_created this time.
# Just provide the previously saved storage state.
async with AsyncWebCrawler(
headless=True,
verbose=True,
use_persistent_context=True,
user_data_dir="./my_user_data",
storage_state="my_storage_state.json" # Reuse previously exported state
) as crawler:
# Now the crawler starts already logged in
result = await crawler.arun(
url='https://example.com/protected-page',
cache_mode=CacheMode.BYPASS,
markdown_generator=DefaultMarkdownGenerator(options={"ignore_links": True}),
)
print("Second run result success:", result.success)
if result.success:
print("Protected page HTML length:", len(result.html))
if __name__ == "__main__":
asyncio.run(main())
```
**Whats Happening Here?**
- During the first run, the `on_browser_created_hook` logs into the site.
- After logging in, the crawler exports the current session (cookies, localStorage, etc.) to `my_storage_state.json`.
- On subsequent runs, passing `storage_state="my_storage_state.json"` starts the browser context with these tokens already in place, skipping the login steps.
**Sign Out Scenario:**
If the website allows you to sign out by clearing tokens or by navigating to a sign-out URL, you can also run a script that uses `on_browser_created_hook` or `arun` to simulate signing out, then export the resulting `storage_state` again. That would give you a baseline “logged out” state to start fresh from next time.
---
### Conclusion
By using `storage_state`, you can skip repetitive actions, like logging in, and jump straight into crawling protected content. Whether you provide a file path or a dictionary, this powerful feature helps maintain state between crawls, simplifying your data extraction pipelines.

View File

@@ -1,41 +1,39 @@
import os
import time
import json
from crawl4ai.web_crawler import WebCrawler
from crawl4ai.chunking_strategy import *
from crawl4ai.extraction_strategy import *
from crawl4ai.crawler_strategy import *
url = r"https://marketplace.visualstudio.com/items?itemName=Unclecode.groqopilot"
url = r'https://marketplace.visualstudio.com/items?itemName=Unclecode.groqopilot'
crawler = WebCrawler()
crawler.warmup()
from pydantic import BaseModel, Field
class PageSummary(BaseModel):
title: str = Field(..., description="Title of the page.")
summary: str = Field(..., description="Summary of the page.")
brief_summary: str = Field(..., description="Brief summary of the page.")
keywords: list = Field(..., description="Keywords assigned to the page.")
result = crawler.run(
url=url,
word_count_threshold=1,
extraction_strategy=LLMExtractionStrategy(
provider="openai/gpt-4o",
api_token=os.getenv("OPENAI_API_KEY"),
extraction_strategy= LLMExtractionStrategy(
provider= "openai/gpt-4o", api_token = os.getenv('OPENAI_API_KEY'),
schema=PageSummary.model_json_schema(),
extraction_type="schema",
apply_chunking=False,
instruction="From the crawled content, extract the following details: "
"1. Title of the page "
"2. Summary of the page, which is a detailed summary "
"3. Brief summary of the page, which is a paragraph text "
"4. Keywords assigned to the page, which is a list of keywords. "
"The extracted JSON format should look like this: "
'{ "title": "Page Title", "summary": "Detailed summary of the page.", "brief_summary": "Brief summary in a paragraph.", "keywords": ["keyword1", "keyword2", "keyword3"] }',
apply_chunking =False,
instruction="From the crawled content, extract the following details: "\
"1. Title of the page "\
"2. Summary of the page, which is a detailed summary "\
"3. Brief summary of the page, which is a paragraph text "\
"4. Keywords assigned to the page, which is a list of keywords. "\
'The extracted JSON format should look like this: '\
'{ "title": "Page Title", "summary": "Detailed summary of the page.", "brief_summary": "Brief summary in a paragraph.", "keywords": ["keyword1", "keyword2", "keyword3"] }'
),
bypass_cache=True,
)

Some files were not shown because too many files have changed in this diff Show More