Files

Nasrin a87e8c1c9e Release/v0.7.8 (#1662 )

* Fix: Use correct URL variable for raw HTML extraction (#1116)

- Prevents full HTML content from being passed as URL to extraction strategies
- Added unit tests to verify raw HTML and regular URL processing

Fix: Wrong URL variable used for extraction of raw html

* Fix #1181: Preserve whitespace in code blocks during HTML scraping

  The remove_empty_elements_fast() method was removing whitespace-only
  span elements inside <pre> and <code> tags, causing import statements
  like "import torch" to become "importtorch". Now skips elements inside
  code blocks where whitespace is significant.

* Refactor Pydantic model configuration to use ConfigDict for arbitrary types

* Fix EmbeddingStrategy: Uncomment response handling for the variations and clean up mock data. ref #1621

* Fix: permission issues with .cache/url_seeder and other runtime cache dirs. ref #1638

* fix: ensure BrowserConfig.to_dict serializes proxy_config

* feat: make LLM backoff configurable end-to-end

- extend LLMConfig with backoff delay/attempt/factor fields and thread them
  through LLMExtractionStrategy, LLMContentFilter, table extraction, and
  Docker API handlers
- expose the backoff parameter knobs on perform_completion_with_backoff/aperform_completion_with_backoff
  and document them in the md_v2 guides

* reproduced AttributeError from #1642

* pass timeout parameter to docker client request

* added missing deep crawling objects to init

* generalized query in ContentRelevanceFilter to be a str or list

* import modules from enhanceable deserialization

* parameterized tests

* Fix: capture current page URL to reflect JavaScript navigation and add test for delayed redirects. ref #1268

* refactor: replace PyPDF2 with pypdf across the codebase. ref #1412

* announcement: add application form for cloud API closed beta

* Release v0.7.8: Stability & Bug Fix Release

- Updated version to 0.7.8
- Introduced focused stability release addressing 11 community-reported bugs.
- Key fixes include Docker API improvements, LLM extraction enhancements, URL handling corrections, and dependency updates.
- Added detailed release notes for v0.7.8 in the blog and created a dedicated verification script to ensure all fixes are functioning as intended.
- Updated documentation to reflect recent changes and improvements.

* docs: add section for Crawl4AI Cloud API closed beta with application link

* fix: add disk cleanup step to Docker workflow

---------

Co-authored-by: rbushria <rbushri@gmail.com>
Co-authored-by: AHMET YILMAZ <tawfik@kidocode.com>
Co-authored-by: Soham Kukreti <kukretisoham@gmail.com>
Co-authored-by: Chris Murphy <chris.murphy@klaviyo.com>
Co-authored-by: Aravind Karnam <aravind.karanam@gmail.com>

2025-12-11 11:04:52 +01:00

8.3 KiB

Raw Blame History

🚀🤖 Crawl4AI: Open-Source LLM-Friendly Web Crawler & Scraper

🚀 Crawl4AI Cloud API — Closed Beta (Launching Soon)

Reliable, large-scale web extraction, now built to be drastically more cost-effective than any of the existing solutions.

👉 Apply here for early access
We’ll be onboarding in phases and working closely with early users. Limited slots.

Crawl4AI is the #1 trending GitHub repository, actively maintained by a vibrant community. It delivers blazing-fast, AI-ready web crawling tailored for large language models, AI agents, and data pipelines. Fully open source, flexible, and built for real-time performance, Crawl4AI empowers developers with unmatched speed, precision, and deployment ease.

Enjoy using Crawl4AI? Consider becoming a sponsor to support ongoing development and community growth!

🆕 AI Assistant Skill Now Available!

🤖 Crawl4AI Skill for Claude & AI Assistants

Supercharge your AI coding assistant with complete Crawl4AI knowledge! Download our comprehensive skill package that includes:

📚 Complete SDK reference (23K+ words)
🚀 Ready-to-use extraction scripts
⚡ Schema generation for efficient scraping
🔧 Version 0.7.4 compatible

📦 Download Skill Package

Works with Claude, Cursor, Windsurf, and other AI coding assistants. Import the .zip file into your AI assistant's skill/knowledge system.

🎯 New: Adaptive Web Crawling

Crawl4AI now features intelligent adaptive crawling that knows when to stop! Using advanced information foraging algorithms, it determines when sufficient information has been gathered to answer your query.

Learn more about Adaptive Crawling →

Quick Start

Here's a quick example to show you how easy it is to use Crawl4AI with its asynchronous capabilities:

import asyncio
from crawl4ai import AsyncWebCrawler

async def main():
    # Create an instance of AsyncWebCrawler
    async with AsyncWebCrawler() as crawler:
        # Run the crawler on a URL
        result = await crawler.arun(url="https://crawl4ai.com")

        # Print the extracted content
        print(result.markdown)

# Run the async main function
asyncio.run(main())

Video Tutorial

What Does Crawl4AI Do?

Crawl4AI is a feature-rich crawler and scraper that aims to:

1. Generate Clean Markdown: Perfect for RAG pipelines or direct ingestion into LLMs.
2. Structured Extraction: Parse repeated patterns with CSS, XPath, or LLM-based extraction.
3. Advanced Browser Control: Hooks, proxies, stealth modes, session re-use—fine-grained control.
4. High Performance: Parallel crawling, chunk-based extraction, real-time use cases.
5. Open Source: No forced API keys, no paywalls—everyone can access their data.

Core Philosophies:

Democratize Data: Free to use, transparent, and highly configurable.
LLM Friendly: Minimally processed, well-structured text, images, and metadata, so AI models can easily consume it.

Documentation Structure

To help you get started, we’ve organized our docs into clear sections:

Setup & Installation
Basic instructions to install Crawl4AI via pip or Docker.
Quick Start
A hands-on introduction showing how to do your first crawl, generate Markdown, and do a simple extraction.
Core
Deeper guides on single-page crawling, advanced browser/crawler parameters, content filtering, and caching.
Advanced
Explore link & media handling, lazy loading, hooking & authentication, proxies, session management, and more.
Extraction
Detailed references for no-LLM (CSS, XPath) vs. LLM-based strategies, chunking, and clustering approaches.
API Reference
Find the technical specifics of each class and method, including AsyncWebCrawler, arun(), and CrawlResult.

Throughout these sections, you’ll find code samples you can copy-paste into your environment. If something is missing or unclear, raise an issue or PR.

How You Can Support

Star & Fork: If you find Crawl4AI helpful, star the repo on GitHub or fork it to add your own features.
File Issues: Encounter a bug or missing feature? Let us know by filing an issue, so we can improve.
Pull Requests: Whether it’s a small fix, a big feature, or better docs—contributions are always welcome.
Join Discord: Come chat about web scraping, crawling tips, or AI workflows with the community.
Spread the Word: Mention Crawl4AI in your blog posts, talks, or on social media.

Our mission: to empower everyone—students, researchers, entrepreneurs, data scientists—to access, parse, and shape the world’s data with speed, cost-efficiency, and creative freedom.

Quick Links

Thank you for joining me on this journey. Let’s keep building an open, democratic approach to data extraction and AI together.

Happy Crawling!
— Unclecode, Founder & Maintainer of Crawl4AI

8.3 KiB Raw Blame History Unescape Escape