48c27899b7f502a5ec65b9bfa3da15c64e1eb329
Crawl4AI v0.2.6 🕷️🤖
Crawl4AI simplifies web crawling and data extraction, making it accessible for large language models (LLMs) and AI applications. 🆓🌐
Try it Now!
✨ visit our Documentation Website
Features ✨
- 🆓 Completely free and open-source
- 🤖 LLM-friendly output formats (JSON, cleaned HTML, markdown)
- 🌍 Supports crawling multiple URLs simultaneously
- 🎨 Extracts and returns all media tags (Images, Audio, and Video)
- 🔗 Extracts all external and internal links
- 📚 Extracts metadata from the page
- 🔄 Custom hooks for authentication, headers, and page modifications before crawling
- 🕵️ User-agent customization
- 🖼️ Takes screenshots of the page
- 📜 Executes multiple custom JavaScripts before crawling
- 📚 Various chunking strategies: topic-based, regex, sentence, and more
- 🧠 Advanced extraction strategies: cosine clustering, LLM, and more
- 🎯 CSS selector support
- 📝 Passes instructions/keywords to refine extraction
Cool Examples 🚀
Quick Start
from crawl4ai import WebCrawler
# Create an instance of WebCrawler
crawler = WebCrawler()
# Warm up the crawler (load necessary models)
crawler.warmup()
# Run the crawler on a URL
result = crawler.run(url="https://www.nbcnews.com/business")
# Print the extracted content
print(result.markdown)
Extract Structured Data from Web Pages 📊
Crawl all OpenAI models and their fees from the official page.
import os
from crawl4ai import WebCrawler
from crawl4ai.extraction_strategy import LLMExtractionStrategy
url = 'https://openai.com/api/pricing/'
crawler = WebCrawler()
crawler.warmup()
result = crawler.run(
url=url,
extraction_strategy=LLMExtractionStrategy(
provider="openai/gpt-4",
api_token=os.getenv('OPENAI_API_KEY'),
instruction="Extract all model names and their fees for input and output tokens."
),
)
print(result.extracted_content)
Execute JS, Filter Data with CSS Selector, and Clustering
from crawl4ai import WebCrawler
from crawl4ai.chunking_strategy import CosineStrategy
js_code = ["const loadMoreButton = Array.from(document.querySelectorAll('button')).find(button => button.textContent.includes('Load More')); loadMoreButton && loadMoreButton.click();"]
crawler = WebCrawler()
crawler.warmup()
result = crawler.run(
url="https://www.nbcnews.com/business",
js=js_code,
css_selector="p",
extraction_strategy=CosineStrategy(semantic_filter="technology")
)
print(result.extracted_content)
Documentation 📚
For detailed documentation, including installation instructions, advanced features, and API reference, visit our Documentation Website.
Contributing 🤝
We welcome contributions from the open-source community. Check out our contribution guidelines for more information.
License 📄
Crawl4AI is released under the Apache 2.0 License.
Contact 📧
For questions, suggestions, or feedback, feel free to reach out:
- GitHub: unclecode
- Twitter: @unclecode
- Website: crawl4ai.com
Happy Crawling! 🕸️🚀
Description
🚀🤖 Crawl4AI: Open-source LLM Friendly Web Crawler & Scraper. Don't be shy, join here: https://discord.gg/jP8KfhDhyN
Languages
Python
99.3%
JavaScript
0.3%
Shell
0.2%
Dockerfile
0.2%