ayrisdev/crawl4ai

Fork 0

Go to file

Unclecode 48c27899b7 Merge branch 'main' of https://github.com/unclecode/crawl4ai

2024-06-22 12:38:14 +00:00

crawl4ai

chore:

2024-06-22 20:36:01 +08:00

docs

chore:

2024-06-22 20:36:01 +08:00

pages

issue 19 is resolved

2024-06-22 17:18:00 +08:00

tests

- Test all methods

2024-05-14 21:27:41 +08:00

.env.txt

chore: Update environment variable usage in config files

2024-05-09 22:37:01 +08:00

.gitignore

Update .gitignore

2024-06-22 18:12:12 +08:00

CHANGELOG.md

issue 19 is resolved

2024-06-22 17:18:00 +08:00

docker-compose.yml

Initial Commit

2024-05-09 19:10:25 +08:00

Dockerfile

issue 19 is resolved

2024-06-22 17:18:00 +08:00

Dockerfile_mac

chore: Update Dockerfile to install chromium-chromedriver and spacy library

2024-05-18 09:16:52 +00:00

LICENSE

Initial Commit

2024-05-09 19:10:25 +08:00

main.py

chore:

2024-06-22 20:36:01 +08:00

mkdocs.yml

chore:

2024-06-22 20:36:01 +08:00

README.md

issue 19 is resolved

2024-06-22 17:18:00 +08:00

requirements.crawl.txt

Update for v0.2.2

2024-06-02 15:40:18 +08:00

requirements.txt

chore: Update configuration values for chunk token threshold, overlap rate, and minimum word threshold. Create a new example for LLMExtraction Strategy, update Dockerfile, and README

2024-06-19 18:32:20 +08:00

setup.py

issue 19 is resolved

2024-06-22 17:18:00 +08:00

README.md

Crawl4AI v0.2.6 🕷️🤖

Crawl4AI simplifies web crawling and data extraction, making it accessible for large language models (LLMs) and AI applications. 🆓🌐

Try it Now!

Use as REST API:
Use as Python library:

✨ visit our Documentation Website

Features ✨

🆓 Completely free and open-source
🤖 LLM-friendly output formats (JSON, cleaned HTML, markdown)
🌍 Supports crawling multiple URLs simultaneously
🎨 Extracts and returns all media tags (Images, Audio, and Video)
🔗 Extracts all external and internal links
📚 Extracts metadata from the page
🔄 Custom hooks for authentication, headers, and page modifications before crawling
🕵️ User-agent customization
🖼️ Takes screenshots of the page
📜 Executes multiple custom JavaScripts before crawling
📚 Various chunking strategies: topic-based, regex, sentence, and more
🧠 Advanced extraction strategies: cosine clustering, LLM, and more
🎯 CSS selector support
📝 Passes instructions/keywords to refine extraction

Cool Examples 🚀

Quick Start

from crawl4ai import WebCrawler

# Create an instance of WebCrawler
crawler = WebCrawler()

# Warm up the crawler (load necessary models)
crawler.warmup()

# Run the crawler on a URL
result = crawler.run(url="https://www.nbcnews.com/business")

# Print the extracted content
print(result.markdown)

Extract Structured Data from Web Pages 📊

Crawl all OpenAI models and their fees from the official page.

import os
from crawl4ai import WebCrawler
from crawl4ai.extraction_strategy import LLMExtractionStrategy

url = 'https://openai.com/api/pricing/'
crawler = WebCrawler()
crawler.warmup()

result = crawler.run(
    url=url,
    extraction_strategy=LLMExtractionStrategy(
        provider="openai/gpt-4",
        api_token=os.getenv('OPENAI_API_KEY'),
        instruction="Extract all model names and their fees for input and output tokens."
    ),
)

print(result.extracted_content)

Execute JS, Filter Data with CSS Selector, and Clustering

from crawl4ai import WebCrawler
from crawl4ai.chunking_strategy import CosineStrategy

js_code = ["const loadMoreButton = Array.from(document.querySelectorAll('button')).find(button => button.textContent.includes('Load More')); loadMoreButton && loadMoreButton.click();"]

crawler = WebCrawler()
crawler.warmup()

result = crawler.run(
    url="https://www.nbcnews.com/business",
    js=js_code,
    css_selector="p",
    extraction_strategy=CosineStrategy(semantic_filter="technology")
)

print(result.extracted_content)

Documentation 📚

For detailed documentation, including installation instructions, advanced features, and API reference, visit our Documentation Website.

Contributing 🤝

We welcome contributions from the open-source community. Check out our contribution guidelines for more information.

License 📄

Crawl4AI is released under the Apache 2.0 License.

Contact 📧

For questions, suggestions, or feedback, feel free to reach out:

GitHub: unclecode
Twitter: @unclecode
Website: crawl4ai.com

Happy Crawling! 🕸️🚀