2025 feb alpha 1 (#685)

* spelling change in prompt

* gpt-4o-mini support

* Remove leading Y before here

* prompt spell correction

* (Docs) Fix numbered list end-of-line formatting

Added the missing "two spaces" to add a line break

* fix: access downloads_path through browser_config in _handle_download method - Fixes #585

* crawl

* fix: https://github.com/unclecode/crawl4ai/issues/592

* fix: https://github.com/unclecode/crawl4ai/issues/583

* Docs update: https://github.com/unclecode/crawl4ai/issues/649

* fix: https://github.com/unclecode/crawl4ai/issues/570

* Docs: updated example for content-selection to reflect new changes in yc newsfeed css

* Refactor: Removed old filters and replaced with optimised filters

* fix:Fixed imports as per the new names of filters

* Tests: For deep crawl filters

* Refactor: Remove old scorers and replace with optimised ones: Fix imports forall filters and scorers.

* fix: awaiting on filters that are async in nature eg: content relevance and seo filters

* fix: https://github.com/unclecode/crawl4ai/issues/592

* fix: https://github.com/unclecode/crawl4ai/issues/715

---------

Co-authored-by: DarshanTank <darshan.tank@gnani.ai>
Co-authored-by: Tuhin Mallick <tuhin.mllk@gmail.com>
Co-authored-by: Serhat Soydan <ssoydan@gmail.com>
Co-authored-by: cardit1 <maneesh@cardit.in>
Co-authored-by: Tautik Agrahari <tautikagrahari@gmail.com>
This commit is contained in:
Aravind
2025-02-19 11:43:17 +05:30
committed by GitHub
parent c171891999
commit dad592c801
19 changed files with 833 additions and 1350 deletions

View File

@@ -7,8 +7,8 @@ Crawl4AI offers multiple power-user features that go beyond simple crawling. Thi
2. **Capturing PDFs & Screenshots**
3. **Handling SSL Certificates**
4. **Custom Headers**
5. **Session Persistence & Local Storage**
6. **Robots.txt Compliance**
5. **Session Persistence & Local Storage**
6. **Robots.txt Compliance**
> **Prerequisites**
> - You have a basic grasp of [AsyncWebCrawler Basics](../core/simple-crawling.md)

View File

@@ -168,10 +168,10 @@ async def main():
"name": "News Items",
"baseSelector": "tr.athing",
"fields": [
{"name": "title", "selector": "a.storylink", "type": "text"},
{"name": "title", "selector": "span.titleline a", "type": "text"},
{
"name": "link",
"selector": "a.storylink",
"selector": "span.titleline a",
"type": "attribute",
"attribute": "href"
}

View File

@@ -135,14 +135,14 @@ html = "<div class='product'><h2>Gaming Laptop</h2><span class='price'>$999.99</
# Using OpenAI (requires API token)
schema = JsonCssExtractionStrategy.generate_schema(
html,
llm_provider="openai/gpt-4o", # Default provider
provider="openai/gpt-4o", # Default provider
api_token="your-openai-token" # Required for OpenAI
)
# Or using Ollama (open source, no token needed)
schema = JsonCssExtractionStrategy.generate_schema(
html,
llm_provider="ollama/llama3.3", # Open source alternative
provider="ollama/llama3.3", # Open source alternative
api_token=None # Not needed for Ollama
)

View File

@@ -434,7 +434,7 @@ html = """
css_schema = JsonCssExtractionStrategy.generate_schema(
html,
schema_type="css", # This is the default
llm_provider="openai/gpt-4o", # Default provider
provider="openai/gpt-4o", # Default provider
api_token="your-openai-token" # Required for OpenAI
)
@@ -442,7 +442,7 @@ css_schema = JsonCssExtractionStrategy.generate_schema(
xpath_schema = JsonXPathExtractionStrategy.generate_schema(
html,
schema_type="xpath",
llm_provider="ollama/llama3.3", # Open source alternative
provider="ollama/llama3.3", # Open source alternative
api_token=None # Not needed for Ollama
)