2025 feb alpha 1 (#685)
* spelling change in prompt * gpt-4o-mini support * Remove leading Y before here * prompt spell correction * (Docs) Fix numbered list end-of-line formatting Added the missing "two spaces" to add a line break * fix: access downloads_path through browser_config in _handle_download method - Fixes #585 * crawl * fix: https://github.com/unclecode/crawl4ai/issues/592 * fix: https://github.com/unclecode/crawl4ai/issues/583 * Docs update: https://github.com/unclecode/crawl4ai/issues/649 * fix: https://github.com/unclecode/crawl4ai/issues/570 * Docs: updated example for content-selection to reflect new changes in yc newsfeed css * Refactor: Removed old filters and replaced with optimised filters * fix:Fixed imports as per the new names of filters * Tests: For deep crawl filters * Refactor: Remove old scorers and replace with optimised ones: Fix imports forall filters and scorers. * fix: awaiting on filters that are async in nature eg: content relevance and seo filters * fix: https://github.com/unclecode/crawl4ai/issues/592 * fix: https://github.com/unclecode/crawl4ai/issues/715 --------- Co-authored-by: DarshanTank <darshan.tank@gnani.ai> Co-authored-by: Tuhin Mallick <tuhin.mllk@gmail.com> Co-authored-by: Serhat Soydan <ssoydan@gmail.com> Co-authored-by: cardit1 <maneesh@cardit.in> Co-authored-by: Tautik Agrahari <tautikagrahari@gmail.com>
This commit is contained in:
@@ -7,8 +7,8 @@ Crawl4AI offers multiple power-user features that go beyond simple crawling. Thi
|
||||
2. **Capturing PDFs & Screenshots**
|
||||
3. **Handling SSL Certificates**
|
||||
4. **Custom Headers**
|
||||
5. **Session Persistence & Local Storage**
|
||||
6. **Robots.txt Compliance**
|
||||
5. **Session Persistence & Local Storage**
|
||||
6. **Robots.txt Compliance**
|
||||
|
||||
> **Prerequisites**
|
||||
> - You have a basic grasp of [AsyncWebCrawler Basics](../core/simple-crawling.md)
|
||||
|
||||
@@ -168,10 +168,10 @@ async def main():
|
||||
"name": "News Items",
|
||||
"baseSelector": "tr.athing",
|
||||
"fields": [
|
||||
{"name": "title", "selector": "a.storylink", "type": "text"},
|
||||
{"name": "title", "selector": "span.titleline a", "type": "text"},
|
||||
{
|
||||
"name": "link",
|
||||
"selector": "a.storylink",
|
||||
"selector": "span.titleline a",
|
||||
"type": "attribute",
|
||||
"attribute": "href"
|
||||
}
|
||||
|
||||
@@ -135,14 +135,14 @@ html = "<div class='product'><h2>Gaming Laptop</h2><span class='price'>$999.99</
|
||||
# Using OpenAI (requires API token)
|
||||
schema = JsonCssExtractionStrategy.generate_schema(
|
||||
html,
|
||||
llm_provider="openai/gpt-4o", # Default provider
|
||||
provider="openai/gpt-4o", # Default provider
|
||||
api_token="your-openai-token" # Required for OpenAI
|
||||
)
|
||||
|
||||
# Or using Ollama (open source, no token needed)
|
||||
schema = JsonCssExtractionStrategy.generate_schema(
|
||||
html,
|
||||
llm_provider="ollama/llama3.3", # Open source alternative
|
||||
provider="ollama/llama3.3", # Open source alternative
|
||||
api_token=None # Not needed for Ollama
|
||||
)
|
||||
|
||||
|
||||
@@ -434,7 +434,7 @@ html = """
|
||||
css_schema = JsonCssExtractionStrategy.generate_schema(
|
||||
html,
|
||||
schema_type="css", # This is the default
|
||||
llm_provider="openai/gpt-4o", # Default provider
|
||||
provider="openai/gpt-4o", # Default provider
|
||||
api_token="your-openai-token" # Required for OpenAI
|
||||
)
|
||||
|
||||
@@ -442,7 +442,7 @@ css_schema = JsonCssExtractionStrategy.generate_schema(
|
||||
xpath_schema = JsonXPathExtractionStrategy.generate_schema(
|
||||
html,
|
||||
schema_type="xpath",
|
||||
llm_provider="ollama/llama3.3", # Open source alternative
|
||||
provider="ollama/llama3.3", # Open source alternative
|
||||
api_token=None # Not needed for Ollama
|
||||
)
|
||||
|
||||
|
||||
Reference in New Issue
Block a user