- User agent

- Extract Links
- Extract Metadata
- Update Readme
- Update REST API document
This commit is contained in:
unclecode
2024-06-08 17:59:42 +08:00
parent 9c34b30723
commit b3a0edaa6d
12 changed files with 155 additions and 75 deletions

View File

@@ -14,6 +14,9 @@ Crawl4AI has one clear task: to simplify crawling and extract useful information
### v0.2.3
- 🎨 Extract and return all media tags (Images, Audio, and Video). Check `result.media`
- 🔗 Extrat all external and internal links. Check `result.links`
- 📚 Extract metadata from the page. Check `result.metadata`
- 🕵️ Support `user_agent` parameter to set the user agent for the HTTP requests.
- 🖼️ Take [screenshots](#taking-screenshots) of the page.
### v0.2.2
@@ -32,7 +35,7 @@ Crawl4AI has one clear task: to simplify crawling and extract useful information
## Power and Simplicity of Crawl4AI 🚀
The most easy way! If you don't want to install any library, you can use the REST API on my server. But remember, this is just a simple server. I may improve its capacity if I see there is demand.
The most easy way! If you don't want to install any library, you can use the REST API on my server. But remember, this is just a simple server. I may improve its capacity if I see there is demand. You can find ll examples of REST API in this colab notebook. [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1zODYjhemJ5bUmYceWpVoBMVpd0ofzNBZ?usp=sharing)
```python
import requests
@@ -41,7 +44,6 @@ data = {
"urls": [
"https://www.nbcnews.com/business"
],
"word_count_threshold": 5,
"screenshot": True
}
@@ -242,8 +244,12 @@ To use the REST API, send a POST request to `http://localhost:8000/crawl` with t
"url": "https://www.nbcnews.com/business",
"extracted_content": "...",
"html": "...",
"cleaned_html": "...",
"markdown": "...",
"metadata": {...}
"media": {...},
"links": {...},
"metadata": {...},
"screenshots": "...",
}
]
}
@@ -282,6 +288,24 @@ Crawl result without raw HTML content:
result = crawler.run(url="https://www.nbcnews.com/business", include_raw_html=False)
```
### Result Structure
The result object contains the following fields:
```python
class CrawlResult(BaseModel):
url: str
html: str
success: bool
cleaned_html: Optional[str] = None
media: Dict[str, List[Dict]] = {} # Media tags in the page {"images": [], "audio": [], "video": []}
links: Dict[str, List[Dict]] = {} # Links in the page {"external": [], "internal": []}
screenshot: Optional[str] = None # Base64 encoded screenshot
markdown: Optional[str] = None
extracted_content: Optional[str] = None
metadata: Optional[dict] = None
error_message: Optional[str] = None
```
### Taking Screenshots
```python
@@ -401,6 +425,7 @@ result = crawler.run(url="https://www.nbcnews.com/business")
| `extraction_strategy` | The strategy to use for extracting content from the HTML (e.g., "CosineStrategy"). | No | `NoExtractionStrategy` |
| `chunking_strategy` | The strategy to use for chunking the text before processing (e.g., "RegexChunking"). | No | `RegexChunking` |
| `css_selector` | The CSS selector to target specific parts of the HTML for extraction. | No | `None` |
| `user_agent` | The user agent to use for the HTTP requests. | No | `Mozilla/5.0` |
| `verbose` | Whether to enable verbose logging. | No | `true` |
## Chunking Strategies 📚