Update README file

This commit is contained in:
unclecode
2024-05-09 22:48:42 +08:00
parent a8e7218769
commit f74f4e88c0

114
README.md
View File

@@ -46,9 +46,53 @@ pip install -e .
```python ```python
from crawl4ai.web_crawler import WebCrawler from crawl4ai.web_crawler import WebCrawler
from crawl4ai.models import UrlModel from crawl4ai.models import UrlModel
import os
crawler = WebCrawler(db_path='crawler_data.db')
``` ```
3. Use the Crawl4AI library in your project as needed. Refer to the [Usage with Python](#usage-with-python-) section for more details. a. Fetch a single page:
```python
single_url = UrlModel(url='https://kidocode.com', forced=False)
result = crawl4ai.fetch_page(
single_url,
provider= "openai/gpt-3.5-turbo",
api_token = os.getenv('OPENAI_API_KEY'),
extract_blocks_flag=True,
word_count_threshold=5 # Minimum word count for a HTML tag to be considered as a worthy block
)
print(result.model_dump())
```
b. Fetch multiple pages:
```python
urls = [
UrlModel(url='http://example.com', forced=False),
UrlModel(url='http://example.org', forced=False)
]
results = crawl4ai.fetch_pages(
urls,
provider= "openai/gpt-3.5-turbo",
api_token = os.getenv('OPENAI_API_KEY'),
extract_blocks_flag=True,
word_count_threshold=5
)
for res in results:
print(res.model_dump())
```
The response model is a `CrawlResponse` object that contains the following attributes:
```python
class CrawlResult(BaseModel):
url: str
html: str
success: bool
cleaned_html: str = None
markdown: str = None
parsed_json: str = None
error_message: str = None
```
### Running Crawl4AI as a Local Server 🚀 ### Running Crawl4AI as a Local Server 🚀
@@ -82,20 +126,13 @@ docker run -d -p 8000:80 crawl4ai
6. Access the application at `http://localhost:8000`. 6. Access the application at `http://localhost:8000`.
For more detailed instructions and advanced configuration options, please refer to the [installation guide](https://github.com/unclecode/crawl4ai/blob/main/INSTALL.md). - CURL Example:
Choose the approach that best suits your needs. If you want to integrate Crawl4AI into your existing Python projects, installing it as a library is the way to go. If you prefer to run Crawl4AI as a standalone service and interact with it via API endpoints, running it as a local server using Docker is the recommended approach.
## Usage with Python 🐍
Here's an example of how to use Crawl4AI with Python to crawl a webpage and retrieve the extracted data:
1. Make sure you have the `requests` library installed. You can install it using pip:
```sh ```sh
pip install requests curl -X POST -H "Content-Type: application/json" -d '{"urls":["https://techcrunch.com/"],"provider_model":"openai/gpt-3.5-turbo","api_token":"your_api_token","include_raw_html":true,"forced":false,"extract_blocks":true,"word_count_threshold":10}' http://localhost:8000/crawl
``` ```
**Set the api_token to your OpenAI API key or any other provider you are using.**
2. Use the following Python code to send a request to the Crawl4AI server and retrieve the crawled data: - Python Example:
```python ```python
import requests import requests
import os import os
@@ -133,58 +170,9 @@ The response from the server includes the parsed JSON, cleaned HTML, and markdow
Make sure to replace `"http://localhost:8000/crawl"` with the appropriate server URL if your Crawl4AI server is running on a different host or port. Make sure to replace `"http://localhost:8000/crawl"` with the appropriate server URL if your Crawl4AI server is running on a different host or port.
## Using Crawl4AI as a Python Library 📚 Choose the approach that best suits your needs. If you want to integrate Crawl4AI into your existing Python projects, installing it as a library is the way to go. If you prefer to run Crawl4AI as a standalone service and interact with it via API endpoints, running it as a local server using Docker is the recommended approach.
You can also use Crawl4AI as a Python library in your own projects. Here's an example of how to use the Crawl4AI library: **Make sure to check the config.py tp set required environment variables.**
1. Install the required dependencies:
```sh
pip install -r requirements.txt
```
2. Import the necessary modules and initialize the `WebCrawler`:
```python
from crawl4ai.web_crawler import WebCrawler
from crawl4ai.models import UrlModel
import os
crawler = WebCrawler(db_path='crawler_data.db')
```
3. Fetch a single page:
```python
single_url = UrlModel(url='https://kidocode.com', forced=False)
result = crawl4ai.fetch_page(
single_url,
provider= "openai/gpt-3.5-turbo",
api_token = os.getenv('OPENAI_API_KEY'),
extract_blocks_flag=True,
word_count_threshold=5 # Minimum word count for a HTML tag to be considered as a worthy block
)
print(result.model_dump())
```
4. Fetch multiple pages:
```python
urls = [
UrlModel(url='http://example.com', forced=False),
UrlModel(url='http://example.org', forced=False)
]
results = crawl4ai.fetch_pages(
urls,
provider= "openai/gpt-3.5-turbo",
api_token = os.getenv('OPENAI_API_KEY'),
extract_blocks_flag=True,
word_count_threshold=5
)
for res in results:
print(res.json())
```
This code demonstrates how to use the Crawl4AI library to fetch a single page or multiple pages. The `WebCrawler` is initialized with the path to the database, and the `fetch_page` and `fetch_pages` methods are used to crawl the specified URLs.
Make sure to check the config.py tp set required environment variables.
That's it! You can now integrate Crawl4AI into your Python projects and leverage its web crawling capabilities. 🎉 That's it! You can now integrate Crawl4AI into your Python projects and leverage its web crawling capabilities. 🎉