chore: Update pip installation command and requirements, add new dependencies

This commit is contained in:
unclecode
2024-05-17 16:53:03 +08:00
parent d7b37e849d
commit 1cc67df301
5 changed files with 46 additions and 60 deletions

View File

@@ -22,32 +22,26 @@ Crawl4AI has one clear task: to simplify crawling and extract useful information
## Power and Simplicity of Crawl4AI 🚀
Crawl4AI makes even complex web crawling tasks simple and intuitive. Below is an example of how you can execute JavaScript, filter data using keywords, and use a CSS selector to extract specific content—all in one go!
To show the simplicity take a look at the first example:
**Example Task:**
```python
from crawl4ai import WebCrawler
# Create the WebCrawler instance
crawler = WebCrawler()
# Run the crawler with keyword filtering and CSS selector
result = crawler.run(url="https://www.example.com")
print(result) # {url, html, markdown, extracted_content, metadata}
```
Now let's try a complex task. Below is an example of how you can execute JavaScript, filter data using keywords, and use a CSS selector to extract specific content—all in one go!
1. Instantiate a WebCrawler object.
2. Execute custom JavaScript to click a "Load More" button.
3. Filter the data to include only content related to "technology".
3. Extract semantical chunks of content and filter the data to include only content related to technology.
4. Use a CSS selector to extract only paragraphs (`<p>` tags).
**Example Code:**
Simply, firtsy install the package:
```bash
virtualenv venv
source venv/bin/activate
# Install Crawl4AI
pip install git+https://github.com/unclecode/crawl4ai.git
```
Run the following command to load the required models. This is optional, but it will boost the performance and speed of the crawler. You need to do this only once.
```bash
crawl4ai-download-models
```
Now, you can run the following code:
```python
# Import necessary modules
from crawl4ai import WebCrawler
@@ -137,7 +131,7 @@ To install Crawl4AI as a library, follow these steps:
```bash
virtualenv venv
source venv/bin/activate
pip install git+https://github.com/unclecode/crawl4ai.git
pip install "crawl4ai[all] @ git+https://github.com/unclecode/crawl4ai.git"
```
💡 Better to run the following CLI-command to load the required models. This is optional, but it will boost the performance and speed of the crawler. You need to do this only once.
@@ -150,12 +144,12 @@ virtualenv venv
source venv/bin/activate
git clone https://github.com/unclecode/crawl4ai.git
cd crawl4ai
pip install -e .
pip install -e .[all]
```
3. Use docker to run the local server:
```bash
docker build -t crawl4ai .
docker build -t crawl4ai .
# For Mac users
# docker build --platform linux/amd64 -t crawl4ai .
docker run -d -p 8000:80 crawl4ai
@@ -349,7 +343,7 @@ result = crawler.run(url="https://www.nbcnews.com/business")
| `include_raw_html` | Whether to include the raw HTML content in the response. | No | `false` |
| `bypass_cache` | Whether to force a fresh crawl even if the URL has been previously crawled. | No | `false` |
| `word_count_threshold`| The minimum number of words a block must contain to be considered meaningful (minimum value is 5). | No | `5` |
| `extraction_strategy` | The strategy to use for extracting content from the HTML (e.g., "CosineStrategy"). | No | `CosineStrategy` |
| `extraction_strategy` | The strategy to use for extracting content from the HTML (e.g., "CosineStrategy"). | No | `NoExtractionStrategy` |
| `chunking_strategy` | The strategy to use for chunking the text before processing (e.g., "RegexChunking"). | No | `RegexChunking` |
| `css_selector` | The CSS selector to target specific parts of the HTML for extraction. | No | `None` |
| `verbose` | Whether to enable verbose logging. | No | `true` |

View File

@@ -170,22 +170,11 @@ def main():
cprint("If this is the first time you're running Crawl4ai, this might take a few seconds to load required model files.")
crawler = create_crawler()
cprint("For the rest of this guide, I set crawler.always_by_pass_cache to True to force the crawler to bypass the cache. This is to ensure that we get fresh results for each run.")
crawler.always_by_pass_cache = True
cprint("\n🔍 [bold cyan]Time to explore another chunking strategy: NlpSentenceChunking![/bold cyan]", True)
cprint("NlpSentenceChunking uses NLP techniques to split the text into sentences. Let's see how it performs!")
result = crawler.run(
url="https://www.nbcnews.com/business",
chunking_strategy=NlpSentenceChunking()
)
cprint("[LOG] 📦 [bold yellow]NlpSentenceChunking result:[/bold yellow]")
print_result(result)
basic_usage(crawler)
understanding_parameters(crawler)
crawler.always_by_pass_cache = True
add_chunking_strategy(crawler)
add_extraction_strategy(crawler)
add_llm_extraction_strategy(crawler)

View File

@@ -69,9 +69,12 @@ axios
// Handle crawl button click
document.getElementById("crawl-btn").addEventListener("click", () => {
// validate input to have both URL and API token
if (!document.getElementById("url-input").value || !document.getElementById("token-input").value) {
alert("Please enter both URL(s) and API token.");
return;
// if selected extraction strategy is LLMExtractionStrategy, then API token is required
if (document.getElementById("extraction-strategy-select").value === "LLMExtractionStrategy") {
if (!document.getElementById("url-input").value || !document.getElementById("token-input").value) {
alert("Please enter both URL(s) and API token.");
return;
}
}
const selectedProviderModel = document.getElementById("provider-model-select").value;
@@ -87,8 +90,6 @@ document.getElementById("crawl-btn").addEventListener("click", () => {
const urls = urlsInput.split(",").map((url) => url.trim());
const data = {
urls: urls,
provider_model: selectedProviderModel,
api_token: apiToken,
include_raw_html: true,
bypass_cache: bypassCache,
extract_blocks: extractBlocks,
@@ -112,8 +113,8 @@ document.getElementById("crawl-btn").addEventListener("click", () => {
localStorage.setItem("api_token", document.getElementById("token-input").value);
document.getElementById("loading").classList.remove("hidden");
document.getElementById("result").classList.add("hidden");
document.getElementById("code_help").classList.add("hidden");
document.getElementById("result").style.visibility = "hidden";
document.getElementById("code_help").style.visibility = "hidden";
axios
.post("/crawl", data)
@@ -128,18 +129,20 @@ document.getElementById("crawl-btn").addEventListener("click", () => {
const extractionStrategy = data.extraction_strategy;
const isLLMExtraction = extractionStrategy === "LLMExtractionStrategy";
// REMOVE API TOKEN FROM CODE EXAMPLES
data.extraction_strategy_args.api_token = "your_api_token";
document.getElementById(
"curl-code"
).textContent = `curl -X POST -H "Content-Type: application/json" -d '${JSON.stringify({
...data,
api_token: isLLMExtraction ? "your_api_token" : undefined,
})}' http://crawl4ai.uccode.io/crawl`;
}, null, 2)}' http://crawl4ai.com/crawl`;
document.getElementById("python-code").textContent = `import requests\n\ndata = ${JSON.stringify(
{ ...data, api_token: isLLMExtraction ? "your_api_token" : undefined },
null,
2
)}\n\nresponse = requests.post("http://crawl4ai.uccode.io/crawl", json=data) # OR local host if your run locally \nprint(response.json())`;
)}\n\nresponse = requests.post("http://crawl4ai.com/crawl", json=data) # OR local host if your run locally \nprint(response.json())`;
document.getElementById(
"nodejs-code"
@@ -147,7 +150,7 @@ document.getElementById("crawl-btn").addEventListener("click", () => {
{ ...data, api_token: isLLMExtraction ? "your_api_token" : undefined },
null,
2
)};\n\naxios.post("http://crawl4ai.uccode.io/crawl", data) // OR local host if your run locally \n .then(response => console.log(response.data))\n .catch(error => console.error(error));`;
)};\n\naxios.post("http://crawl4ai.com/crawl", data) // OR local host if your run locally \n .then(response => console.log(response.data))\n .catch(error => console.error(error));`;
document.getElementById(
"library-code"
@@ -169,8 +172,8 @@ document.getElementById("crawl-btn").addEventListener("click", () => {
document.getElementById("loading").classList.add("hidden");
document.getElementById("result").classList.remove("hidden");
document.getElementById("code_help").classList.remove("hidden");
document.getElementById("result").style.visibility = "visible";
document.getElementById("code_help").style.visibility = "visible";
// increment the total count
document.getElementById("total-count").textContent =

View File

@@ -29,7 +29,7 @@
class="bg-zinc-800 p-4 rounded mt-2 text-zinc-100"
><code>virtualenv venv
source venv/bin/activate
pip install git+https://github.com/unclecode/crawl4ai.git
pip install "crawl4ai[all] @ git+https://github.com/unclecode/crawl4ai.git"
</code></pre>
</li>
<li class="mb-4">
@@ -46,7 +46,7 @@ pip install git+https://github.com/unclecode/crawl4ai.git
source venv/bin/activate
git clone https://github.com/unclecode/crawl4ai.git
cd crawl4ai
pip install -e .
pip install -e .[all]
</code></pre>
</li>
<li class="">

View File

@@ -46,9 +46,9 @@
id="extraction-strategy-select"
class="border border-zinc-700 rounded px-4 py-1 bg-zinc-900 text-zinc-300"
>
<option value="NoExtractionStrategy" selected>NoExtractionStrategy</option>
<option value="CosineStrategy">CosineStrategy</option>
<option value="LLMExtractionStrategy">LLMExtractionStrategy</option>
<option value="NoExtractionStrategy">NoExtractionStrategy</option>
</select>
</div>
<div class="flex flex-col">
@@ -99,7 +99,7 @@
</div>
<div class="flex gap-2">
<!-- Add two textarea one for getting Keyword Filter and another one Instruction, make both grow whole with-->
<div id = "semantic_filter_div" class="flex flex-col flex-1">
<div id = "semantic_filter_div" class="flex flex-col flex-1 hidden">
<label for="keyword-filter" class="text-lime-500 font-bold text-xs">Keyword Filter</label>
<textarea
id="semantic_filter"
@@ -131,10 +131,10 @@
</div>
</div>
<div id="loading" class="hidden">
<p class="text-white">Loading... Please wait.</p>
</div>
<div id="result" class="flex-1">
<div id="loading" class="hidden">
<p class="text-white">Loading... Please wait.</p>
</div>
<div class="tab-buttons flex gap-2">
<button class="tab-btn px-4 py-1 text-sm bg-zinc-700 rounded-t text-lime-500" data-tab="json">
JSON
@@ -181,19 +181,19 @@
</button> -->
</div>
<div class="tab-content result bg-zinc-900 p-2 rounded h-full border border-zinc-700 text-sm">
<pre class="h-full flex relative">
<pre class="h-full flex relative overflow-x-auto">
<code id="curl-code" class="language-bash"></code>
<button class="absolute top-2 right-2 bg-zinc-700 text-white px-2 py-1 rounded copy-btn" data-target="curl-code">Copy</button>
</pre>
<pre class="hidden h-full flex relative">
<pre class="hidden h-full flex relative overflow-x-auto">
<code id="python-code" class="language-python"></code>
<button class="absolute top-2 right-2 bg-zinc-700 text-white px-2 py-1 rounded copy-btn" data-target="python-code">Copy</button>
</pre>
<pre class="hidden h-full flex relative">
<pre class="hidden h-full flex relative overflow-x-auto">
<code id="nodejs-code" class="language-javascript"></code>
<button class="absolute top-2 right-2 bg-zinc-700 text-white px-2 py-1 rounded copy-btn" data-target="nodejs-code">Copy</button>
</pre>
<pre class="hidden h-full flex relative">
<pre class="hidden h-full flex relative overflow-x-auto">
<code id="library-code" class="language-python"></code>
<button class="absolute top-2 right-2 bg-zinc-700 text-white px-2 py-1 rounded copy-btn" data-target="library-code">Copy</button>
</pre>