From 6a6365ae0ac93b8a5e474364c6e7fa6cdd4b44f5 Mon Sep 17 00:00:00 2001 From: unclecode Date: Thu, 16 May 2024 18:10:55 +0800 Subject: [PATCH] Refactor code to exclude the extraction of semantical blocks of text from the HTML --- README.md | 1 - 1 file changed, 1 deletion(-) diff --git a/README.md b/README.md index 6740b4e2..183b7ba5 100644 --- a/README.md +++ b/README.md @@ -322,7 +322,6 @@ result = crawler.run(url="https://www.nbcnews.com/business") | `urls` | A list of URLs to crawl and extract data from. | Yes | - | | `include_raw_html` | Whether to include the raw HTML content in the response. | No | `false` | | `bypass_cache` | Whether to force a fresh crawl even if the URL has been previously crawled. | No | `false` | -| `extract_blocks` | Whether to extract semantical blocks of text from the HTML. | No | `true` | | `word_count_threshold`| The minimum number of words a block must contain to be considered meaningful (minimum value is 5). | No | `5` | | `extraction_strategy` | The strategy to use for extracting content from the HTML (e.g., "CosineStrategy"). | No | `CosineStrategy` | | `chunking_strategy` | The strategy to use for chunking the text before processing (e.g., "RegexChunking"). | No | `RegexChunking` |