chore: Update NlpSentenceChunking constructor parameters to None
The NlpSentenceChunking constructor parameters have been updated to None in order to simplify the usage of the class. This change removes the need for specifying the SpaCy model for sentence detection, making the code more concise and easier to understand.
This commit is contained in:
@@ -258,6 +258,8 @@ result = crawler.run(
|
|||||||
|
|
||||||
### Extraction strategy: CosineStrategy
|
### Extraction strategy: CosineStrategy
|
||||||
|
|
||||||
|
So far, the extracted content is just the result of chunking. To extract meaningful content, you can use extraction strategies. These strategies cluster consecutive chunks into meaningful blocks, keeping the same order as the text in the HTML. This approach is perfect for use in RAG applications and semantical search queries.
|
||||||
|
|
||||||
Using CosineStrategy:
|
Using CosineStrategy:
|
||||||
```python
|
```python
|
||||||
result = crawler.run(
|
result = crawler.run(
|
||||||
@@ -368,11 +370,11 @@ chunks = chunker.chunk("This is a sample text. It will be split into chunks.")
|
|||||||
`NlpSentenceChunking` uses a natural language processing model to chunk a given text into sentences. This approach leverages SpaCy to accurately split text based on sentence boundaries.
|
`NlpSentenceChunking` uses a natural language processing model to chunk a given text into sentences. This approach leverages SpaCy to accurately split text based on sentence boundaries.
|
||||||
|
|
||||||
**Constructor Parameters:**
|
**Constructor Parameters:**
|
||||||
- `model` (str, optional): The SpaCy model to use for sentence detection. Default is `'en_core_web_sm'`.
|
- None.
|
||||||
|
|
||||||
**Example usage:**
|
**Example usage:**
|
||||||
```python
|
```python
|
||||||
chunker = NlpSentenceChunking(model='en_core_web_sm')
|
chunker = NlpSentenceChunking()
|
||||||
chunks = chunker.chunk("This is a sample text. It will be split into sentences.")
|
chunks = chunker.chunk("This is a sample text. It will be split into sentences.")
|
||||||
```
|
```
|
||||||
|
|
||||||
|
|||||||
@@ -1,7 +1,7 @@
|
|||||||
{
|
{
|
||||||
"RegexChunking": "### RegexChunking\n\n`RegexChunking` is a text chunking strategy that splits a given text into smaller parts using regular expressions.\nThis is useful for preparing large texts for processing by language models, ensuring they are divided into manageable segments.\n\n#### Constructor Parameters:\n- `patterns` (list, optional): A list of regular expression patterns used to split the text. Default is to split by double newlines (`['\\n\\n']`).\n\n#### Example usage:\n```python\nchunker = RegexChunking(patterns=[r'\\n\\n', r'\\. '])\nchunks = chunker.chunk(\"This is a sample text. It will be split into chunks.\")\n```",
|
"RegexChunking": "### RegexChunking\n\n`RegexChunking` is a text chunking strategy that splits a given text into smaller parts using regular expressions.\nThis is useful for preparing large texts for processing by language models, ensuring they are divided into manageable segments.\n\n#### Constructor Parameters:\n- `patterns` (list, optional): A list of regular expression patterns used to split the text. Default is to split by double newlines (`['\\n\\n']`).\n\n#### Example usage:\n```python\nchunker = RegexChunking(patterns=[r'\\n\\n', r'\\. '])\nchunks = chunker.chunk(\"This is a sample text. It will be split into chunks.\")\n```",
|
||||||
|
|
||||||
"NlpSentenceChunking": "### NlpSentenceChunking\n\n`NlpSentenceChunking` uses a natural language processing model to chunk a given text into sentences. This approach leverages SpaCy to accurately split text based on sentence boundaries.\n\n#### Constructor Parameters:\n- `model` (str, optional): The SpaCy model to use for sentence detection. Default is `'en_core_web_sm'`.\n\n#### Example usage:\n```python\nchunker = NlpSentenceChunking(model='en_core_web_sm')\nchunks = chunker.chunk(\"This is a sample text. It will be split into sentences.\")\n```",
|
"NlpSentenceChunking": "### NlpSentenceChunking\n\n`NlpSentenceChunking` uses a natural language processing model to chunk a given text into sentences. This approach leverages SpaCy to accurately split text based on sentence boundaries.\n\n#### Constructor Parameters:\n- None.\n\n#### Example usage:\n```python\nchunker = NlpSentenceChunking()\nchunks = chunker.chunk(\"This is a sample text. It will be split into sentences.\")\n```",
|
||||||
|
|
||||||
"TopicSegmentationChunking": "### TopicSegmentationChunking\n\n`TopicSegmentationChunking` uses the TextTiling algorithm to segment a given text into topic-based chunks. This method identifies thematic boundaries in the text.\n\n#### Constructor Parameters:\n- `num_keywords` (int, optional): The number of keywords to extract for each topic segment. Default is `3`.\n\n#### Example usage:\n```python\nchunker = TopicSegmentationChunking(num_keywords=3)\nchunks = chunker.chunk(\"This is a sample text. It will be split into topic-based segments.\")\n```",
|
"TopicSegmentationChunking": "### TopicSegmentationChunking\n\n`TopicSegmentationChunking` uses the TextTiling algorithm to segment a given text into topic-based chunks. This method identifies thematic boundaries in the text.\n\n#### Constructor Parameters:\n- `num_keywords` (int, optional): The number of keywords to extract for each topic segment. Default is `3`.\n\n#### Example usage:\n```python\nchunker = TopicSegmentationChunking(num_keywords=3)\nchunks = chunker.chunk(\"This is a sample text. It will be split into topic-based segments.\")\n```",
|
||||||
|
|
||||||
|
|||||||
@@ -236,12 +236,11 @@ chunks = chunker.chunk("This is a sample text. It will be split into chunks.")
|
|||||||
<h4>Constructor Parameters:</h4>
|
<h4>Constructor Parameters:</h4>
|
||||||
<ul>
|
<ul>
|
||||||
<li>
|
<li>
|
||||||
<code>model</code> (str, optional): The SpaCy model to use for sentence detection. Default is
|
None.
|
||||||
<code>'en_core_web_sm'</code>.
|
|
||||||
</li>
|
</li>
|
||||||
</ul>
|
</ul>
|
||||||
<h4>Example usage:</h4>
|
<h4>Example usage:</h4>
|
||||||
<pre><code class="language-python">chunker = NlpSentenceChunking(model='en_core_web_sm')
|
<pre><code class="language-python">chunker = NlpSentenceChunking()
|
||||||
chunks = chunker.chunk("This is a sample text. It will be split into sentences.")
|
chunks = chunker.chunk("This is a sample text. It will be split into sentences.")
|
||||||
</code></pre>
|
</code></pre>
|
||||||
</div>
|
</div>
|
||||||
|
|||||||
Reference in New Issue
Block a user