Enhance Crawl4AI with CLI and documentation updates - Implemented Command-Line Interface (CLI) in `crawl4ai/cli.py` - Added chunking strategies and their documentation in `llm.txt`
145 lines
4.7 KiB
Markdown
145 lines
4.7 KiB
Markdown
# Chunking Strategies
|
||
Chunking strategies are critical for dividing large texts into manageable parts, enabling effective content processing and extraction. These strategies are foundational in cosine similarity-based extraction techniques, which allow users to retrieve only the most relevant chunks of content for a given query. Additionally, they facilitate direct integration into RAG (Retrieval-Augmented Generation) systems for structured and scalable workflows.
|
||
|
||
### Why Use Chunking?
|
||
1. **Cosine Similarity and Query Relevance**: Prepares chunks for semantic similarity analysis.
|
||
2. **RAG System Integration**: Seamlessly processes and stores chunks for retrieval.
|
||
3. **Structured Processing**: Allows for diverse segmentation methods, such as sentence-based, topic-based, or windowed approaches.
|
||
|
||
### Methods of Chunking
|
||
|
||
#### 1. Regex-Based Chunking
|
||
Splits text based on regular expression patterns, useful for coarse segmentation.
|
||
|
||
**Code Example**:
|
||
```python
|
||
class RegexChunking:
|
||
def __init__(self, patterns=None):
|
||
self.patterns = patterns or [r'\n\n'] # Default pattern for paragraphs
|
||
|
||
def chunk(self, text):
|
||
paragraphs = [text]
|
||
for pattern in self.patterns:
|
||
paragraphs = [seg for p in paragraphs for seg in re.split(pattern, p)]
|
||
return paragraphs
|
||
|
||
# Example Usage
|
||
text = """This is the first paragraph.
|
||
|
||
This is the second paragraph."""
|
||
chunker = RegexChunking()
|
||
print(chunker.chunk(text))
|
||
```
|
||
|
||
#### 2. Sentence-Based Chunking
|
||
Divides text into sentences using NLP tools, ideal for extracting meaningful statements.
|
||
|
||
**Code Example**:
|
||
```python
|
||
from nltk.tokenize import sent_tokenize
|
||
|
||
class NlpSentenceChunking:
|
||
def chunk(self, text):
|
||
sentences = sent_tokenize(text)
|
||
return [sentence.strip() for sentence in sentences]
|
||
|
||
# Example Usage
|
||
text = "This is sentence one. This is sentence two."
|
||
chunker = NlpSentenceChunking()
|
||
print(chunker.chunk(text))
|
||
```
|
||
|
||
#### 3. Topic-Based Segmentation
|
||
Uses algorithms like TextTiling to create topic-coherent chunks.
|
||
|
||
**Code Example**:
|
||
```python
|
||
from nltk.tokenize import TextTilingTokenizer
|
||
|
||
class TopicSegmentationChunking:
|
||
def __init__(self):
|
||
self.tokenizer = TextTilingTokenizer()
|
||
|
||
def chunk(self, text):
|
||
return self.tokenizer.tokenize(text)
|
||
|
||
# Example Usage
|
||
text = """This is an introduction.
|
||
This is a detailed discussion on the topic."""
|
||
chunker = TopicSegmentationChunking()
|
||
print(chunker.chunk(text))
|
||
```
|
||
|
||
#### 4. Fixed-Length Word Chunking
|
||
Segments text into chunks of a fixed word count.
|
||
|
||
**Code Example**:
|
||
```python
|
||
class FixedLengthWordChunking:
|
||
def __init__(self, chunk_size=100):
|
||
self.chunk_size = chunk_size
|
||
|
||
def chunk(self, text):
|
||
words = text.split()
|
||
return [' '.join(words[i:i + self.chunk_size]) for i in range(0, len(words), self.chunk_size)]
|
||
|
||
# Example Usage
|
||
text = "This is a long text with many words to be chunked into fixed sizes."
|
||
chunker = FixedLengthWordChunking(chunk_size=5)
|
||
print(chunker.chunk(text))
|
||
```
|
||
|
||
#### 5. Sliding Window Chunking
|
||
Generates overlapping chunks for better contextual coherence.
|
||
|
||
**Code Example**:
|
||
```python
|
||
class SlidingWindowChunking:
|
||
def __init__(self, window_size=100, step=50):
|
||
self.window_size = window_size
|
||
self.step = step
|
||
|
||
def chunk(self, text):
|
||
words = text.split()
|
||
chunks = []
|
||
for i in range(0, len(words) - self.window_size + 1, self.step):
|
||
chunks.append(' '.join(words[i:i + self.window_size]))
|
||
return chunks
|
||
|
||
# Example Usage
|
||
text = "This is a long text to demonstrate sliding window chunking."
|
||
chunker = SlidingWindowChunking(window_size=5, step=2)
|
||
print(chunker.chunk(text))
|
||
```
|
||
|
||
### Combining Chunking with Cosine Similarity
|
||
To enhance the relevance of extracted content, chunking strategies can be paired with cosine similarity techniques. Here’s an example workflow:
|
||
|
||
**Code Example**:
|
||
```python
|
||
from sklearn.feature_extraction.text import TfidfVectorizer
|
||
from sklearn.metrics.pairwise import cosine_similarity
|
||
|
||
class CosineSimilarityExtractor:
|
||
def __init__(self, query):
|
||
self.query = query
|
||
self.vectorizer = TfidfVectorizer()
|
||
|
||
def find_relevant_chunks(self, chunks):
|
||
vectors = self.vectorizer.fit_transform([self.query] + chunks)
|
||
similarities = cosine_similarity(vectors[0:1], vectors[1:]).flatten()
|
||
return [(chunks[i], similarities[i]) for i in range(len(chunks))]
|
||
|
||
# Example Workflow
|
||
text = """This is a sample document. It has multiple sentences.
|
||
We are testing chunking and similarity."""
|
||
|
||
chunker = SlidingWindowChunking(window_size=5, step=3)
|
||
chunks = chunker.chunk(text)
|
||
query = "testing chunking"
|
||
extractor = CosineSimilarityExtractor(query)
|
||
relevant_chunks = extractor.find_relevant_chunks(chunks)
|
||
|
||
print(relevant_chunks)
|
||
```
|