Enhance crawler capabilities and documentation
- Add llm.txt generator - Added SSL certificate extraction in AsyncWebCrawler. - Introduced new content filters and chunking strategies for more robust data extraction. - Updated documentation.
This commit is contained in:
87
docs/llm.txt/5_markdown_generation.xs.md
Normal file
87
docs/llm.txt/5_markdown_generation.xs.md
Normal file
@@ -0,0 +1,87 @@
|
||||
```markdown
|
||||
# Chunking Strategies
|
||||
|
||||
> Break large texts into manageable chunks for relevance and retrieval workflows.
|
||||
|
||||
Enables segmentation for similarity-based retrieval and integration into RAG pipelines.
|
||||
|
||||
## Why Use Chunking?
|
||||
|
||||
- Prepare text for cosine similarity scoring
|
||||
- Integrate into RAG systems
|
||||
- Support multiple segmentation methods (regex, sentences, topics, fixed-length, sliding windows)
|
||||
|
||||
## Methods of Chunking
|
||||
|
||||
- [Regex-Based Chunking]: Splits text on patterns (e.g., `\n\n`)
|
||||
```python
|
||||
class RegexChunking:
|
||||
def __init__(self, patterns=[r'\n\n']):
|
||||
self.patterns = patterns
|
||||
def chunk(self, text):
|
||||
parts = [text]
|
||||
for p in self.patterns:
|
||||
parts = [seg for pr in parts for seg in re.split(p, pr)]
|
||||
return parts
|
||||
```
|
||||
|
||||
- [Sentence-Based Chunking]: Uses NLP (e.g., `nltk.sent_tokenize`) for sentence-level chunks
|
||||
```python
|
||||
from nltk.tokenize import sent_tokenize
|
||||
class NlpSentenceChunking:
|
||||
def chunk(self, text):
|
||||
return sent_tokenize(text)
|
||||
```
|
||||
|
||||
- [Topic-Based Segmentation]: Leverages `TextTilingTokenizer` for topic-level segments
|
||||
```python
|
||||
from nltk.tokenize import TextTilingTokenizer
|
||||
class TopicSegmentationChunking:
|
||||
def __init__(self):
|
||||
self.tokenizer = TextTilingTokenizer()
|
||||
def chunk(self, text):
|
||||
return self.tokenizer.tokenize(text)
|
||||
```
|
||||
|
||||
- [Fixed-Length Word Chunking]: Chunks by a fixed number of words
|
||||
```python
|
||||
class FixedLengthWordChunking:
|
||||
def __init__(self, chunk_size=100):
|
||||
self.chunk_size = chunk_size
|
||||
def chunk(self, text):
|
||||
w = text.split()
|
||||
return [' '.join(w[i:i+self.chunk_size]) for i in range(0, len(w), self.chunk_size)]
|
||||
```
|
||||
|
||||
- [Sliding Window Chunking]: Overlapping chunks for context retention
|
||||
```python
|
||||
class SlidingWindowChunking:
|
||||
def __init__(self, window_size=100, step=50):
|
||||
self.window_size = window_size
|
||||
self.step = step
|
||||
def chunk(self, text):
|
||||
w = text.split()
|
||||
return [' '.join(w[i:i+self.window_size]) for i in range(0, max(len(w)-self.window_size+1, 1), self.step)]
|
||||
```
|
||||
|
||||
## Combining Chunking with Cosine Similarity
|
||||
|
||||
- Extract relevant chunks based on a query
|
||||
```python
|
||||
from sklearn.feature_extraction.text import TfidfVectorizer
|
||||
from sklearn.metrics.pairwise import cosine_similarity
|
||||
|
||||
class CosineSimilarityExtractor:
|
||||
def __init__(self, query):
|
||||
self.query = query
|
||||
self.vectorizer = TfidfVectorizer()
|
||||
def find_relevant_chunks(self, chunks):
|
||||
X = self.vectorizer.fit_transform([self.query] + chunks)
|
||||
sims = cosine_similarity(X[0:1], X[1:]).flatten()
|
||||
return list(zip(chunks, sims))
|
||||
```
|
||||
|
||||
## Optional
|
||||
|
||||
- [chuncking_strategies.py](https://github.com/unclecode/crawl4ai/blob/main/crawl4ai/chuncking_strategies.py)
|
||||
```
|
||||
Reference in New Issue
Block a user