Files
crawl4ai/docs/llm.txt/6_chunking_strategies.q.md
UncleCode d5ed451299 Enhance crawler capabilities and documentation
- Add llm.txt generator
  - Added SSL certificate extraction in AsyncWebCrawler.
  - Introduced new content filters and chunking strategies for more robust data extraction.
  - Updated documentation.
2024-12-25 21:34:31 +08:00

1.7 KiB

chunking_overview: Chunking strategies divide large texts into manageable parts for content processing and extraction | text segmentation, content division, document splitting | None cosine_similarity_integration: Chunking prepares text segments for semantic similarity analysis using cosine similarity | semantic search, relevance matching | from sklearn.metrics.pairwise import cosine_similarity rag_integration: Chunks can be integrated into RAG (Retrieval-Augmented Generation) systems for structured workflows | retrieval augmented generation, RAG pipeline | None regex_chunking: Split text using regular expression patterns for basic segmentation | regex splitting, pattern-based chunking | RegexChunking(patterns=[r'\n\n']) sentence_chunking: Divide text into individual sentences using NLP tools | sentence tokenization, NLP chunking | from nltk.tokenize import sent_tokenize topic_chunking: Create topic-coherent chunks using TextTiling algorithm | topic segmentation, TextTiling | from nltk.tokenize import TextTilingTokenizer fixed_length_chunking: Segment text into chunks with fixed word count | word-based chunking, fixed size segments | FixedLengthWordChunking(chunk_size=100) sliding_window_chunking: Generate overlapping chunks for better context preservation | overlapping segments, windowed chunking | SlidingWindowChunking(window_size=100, step=50) cosine_similarity_extraction: Extract relevant chunks using TF-IDF and cosine similarity comparison | similarity search, relevance extraction | from sklearn.feature_extraction.text import TfidfVectorizer chunking_workflow: Combine chunking with cosine similarity for enhanced content retrieval | content extraction, similarity workflow | CosineSimilarityExtractor(query).find_relevant_chunks(chunks)