
Reorganize documentation into core/advanced/extraction sections for better navigation. Update terminal theme styles and add rich library for better CLI output. Remove redundant tutorial files and consolidate content into core sections. Add personal story to index page for project context. BREAKING CHANGE: Documentation structure has been significantly reorganized
4.7 KiB
Chunking Strategies
Chunking strategies are critical for dividing large texts into manageable parts, enabling effective content processing and extraction. These strategies are foundational in cosine similarity-based extraction techniques, which allow users to retrieve only the most relevant chunks of content for a given query. Additionally, they facilitate direct integration into RAG (Retrieval-Augmented Generation) systems for structured and scalable workflows.
Why Use Chunking?
1. Cosine Similarity and Query Relevance: Prepares chunks for semantic similarity analysis. 2. RAG System Integration: Seamlessly processes and stores chunks for retrieval. 3. Structured Processing: Allows for diverse segmentation methods, such as sentence-based, topic-based, or windowed approaches.
Methods of Chunking
1. Regex-Based Chunking
Splits text based on regular expression patterns, useful for coarse segmentation.
Code Example:
class RegexChunking:
def __init__(self, patterns=None):
self.patterns = patterns or [r'\n\n'] # Default pattern for paragraphs
def chunk(self, text):
paragraphs = [text]
for pattern in self.patterns:
paragraphs = [seg for p in paragraphs for seg in re.split(pattern, p)]
return paragraphs
# Example Usage
text = """This is the first paragraph.
This is the second paragraph."""
chunker = RegexChunking()
print(chunker.chunk(text))
2. Sentence-Based Chunking
Divides text into sentences using NLP tools, ideal for extracting meaningful statements.
Code Example:
from nltk.tokenize import sent_tokenize
class NlpSentenceChunking:
def chunk(self, text):
sentences = sent_tokenize(text)
return [sentence.strip() for sentence in sentences]
# Example Usage
text = "This is sentence one. This is sentence two."
chunker = NlpSentenceChunking()
print(chunker.chunk(text))
3. Topic-Based Segmentation
Uses algorithms like TextTiling to create topic-coherent chunks.
Code Example:
from nltk.tokenize import TextTilingTokenizer
class TopicSegmentationChunking:
def __init__(self):
self.tokenizer = TextTilingTokenizer()
def chunk(self, text):
return self.tokenizer.tokenize(text)
# Example Usage
text = """This is an introduction.
This is a detailed discussion on the topic."""
chunker = TopicSegmentationChunking()
print(chunker.chunk(text))
4. Fixed-Length Word Chunking
Segments text into chunks of a fixed word count.
Code Example:
class FixedLengthWordChunking:
def __init__(self, chunk_size=100):
self.chunk_size = chunk_size
def chunk(self, text):
words = text.split()
return [' '.join(words[i:i + self.chunk_size]) for i in range(0, len(words), self.chunk_size)]
# Example Usage
text = "This is a long text with many words to be chunked into fixed sizes."
chunker = FixedLengthWordChunking(chunk_size=5)
print(chunker.chunk(text))
5. Sliding Window Chunking
Generates overlapping chunks for better contextual coherence.
Code Example:
class SlidingWindowChunking:
def __init__(self, window_size=100, step=50):
self.window_size = window_size
self.step = step
def chunk(self, text):
words = text.split()
chunks = []
for i in range(0, len(words) - self.window_size + 1, self.step):
chunks.append(' '.join(words[i:i + self.window_size]))
return chunks
# Example Usage
text = "This is a long text to demonstrate sliding window chunking."
chunker = SlidingWindowChunking(window_size=5, step=2)
print(chunker.chunk(text))
Combining Chunking with Cosine Similarity
To enhance the relevance of extracted content, chunking strategies can be paired with cosine similarity techniques. Here’s an example workflow:
Code Example:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
class CosineSimilarityExtractor:
def __init__(self, query):
self.query = query
self.vectorizer = TfidfVectorizer()
def find_relevant_chunks(self, chunks):
vectors = self.vectorizer.fit_transform([self.query] + chunks)
similarities = cosine_similarity(vectors[0:1], vectors[1:]).flatten()
return [(chunks[i], similarities[i]) for i in range(len(chunks))]
# Example Workflow
text = """This is a sample document. It has multiple sentences.
We are testing chunking and similarity."""
chunker = SlidingWindowChunking(window_size=5, step=3)
chunks = chunker.chunk(text)
query = "testing chunking"
extractor = CosineSimilarityExtractor(query)
relevant_chunks = extractor.find_relevant_chunks(chunks)
print(relevant_chunks)