Text Splitters
The SentenceSplitter class is a simple wrapper around the sentence_splitter library, providing an easy way to split text into overlapping chunks of sentences.
UsageCopied!
from mainframe_orchestra.tools import SentenceSplitter splitter = SentenceSplitter(language='en') chunks = splitter.split_text_by_sentences( text="Your long text here...", chunk_size=5, overlap=1 )
ParametersCopied!
-
language
-
text
-
chunk_size
-
overlap
Return ValueCopied!
Returns a list of strings, where each string is a chunk of text containing the specified number of sentences.
NoteCopied!
For more advanced usage or language-specific options, consider using the sentence_splitter library directly.
Text Splitters
The SemanticSplitter class provides a sophisticated method for splitting text into semantically coherent chunks. This tool is particularly useful for processing large texts, preparing data for summarization, or creating more manageable segments for further NLP tasks. It uses sentence embeddings and community detection algorithms to group similar sentences together.
Class MethodsCopied!
chunk_text()
This is the main static method of the SemanticSplitter class. It takes a text input and returns a list of semantically coherent chunks.
@staticmethod def chunk_text(text: str, rearrange: bool = False, embedding_provider: str = "openai", embedding_model: str = "text-embedding-3-small") -> List[str]:
Parameters:
-
text
-
rearrange
-
embedding_provider
-
embedding_model
Usage NotesCopied!
To use the SemanticSplitter class, you need to have the necessary API keys set up in your environment variables for the chosen embedding provider.
The class uses the sentence_splitter library for initial text segmentation and the igraph and leidenalg libraries for community detection. Make sure these dependencies are installed in your environment.
The SemanticSplitter process involves creating sentence segments, embedding them, detecting communities using graph algorithms, and finally creating chunks from these communities.
The class includes a method to split oversized communities, which can help manage very large chunks of text and ensure more balanced output.
Basic error handling is implemented, such as returning a single community for very short inputs.
The SemanticSplitter can be particularly useful when working with large text strings that need to be processed in smaller, semantically coherent chunks. This can improve the performance of downstream NLP tasks such as summarization, question-answering, or topic modeling.
Advanced UsageCopied!
The SemanticSplitter offers several parameters that can be tuned for optimal performance:
-
The
rearrange
-
Different embedding providers and models can be used by specifying the
embedding_provider
embedding_model
# Example of advanced usage with custom parameters chunks = SemanticSplitter.chunk_text( text=large_text, rearrange=True, embedding_provider="cohere", embedding_model="embed-english-v3.0" )
This advanced usage demonstrates how to customize the chunking process for specific needs, such as using a different embedding provider or adjusting the chunking parameters.
Performance ConsiderationsCopied!
The SemanticSplitter can be computationally intensive, especially for very large texts. The performance is primarily affected by:
- The length of the input text (number of sentences).
- The chosen embedding model and provider.
For extremely large texts, consider breaking the text into individual documents or subsections before applying the SemanticSplitter.
Use with KnowledgebasesCopied!
The SemanticSplitter can be particularly useful when working with knowledge bases or large documents that need to be processed in smaller, semantically coherent chunks. This can improve the performance of downstream NLP tasks such as summarization, question-answering, or topic modeling. Articles, documents can be fed through the splitter to feed the chunks into an agent knowledgebase.
By leveraging the SemanticSplitter tool, agents can efficiently process and analyze large volumes of text, breaking them down into manageable, semantically coherent chunks for further processing or analysis.
Example Usage with Different Embedding ProvidersCopied!
Here's how to use the SemanticSplitter with different embedding providers:
text = "This is a test text to demonstrate the semantic splitter. It should be split into meaningful chunks based on the content and similarity threshold. There are many different embedding providers and models available." # Using OpenAI (default) chunks = SemanticSplitter.chunk_text(text) # Using Cohere chunks = SemanticSplitter.chunk_text(text, embedding_provider="cohere", embedding_model="embed-english-v3.0") # Using Mistral chunks = SemanticSplitter.chunk_text(text, embedding_provider="mistral", embedding_model="mistral-embed")
These examples demonstrate how to use different embedding providers and models with the SemanticSplitter.