Chunk Split Component

The Chunk Split component provides intelligent text splitting capabilities for processing large documents. It supports multiple splitting strategies, customizable chunk sizes, and advanced splitting configurations while maintaining context and semantic meaning.

Chunk Split Architecture

Chunk Split workflow and architecture

Configuration Parameters

Required Parameters

  • input: Text content to split
  • splitterType: Type of splitting algorithm
    • character
    • token
    • sentence
    • paragraph
    • semantic
    • code

Optional Parameters

  • chunkSize: Size of each chunk (default: 1000)
  • chunkOverlap: Overlap between chunks (default: 200)
  • separators: Custom text separators
  • codeLanguage: Programming language for code splitting
  • embeddings: Embedding configuration for semantic splitting
  • breakpointThresholdType: Type of threshold for splits
    • tokens
    • characters
    • sentences
    • semantic_similarity
  • breakpointThresholdAmount: Threshold value
  • numberOfChunks: Target number of chunks
  • sentenceSplitRegex: Custom regex for sentence splitting
  • bufferSize: Memory buffer size for large texts

Output Format

{
  "data": {
    "chunks": [
      {
        "text": string,
        "index": number,
        "metadata": {
          "start_char": number,
          "end_char": number,
          "tokens": number,
          "embedding": array (optional)
        }
      }
    ],
    "statistics": {
      "total_chunks": number,
      "average_chunk_size": number,
      "overlap_percentage": number,
      "processing_time": number
    },
    "analysis": {
      "semantic_coherence": number,
      "context_preservation": number,
      "chunk_distribution": {
        "min_size": number,
        "max_size": number,
        "std_dev": number
      }
    }
  }
}

Features

  • Multiple splitting strategies
  • Semantic preservation
  • Code-aware splitting
  • Custom separators
  • Overlap control
  • Memory efficiency
  • Statistical analysis
  • Embedding support

Note: Choose appropriate chunk sizes based on your embedding model's token limits. Consider memory usage for large documents.

Tip: Use semantic splitting for natural language content and code-aware splitting for source code files.