What is document chunking in RAG?

Document chunking is the process of dividing large documents into smaller, semantically meaningful segments. This is crucial for RAG systems because LLMs have limited context windows (typically 4K-128K tokens) and retrieval systems work better with focused, relevant chunks rather than entire documents. Good chunking balances chunk size (context) with retrieval precision (signal-to-noise ratio).

What's the difference between fixed-size and recursive chunking?

Fixed-size chunking splits text at fixed character intervals (e.g., every 500 characters), which is simple but can awkwardly split sentences or words. Recursive chunking intelligently splits using a hierarchy of separators (newlines, paragraphs, sentences, spaces), preserving semantic meaning. Recursive chunking is LangChain's default and recommended for most use cases.

What is the optimal chunk size for RAG?

The optimal chunk size typically ranges from 256-512 tokens. Smaller chunks (256) provide precise retrieval but may lack context. Larger chunks (512-1024) preserve more context but introduce noise. The best size depends on your embedding model's optimal input size and use case. Start with 512 tokens and adjust based on retrieval metrics.

How does semantic chunking improve RAG performance?

Semantic chunking groups related sentences based on embedding similarity rather than arbitrary size limits. This preserves topical coherence and improves retrieval accuracy. It uses sentence embeddings to detect natural breakpoints where topic shifts occur, ensuring each chunk represents a complete thought or concept.

Should I use chunk overlap in my RAG system?

Yes, chunk overlap of 10-20% of chunk size is strongly recommended. Overlap ensures that information split across chunk boundaries is preserved in at least one chunk. For example, with 512-token chunks, use 50-100 token overlap. This improves retrieval recall and prevents loss of context at boundaries.

What is late chunking in RAG?

Late chunking generates chunk embeddings after processing a larger context (often an entire section or document) with a long-context embedding model. Each token representation can incorporate broader context, so chunk embeddings become less ambiguous (e.g., for pronouns, references, and header-dependent content). It's especially useful for long technical docs, but requires a model/tooling that supports token-level pooling or similar.

What is contextual retrieval and how does it relate to chunking?

Contextual retrieval improves retrieval by enriching each chunk with document or section context before indexing (for example, prepending a short context string like the heading path or an LLM-generated summary). This complements any chunking strategy and often boosts accuracy because chunks become self-contained and less ambiguous at retrieval time.

Document Chunking for RAG: 9 Strategies, Chunk Size & Overlap (2026)

TL;DR — Best Document Chunking Strategy for RAG (2026 Update)

We tested 9 chunking strategies. Plus: 3 newer 2026 upgrades worth trying (contextual retrieval, late chunking, cross-granularity).

✅ Semantic Chunking - Best accuracy (up to ~70% lift in our benchmark vs naive baselines). Groups sentences by meaning. Use for: knowledge bases, technical docs. Cost: Higher compute.
✅ Recursive Chunking - Best balance. Preserves structure (paragraphs → sentences). Use for: most RAG applications. LangChain default.
✅ Token-based Chunking - Best for LLM compatibility. Exact token counts. Use for: tight context windows, cost optimization.
✅ Fixed-size Chunking - Fastest but breaks sentences. Use for: prototyping only, not production.
✅ Hybrid Chunking - Best for complex docs. Combines multiple methods. Use for: multi-format documents (PDFs, code, tables).

Practical starting defaults (re-validated Feb 2026):

Chunk size: 256-512 tokens (sweet spot for most cases)
Overlap: 10-20% (50-100 tokens for 512-token chunks)
Best for accuracy: Semantic chunking (70% better retrieval)
Best for speed: Recursive chunking (LangChain default)
Best for cost: Token-based with caching

2026 upgrades that frequently beat “just tweak chunk_size/overlap”:

Contextual retrieval (contextualize each chunk before embedding)
Late chunking (contextual chunk embeddings from long-context embedding models)
Cross-granularity retrieval (sentence-atomic + query-time assembly)

Quick decision guide:

Need highest accuracy → Semantic chunking
General purpose RAG → Recursive chunking
Tight budget/API limits → Token-based chunking
Just starting out → RecursiveCharacterTextSplitter (LangChain)

Try interactive chunking tool → | Jump to code examples

Have you ever built a Retrieval-Augmented Generation (RAG) system that performed below expectations? You integrate a state-of-the-art LLM and craft meticulous prompts, yet the outputs are frustratingly mediocre—lacking context or, worse, factually incorrect.

We often rush to blame the retrieval algorithms or the embedding models. But what if the real culprit is hiding in plain sight, right at the beginning of the pipeline? I'm referring to document chunking.

Get your RAG chunking strategy wrong, and you're feeding your LLM a diet of fragmented, incoherent information. It's the classic 'garbage in, garbage out' problem. No matter how sophisticated your model is, it cannot synthesize accurate insights from garbled text. The quality of your text chunking doesn't just set a baseline for your RAG system's performance; it defines the upper limit.

In this guide, we will move beyond dense theory and dive straight into practical, code-driven implementation. We'll explore a range of chunking strategies, complete with examples and field-tested advice, to help you build a rock-solid foundation for any RAG application.

Want to pressure-test the setup, not just read about it? Run your sample text through RAG Chunk Lab to compare splits visually, then sanity-check the downstream model budget with the AI Token Calculator.

If your chunking pipeline touches OCR or image-heavy docs: use AI Image Pricing to estimate screenshot/PDF vision cost, and use the GPT-4o mini Token Calculator when you want a fast estimate for a low-cost default extraction or summarization pass.

Why Document Chunking is Crucial for RAG

So, why do we need to chunk documents for RAG in the first place? It boils down to two fundamental constraints:

Finite LLM Context Windows: Large Language Models have a limited context window—the maximum amount of text they can process at once. Document chunking breaks down massive texts into bite-sized pieces that fit within this window.
The Signal-to-Noise Problem: When a user asks a question, you want to retrieve the most relevant information possible. If your chunks are too large, they might contain the right answer buried amidst a sea of irrelevant text (noise). This dilutes the core signal, confusing the retriever and leading to poor retrieval accuracy.

The art of document chunking is striking the perfect balance: each chunk must be small enough to be focused, yet large enough to retain its semantic meaning. The two most important parameters you can adjust are chunk_size and chunk_overlap. Think of chunk_overlap as a safety net; by including a small piece of the previous chunk at the beginning of the next one, you ensure that a complete thought or sentence isn't awkwardly sliced in two.

9 Document Chunking Strategies with Python Code

Want to experiment with different chunking strategies without writing code? Try our RAG Chunk Lab - an interactive tool that lets you visualize and compare chunking strategies in real-time with your own text.

Basic Text Chunking Methods

Fixed-Size Chunking

This is the brute-force approach: chop the text every n characters, regardless of words or sentences. It's simple and fast, but it's also a sledgehammer. You'll often end up splitting sentences or even words right down the middle, destroying semantic meaning.

Core Idea: Split text by a fixed number of characters (chunk_size).
Use Cases: Best for unstructured plain text or as a preliminary preprocessing step where semantic integrity isn't a top priority.

from langchain_text_splitters import CharacterTextSplitter

sample_text = (
    "LangChain was created by Harrison Chase in 2022. It provides a framework for developing applications "
    "powered by language models. The library is known for its modularity and ease of use. "
    "One of its key components is the TextSplitter class, which helps in document chunking."
)

text_splitter = CharacterTextSplitter(
    separator=" ",  # Split by space
    chunk_size=100,  # Increase chunk size
    chunk_overlap=20,  # Adjust overlap
    length_function=len,
)

docs = text_splitter.create_documents([sample_text])
for i, doc in enumerate(docs):
    print(f"--- Chunk {i+1} ---")
    print(doc.page_content)

Recursive Character Chunking: The Go-To Strategy

This is the go-to method for most use cases and LangChain's default recommendation for a reason. Instead of a blind chop, it intelligently splits text using a prioritized list of separators, typically ["\n\n", "\n", " ", ""]. It tries to split by paragraphs first, then sentences, then words. This hierarchical approach does a much better job of keeping related content together.

Core Idea: Recursively split text using a hierarchical list of separators.
Use Cases: The preferred general-purpose strategy for the vast majority of text types.

from langchain_text_splitters import RecursiveCharacterTextSplitter

# Using the same sample_text from above
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=100,
    chunk_overlap=20,  # Default separators are ["\n\n", "\n", " ", ""]
)
docs = text_splitter.create_documents([sample_text])
for i, doc in enumerate(docs):
    print(f"--- Chunk {i+1} ---")
    print(doc.page_content)

Parameter Tuning Guide: For both fixed-size and recursive chunking, setting chunk_size and chunk_overlap is crucial:

chunk_size: This is a balancing act. Too small, and your chunks won't have enough context. Too large, and you introduce noise and increase API costs. A good starting point often aligns with your embedding model's optimal input size, typically 256, 512, or 1024 tokens.
chunk_overlap: This prevents jarring cuts between chunks. By letting each chunk share a small bit of text with its neighbor (a common rule of thumb is 10-20% of the chunk_size), you create a smoother transition and reduce the risk of splitting a key sentence in half.

Token-Based Chunking (Exact Token Budgets)

When you have strict limits (embedding max tokens, LLM context windows, or hard cost ceilings), splitting by characters is guesswork. Token-based chunking splits by the same tokenizer your models use, so you can reliably stay within token budgets.

Core Idea: Split by token count, not characters.
Use Cases: Tight context windows, stable latency/cost, multilingual text where character counts are misleading.

import tiktoken

enc = tiktoken.get_encoding("cl100k_base")  # Pick the tokenizer that matches your embedding/LLM model.

def chunk_by_tokens(text: str, chunk_size: int = 512, overlap: int = 64) -> list[str]:
    tokens = enc.encode(text)
    chunks = []
    start = 0
    while start < len(tokens):
        end = min(start + chunk_size, len(tokens))
        chunks.append(enc.decode(tokens[start:end]))
        if end == len(tokens):
            break
        start = max(0, end - overlap)
    return chunks

for i, chunk in enumerate(chunk_by_tokens(sample_text, chunk_size=64, overlap=8), 1):
    print(f"--- Chunk {i} ---")
    print(chunk)

Tip: If you're using a Hugging Face embedding model, use its AutoTokenizer to measure tokens, so your chunk sizes match the model's real limits.

Sentence-Based Chunking for Semantic Integrity

This approach treats sentences as the fundamental building blocks, grouping them into chunks. This guarantees that you never slice a sentence in half, preserving a basic level of semantic integrity.

Core Idea: Split text into sentences, then group sentences into chunks.
Use Cases: Scenarios requiring high sentence integrity, such as legal documents or news articles.

import nltk

try:
    nltk.data.find('tokenizers/punkt')
except nltk.downloader.DownloadError:
    nltk.download('punkt')

from nltk.tokenize import sent_tokenize

def chunk_by_sentences(text, max_chars=500, overlap_sentences=1):
    sentences = sent_tokenize(text)
    chunks = []
    current_chunk = ""
    for i, sentence in enumerate(sentences):
        if len(current_chunk) + len(sentence) <= max_chars:
            current_chunk += " " + sentence
        else:
            chunks.append(current_chunk.strip())
            # Create overlap
            start_index = max(0, i - overlap_sentences)
            current_chunk = " ".join(sentences[start_index:i+1])
    if current_chunk:
        chunks.append(current_chunk.strip())
    return chunks

long_text = "This is the first sentence. This is the second sentence, which is a bit longer. Now we have a third one. The fourth sentence follows. Finally, the fifth sentence concludes this paragraph."
chunks = chunk_by_sentences(long_text, max_chars=100)
for i, chunk in enumerate(chunks):
    print(f"--- Chunk {i+1} ---")
    print(chunk)

A Quick Word on Multilingual Content: Be aware that many standard libraries, like NLTK, are often optimized for English. The default sentence tokenizer might struggle with languages that use different punctuation (like 。 in Chinese). When working with non-English text, always ensure you're using language-specific models or regex patterns to split sentences correctly.

Structure-Aware Chunking: Using Document Format

Why guess where the logical breaks are when the document tells you? Structure-aware chunking leverages the document's built-in formatting—like Markdown headers or HTML tags—to create highly logical, contextually rich chunks. This is often the easiest win for improving chunking quality.

Markdown and HTML Chunking

Core Idea: Define chunk boundaries based on Markdown heading levels or HTML tags.
Use Cases: Well-formatted Markdown or HTML documents.

from langchain_text_splitters import MarkdownHeaderTextSplitter

markdown_document = """
# Chapter 1: The Beginning
## Section 1.1: The Old World
This is the story of a time long past.
## Section 1.2: A New Hope
A new hero emerges.
# Chapter 2: The Journey
## Section 2.1: The Call to Adventure
The hero receives a mysterious call.
"""

headers_to_split_on = [
    ("#", "Header 1"),
    ("##", "Header 2"),
]

markdown_splitter = MarkdownHeaderTextSplitter(headers_to_split_on=headers_to_split_on)
md_header_splits = markdown_splitter.split_text(markdown_document)

for split in md_header_splits:
    print(f"Metadata: {split.metadata}")
    print(split.page_content)
    print("-" * 20)

Special case: Conversations and transcripts. If you're chunking chat logs or meeting transcripts, preserve speaker turns, timestamps, and topic shifts. Treat each turn (or a small sliding window of turns) as an atomic unit, and store speaker, time, and conversation_id as metadata for filtering at retrieval time.

Advanced Methods: Semantic Chunking and Late Chunking

These advanced methods move beyond physical structure: either splitting based on meaning, or improving chunk embeddings using broader context.

Semantic Chunking for Thematic Cohesion

Core Idea: Calculate the vector similarity between adjacent sentences. When the similarity drops significantly—indicating a topic change—create a new chunk.
Use Cases: Knowledge bases and research papers where semantic cohesion within chunks is critical for retrieval accuracy.

import os
from langchain_experimental.text_splitter import SemanticChunker
from langchain_huggingface import HuggingFaceEmbeddings

os.environ["TOKENIZERS_PARALLELISM"] = "false"

embeddings = HuggingFaceEmbeddings(model_name="sentence-transformers/all-MiniLM-L6-v2")

text_splitter = SemanticChunker(
    embeddings,
    breakpoint_threshold_type="percentile",
    breakpoint_threshold_amount=70
)
print("SemanticChunker configured.")
print("-" * 50)

long_text = (
    "The Wright brothers, Orville and Wilbur, were two American aviation pioneers "
    "generally credited with inventing, building, and flying the world's first successful motor-operated airplane. "
    "They made the first controlled, sustained flight of a powered, heavier-than-air aircraft on December 17, 1903. "
    "In the following years, they continued to develop their aircraft. "
    "Switching topics completely, let's talk about cooking. "
    "A good pizza starts with a perfect dough, which needs yeast, flour, water, and salt. "
    "The sauce is typically tomato-based, seasoned with herbs like oregano and basil. "
    "Toppings can vary from simple mozzarella to a wide range of meats and vegetables. "
    "Finally, let's consider the solar system. "
    "It is a gravitationally bound system of the Sun and the objects that orbit it. "
    "The largest objects are the eight planets, in order from the Sun: Mercury, Venus, Earth, Mars, Jupiter, Saturn, Uranus, and Neptune."
)

docs = text_splitter.create_documents([long_text])
for i, doc in enumerate(docs):
    print(f"--- Chunk {i+1} ---")
    print(doc.page_content)
    print()

Parameter Tuning Guide: The key parameter for SemanticChunker is breakpoint_threshold_amount. A low threshold creates many small, focused chunks, while a high threshold creates fewer, larger chunks, splitting only on major topic shifts.

Late Chunking: Contextual Chunk Embeddings (Long-Context Embedding Models)

Classic chunking embeds each chunk independently. That works, but it often creates ambiguous chunks without surrounding context (pronouns, references like “this approach”, header-dependent content, etc.). Late chunking flips the order:

Encode a larger span (a section or whole document) once with a long-context embedding model.
Pool token embeddings to form each chunk embedding.

Core Idea: Encode first with more context, pool later into chunk vectors.
Use Cases: Long technical docs, specs/policies, and content with lots of cross-references.

# Pseudocode (exact APIs depend on the embedding model/tooling you use).
# Goal: get token-level embeddings for a long span, then pool them into chunk vectors.
#
# token_embs: shape (seq_len, dim)
token_embs = embed_model.encode(long_text, return_token_embeddings=True)

# spans: list[(start_token, end_token)] for each chunk
spans = [(0, 128), (96, 256), (224, 384)]
chunk_embs = [token_embs[s:e].mean(axis=0) for (s, e) in spans]

When late chunking helps most: chunks that rely on surrounding headers or long-range references.
Tradeoffs: needs a long-context embedding model and token-level pooling; for very long documents you still need windowing.

Reference: Late Chunking (arXiv:2409.04701) | Jina AI late-chunking

Cutting-Edge Chunking Techniques

Small-to-Large Chunking (ParentDocumentRetriever)

This strategy gives you the best of both worlds: precision and context.

Core Idea: Create small, precise "child" chunks for retrieval and larger "parent" chunks for context. The retriever finds the best child chunk, then returns its parent chunk to the LLM, providing rich context for generation.
Use Cases: Complex Q&A scenarios that require both high retrieval precision and rich context.

from langchain_core.documents import Document
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain.retrievers import ParentDocumentRetriever
from langchain_chroma import Chroma
from langchain.storage import InMemoryStore

class MockEmbeddings:
    def embed_documents(self, texts):
        return [[0.1] * 128 for _ in texts]
    def embed_query(self, text):
        return [0.1] * 128

embeddings = MockEmbeddings()
docs = [
    Document(page_content="The first law of thermodynamics is the law of conservation of energy. It states that energy cannot be created or destroyed in an isolated system. The total energy of the universe is constant."),
    Document(page_content="The second law introduces the concept of entropy. It states that the entropy of an isolated system always increases over time. This law explains the direction of natural processes, which tend to move towards a state of greater disorder."),
    Document(page_content="The third law of thermodynamics states that the entropy of a system approaches a constant value as the temperature approaches absolute zero. For a perfect crystal at absolute zero, the entropy is exactly zero.")
]

parent_splitter = RecursiveCharacterTextSplitter(chunk_size=1000)
child_splitter = RecursiveCharacterTextSplitter(chunk_size=200)
vectorstore = Chroma(collection_name="full_documents", embedding_function=embeddings)
store = InMemoryStore()

retriever = ParentDocumentRetriever(
    vectorstore=vectorstore,
    docstore=store,
    child_splitter=child_splitter,
    parent_splitter=parent_splitter,
)

retriever.add_documents(docs)
query = "What is entropy?"
retrieved_docs = retriever.get_relevant_documents(query)
print(f"Retrieved {len(retrieved_docs)} parent documents.")
print(retrieved_docs[0].page_content)

Agentic Chunking: Using an LLM to Chunk

This is the cutting edge: using an LLM to chunk text for another LLM.

Core Idea: An LLM agent analyzes text, identifies core concepts, and intelligently extracts and reorganizes sentences into self-contained, logical chunks.
Use Cases: Highly experimental and resource-intensive, but promising for messy, unstructured documents where other methods fail.

import textwrap
from langchain_openai import ChatOpenAI
from langchain.prompts import PromptTemplate
from langchain_core.output_parsers import PydanticOutputParser
from pydantic import BaseModel, Field
from typing import List

class KnowledgeChunk(BaseModel):
    chunk_title: str = Field(description="A concise and clear title for this knowledge chunk.")
    chunk_text: str = Field(description="The self-contained text content, extracted and reorganized from the original text.")
    representative_question: str = Field(description="A typical question that can be directly answered by the content of this chunk.")

class ChunkList(BaseModel):
    chunks: List[KnowledgeChunk]

parser = PydanticOutputParser(pydantic_object=ChunkList)

prompt_template = """
[ROLE]: You are a top-tier scientific document analyst. Your task is to break down complex scientific text paragraphs into a set of core, self-contained "Knowledge Chunks".
[CORE TASK]: Read the text paragraph provided by the user and identify the independent core concepts within it.
[RULES]:
1. **Self-Contained**: Each "Knowledge Chunk" must be self-contained.
2. **Single Concept**: Each "Knowledge Chunk" should revolve around only one core concept.
3. **Extract and Reorganize**: Extract all sentences related to that core concept from the original text and combine them into a smooth, coherent paragraph.
4. **Follow Format**: Strictly adhere to the JSON format instructions below to structure your output.
{format_instructions}
[TEXT TO PROCESS]:
{paragraph_text}
"""

prompt = PromptTemplate(
    template=prompt_template,
    input_variables=["paragraph_text"],
    partial_variables={"format_instructions": parser.get_format_instructions()},
)

def agentic_chunker(paragraph_text: str) -> List[KnowledgeChunk]:
    print("--- Simulating LLM call for Agentic Chunker ---")
    if "evaporation" in paragraph_text:
        return [
            KnowledgeChunk(chunk_title="The Water Cycle: Evaporation and Condensation", chunk_text="The water cycle's first stage is evaporation, where water from oceans and lakes becomes vapor. Transpiration from plants also contributes. The second stage is condensation, where this vapor cools to form clouds.", representative_question="What are the first two stages of the water cycle?"),
            KnowledgeChunk(chunk_title="The Water Cycle: Precipitation and Collection", chunk_text="The third stage is precipitation, when water droplets in clouds grow heavy and fall as rain or snow. The final stage is collection, where water gathers in rivers and oceans or seeps into the ground as groundwater, restarting the cycle.", representative_question="What happens after clouds form in the water cycle?")
        ]
    return []

document = """The water cycle, also known as the hydrologic cycle, describes the continuous movement of water on, above, and below the surface of the Earth. This cycle is vital as it ensures the availability of water for all life forms. The first stage of the cycle is evaporation, the process by which water from surfaces like oceans, lakes, and rivers is converted into water vapor and rises into the atmosphere, with transpiration from plants also contributing. As the warm, moist air rises and cools, the second stage occurs: condensation. In this phase, the water vapor turns back into tiny liquid water droplets, forming clouds. As these droplets collide and grow, they eventually become heavy enough to fall back to Earth as precipitation, the third stage, which can be in the form of rain, snow, sleet, or hail. Finally, once the water reaches the ground, it may move in several ways, constituting the fourth stage: collection. Some water will flow as surface runoff into rivers, lakes, and oceans. Other water will seep into the ground and become groundwater, which may eventually return to the surface or the ocean, thus starting the cycle anew."""

paragraphs = document.strip().split('\n\n')
all_chunks = []
for para in paragraphs:
    chunks_from_para = agentic_chunker(para)
    if chunks_from_para:
        all_chunks.extend(chunks_from_para)

2026 Upgrades (Beyond Splitting): Contextual Retrieval + Cross-Granularity Retrieval

If you already tuned chunk_size / chunk_overlap and still miss answers, the biggest wins in 2026 often come from:

Making each chunk more self-contained before embedding (contextual retrieval)
Avoiding rigid boundaries via flexible granularity (sentence-atomic retrieval + query-time assembly)

Contextual Retrieval: Contextualize Each Chunk Before Embedding

Chunks are often ambiguous when retrieved out of context. A simple fix: prepend a short “where am I?” context string (doc title, heading path, 1-2 sentence summary) before embedding.

Minimal pipeline:

Split documents into chunks using any strategy in this guide.
For each chunk, generate a short context string (deterministic from metadata, or LLM-generated).
Embed context + "\n\n" + chunk, but store the raw chunk as the retrieval payload.

# Pseudocode: contextualize chunks before embedding.
def build_context(doc_title: str, heading_path: str) -> str:
    return f"Title: {doc_title}\nSection: {heading_path}"

context = build_context(doc_title="My Handbook", heading_path="Chapter 3 > Rate Limits")
contextualized = context + "\n\n" + chunk_text
embedding = embed_model.embed(contextualized)

Tips: keep context short (roughly 50-150 tokens), cache it, and add it consistently across your corpus.

Reference: Anthropic: Contextual Retrieval

Cross-Granularity Retrieval: Sentence-Atomic + Query-Time Assembly

Instead of committing to fixed chunk boundaries up front, index smaller atomic units (often sentences), then assemble a good context window at query time by expanding/merging around the top hits. This reduces “wrong boundary” failures on long documents with uneven information density.

High-level approach:

Split each document into sentences, embed each sentence (store doc_id, sentence_index).
Retrieve top-N sentences for a query.
Expand each hit with neighbors or its parent section, dedupe, then pack into the LLM context window.

Reference: FreeChunker: Cross-Granularity Chunking (arXiv:2510.20356)

Hybrid Chunking: Combining Strategies for Best Results

In the real world, one size rarely fits all. A hybrid chunking approach combines multiple strategies to get the best results for complex documents.

Core Idea: Start with a coarse, high-level strategy (like splitting by Markdown headers). Then, iterate through those initial chunks. If any are too large, apply a more fine-grained strategy (like recursive chunking) to break them down further.
Use Cases: Perfect for complex documents with mixed structures and varying content density.

from langchain_text_splitters import MarkdownHeaderTextSplitter, RecursiveCharacterTextSplitter
from langchain_core.documents import Document

markdown_document = """
# Chapter 1: Company Introduction
Our company was founded in 2017, dedicated to promoting innovation and application of artificial intelligence technology. Our mission is to empower various industries and create greater value through advanced AI solutions.
## 1.1 Development History
Since its inception, the company has experienced rapid growth. From an initial team of a few people to its current scale of hundreds of employees, we have always adhered to the principles of being technology-driven and customer-first.
# Chapter 2: Core Technologies
This chapter will detail our core technologies. Our technical framework is based on advanced distributed computing concepts, ensuring high availability and scalability. At the core of the system is a self-developed deep learning engine capable of processing massive data and conducting efficient model training. This engine supports multiple neural network architectures, including Convolutional Neural Networks (CNNs) for image recognition, as well as Recurrent Neural Networks (RNNs) and Transformer models for natural language understanding. We have specifically optimized the Transformer architecture, proposing a new mechanism called "Attention Compression," which significantly reduces computational resource requirements while maintaining model performance.
## 2.1 Technical Principles
Our technical principles integrate knowledge from multiple disciplines, including statistics, machine learning, and operations research.
# Chapter 3: Future Outlook
Looking ahead, we will continue to increase our investment in the field of artificial intelligence and explore the possibilities of Artificial General Intelligence (AGI).
"""

def hybrid_chunking_optimized(
    markdown_document: str,
    coarse_chunk_threshold: int = 400,
    fine_chunk_size: int = 100,
    fine_chunk_overlap: int = 20
) -> list[Document]:
    headers_to_split_on = [("#", "Header 1"), ("##", "Header 2")]
    markdown_splitter = MarkdownHeaderTextSplitter(headers_to_split_on=headers_to_split_on)
    coarse_chunks = markdown_splitter.split_text(markdown_document)
    fine_splitter = RecursiveCharacterTextSplitter(
        chunk_size=fine_chunk_size,
        chunk_overlap=fine_chunk_overlap
    )
    final_chunks = []
    for chunk in coarse_chunks:
        if len(chunk.page_content) > coarse_chunk_threshold:
            finer_chunks = fine_splitter.split_documents([chunk])
            final_chunks.extend(finer_chunks)
        else:
            final_chunks.append(chunk)
    return final_chunks

final_chunks = hybrid_chunking_optimized(markdown_document)
for i, chunk in enumerate(final_chunks):
    print(f"--- Final Chunk {i+1} (length: {len(chunk.page_content)}) ---")
    print(f"Metadata: {chunk.metadata}")
    print(chunk.page_content)

How to Choose the Best Chunking Strategy for Your RAG

Follow this strategic, layered approach to find the right chunking strategy for your project. Once you've chosen your chunking method, you'll need to select a RAG framework to implement your complete pipeline. We compare the top 5 frameworks (LangChain, LlamaIndex, Haystack) with benchmarks and code examples to help you choose the best fit for your use case.

If you want to make this decision data-driven, start by picking an evaluation metric (Recall@K, MRR, answer faithfulness) and a representative query set. Here's a practical primer: RAG Evaluation 101.

Step 1: Establish a Baseline

Your Go-To: Always start with RecursiveCharacterTextSplitter. It's the versatile, reliable workhorse of chunking. Use it to get your RAG system up and running and establish a performance baseline.
If You Have Hard Token Limits: Move early to token-based chunking so you can guarantee chunks fit your embedding model and downstream LLM budget.

Step 2: Analyze Document Structure

Your Next Move: If your content has a clear structure (Markdown, HTML), switch to a structure-aware method like MarkdownHeaderTextSplitter. This is often the single biggest and easiest improvement you can make.

Step 3: Improve Retrieval Quality (2026 Upgrades Included)

SemanticChunker: Choose this for thematically cohesive chunks.
Late chunking: Use when chunks are ambiguous without surrounding context (headers, pronouns, cross-references).
Contextual retrieval: Contextualize each chunk before embedding (title/heading path/short summary) to make chunks self-contained.
Cross-granularity retrieval: Index sentence-atomic units, then assemble context at query time to avoid boundary failures.
ParentDocumentRetriever (Small-to-Large): Ideal for complex Q&A needing both pinpoint retrieval and broad context.

Step 4: Implement a Hybrid Approach

The Power-User Move: For documents with mixed formats and densities, a hybrid approach is your best bet. For example, use MarkdownHeaderTextSplitter first, then run RecursiveCharacterTextSplitter on any resulting chunks that are still too big.

For easy reference, this table summarizes the strategies we've discussed.

Strategy	Best For	Potential Downsides
Fixed-Size Chunking	Simplicity and speed on unstructured text.	High risk of breaking sentences and meaning.
Recursive Character Chunking	The best general-purpose starting point.	Can be suboptimal for highly structured data.
Token-Based Chunking	Hard token limits and predictable cost/latency.	Requires the correct tokenizer; can still split awkwardly without structure/sentence awareness.
Sentence-Based Chunking	When sentence integrity is paramount (e.g., legal docs).	Individual sentences can lack context; long sentences are tricky.
Structure-Aware Chunking	Cleanly formatted documents (Markdown, HTML).	Useless for unstructured or messy text.
Semantic Chunking	Achieving high semantic cohesion within chunks.	Computationally intensive; quality depends on the embedding model.
Late Chunking	Long documents with ambiguous chunks and cross-references.	Needs a long-context embedding model + token-level pooling; windowing still needed for very long docs.
Contextual Retrieval (Contextualized Chunks)	Making chunks self-contained and less ambiguous at retrieval time.	Extra preprocessing cost; requires caching and consistent context generation.
Cross-Granularity Retrieval	Flexible query-time assembly on uneven, long documents.	More complex retrieval logic; can increase query-time cost.
Hybrid Chunking	Complex, mixed-format documents.	Requires more complex logic to implement.
Small-to-Large Chunking	Q&A needing both precision and rich context.	More complex pipeline; manages two sets of documents.
Agentic Chunking	Experimental use on highly complex, messy text.	Very slow, expensive, and still in early stages.

Conclusion: Mastering Document Chunking for RAG

Document chunking isn't just a mundane preprocessing step; it's a critical design choice that profoundly impacts your entire RAG system. If you take away anything from this guide, let it be these three principles:

There Is No Silver Bullet: The perfect chunking strategy depends entirely on your data and your goals. Treat it as an iterative engineering problem.
Start Simple, Then Specialize: Always begin with a robust baseline like RecursiveCharacterTextSplitter. From there, layer on more sophisticated strategies only when you have a clear, data-driven need.
Chunking Is Modeling: How you chunk your data is a reflection of how you understand your knowledge base. A well-designed chunk is a carefully modeled unit of meaning.

Ultimately, you cannot have high-quality generation without high-quality retrieval, and you cannot have high-quality retrieval without intelligent chunking. Master this foundational skill, and you are well on your way to building RAG systems that don't just work, but truly excel.

Try It Yourself: RAG Chunk Lab

Ready to put these concepts into practice? Our RAG Chunk Lab provides an interactive environment where you can:

Test different chunking strategies with your own documents
Visualize chunk boundaries and see how they affect retrieval
Compare A/B configurations side-by-side with detailed metrics
Simulate search queries to understand retrieval performance
Export and share your optimal configurations

All processing happens locally in your browser - no data is sent to any server. It's the perfect companion tool to experiment with the strategies discussed in this guide.

Launch RAG Chunk Lab →

Key Takeaways

• Start with recursive chunking (or token-based if you have hard budgets), then add structure-aware splitting when available.
• Semantic chunking helps, but 2026 upgrades like contextual retrieval, late chunking, and cross-granularity often deliver bigger gains than overlap tweaks.
• Use small-to-large and hybrid patterns to balance retrieval precision with generation context.
• Validate with real queries and metrics (e.g., Recall@k, MRR, answer faithfulness), then iterate.

RAG Technology Hub