Stop Using Fixed-Size Chunks for Technical Documentation

Fixed 512-token chunking is the default in every RAG tutorial because it is simple to implement and model-agnostic. It also regularly splits a function signature from its body, a heading from the content it introduces, and a code block mid-example. The embedding model then encodes an incoherent fragment, and retrieval returns it confidently for queries it cannot actually answer.

On this portfolio's corpus, 23% of chunks produced by naive fixed-size splitting contained a code block that had been cut in half. Switching to heading-aware chunking with paragraph fallback dropped that to 2%.

The strategy: split first on markdown heading boundaries to keep sections intact, then fall back to paragraph breaks for sections that exceed the token ceiling. Add overlap between adjacent chunks so a query that lands on a boundary still retrieves sufficient context.

/**
 * Split markdown into semantically coherent chunks for RAG indexing.
 *
 * @param  string      $markdown     Raw markdown content.
 * @param  int         $maxTokens    Token ceiling per chunk (approximated as maxTokens * 4 chars).
 * @param  int         $overlapChars Characters carried over from the previous chunk.
 * @return list<string>
 */
function chunk_markdown(string $markdown, int $maxTokens = 400, int $overlapChars = 150): array
{
    $maxChars = $maxTokens * 4;
    $sections = preg_split('/(?=^#{2,4}\s)/m', $markdown, -1, PREG_SPLIT_NO_EMPTY);

    $chunks = [];
    $prev   = '';

    foreach ($sections as $section) {
        $section = trim($section);

        if (strlen($section) <= $maxChars) {
            $chunks[] = $prev ? substr($prev, -$overlapChars) . "\n\n" . $section : $section;
            $prev     = $section;
            continue;
        }

        // Section exceeds ceiling: split on paragraph boundaries.
        $paragraphs = preg_split('/\n{2,}/', $section, -1, PREG_SPLIT_NO_EMPTY);
        $buffer     = '';

        foreach ($paragraphs as $para) {
            if ($buffer !== '' && strlen($buffer) + strlen($para) + 2 > $maxChars) {
                $chunks[] = $prev ? substr($prev, -$overlapChars) . "\n\n" . $buffer : $buffer;
                $prev     = $buffer;
                $buffer   = $para;
            } else {
                $buffer .= ($buffer ? "\n\n" : '') . $para;
            }
        }

        if ($buffer !== '') {
            $chunks[] = $prev ? substr($prev, -$overlapChars) . "\n\n" . $buffer : $buffer;
            $prev     = $buffer;
        }
    }

    return array_values(array_filter($chunks));
}

Three decisions worth noting. The 1 token ≈ 4 chars approximation is rough — actual token counts vary by model and content type. Use it as a ceiling, not a target, and size $maxTokens conservatively relative to your embedding model's context limit. The $overlapChars value of 150 covers roughly two sentences — enough to give a retrieval hit context without doubling index size. And the preg_split lookahead (?=^#{2,4}\s) preserves the heading at the start of each section rather than stripping it, which keeps the section label inside the chunk's embedding and improves retrieval for heading-level queries.

If your corpus mixes markdown with raw PHP or SQL files, apply a separate chunking strategy per content type — paragraph-boundary splitting is meaningless inside a class definition. Split PHP on method boundaries using token_get_all() instead.

Share 𝕏 in

Al Amin Ahamed

Senior software engineer & AI practitioner. 5+ years shipping Laravel platforms, WordPress plugins, WooCommerce extensions, and AI-driven products.

About me →