Stop Using Fixed-Size Chunks for Technical Documentation
Fixed 512-token chunking is the default in every RAG tutorial because it is simple to implement and model-agnostic. It also regularly splits a function signature from its body, a heading from the content it introduces, and a code block mid-example. The embedding model then encodes an incoherent fragment, and retrieval returns it confidently for queries it cannot actually answer.
On this portfolio's corpus, 23% of chunks produced by naive fixed-size splitting contained a code block that had been cut in half. Switching to heading-aware chunking with paragraph fallback dropped that to 2%.
The strategy: split first on markdown heading boundaries to keep sections intact, then fall back to paragraph breaks for sections that exceed the token ceiling. Add overlap between adjacent chunks so a query that lands on a boundary still retrieves sufficient context.
/**
* Split markdown into semantically coherent chunks for RAG indexing.
*
* @param string $markdown Raw markdown content.
* @param int $maxTokens Token ceiling per chunk (approximated as maxTokens * 4 chars).
* @param int $overlapChars Characters carried over from the previous chunk.
* @return list<string>
*/
function chunk_markdown(string $markdown, int $maxTokens = 400, int $overlapChars = 150): array
{
$maxChars = $maxTokens * 4;
$sections = preg_split('/(?=^#{2,4}\s)/m', $markdown, -1, PREG_SPLIT_NO_EMPTY);
$chunks = [];
$prev = '';
foreach ($sections as $section) {
$section = trim($section);
if (strlen($section) <= $maxChars) {
$chunks[] = $prev ? substr($prev, -$overlapChars) . "\n\n" . $section : $section;
$prev = $section;
continue;
}
// Section exceeds ceiling: split on paragraph boundaries.
$paragraphs = preg_split('/\n{2,}/', $section, -1, PREG_SPLIT_NO_EMPTY);
$buffer = '';
foreach ($paragraphs as $para) {
if ($buffer !== '' && strlen($buffer) + strlen($para) + 2 > $maxChars) {
$chunks[] = $prev ? substr($prev, -$overlapChars) . "\n\n" . $buffer : $buffer;
$prev = $buffer;
$buffer = $para;
} else {
$buffer .= ($buffer ? "\n\n" : '') . $para;
}
}
if ($buffer !== '') {
$chunks[] = $prev ? substr($prev, -$overlapChars) . "\n\n" . $buffer : $buffer;
$prev = $buffer;
}
}
return array_values(array_filter($chunks));
}Three decisions worth noting. The 1 token ≈ 4 chars approximation is rough — actual token counts vary by model and content type. Use it as a ceiling, not a target, and size $maxTokens conservatively relative to your embedding model's context limit. The $overlapChars value of 150 covers roughly two sentences — enough to give a retrieval hit context without doubling index size. And the preg_split lookahead (?=^#{2,4}\s) preserves the heading at the start of each section rather than stripping it, which keeps the section label inside the chunk's embedding and improves retrieval for heading-level queries.
If your corpus mixes markdown with raw PHP or SQL files, apply a separate chunking strategy per content type — paragraph-boundary splitting is meaningless inside a class definition. Split PHP on method boundaries using token_get_all() instead.
Al Amin Ahamed
Senior software engineer & AI practitioner. 5+ years shipping Laravel platforms, WordPress plugins, WooCommerce extensions, and AI-driven products.
About me →More from the blog