How I Cut Edit Distance from 168 to 43 in a Legal Document RAG Pipeline

The metric sounds odd at first — edit distance on generated text, not BLEU or ROUGE. But for legal document correction, edit distance between the model's output and the reference correction is exactly the right signal. A paraphrase that changes the meaning is worse than a non-correction; a precise in-place fix with minimal surrounding noise is better than a verbose re-statement of the corrected clause.

This was the evaluation criterion for the Idea Builders Studio AI Engineer take-home. The task: build a RAG pipeline that ingests legal documents and answers questions about them, with a specific sub-task of correcting factual errors in contract clauses. The baseline approach — naive chunking, top-k retrieval, prompt-the-model — produced an average edit distance of 168 against the reference answers. The final pipeline brought that to 43.

This post covers what changed and why.

The Baseline and Why It Failed

The baseline was textbook RAG: split the document into fixed 500-token chunks, embed each chunk with sentence-transformers/all-MiniLM-L6-v2, store in pgvector, retrieve top-3 by cosine similarity at query time, pass to Claude with a "answer based on the following context" prompt.

For general Q&A ("what is the termination notice period?") it worked acceptably. For factual correction ("the clause states 30 days — is this correct?") it failed in a consistent pattern: the model retrieved the right section but produced a verbose re-explanation of the clause rather than a minimal correction. Edit distance was high not because the answer was wrong but because the answer was surrounded by noise.

Two root causes:

Chunking ignored clause structure. Legal documents are organised by section and clause. A 500-token fixed-size chunk frequently split a clause mid-sentence or merged two unrelated clauses. Retrieving a partial clause gave the model incomplete context for the correction task.
The prompt didn't constrain the output format. "Answer based on the following context" invites explanation. For a correction task, you want a diff, not an essay.

Section-Scoped Chunking

The first fix was structural: parse the document into sections and treat each section as the minimum chunk unit. If a section is under 512 tokens, it's a single chunk. If it exceeds 512 tokens, it's split at sentence boundaries within the section — never across section boundaries.

Legal documents have consistent structural markers: numbered clauses (1., 1.1, (a)), all-caps headings ("REPRESENTATIONS AND WARRANTIES"), and horizontal rule separators. A lightweight parser handles all three patterns and builds a section tree before chunking.

def chunk_legal_document(text: str) -> list[DocumentChunk]:
    sections = parse_section_tree(text)
    chunks = []
    for section in sections:
        if token_count(section.text) <= 512:
            chunks.append(DocumentChunk(
                text=section.text,
                section_id=section.id,
                section_title=section.title,
            ))
        else:
            for sub_chunk in sentence_split(section.text, max_tokens=512, overlap=64):
                chunks.append(DocumentChunk(
                    text=sub_chunk,
                    section_id=section.id,
                    section_title=section.title,
                ))
    return chunks

Each chunk carries section_id and section_title as metadata. At retrieval time, if any retrieved chunk has a section_id, the full section text is also fetched and appended to the context — the model always receives the complete clause, not a fragment of it.

Section-Scoped Exemplar Prompting

The second fix was the prompt structure. For correction tasks, the system prompt includes an exemplar pair: a clause with a known error and the minimal correction. The exemplar teaches the output format by demonstration rather than instruction.

System:
You are a legal document reviewer. When asked to correct a clause,
return ONLY the corrected text with the minimum change needed.
Do not explain the correction. Do not repeat unchanged text unless
it is necessary to show the correction in context.

Example:
  Input clause: "The agreement shall remain in force for a period of
  thirty (30) business days following the termination date."
  Correct answer: "thirty (30) calendar days"

User:
Section 4.2 — Termination Notice:
{full_section_text}

The clause above contains a factual error. Return the corrected text only.

The exemplar is section-scoped: it matches the type of clause being corrected (duration clauses get duration exemplars, liability clauses get liability exemplars). The exemplar library has 12 entries covering the common legal clause types in the test set.

This change alone dropped average edit distance from 168 to 71. The model stopped producing verbose explanations and started returning minimal corrections.

Embedding Model Upgrade

all-MiniLM-L6-v2 is a general-purpose model. Legal text is a specific domain — dense with terminology, nominalisations, and passive constructions that the general model embeds poorly. Switching to voyage-law-2 (Voyage AI's legal-domain model) improved retrieval precision on the evaluation set from 0.61 to 0.79 (precision@3).

The precision improvement reduced the cases where the wrong clause was retrieved — which previously caused the model to produce confident corrections of the wrong section.

The Eval Framework

The evaluation commits before and after each change were the part of the submission that the reviewer highlighted. The eval runs automatically in CI: a pytest suite with 40 query-reference pairs, computing edit distance between the model's output and the reference answer for each pair.

@pytest.mark.parametrize("case", load_eval_cases())
def test_correction_edit_distance(case: EvalCase):
    result = pipeline.answer(case.query, case.document)
    distance = edit_distance(result, case.reference)
    assert distance <= EDIT_DISTANCE_THRESHOLD, (
        f"Edit distance {distance} exceeds threshold {EDIT_DISTANCE_THRESHOLD}\n"
        f"Query: {case.query}\n"
        f"Result: {result}\n"
        f"Reference: {case.reference}"
    )

EDIT_DISTANCE_THRESHOLD is 50 in the final version — the 43 average sits comfortably below it, with the worst-case individual result at 48.

Running the eval after each architectural change made the improvement trajectory visible and reproducible. The committed history shows three distinct drops: section-scoped chunking (168 → 117), exemplar prompting (117 → 71), embedding model upgrade (71 → 43).

What Each Change Contributed

| Change | Avg edit distance | Delta | |---|---|---| | Baseline (fixed chunks, MiniLM, no exemplar) | 168 | — | | Section-scoped chunking | 117 | −51 | | Section-scoped exemplar prompting | 71 | −46 | | Voyage law-2 embeddings | 43 | −28 |

The chunking and prompting changes contributed roughly equally. The embedding upgrade was smaller in absolute terms but closed the remaining gap by improving the cases where the wrong clause was retrieved — which the prompt changes couldn't fix.

Stack

Python 3.12, LangChain (retrieval orchestration), pgvector (vector storage), PostgreSQL, Anthropic Messages API (Claude Sonnet), Voyage AI embeddings (voyage-law-2), Sentence Transformers (baseline comparison), pytest (eval suite), GitHub Actions (CI).

Source: github.com/mralaminahamed/legal-rag

Share 𝕏 in

Al Amin Ahamed

Senior software engineer & AI practitioner. 5+ years shipping Laravel platforms, WordPress plugins, WooCommerce extensions, and AI-driven products.

About me →