Prompt Versioning in Production: Managing LLM Prompts Like Code

LLM prompts are code. They have bugs, regressions, and breaking changes. They need to be reviewed, versioned, tested, and rolled back. Most teams treat them as configuration strings — hard-coded in environment variables, edited directly in production, with no record of what changed or why.

This post covers how EasyCommerce manages prompts across three providers (Claude, OpenAI, GPT-4o, DeepSeek), with version tracking, structured evals, and a rollback path that does not require a deployment.

Why prompts break in production

Prompt failures are subtle in ways that code failures are not. A code bug throws an exception. A prompt regression produces output that is slightly worse than before — shorter descriptions, more generic fraud scores, inventory forecasts that are systematically low by 15%. These do not trigger alerts. They degrade quietly until a merchant notices or you run an eval.

The failure modes we have seen in EasyCommerce's AI layer:

Provider model updates. Claude 3.5 Sonnet behaved differently on the same prompt than Claude Sonnet 4. Not worse, but differently — more verbose, different formatting conventions. Prompts tuned for one model version need adjustment when the model changes.

Regression on specific input shapes. A prompt that works well on typical product data produces garbage on products with very short titles, unicode characters, or prices in non-USD currencies. You only find these when you run evals against a diverse dataset.

Context window pressure. As we added more instructions to the fraud detection prompt, responses started truncating mid-analysis. The prompt grew past the point where the model had enough output tokens to complete its reasoning.

Instruction conflicts. Adding a new formatting requirement contradicted an existing instruction about response length. Claude resolved the conflict by ignoring the length instruction; GPT-4o resolved it by ignoring the formatting requirement.

The prompt registry

Prompts live in a version-controlled registry, not in environment variables or database records that can be edited ad hoc.

from dataclasses import dataclass, field
from enum import Enum
from typing import Callable


class PromptStatus(str, Enum):
    ACTIVE = "active"
    SHADOW = "shadow"    # Running alongside active for comparison
    DEPRECATED = "deprecated"
    ARCHIVED = "archived"


@dataclass
class PromptVersion:
    key: str               # e.g. "product_description_v3"
    family: str            # e.g. "product_description"
    version: int
    status: PromptStatus
    system: str
    user_template: str     # Jinja2 template
    provider: str          # "claude" | "openai" | "deepseek"
    model: str
    max_tokens: int
    temperature: float
    changelog: str         # What changed from the previous version


PROMPT_REGISTRY: dict[str, PromptVersion] = {}


def register_prompt(prompt: PromptVersion) -> PromptVersion:
    """Register a prompt version. Raises if a version conflict exists."""
    key = f"{prompt.family}_v{prompt.version}"
    if key in PROMPT_REGISTRY:
        raise ValueError(f"Prompt {key} is already registered.")
    PROMPT_REGISTRY[key] = prompt
    return prompt

Each prompt has a family (the feature it serves) and a version number. The active version for each family is the one with status=ACTIVE. There should be exactly one active version per family at any time — this is enforced at startup.

Defining prompt versions

Prompts are defined as Python objects, not as strings in a database. This means they live in version control, changes are tracked via git history, and code review applies.

PRODUCT_DESCRIPTION_V2 = register_prompt(PromptVersion(
    key="product_description_v2",
    family="product_description",
    version=2,
    status=PromptStatus.ACTIVE,
    provider="claude",
    model="claude-sonnet-4-20250514",
    max_tokens=600,
    temperature=0.4,
    changelog="Reduced temperature from 0.7 to 0.4 to reduce hallucinated features. "
               "Added explicit instruction to avoid superlatives.",
    system="""You are a product copywriter for an eCommerce store.
Write concise, accurate product descriptions based on the attributes provided.
Rules:
- Use only the information provided. Do not invent features.
- Maximum 120 words.
- No superlatives (best, amazing, incredible, etc.).
- Write in second person (\"you\", \"your\").
- Include key specifications in the first sentence.""",
    user_template="""Product name: {{ product.name }}
Category: {{ product.category }}
Price: {{ product.price | currency }}
Attributes:
{% for key, value in product.attributes.items() %}
- {{ key }}: {{ value }}
{% endfor %}

Write a product description.""",
))

The changelog field is required for all versions after v1. It functions like a commit message — it should explain why the prompt changed, not what changed (the diff shows what changed).

Rendering and calling

from jinja2 import Environment, StrictUndefined
import anthropic
import openai

jinja_env = Environment(undefined=StrictUndefined)
jinja_env.filters["currency"] = lambda v: f"${v:.2f}" if isinstance(v, (int, float)) else str(v)

claude_client = anthropic.AsyncAnthropic()
openai_client = openai.AsyncOpenAI()


async def call_prompt(
    family: str,
    context: dict,
    version: int | None = None,
) -> str:
    """
    Render and call the active (or specified) prompt for a given family.

    Args:
        family:  Prompt family name, e.g. "product_description".
        context: Template variables for the user message.
        version: Pin to a specific version. If None, uses the active version.
    """
    if version is not None:
        key = f"{family}_v{version}"
        prompt = PROMPT_REGISTRY.get(key)
        if prompt is None:
            raise ValueError(f"Prompt version {key} not found.")
    else:
        prompt = _get_active_prompt(family)

    template = jinja_env.from_string(prompt.user_template)
    user_message = template.render(**context)

    if prompt.provider == "claude":
        response = await claude_client.messages.create(
            model=prompt.model,
            max_tokens=prompt.max_tokens,
            temperature=prompt.temperature,
            system=prompt.system,
            messages=[{"role": "user", "content": user_message}],
        )
        return response.content[0].text

    elif prompt.provider == "openai":
        response = await openai_client.chat.completions.create(
            model=prompt.model,
            max_tokens=prompt.max_tokens,
            temperature=prompt.temperature,
            messages=[
                {"role": "system", "content": prompt.system},
                {"role": "user", "content": user_message},
            ],
        )
        return response.choices[0].message.content

    raise ValueError(f"Unknown provider: {prompt.provider}")


def _get_active_prompt(family: str) -> PromptVersion:
    active = [
        p for p in PROMPT_REGISTRY.values()
        if p.family == family and p.status == PromptStatus.ACTIVE
    ]
    if len(active) != 1:
        raise RuntimeError(
            f"Expected exactly 1 active prompt for family '{family}', found {len(active)}."
        )
    return active[0]

The version pin is how you run shadow comparisons: call the same feature with two versions in parallel, log both outputs, and compare them before promoting the new version to active.

Structured evals

An eval is a function that takes a prompt output and returns a pass/fail judgment plus a score. Evals run in CI against a fixed dataset before any prompt version can be promoted to active.

from dataclasses import dataclass


@dataclass
class EvalResult:
    passed: bool
    score: float        # 0.0–1.0
    reason: str


def eval_product_description(output: str, ground_truth: dict) -> EvalResult:
    """
    Eval for product description quality.

    Checks: length, no superlatives, no invented features, second-person voice.
    """
    word_count = len(output.split())
    if word_count > 140:
        return EvalResult(False, 0.0, f"Too long: {word_count} words (max 120)")

    superlatives = ["best", "amazing", "incredible", "outstanding", "superior", "unmatched"]
    found = [s for s in superlatives if s in output.lower()]
    if found:
        return EvalResult(False, 0.2, f"Contains superlatives: {found}")

    # Check that no features are mentioned that weren't in the input
    input_features = set(str(v).lower() for v in ground_truth["attributes"].values())
    output_words = set(output.lower().split())
    # Rough heuristic — replace with LLM-as-judge for more accuracy
    unexpected = output_words - input_features - COMMON_WORDS
    if len(unexpected) > 30:  # too many words not in input
        return EvalResult(False, 0.4, "Possible hallucinated features")

    if "you" not in output.lower() and "your" not in output.lower():
        return EvalResult(False, 0.6, "Not written in second person")

    return EvalResult(True, 1.0, "All checks passed")

The LLM-as-judge comment is genuine: for nuanced quality criteria, a small eval model (Claude Haiku, GPT-4o Mini) produces more reliable judgments than hand-written heuristics. The heuristics shown here are fast to run and good for catching obvious regressions; they are not a substitute for semantic quality assessment.

Rolling back without a deployment

The status field enables rollback without touching the codebase. A thin admin endpoint reads the registry from the database at runtime:

# POST /admin/prompts/{family}/rollback
async def rollback_prompt(family: str, to_version: int) -> dict:
    """
    Roll back a prompt family to a previous version.
    Marks the current active version as deprecated and activates the target.
    """
    current_active = _get_active_prompt(family)
    target_key = f"{family}_v{to_version}"
    target = PROMPT_REGISTRY.get(target_key)

    if target is None:
        raise HTTPException(status_code=404, detail=f"Version {to_version} not found")

    if target.status == PromptStatus.ARCHIVED:
        raise HTTPException(status_code=400, detail="Cannot activate an archived prompt")

    # Persist status changes to the database
    await db.update_prompt_status(current_active.key, PromptStatus.DEPRECATED)
    await db.update_prompt_status(target_key, PromptStatus.ACTIVE)

    # Invalidate in-memory registry cache
    PROMPT_REGISTRY[current_active.key].status = PromptStatus.DEPRECATED
    PROMPT_REGISTRY[target_key].status = PromptStatus.ACTIVE

    return {
        "rolled_back_from": current_active.key,
        "rolled_back_to": target_key,
    }

When a bad prompt ships to production and fraud scores start behaving strangely at 3am, this endpoint rolls back to the previous version in under a second — no deployment, no Slack emergency, no waking up a second engineer.

What this does not solve

Version control and rollback are necessary, not sufficient. The harder problem is knowing when a prompt needs to change. The eval suite catches regressions on the test dataset; it does not catch regressions on the long tail of production inputs.

The missing piece: logging prompt inputs and outputs in production, sampling a percentage of them, and running the eval suite against the sampled production distribution on a regular schedule. We have the logging. The scheduled eval run against production samples is on the roadmap but not yet built.

That gap — between knowing a prompt works on your test cases and knowing it works on what your users are actually sending — is where production AI systems fail quietly.

Share 𝕏 in

Al Amin Ahamed

Senior software engineer & AI practitioner. 5+ years shipping Laravel platforms, WordPress plugins, WooCommerce extensions, and AI-driven products.

About me →