Prompt Versioning in Production: Managing LLM Prompts Like Code
LLM prompts are code. They have bugs, regressions, and breaking changes. They need to be reviewed, versioned, tested, and rolled back. Most teams treat them as configuration strings — hard-coded in environment variables, edited directly in production, with no record of what changed or why.
This post covers how EasyCommerce manages prompts across three providers (Claude, OpenAI, GPT-4o, DeepSeek), with version tracking, structured evals, and a rollback path that does not require a deployment.
Why prompts break in production
Prompt failures are subtle in ways that code failures are not. A code bug throws an exception. A prompt regression produces output that is slightly worse than before — shorter descriptions, more generic fraud scores, inventory forecasts that are systematically low by 15%. These do not trigger alerts. They degrade quietly until a merchant notices or you run an eval.
The failure modes we have seen in EasyCommerce's AI layer:
Provider model updates. Claude 3.5 Sonnet behaved differently on the same prompt than Claude Sonnet 4. Not worse, but differently — more verbose, different formatting conventions. Prompts tuned for one model version need adjustment when the model changes.
Regression on specific input shapes. A prompt that works well on typical product data produces garbage on products with very short titles, unicode characters, or prices in non-USD currencies. You only find these when you run evals against a diverse dataset.
Context window pressure. As we added more instructions to the fraud detection prompt, responses started truncating mid-analysis. The prompt grew past the point where the model had enough output tokens to complete its reasoning.
Instruction conflicts. Adding a new formatting requirement contradicted an existing instruction about response length. Claude resolved the conflict by ignoring the length instruction; GPT-4o resolved it by ignoring the formatting requirement.
The prompt registry
Prompts live in a version-controlled registry, not in environment variables or database records that can be edited ad hoc.
from dataclasses import dataclass, field
from enum import Enum
from typing import Callable
class PromptStatus(str, Enum):
ACTIVE = "active"
SHADOW = "shadow" # Running alongside active for comparison
DEPRECATED = "deprecated"
ARCHIVED = "archived"
@dataclass
class PromptVersion:
key: str # e.g. "product_description_v3"
family: str # e.g. "product_description"
version: int
status: PromptStatus
system: str
user_template: str # Jinja2 template
provider: str # "claude" | "openai" | "deepseek"
model: str
max_tokens: int
temperature: float
changelog: str # What changed from the previous version
PROMPT_REGISTRY: dict[str, PromptVersion] = {}
def register_prompt(prompt: PromptVersion) -> PromptVersion:
"""Register a prompt version. Raises if a version conflict exists."""
key = f"{prompt.family}_v{prompt.version}"
if key in PROMPT_REGISTRY:
raise ValueError(f"Prompt {key} is already registered.")
PROMPT_REGISTRY[key] = prompt
return promptEach prompt has a family (the feature it serves) and a version number. The active version for each family is the one with status=ACTIVE. There should be exactly one active version per family at any time — this is enforced at startup.
Defining prompt versions
Prompts are defined as Python objects, not as strings in a database. This means they live in version control, changes are tracked via git history, and code review applies.
PRODUCT_DESCRIPTION_V2 = register_prompt(PromptVersion(
key="product_description_v2",
family="product_description",
version=2,
status=PromptStatus.ACTIVE,
provider="claude",
model="claude-sonnet-4-20250514",
max_tokens=600,
temperature=0.4,
changelog="Reduced temperature from 0.7 to 0.4 to reduce hallucinated features. "
"Added explicit instruction to avoid superlatives.",
system="""You are a product copywriter for an eCommerce store.
Write concise, accurate product descriptions based on the attributes provided.
Rules:
- Use only the information provided. Do not invent features.
- Maximum 120 words.
- No superlatives (best, amazing, incredible, etc.).
- Write in second person (\"you\", \"your\").
- Include key specifications in the first sentence.""",
user_template="""Product name: {{ product.name }}
Category: {{ product.category }}
Price: {{ product.price | currency }}
Attributes:
{% for key, value in product.attributes.items() %}
- {{ key }}: {{ value }}
{% endfor %}
Write a product description.""",
))The changelog field is required for all versions after v1. It functions like a commit message — it should explain why the prompt changed, not what changed (the diff shows what changed).
Rendering and calling
from jinja2 import Environment, StrictUndefined
import anthropic
import openai
jinja_env = Environment(undefined=StrictUndefined)
jinja_env.filters["currency"] = lambda v: f"${v:.2f}" if isinstance(v, (int, float)) else str(v)
claude_client = anthropic.AsyncAnthropic()
openai_client = openai.AsyncOpenAI()
async def call_prompt(
family: str,
context: dict,
version: int | None = None,
) -> str:
"""
Render and call the active (or specified) prompt for a given family.
Args:
family: Prompt family name, e.g. "product_description".
context: Template variables for the user message.
version: Pin to a specific version. If None, uses the active version.
"""
if version is not None:
key = f"{family}_v{version}"
prompt = PROMPT_REGISTRY.get(key)
if prompt is None:
raise ValueError(f"Prompt version {key} not found.")
else:
prompt = _get_active_prompt(family)
template = jinja_env.from_string(prompt.user_template)
user_message = template.render(**context)
if prompt.provider == "claude":
response = await claude_client.messages.create(
model=prompt.model,
max_tokens=prompt.max_tokens,
temperature=prompt.temperature,
system=prompt.system,
messages=[{"role": "user", "content": user_message}],
)
return response.content[0].text
elif prompt.provider == "openai":
response = await openai_client.chat.completions.create(
model=prompt.model,
max_tokens=prompt.max_tokens,
temperature=prompt.temperature,
messages=[
{"role": "system", "content": prompt.system},
{"role": "user", "content": user_message},
],
)
return response.choices[0].message.content
raise ValueError(f"Unknown provider: {prompt.provider}")
def _get_active_prompt(family: str) -> PromptVersion:
active = [
p for p in PROMPT_REGISTRY.values()
if p.family == family and p.status == PromptStatus.ACTIVE
]
if len(active) != 1:
raise RuntimeError(
f"Expected exactly 1 active prompt for family '{family}', found {len(active)}."
)
return active[0]The version pin is how you run shadow comparisons: call the same feature with two versions in parallel, log both outputs, and compare them before promoting the new version to active.
Structured evals
An eval is a function that takes a prompt output and returns a pass/fail judgment plus a score. Evals run in CI against a fixed dataset before any prompt version can be promoted to active.
from dataclasses import dataclass
@dataclass
class EvalResult:
passed: bool
score: float # 0.0–1.0
reason: str
def eval_product_description(output: str, ground_truth: dict) -> EvalResult:
"""
Eval for product description quality.
Checks: length, no superlatives, no invented features, second-person voice.
"""
word_count = len(output.split())
if word_count > 140:
return EvalResult(False, 0.0, f"Too long: {word_count} words (max 120)")
superlatives = ["best", "amazing", "incredible", "outstanding", "superior", "unmatched"]
found = [s for s in superlatives if s in output.lower()]
if found:
return EvalResult(False, 0.2, f"Contains superlatives: {found}")
# Check that no features are mentioned that weren't in the input
input_features = set(str(v).lower() for v in ground_truth["attributes"].values())
output_words = set(output.lower().split())
# Rough heuristic — replace with LLM-as-judge for more accuracy
unexpected = output_words - input_features - COMMON_WORDS
if len(unexpected) > 30: # too many words not in input
return EvalResult(False, 0.4, "Possible hallucinated features")
if "you" not in output.lower() and "your" not in output.lower():
return EvalResult(False, 0.6, "Not written in second person")
return EvalResult(True, 1.0, "All checks passed")The LLM-as-judge comment is genuine: for nuanced quality criteria, a small eval model (Claude Haiku, GPT-4o Mini) produces more reliable judgments than hand-written heuristics. The heuristics shown here are fast to run and good for catching obvious regressions; they are not a substitute for semantic quality assessment.
Rolling back without a deployment
The status field enables rollback without touching the codebase. A thin admin endpoint reads the registry from the database at runtime:
# POST /admin/prompts/{family}/rollback
async def rollback_prompt(family: str, to_version: int) -> dict:
"""
Roll back a prompt family to a previous version.
Marks the current active version as deprecated and activates the target.
"""
current_active = _get_active_prompt(family)
target_key = f"{family}_v{to_version}"
target = PROMPT_REGISTRY.get(target_key)
if target is None:
raise HTTPException(status_code=404, detail=f"Version {to_version} not found")
if target.status == PromptStatus.ARCHIVED:
raise HTTPException(status_code=400, detail="Cannot activate an archived prompt")
# Persist status changes to the database
await db.update_prompt_status(current_active.key, PromptStatus.DEPRECATED)
await db.update_prompt_status(target_key, PromptStatus.ACTIVE)
# Invalidate in-memory registry cache
PROMPT_REGISTRY[current_active.key].status = PromptStatus.DEPRECATED
PROMPT_REGISTRY[target_key].status = PromptStatus.ACTIVE
return {
"rolled_back_from": current_active.key,
"rolled_back_to": target_key,
}When a bad prompt ships to production and fraud scores start behaving strangely at 3am, this endpoint rolls back to the previous version in under a second — no deployment, no Slack emergency, no waking up a second engineer.
What this does not solve
Version control and rollback are necessary, not sufficient. The harder problem is knowing when a prompt needs to change. The eval suite catches regressions on the test dataset; it does not catch regressions on the long tail of production inputs.
The missing piece: logging prompt inputs and outputs in production, sampling a percentage of them, and running the eval suite against the sampled production distribution on a regular schedule. We have the logging. The scheduled eval run against production samples is on the roadmap but not yet built.
That gap — between knowing a prompt works on your test cases and knowing it works on what your users are actually sending — is where production AI systems fail quietly.
Al Amin Ahamed
Senior software engineer & AI practitioner. 5+ years shipping Laravel platforms, WordPress plugins, WooCommerce extensions, and AI-driven products.
About me →More from the blog